Document Intake and Embedding: Millions of Tokens from Audio -- Part 3

tl;dr:
I transcribe audio to get ~20 million words of text, then load this into my ‘knowledgebase system’ with several different chunking configurations and embedding models.

How?

For me, the main headline here was that all of this runs faster than expected.

In short, using an RTX 3060 with flash-attention:

You can generate pretty solid transcriptions of audio at ~100x realtime, and
embed text at 2,000 tok/s, even with a (large-ish) 4B embedding model.

Just thinking about it naively without calculating anything, I had underestimated the gains available from batching on a small-ish GPU like this. Considering the actual performance, I scaled up my experiment a bit. Every document is getting embedded 10 times across the different models and chunking configs, so there’s ~200 million tokens going through the embedding pipeline, but that only took ~24 hours.¹

Previously, I observed excellent retrieval behavior from a rather small set of documents, ~500k tokens. Now, with 40x as much text in the system, will retrieval start to fall off, qualitatively?

Analysis: Same Knowledgebase, More Documents

Effects of Collection Scale on Retrieval of Specific Facts Known to Be in the Collection

I have a very basic rubric which attempts to retrieve specific facts from the knowledgebase. The grading is based on whether a given answer string is contained in the returned chunks, and how those chunks rank.

For example:

Query: “What’s a tool that can help me identify keypresses on Linux?”
Answer: must contain “xev”, and be from a certain document.
Score: 5 points if it’s the first ranked result, 4 points if it’s the second ranked result, and so on.

Adding 40x more documents to my collection doesn’t meaningfully impact the retrieval of that particular fact.²

Of course, one should ask at this point where in semantic space I added all these new documents! If you had one document about handling keypresses in Linux and then added a billion documents about baseball, this result would be uninformative. In this case, there’s some technical content mixed into the corpus, but I can’t rigorously quantify the amount at the moment.³

Aside from that, there are serious limitations to this analysis. First of all, I have chosen correct answers ahead of time based on some needles I happened to know were buried in the haystack. With this current method, a retrieval function would score zero points for returning a chunk mentioning wev, which is another tool that might help me identify keypresses on Linux.

Another less immediately obvious limitation is that, since I am experimenting with a range of chunk sizes, these scores are biased in favor of small chunks when the passage the chunks came from contained multiple instances of the answer string. (There are more right answers when there are more matching chunks.)

For now, I’m mostly just trying to gain some first-hand appreciation of working with knowledgebase systems, so I’m happy to have even the barest wisps of signal to investigate. One thing which my current, weak methodology doesn’t overtly preclude me from attempting is comparison of different models’ behavior with the same chunking configuration.

Some signals of interest:

Smaller or lower dimensionality models seem to be more affected by the increased size of the collection.
Larger models seem to perform better, but this effect is not uniform across tasks.

Neither of these is at all surprising, but it’s nice to see that I’m turning up what you’d expect!

I’m very interested in coming up with a believable, qualitative description of the types of tasks where larger models perform meaningfully better, and hopefully one that I can at least somewhat substantiate.

Vibes About Model Size

One query I tried side-by-side with a small model and a large model was: “What type of insurance is most profitable for the insurer?”.

The smaller model just retrieved a bunch of advertisements about insurance.⁴ The larger model retrieved precisely the passage I was looking for, one in which the speaker describes the loss-ratios of different types of insurance products.

Does the larger model include concepts such as the relation between this “loss ratio” and the profitability of a given insurance product? I don’t strictly know.⁵

Another one I tried side-by-side, after randomly selecting a document containing information about porcelain: “What is a type of earthenware material that’s harder than most?”.

The smaller model returns 5 passages about granite, metal alloys, titanium, and fiberglass. A medium-sized model returns 3 passages about porcelain and a couple about rocks and minerals. The larger model returns 5 passages in a row about porcelain.⁶

Ultimately, these are just vibes, but I feel like I’m starting to get a sense for the kinds of concepts which might exist in larger models but might not exist (or might be poorly defined) in smaller ones.

On Retrieval Relevance

One of my benchmark queries (scored as described above) was seriously affected by adding more documents to the system. This was almost certainly an artifact of scoring based on a rubric.

A rubric makes the assumption that we know every correct answer ahead of time. I violated this assumption by adding hundreds of new documents to the corpus without explicitly reviewing every one to find every possible relevant passage for every query in the rubric.

In reality, the newly added documents happen to contain information relevant to the query, and these new passages are being retrieved (just as you would expect from a system that’s functioning properly).

What I really want to do is measure the relevance of the retrieved passages with respect to the query. What I am measuring is only whether the ‘right’ passages (defined as the relevant passages which I already know about) are retrieved. Now there are multiple ‘right’ passages, and I don’t know about all of them.

How do you benchmark the retrieval performance of a system when you don’t know all the answers?

I have some thoughts about this⁷ , but, frankly, I’ve reached a point in this where it’s probably time for me to review the literature.

What’s Next?

Other than hitting up Arxiv, I’m starting to think of quality-of-life improvements for working with this (and related) systems. Multiple times in this project it would have been handy to have a priority queue setup to manage the GPU workload. I keep thinking there must be some way to approximate pre-emptive scheduling (coarsely), but it may very well be more trouble than it’s worth.

I’m developing a composable system of tooling where each tool is specifically designed to be maximally legible to claude-code. The idea is many small and relatively stateless tools which are each available through a CLI, a JSON-RPC interface, and as a Python library. Ultimately, I want to be able to use my Python utilities from Elixir, which motivated the RPC interface.

For example, I implemented the batched transcription part of this project as a tool, since it will be useful in more than one place. If I’m using another Python project, I can install the tool as a library in that project. The same tool could be used in an Elixir project via the JSON-RPC interface over a Unix domain socket. It can also be used from the shell to do a one-off transcription of a folder full of files, or for testing, or for throwing together a quick job written in Bash.

Yes, I know OpenAI will do it for ~$26. My knowledgebase experiments are decidedly local-first though. It would be pretty annoying to take time exploring all the characteristics of their embeddings models and then go on to build a system using them, only to have them be superceded or discontinued.
Interestingly, they don't mention text embeddings at all on their pricing page. If you click through to "detailed pricing", you might notice that "Embeddings" is way down the page next to undesirables like "Moderation" and "Legacy Models". While it might be foolish to read too far into that, I was already disinclined to develop any reliance on closed-weights models, particularly in a tool that other systems might be built on top of.
↩
On a per-query basis, adding 40x more documents to the collection doesn't change the benchmark results much.
↩
Essentially, we are talking about estimating the density or sparsity of points in subregions of the high-dimensional space where our texts are embedded.
↩
One suggestion I've heard but cannot vouch for (although it feels at least kind of truthy) is that smaller models may, to a degree, conflate words like "profitable" with "good" generally. The insurance advertisements did indeed use a lot of positive words about the quality of their insurance, so, maybe?
↩
In this case, the embedding model was Qwen3-Embedding-4B. That's derived from Qwen3-4B. You can download this model and chat with it. So I asked it: "What determines the profitability of a given insurance business?". It mentions "loss ratio" as the first "Core Profitability Metric". That seems kind of suggestive.
↩
I don't think we should imagine that there is any remotely linear correlation between model size and the presence of any given concept, but it is kind of interesting that this n=1 example appears that way.
↩
In essence, the cybernetic principle. Some kind of feedback. Perhaps a task-specific fitness function, rather than attempting to measure 'retrieval relevance'.
↩

January 1, 2026 · project, software, knowledgebase, embeddings, semantic-search, prototyping, AI, local-inference, audio, transcription, GPUs