Document Intake and Embedding for Semantic Search -- Part 2
tl;dr:
I implemented semantic search over a collection of documents I had handy. In short, it works too well! My benchmark is saturated; every model does nearly perfectly. Now I want to find out where they do differ in performance.
Why?
I’m interested in knowledgebase systems.
What is a knowledgebase?
If you ask me, a knowledgebase is simply a collection of documents and some sort of index mechanism which allows them to be selectively retrieved by their relevance to a given objective. Critically, “a collection” implies curation.1
By this definition, a library might count as a knowledgebase,2 although the index is somewhat coarse-grained. Let’s consider how you might use a library to further develop an idea about what a theoretically ideal knowledgebase might look like.
Case Study: Finding Information at the Library
Suppose you wanted to learn about the main exports from a given geographical region, both current and historical, and develop a decently detailed picture of how trade has shaped the economy of the region over time. Let’s consider how you might use the library:
One approach might be to use the Dewey Decimal system. Here are some areas that might contain books that might contain the information you’re looking for:3
- 330s – Economics generally
- 382 – International commerce, specifically trade and exports
- 338.1 – Agricultural products and their economics
- 338.2 – Extraction of minerals and other natural resources
- 338.4 – Secondary industries and services
- 339 – Macroeconomics, which sometimes covers trade balances
- 910s – Geography, travel, and regional descriptions (often includes economic geography)
- 330.9 + region code – Economic history and conditions by place (e.g., 330.94 for Europe)
Another approach might be to search the catalog by keyword. Additionally, you might look to atlases or almanacs like the CIA World Factbook, but these would only give you higher-level summary information.
However you do it, you’ll end up with perhaps dozens of books. Some of these may have their own internal indexes, some may not. Of course, there could be a great value in reading (or even just skimming through) these dozens of books, but usually our queries are bounded. We might want to know more than a factbook or atlas could tell us, but not necessarily have use for dozens of books’ worth of adjacent information.
Efficiently Retrieving Information from a Knowledgebase
At this point, if you’re familiar with LLMs but don’t happen to know about embeddings, you might think of scanning the full content of every book with instructed chat sessions. You could just chunk the book into sections, run a chat thread over each chunk, and have them all return essentially a classification: “relevant or not-relevant”.
This would work! Supposing there were 20 books of 50k words each, that’d be approximately 1 million tokens. Running this scan would cost you something like $1, at 2025 prices.4 It might take a while, depending on how bursty your inference provider is willing to let you be.5
A more efficient strategy for semantic search might be to use text embeddings. You could think of this as (very approximately) like a similar LLM “reading” all those same chunks of text and then placing them into an index based on what they’re about. Instead of an alphabetically-sorted keyword index like you might find in the back of a reference book, these indexes are more like a multi-dimensional “map”.6
In this case, the index only needs to be created once, and then it can be used for multiple, different queries. Creating the index involves an LLM, and is computationally intensive, though it’s more efficient than dumping documents into chat threads. In practical terms, you might imagine that all those Dewey Decimal areas above contain 5,000 books, or 250 million words. This amount of material could be indexed for ~$25 in AI inference costs. Then, if your typical query was ~100 words, you could run 100,000 queries for another $1 in AI inference costs. Amortized, this is like $0.0003 per query.7
Running queries like this is also fast - you can pretty easily have results in less than a second!
What I’m Working On
I’ve continued working on my knowledgebase experiments. Now that I have a collection of documents with which I have at least a passing familiarity, I’m well positioned to run queries on them and assess the vibes.
The flow looks like this:
- I email myself a URL, possibly with some metadata.
- The URLs are fetched, PDFs are rendered for reading, and the text content is stored.
- The documents are broken up into chunks.
- Embedding vectors are computed for the chunks.
The highlights here are that it’s very easy to experiment with different chunking strategies and with different embedding models. They’re basically pluggable.
If I wanted to, for example, experiment with context-enriched chunking8, I could just write the new chunker and specify a config that uses it. The new table will be automatically created in Postgres and it’ll automatically be populated with chunks on the next batch run. Adding a new embedding model works similarly. Once a new embedding configuration is registered, the schema will be automatically updated and all the embeddings will be computed on the next run. The benchmark script will also be aware of the newly registered config.
Benchmark Saturation
Okay, so my current benchmark is just some queries that I noted down while I was reading some of the articles I mailed myself. I tried to write my queries such that they did not include any of the same high-entropy words as the passages/chunks I hoped they would retrieve, in order to stress the semantic representations.
In short, all the models, even the smallest, retrieve exactly what my rubric expects. So, if I want to learn about the relative capabilities of these models, I’ll have to think of something else to try.
Why is my benchmark saturated?
I think I chose decently tricky queries that should only be answerable when there’s some degree of concept modeling present (rather than queries that you could just answer by doing text search, or a fancy text-search with a thesaurus). I expected some to fail, but none did. Why? Are these embedding models just far better than I’d expected?
One possibility is that there simply aren’t enough documents embedded in my database and so there’s very little possibility for confoundment. That is to say, maybe these queries are actually casting a very wide net through semantic space, but there’s such a sparse distribution of passages in that space that even a wide net will still manage to catch the right passage (and even rank it as the best one), because there is nothing else to get swept up in the net.
For example, I have one article in my collection that describes a text-based web browser that uses AI to rewrite pages. When I search for “an app that lets me look at websites with a TUI”, well, there’s only a handful of articles in my collection that even mention browsers heavily in the first place. So, even if the recall was relatively imprecise, it might still get the correct answer! It could be that my collection of documents is sufficiently small that something like this happens for all of them.
Scaling Up the Collection
It might be interesting to try indexing some substantially larger collections of documents. However, this creates difficulties for a vibes-based assessment: ideally I have actually read all of these documents so that I have a good sense for what is present in the collection and what degree of confoundment might be expected.
I can go and get a large corpus of material, but if I don’t read all of it, I cannot know if a given query returns all of the results that I would reasonably have expected it to return. Certainly I can evaluate the results that are returned and judge their relevance to the query, but I have no idea how many relevant results are not being returned.
Anyway, more documents seems like the best next move. Even if I can’t be as rigorous as I’d like, I can still learn something about how recall performance is affected by the amount of material in the collection.
Some Other Ideas: Knowledge-graph Traversal
I include a document ID and document sequence number in my chunks tables because the sequence number turns the chunks into a very simple graph. That is to say, you could follow the sequence number up or down to get additional context around any given passage. Or an LLM could make a tool call to request this if you were doing RAG and the passage retrieved was relevant but insufficient.
I’ve been thinking about also including references to other documents in the metadata if, for example, reading one somehow otherwise prompts me to discover the other. That is probably too labor-intensive to track, though. Another possibility which leverages my own judgements less (but which is accordingly easier), would be to just parse links in documents. These could theoretically be crawled automatically. Of course that eventually covers the whole Internet, so there’d need to be some rules to limit the depth, and I don’t know how much bandwidth I’d want to spend on that anyway.
Wikipedia has plenty of internal links and is available as a bulk download, so using their data could sidestep these issues and might be a quicker way to create a toy knowledgebase for exploring graph features.
Code
In my manner of speaking then, the Internet isn't a knowledgebase. It may be indexed, but it's not "a collection" in any coherent, intentional sense. Admittedly, this distinction can rapidly become blurry if you consider different degrees and means of "curation", like PageRank etc.
I would suggest that it depends on where your specific cutoff is for curation. To what degree do you consider every book in the library to be an authoritative source?
This list was suggested by Claude. I don't happen to be all that proficient with the Dewey Decimal system, and I'm presuming that these areas are not entirely ridiculous. Maybe one could make a case to remove some of them, but my point is that there are at least a few different areas and a few different books in each.
Okay, sure, it depends on which model you use. Without trying to figure out what would be suitable for this use case, just know that a small model might be as cheap as 10¢ per million tokens, and an expensive one might be as much as $10 per million tokens.
If you were limited to a single concurrent chat, it might take 3 hours from a traditional provider, or 15 minutes from Cerebras. Though compute-intensive, a scan like this would be embarrassingly parallel, so you can just divide by how many concurrent chats you can run to get the wall time.
In keeping with the case study involving geographically-related research, you could imagine a couple different scenarios:
a) Every paragraph of every book could be placed into a 1-dimensional, alphabetically-sorted index of place names. All the paragraphs about Calgary would be listed under "Calgary". And "Canada" might have a much bigger list of paragraphs: anything that pertains to any place in Canada.
b) Every paragraph of every book could be indexed by giving it an area on a 2-dimensional map. In this case, the position of a paragraph in the index has a real meaning in relation to the world. If you want to learn about the areas surrounding Calgary, you can literally just look up, down, left, and right of Calgary on the map-index, and you'll see paragraphs pertaining to the areas that are _actually_ north, south, west, and east of Calgary.
In the same way that scenario (b) is a more powerful way of organizing information than scenario (a), text embeddings are more powerful than keyword indexes.
However, while a 2-dimensional map is intuitive and directly useful to humans, a vector with hundreds of dimensions is not. Placing information into a highly dimensional space does give the index more structure, and it even has conceptual relevance like locations on a map do, but we aren't able to make much sense of this space ourselves. One way to think about using these embeddings for retrieval is to 'use the AI model to put our query onto the map' and then 'find the documents which are positioned near our query on the map'.
To put that into real terminology, we have to embed our queries using AI models to get a point in space, and then we can use the index to find the nearest neighbors to that point.
This is a simplified story. I neglected to price the storage/database, which is potentially negligible compared to AI inference, and I priced the AI in dollars even though you could run it yourself on your own hardware (at which point you might think only in terms of electricity, if you were going to own the hardware anyway).
An example of this might be to create chunks that contain:
an LLM-generated summary of the previous passages in a given document
+
the current passage
+
a summary of the rest of the document.