Document Intake and Embedding for Semantic Search — Part 1

tl;dr:
I build a small service to automatically fetch web documents, push PDFs to my e-reader, extract markdown text content, and store metadata in Mongo. I will be computing embeddings on these documents to enable semantic-search.

What? Why?

This first phase actually came from my desire to automate a manual workflow that I used for long articles I wanted to read later. I’d email them to myself, then load them in Firefox, render a nice printable format with “Reader Mode”, and print PDFs to send to my e-reader.

Separately, I’d been kicking around ideas for prototypes to use embeddings for semantic-search and retrieval-augmented generation. These ideas came to a pleasing confluence here. If I’m already intending to pass a large chunk of my reading through a single point, that’s the perfect opportunity to build intuitions about semantic processing!

This post is a bit preliminary, it just discusses the document intake pipeline.

Overview

Since deciding to build this, I’ve been emailing myself articles and papers using a structured but human-friendly format. I often find articles on my phone, but I don’t like to read them on there, so I use Android’s “Share” UI to send them via email. This action costs about 3 taps and takes less than a second. Sometimes I add comments or tags, which just consists of typing those things and a line prefix (# or ##).

Everything goes to an email-alias and automatically ends up in a specific IMAP folder.

Although this is my primary email account, scripted access is low-stakes because I can use a read-only credential. The only cost for this safety is that the intake system has to track which messages have been seen already.

I’d planned to do the IMAP interaction myself, but I was able to skip this by just using mbsync to create a local maildir mirroring the server. Python’s standard library supports maildirs and RFC-5322 email messages, so that’s all very easy.

The functionality of Firefox’s “Reader Mode” is powered by readability, so I’ll use this to process pages after I fetch them.

For maximum compatibility, I wanted to use a real browser to load and render pages. Playwright is the obvious choice at the moment, and I happened to recall shot-scraper wraps it in a handy CLI. I created a fork to add some enhancements to the pdf command, so I can capture HTML, JSON, PDF, MD, and PNG outputs without loading the page more than once.

Markdown content and metadata (including, soon, the embeddings) are stored in MongoDB, which I decided to try because they very recently added vector search to the community edition. PDFs are saved to a Syncthing sync which pushes them to my e-reader. The PDFs carry a human-memorable ID1  in the filename so they can always be easily tied back to the metadata. This also theoretically permits me to file them (on the e-reader) after I’ve read them, and automatically propagate that human judgement back to the database just by e.g. running an inotify watch on the sync dir.

Code

There’s not much to see yet, but if you’re doing something similar feel free to take a look and borrow whatever: Source on GitLab.

Next steps

The majority of articles are already turning into well-rendered PDFs, but I will be adding some tooling for easy debugging of ugly renders. Image widths particularly need work.

As an aside, it would be interesting to learn a bit more about IMAP’s IDLE and NOTIFY mechanisms. I had expected originally to be maintaining a persistent connection to my mailserver and getting new messages using something like one of those. Instead I’m polling, which is totally fine in this case2  but isn’t how I thought it would work.

For the main knowledgebase project, I’ll be experimenting with chunking strategies and embedding models next.





  1. I used 5 case-insensitive alphanumeric characters because it's a good balance between having plenty of IDs and retaining the possibility of carrying one or two in my head. My thinking is that, e.g. a semantic search query might return these IDs on my computer, but then, to actually access the document, I might prefer to find it on the tablet. It's less work to just remember 5 characters than to try and do something elaborate.

  2. `mbsync` isn't fully naive (it should be downloading only the deltas each time, not fetching all the messages).

· software, knowledgebase, embeddings, semantic-search, prototyping, AI, local-inference