Podcasts - Filtering Advertisements from Transcripts
tl;dr:
The podcast content I’ve been indexing was diluted with advertisements. I’ve eliminated the majority of them fairly quickly with a basic prompt-based classifier.
Design Considerations
The main factors in play here are my time, inference costs, and the filter’s performance in terms of precision/recall.
If I were aiming for the absolute best performance possible from an LLM-based classifier, without regard for my time or the inference cost, I would probably intervene before my chunking process. At that point, any given podcast is a list of timecoded speech segments, like this:
{
"timestamp": [ 4324.08, 4326.5],
"text": "Buy a product from one of our wonderful sponsors today."
}
{
"timestamp": [ 4327.08, 4330.5],
"text": "We're back! The largest squid can weigh in excess of 1000 pounds,"
}These segment boundaries impart non-negligible information. They correlate strongly with pauses and transitions between different speakers. They’re also more finely grained than my (~512 token) chunks. I am fairly certain that I could get an LLM to process sequences of segments and, from them, extract the ranges that are advertisements. I am uncertain how much this would cost, though. I think it would require a larger model, more thinking/reasoning tokens, and more time spent on human labeling than the path I took instead.
My actual solution involves no thinking/reasoning tokens, and works well with very small models (2 - 4 billion parameters). Small is economical to begin with, and operating without thinking/reasoning means I can set the maximum output tokens to near-zero. That’s neat, because then the total cost is essentially just input tokens and can be trivially estimated.
To make the most efficient use of my time, I also wanted to use my existing tools wherever possible, and create a labeling process that I could move through quickly. I decided to operate on a per-chunk basis, so every chunk is treated as either “mostly advertisements, nothing important” or “important content”. This made labeling a breeze, and saved development time by letting me use my existing chunk-based filtering and “llm-map” tool, a straightforward batch prompt runner that just applies a prompt to a list of input texts and returns a result for each input.1
How’d it go?
This strategy worked even better than expected. With a considered (but not optimized) prompt, I evaluated performance across a few small models. I threw Claude Sonnet into the ring as well, just out of curiosity. Below is a summary of their performance.
That summary was produced from simply running my chunk-based filtering on a random sample of chunks, then exporting those chunks into label-studio, annotating all of them, and exporting the full dataset with human and machine labels for analysis.
I chose Qwen3.5-4b, which arguably outperformed Claude Sonnet 4.6 on this task, for ¹⁄₁₀₀th the price.
Any other cool ideas?
Some podcasts include transcripts on the episodes’ landing pages. It has occurred to me that these transcripts will probably be free of dynamically-injected advertisements. Possibly these ad-free transcripts could be used in conjunction with my ad-containing transcripts to create labeled data automatically, with no human input.
I’d also like to try using prompt optimization techniques (such as GEPA) where I could experiment with guiding feedback.
What’s next?
While I was working on this, the system has continued to ingest ever more podcasts. It should be possible to visualize some trends in the data very soon!
Build Log
Labeling Setup
I enriched2 my sample with ads to make labeling more efficient. In my feeds, it seems like ads are roughly 10% of the total chunks, and I wanted to label at least a few dozen ‘ad’ chunks without labeling several dozen extra ‘content’ chunks unnecessarily.
Labeling on a chunk basis was quick. I have listened to a lot of podcasts, so I know how the ads typically read, and I know that they are usually above a certain length. Thus, biasing my attention heavily towards the first, last, and middle sentences of a chunk gave me good signal very quickly, and made it possible to finish labeling in just a couple hours. As a post-hoc validation, I gave special attention to cases where the models disagreed with my label and double-checked that I had not made any mistakes.
Inference
I used Qwen3.5-2b locally via vLLM and was very pleased with its performance considering its size. If pursuing maximum cost savings at scale, I’d dig a little deeper into trying to optimize its performance. It could be an excellent case to try out DSPy’s prompt optimization.
For the other models, I used DeepInfra. I’m overall decently happy with them, though I have been getting a lot of transient 500 codes from their API (for minutes at a time). On batch work like this, they’re still averaging ~15-20k TPS, so it’s fine I guess.
It was tempting to run this locally, but I’m already tying up my server’s GPU with transcriptions for a large part of the day, so I was glad to do something that can run concurrently offsite. DeepInfra doesn’t give me preferential pricing on Qwen3.5-4b for cacheable prefixes however, so I might switch to using ephemeral vLLM instances on RunPod GPUs, where I would be able to take advantage of the caching.
Comparing Models
================================================================================
PER-MODEL METRICS (positive class = advertisement)
================================================================================
Total items: 189 | Human labels: 147 content, 42 advertisement
Model Acc Prec Rec F1 TP FP TN FN Err
---------------------------------------------------------------------
qwen3.5-2b 78.8% 51.7% 71.4% 60.0% 30 28 119 12 0
qwen3.5-4b 94.2% 94.3% 78.6% 85.7% 33 2 145 9 0
qwen3.5-9b 88.4% 67.9% 90.5% 77.6% 38 18 129 4 0
glm-4.7-flash 81.0% 100.0% 14.3% 25.0% 6 0 147 36 0
sonnet-4.6 87.3% 67.3% 83.3% 74.5% 35 17 130 7 0
================================================================================
PAIRWISE AGREEMENT (%)
================================================================================
qwen3.5-2b qwen3.5-4b qwen3.5-9b glm-4.7-flas sonnet-4.6 human
qwen3.5-2b 100.0% 81.5% 79.9% 72.5% 75.7% 78.8%
qwen3.5-4b 81.5% 100.0% 88.9% 84.7% 85.7% 94.2%
qwen3.5-9b 79.9% 88.9% 100.0% 73.5% 76.7% 88.4%
glm-4.7-flash 72.5% 84.7% 73.5% 100.0% 75.7% 81.0%
sonnet-4.6 75.7% 85.7% 76.7% 75.7% 100.0% 87.3%
human 78.8% 94.2% 88.4% 81.0% 87.3% 100.0%Using this component now will allow my ad-filtering process to benefit later, if/when I extend my llm-map to support the use of GPU spot instances.
This was another good task for claude-code subagents. I don't expect the output to be correct. I merely expect, on average, that it will be more effective than having done nothing.
In practice, this assumption seems to have been plausible. After labeling, my sample was ~20% ads instead of only ~10%. I could have gone higher, but I still wanted to label plenty of "content" chunks to be sensitive to false positives.