Podcasts - Filtering Advertisements from Transcripts

tl;dr:
The podcast content I’ve been indexing was diluted with advertisements. I’ve eliminated the majority of them fairly quickly with a basic prompt-based classifier.

Design Considerations

The main factors in play here are my time, inference costs, and the filter’s performance in terms of precision/recall.

If I were aiming for the absolute best performance possible from an LLM-based classifier, without regard for my time or the inference cost, I would probably intervene before my chunking process. At that point, any given podcast is a list of timecoded speech segments, like this:

{
  "timestamp": [ 4324.08, 4326.5],
  "text": "Buy a product from one of our wonderful sponsors today."
}
{
  "timestamp": [ 4327.08, 4330.5],
  "text": "We're back!  The largest squid can weigh in excess of 1000 pounds,"
}

These segment boundaries impart non-negligible information. They correlate strongly with pauses and transitions between different speakers. They’re also more finely grained than my (~512 token) chunks. I am fairly certain that I could get an LLM to process sequences of segments and, from them, extract the ranges that are advertisements. I am uncertain how much this would cost, though. I think it would require a larger model, more thinking/reasoning tokens, and more time spent on human labeling than the path I took instead.

My actual solution involves no thinking/reasoning tokens, and works well with very small models (2 - 4 billion parameters). Small is economical to begin with, and operating without thinking/reasoning means I can set the maximum output tokens to near-zero. That’s neat, because then the total cost is essentially just input tokens and can be trivially estimated.

To make the most efficient use of my time, I also wanted to use my existing tools wherever possible, and create a labeling process that I could move through quickly. I decided to operate on a per-chunk basis, so every chunk is treated as either “mostly advertisements, nothing important” or “important content”. This made labeling a breeze, and saved development time by letting me use my existing chunk-based filtering and “llm-map” tool, a straightforward batch prompt runner that just applies a prompt to a list of input texts and returns a result for each input.1 

How’d it go?

This strategy worked even better than expected. With a considered (but not optimized) prompt, I evaluated performance across a few small models. I threw Claude Sonnet into the ring as well, just out of curiosity. Below is a summary of their performance.

That summary was produced from simply running my chunk-based filtering on a random sample of chunks, then exporting those chunks into label-studio, annotating all of them, and exporting the full dataset with human and machine labels for analysis.

I chose Qwen3.5-4b, which arguably outperformed Claude Sonnet 4.6 on this task, for ¹⁄₁₀₀th the price.

Any other cool ideas?

Some podcasts include transcripts on the episodes’ landing pages. It has occurred to me that these transcripts will probably be free of dynamically-injected advertisements. Possibly these ad-free transcripts could be used in conjunction with my ad-containing transcripts to create labeled data automatically, with no human input.

I’d also like to try using prompt optimization techniques (such as GEPA) where I could experiment with guiding feedback.

What’s next?

While I was working on this, the system has continued to ingest ever more podcasts. It should be possible to visualize some trends in the data very soon!














Build Log

Labeling Setup

I enriched2  my sample with ads to make labeling more efficient. In my feeds, it seems like ads are roughly 10% of the total chunks, and I wanted to label at least a few dozen ‘ad’ chunks without labeling several dozen extra ‘content’ chunks unnecessarily.

Labeling on a chunk basis was quick. I have listened to a lot of podcasts, so I know how the ads typically read, and I know that they are usually above a certain length. Thus, biasing my attention heavily towards the first, last, and middle sentences of a chunk gave me good signal very quickly, and made it possible to finish labeling in just a couple hours. As a post-hoc validation, I gave special attention to cases where the models disagreed with my label and double-checked that I had not made any mistakes.

Inference

I used Qwen3.5-2b locally via vLLM and was very pleased with its performance considering its size. If pursuing maximum cost savings at scale, I’d dig a little deeper into trying to optimize its performance. It could be an excellent case to try out DSPy’s prompt optimization.

For the other models, I used DeepInfra. I’m overall decently happy with them, though I have been getting a lot of transient 500 codes from their API (for minutes at a time). On batch work like this, they’re still averaging ~15-20k TPS, so it’s fine I guess.

It was tempting to run this locally, but I’m already tying up my server’s GPU with transcriptions for a large part of the day, so I was glad to do something that can run concurrently offsite. DeepInfra doesn’t give me preferential pricing on Qwen3.5-4b for cacheable prefixes however, so I might switch to using ephemeral vLLM instances on RunPod GPUs, where I would be able to take advantage of the caching.

Comparing Models

================================================================================
PER-MODEL METRICS (positive class = advertisement)
================================================================================
Total items: 189 | Human labels: 147 content, 42 advertisement

Model               Acc   Prec    Rec     F1   TP   FP   TN   FN  Err
---------------------------------------------------------------------
qwen3.5-2b        78.8%  51.7%  71.4%  60.0%   30   28  119   12    0
qwen3.5-4b        94.2%  94.3%  78.6%  85.7%   33    2  145    9    0
qwen3.5-9b        88.4%  67.9%  90.5%  77.6%   38   18  129    4    0
glm-4.7-flash     81.0% 100.0%  14.3%  25.0%    6    0  147   36    0
sonnet-4.6        87.3%  67.3%  83.3%  74.5%   35   17  130    7    0

================================================================================
PAIRWISE AGREEMENT (%)
================================================================================
                   qwen3.5-2b   qwen3.5-4b   qwen3.5-9b glm-4.7-flas   sonnet-4.6        human
qwen3.5-2b             100.0%        81.5%        79.9%        72.5%        75.7%        78.8%
qwen3.5-4b              81.5%       100.0%        88.9%        84.7%        85.7%        94.2%
qwen3.5-9b              79.9%        88.9%       100.0%        73.5%        76.7%        88.4%
glm-4.7-flash           72.5%        84.7%        73.5%       100.0%        75.7%        81.0%
sonnet-4.6              75.7%        85.7%        76.7%        75.7%       100.0%        87.3%
human                   78.8%        94.2%        88.4%        81.0%        87.3%       100.0%





  1. Using this component now will allow my ad-filtering process to benefit later, if/when I extend my llm-map to support the use of GPU spot instances.

  2. This was another good task for claude-code subagents. I don't expect the output to be correct. I merely expect, on average, that it will be more effective than having done nothing.

    In practice, this assumption seems to have been plausible. After labeling, my sample was ~20% ads instead of only ~10%. I could have gone higher, but I still wanted to label plenty of "content" chunks to be sensitive to false positives.

· project, podcasts, llm-classifier, classification, data-labeling, remote-inference, label-studio, prototyping, AI, local-inference, audio, transcription