Building a Podcast Automation Platform with Ittybit
Part 4 - Chapters, Thumbnails, and Declarative Workflows
In the last article, we conquered video transcoding and AI transcription. Our pipeline can take a raw podcast recording and spit out pristine audio, optimized video, and accurate transcripts. That's powerful. But here's something I've learned from running my own podcast: having great content isn't enough. You need to make it discoverable.
Think about it. You've got a 90-minute conversation with incredible insights scattered throughout. Your viewer lands on YouTube, sees the runtime, and thinks "90 minutes? I don't have time for this." They close the tab. Meanwhile, buried at minute 47 is the exact answer to their question - but they'll never find it.
This is the discoverability problem, and it's killing long-form content. The solution? Make your content navigable. Chapters. Thumbnails. Key moments. Visual cues that say "here's where we talk about X, here's where Y happens."
And here's where things get really interesting: we've been building our processing pipeline imperatively - create task A, wait for result, create task B. But what if I told you there's a better way? A declarative approach where you describe what you want, and the system figures out how to do it?
Let me show you what I mean.
The Discoverability Problem (And Why It's Not Your Fault)
I used to publish 60-90 minute podcast episodes with zero chapters. Just a title, a description, and a play button. My analytics were depressing. Average view duration: 8 minutes. Completion rate: 4%. People would start watching, realize they couldn't scan the content, and bail.
Then I started adding chapters manually. I'd watch the whole episode, note timestamps where topics changed, and add them to YouTube. My stats immediately improved. Average view duration jumped to 18 minutes. Completion rate hit 12%. But the process was brutal - 45 minutes of work per episode, just to add timestamps.
Here's what manual chapter creation looks like:
It's mind-numbing. And if you forget to add a chapter? Too bad. You'd have to re-upload or edit the video description.
We can do better.
Making Videos Navigable: The Three Pillars
There are three things that make long-form video content navigable:
1. Chapters - Timestamped sections with descriptive titles 2. Thumbnails - Visual keyframes for each chapter 3. Subtitles - Searchable, accessible text overlays
We've already built subtitle generation (transcripts → VTT files). Now let's tackle chapters and thumbnails.
Intelligent Chapter Detection
In the last article, I showed you Ittybit's chapter detection task. But I didn't really explain how it works or why it's better than doing it yourself.
Here's what Ittybit does under the hood:
- Scene detection - Analyzes visual changes (new speaker, screen share starts, etc.)
- Audio analysis - Detects topic boundaries from speech patterns and pauses
- Transcript analysis - Uses NLP to identify topic transitions
- Heuristics - Applies rules about minimum chapter length, natural breaks
The result? Chapters that actually make sense. Not just "every 5 minutes," but "when the topic genuinely changes."
Let me show you how to configure this properly:
The min_duration and max_duration settings are crucial. I learned this the hard way. My first attempt at automatic chapters gave me 47 chapters in a 60-minute video. That's not helpful—that's just a transcript with timestamps. After experimentation, I settled on:
- Min duration: 2 minutes (120 seconds) - Short enough to be granular, long enough to be meaningful
- Max duration: 10 minutes (600 seconds) - Forces breaks even in marathon discussions
The scene_threshold controls sensitivity. Lower values mean "only create chapters at obvious transitions." Higher values mean "be liberal about what counts as a topic change." I keep mine at 0.4, which seems to hit the sweet spot.
Generating Chapter Thumbnails
Chapters are great, but you know what's even better? Visual previews. YouTube lets you add custom thumbnails for each chapter, and viewers love it. They can scan your video visually and jump to the part that interests them.
Manually creating chapter thumbnails means:
- Scrubbing through the video
- Finding a representative frame for each chapter
- Taking screenshots
- Cropping and formatting
- Uploading to YouTube
For a 10-chapter video, that's 30-45 minutes of work. Let's automate it:
Here's the beautiful part: set from_chapters: true and Ittybit automatically extracts thumbnails at each chapter boundary. You get a thumbnail for every chapter without specifying timestamps. It just works.
The Problem with Imperative Workflows
Alright, we've been building this system imperatively. Here's what our current ProcessEpisodeJob does:
See the problem? It's a waterfall. Each step waits for the previous. If we want to add a new task - say, generating social media clips - we have to modify the job, add more webhook handling, update the status tracking. It's fragile. It's hard to test. And it doesn't scale.
This is where I discovered Ittybit's Automations feature, and it changed everything.
Declarative Workflows: Automations
Here's the big idea: instead of writing code that says "do A, then B, then C," you write a configuration that says "when media is created, here's everything I want done." The system figures out dependencies, parallelization, and ordering.
It's like the difference between:
Imperative (what we've been doing):
Declarative (automations):
The system sees: "Audio, video, and transcript can run in parallel. Chapters needs the transcript. Thumbnails need chapters." It orchestrates everything automatically.
Let me show you how to build this.
Creating Your First Automation
Automations are created once and run automatically every time you upload new media. Let's build a complete podcast processing automation:
Look at that workflow. Six tasks. Some parallel, some sequential. Zero imperative code. Just a description of what we want.
Here's how Ittybit executes it:
Ittybit handles all the orchestration. It knows chapters need the transcript, so it waits. It knows thumbnails need chapters, so it waits. But audio, video, and transcript can all run simultaneously, so they do.
Conditional Workflows: Smart Routing
Here's where automations get really powerful. You can add conditional logic:
See the conditions block? It says "only do these tasks if the media is video." This is perfect for podcasters who sometimes publish audio-only episodes and sometimes publish video. The automation adapts.
You can check:
media.kind- video, audio, or imagemedia.duration- length in secondsmedia.width/media.height- dimensionsfile.filesize- file size in bytes
Want to skip transcoding for short clips? Add a condition:
Simplifying Our Application Code
Here's the beautiful part: with automations set up, our application code becomes trivial. Watch this:
Before (imperative approach):
After (with automations):
From 50+ lines to 15. And the automation runs for every media you create, not just this one episode. Create a media object, and the entire pipeline executes. That's the power of declarative configuration.
Setting Up Automations: One-Time Setup
Automations are persistent. You create them once, and they run forever (or until you delete/disable them). Here's how I recommend setting them up:
1. Create a setup command:
2. Run it once during setup:
3. Store the automation ID:
Now your automation runs for every media object you create. Forever. Until you explicitly disable or delete it.
Tracking Automation Results
When an automation runs, Ittybit creates a "workflow task" that tracks the entire execution. Each step in the workflow gets its own task ID, but they're all linked to the parent workflow.
Let's update our webhook handler to track automation workflows:
Beautiful. One webhook tells us the entire workflow completed. We fetch the media object, grab all the processed files by their ref values, and we're done.
The Database Update: Storing Chapter Data
Let's add columns to store chapter and thumbnail data:
Update the Episode model:
Enhanced Episode API Response
Now our episode endpoint can return rich navigational data:
Example response:
The Complete Picture
Let me show you what the entire flow looks like with automations:
The imperative complexity is gone. We describe what we want, and Ittybit handles the orchestration.
Why This Matters: The Developer Experience
Here's what I love about declarative workflows: they're testable, versionable, and portable.
Testable: You can test an automation config without running it:
Versionable: Store your automation configs in Git:
Portable: Need a staging environment? Copy the automation config. Need to process old episodes differently? Create a second automation.
And if something breaks? Look at the automation config. The workflow is right there, not scattered across job classes and webhook handlers.
The DIY Comparison: Orchestration is Hard
Let me tell you what building this yourself looks like:
Task orchestration:
- Build a DAG (directed acyclic graph) system
- Implement dependency resolution
- Handle parallel execution
- Track task states across multiple workers
- Retry failed tasks without re-running successful ones
- Detect circular dependencies
Time investment: 3-4 weeks for basic orchestration. More for robustness.
Chapter detection:
- Scene detection algorithms
- Audio analysis
- NLP for topic segmentation
- Heuristics for good breaks
- Testing on diverse content
Time investment: 2-3 weeks.
Thumbnail extraction:
- Frame selection algorithms
- Quality scoring
- Batch processing
- Format conversion
Time investment: 1 week.
Total DIY investment: 6-8 weeks, plus ongoing maintenance.
With Ittybit automations: The code we've written today. Maybe 1-2 days, including testing. And it just works.
What We've Built
✅ Intelligent chapter detection - AI-powered topic segmentation
✅ Automatic thumbnails - Visual previews for every chapter
✅ Hero thumbnails - Engaging preview images
✅ Declarative workflows - Configuration over code
✅ Conditional logic - Smart routing based on content
✅ Parallel execution - Maximum speed
✅ Simple integration - One webhook, all results
Our videos are now navigable. Viewers can scan chapters, see thumbnails, jump to topics. And we did it without writing orchestration code.
What's Next
In Part 5, we'll tackle the final piece: automated distribution. We have all this processed media - audio, video, transcripts, chapters. Now we need to publish it:
- Uploading audio to Transistor (podcast RSS feed)
- Uploading video to YouTube with chapters and subtitles
- OAuth authentication flows
- Handling rate limits
- Retry strategies
- Success/failure notifications
We're in the home stretch. Our content is processed. Our videos are navigable. Now we just need to get them out into the world.
Until then, think about the workflows in your own applications. How many imperative job chains could be replaced with declarative configs? How much complexity could you eliminate?