Building a Podcast Automation Platform with Ittybit
Part 3 - Video Transcoding & AI Transcription
In the last article, we built the processing pipeline - creating tasks, handling webhooks, tracking progress. But I glossed over something important: we told Ittybit to "transcode this video to 1080p MP4," and I didn't really explain why those settings, or what other options you might want.
See, here's the thing about video: it's complicated. I mean really complicated. There are codecs (H.264, H.265, VP9), containers (MP4, MOV, MKV), resolutions (720p, 1080p, 4K), frame rates (24fps, 30fps, 60fps), bitrates (constant, variable, adaptive), color spaces... it goes on forever. And every platform has different requirements. YouTube wants one thing. Twitter wants another. Your website wants something else entirely.
When I first started working with video, I spent weeks reading FFmpeg documentation, trying to figure out the right incantation of flags to get a video that looked good but didn't bloat to 10GB. I'd transcode something, upload it, and YouTube would just... re-encode it anyway because I got the profile wrong. Or the audio would be out of sync. Or it would look pixelated on mobile. Video encoding is one of those things where you think you understand it, and then you realize you don't, and then you spend three days debugging why 23.976fps is different from 24fps.
This is exactly the kind of complexity that Ittybit abstracts away. And in this article, we're going to dive deep into video transcoding options and AI-powered transcription. Let's get into it.
Understanding Video Transcoding: Why It Matters
Before we write any code, I want you to understand what transcoding actually does and why you need it.
The problem: Your user records a podcast video in StreamYard. StreamYard gives them a 2.5GB MP4 file at 1920x1080, H.264 codec, 8Mbps bitrate. Sounds great, right?
Wrong. That file is:
- Too big for web streaming - 2.5GB for 60 minutes means long load times
- Wrong bitrate for YouTube - YouTube recommends 5Mbps for 1080p, not 8Mbps
- Possibly wrong container - Some players prefer fragmented MP4
- Unoptimized audio - Might have 256kbps AAC when 192kbps would sound identical
You need to transcode: take the input video and re-encode it with optimal settings for its destination.
The YouTube Optimization Strategy
YouTube is our primary video destination, so let's start there. YouTube has official encoding recommendations, but here's what actually matters in practice:
For 1080p uploads:
- Container: MP4
- Video codec: H.264 (not H.265, despite what you might think)
- Resolution: 1920x1080
- Frame rate: Match source (typically 30fps or 24fps)
- Bitrate: 8Mbps for 30fps, 12Mbps for 60fps
- Audio codec: AAC-LC
- Audio bitrate: 192kbps
- Color space: BT.709
Why H.264 and not the newer H.265 (HEVC)? Because H.265 requires more CPU to decode, and not all devices support it. H.264 is universally supported and YouTube will happily accept it.
Let's update our IttybitService with more sophisticated video transcoding:
Let me break down what's happening in these methods:
The createYouTubeVideoTask() method - This is YouTube-specific. Notice how we:
- Check the source frame rate and match it (you don't want to convert 24fps to 30fps)
- Adjust bitrate based on frame rate (60fps needs more bits)
- Use the 'high' H.264 profile for maximum quality
- Set maxrate and bufsize for better quality control
The createWebVideoTask() method - For website embedding, we optimize differently:
- Lower resolution (720p is fine for most websites)
- The
movflags: '+faststart'is crucial - it moves metadata to the beginning of the file so videos can start playing before fully downloading - Lower bitrate (web viewers are more sensitive to load times)
- Faster encoding preset (medium vs slow)
The createMultiQualityVideoTasks() method - This is for advanced use cases. If you're building a video platform with adaptive streaming, you'd generate multiple quality versions. A user on mobile gets 360p. A user on desktop WiFi gets 1080p. But for our podcast automation platform, we probably don't need this. I'm showing it to illustrate the flexibility.
Updating ProcessEpisodeJob: Smarter Video Handling
Now let's update our processing job to use these new methods and handle different video scenarios:
The key change here is that we're now fetching the source file information before creating tasks. This lets us make smarter decisions. For example, if the source is 720p, we don't upscale to 1080p. If it's 60fps, we match that frame rate.
AI-Powered Transcription: Speech to Text
Now let's talk about transcription. This is where things get really cool.
Ittybit uses advanced speech recognition models (think Whisper-level quality) to generate transcripts. But transcripts aren't just a wall of text - they come with timestamps, speaker detection, and word-level accuracy scores.
Here's what a transcript task can give you:
WebVTT format (subtitles):
JSON format (structured data):
Let's enhance our transcript task creation:
Why multiple formats?
- VTT/SRT: For video subtitles (YouTube, your website)
- JSON: For your application (searchable, parseable)
- TXT: For blog posts, show notes, AI processing
In practice, I usually request VTT and JSON. VTT for YouTube (which accepts WebVTT subtitles), JSON for everything else.
Parsing and Using Transcripts
When the transcript task completes, we receive a file. Let's build a service to parse and use it:
Now let's update our CompleteEpisodeProcessingJob to use this service:
Advanced Feature: Chapter Detection
Here's something really cool: Ittybit can automatically detect chapters in your video using scene detection and content analysis. This is perfect for podcasts where topics change throughout the episode.
Let's add chapter detection to our pipeline:
Update the ProcessEpisodeJob to include chapter detection:
Don't forget to update your migration to include the new task type:
When the chapter detection task completes, you'll get a JSON file with chapters:
These chapters are perfect for:
- YouTube chapter markers
- Podcast apps that support chapters (Overcast, Pocket Casts)
- Your website's episode page
- Allowing users to jump to specific topics
Enhanced Episode Status Endpoint
Let's update our episode show endpoint to include all this rich data:
Now when you hit GET /episodes/{id}, you get everything:
Beautiful.
The DIY Reality Check
Let me tell you what this would look like if you were doing it yourself:
Video transcoding:
- Set up FFmpeg servers (install, configure, secure)
- Write wrapper scripts for different use cases
- Handle edge cases (corrupt files, unsupported codecs, etc.)
- Implement progress tracking
- Manage compute resources (don't let one job starve others)
- Test on dozens of input formats
- Update when YouTube changes recommendations
Time investment: 2-3 weeks, plus ongoing maintenance.
Speech-to-text:
- Choose a provider (AWS Transcribe, Google Speech-to-Text, Azure, OpenAI Whisper)
- Integrate their API
- Handle audio pre-processing (some services require specific formats)
- Parse their output formats (they're all different)
- Implement retry logic for failures
- Manage costs (some charge per minute)
- Handle language detection
Time investment: 1-2 weeks, plus API cost management.
Chapter detection:
- Scene detection algorithms (comparing frames)
- Content analysis (what's changing?)
- NLP on transcript (topic modeling)
- Heuristics for good chapter breaks
- Testing and tuning
Time investment: 2-4 weeks if you're good. More if you're not.
Total DIY investment: 5-9 weeks of development, ongoing maintenance, infrastructure costs.
With Ittybit: The code we've written today. Maybe 2-3 days of work. And it's production-ready.
The time savings are obvious. But here's the part people often miss: the cognitive load savings. I don't have to think about FFmpeg flags. I don't have to debug why scene detection is failing on certain videos. I don't have to worry about keeping up with YouTube's evolving recommendations. I just tell Ittybit what I want, and it handles the complexity.
What We've Built
Let's take stock:
✅ YouTube-optimized video transcoding - Proper bitrates, codecs, frame rates ✅ Web-optimized video - Smaller, faster, with progressive download ✅ Multi-quality support - Ready for adaptive streaming if needed ✅ AI-powered transcription - Speech-to-text with timestamps ✅ Transcript parsing - VTT to structured data ✅ Show notes generation - Automatic from transcript ✅ Chapter detection - AI-powered topic segmentation ✅ Rich API responses - Everything a client needs
We now have processed videos, pristine audio, accurate transcripts, and even chapters. In the next article, we'll take all this goodness and distribute it to the world: uploading to Transistor and YouTube, handling OAuth, managing metadata, and making sure everything publishes correctly.
What's Next
In Part 4, we'll tackle automated distribution:
- Uploading audio to Transistor (podcast hosting)
- Uploading video to YouTube with chapters and subtitles
- OAuth flows for user authentication
- Handling rate limits and quotas
- Retry strategies for failed uploads
- Notification systems
We're almost there. Our episodes are processed. Now we just need to publish them. See you in the next one.