Speech
Overview
The Speech Task transcribes spoken audio from a video or audio file into structured text.
Speech files are part of Intelligence and provide detailed transcripts, speaker information, confidence levels, and time-aligned segments.
When a Speech task runs, it creates an Intelligence file with kind: "speech"
and a .json
output containing the transcription data.
Example Output
Creating a Speech Task
Speech tasks can be created using either a file already stored in your project or a public (or signed) URL.
When processing completes, ittybit will create an Intelligence file in your project and (if a webhook_url
was provided) send the results to that endpoint.
Webhook Example
You can handle Speech task results in your own server or Supabase Edge Function:
File Structure
Speech task results follow a consistent structure, with top-level and timeline-level properties:
Property | Type | Description |
---|---|---|
id | string | Unique file ID for the Intelligence file. |
object | string | Always "intelligence" . |
kind | string | Always "speech" . |
detected | boolean | Whether speech was detected in the file. |
speakers | integer | Number of distinct speakers detected. |
language | string | Detected language code (ISO 639-1). |
text | array | Transcript text segments (top-level, simplified). |
confidence | number | Average confidence score for the transcript. |
timeline | array | List of time-coded transcript segments with start, end, speaker, and confidence. |
created / updated | string (ISO 8601) | Timestamps for creation and last update. |
Supported Inputs
Speech tasks work with:
- Audio files (
.mp3
,.m4a
,.wav
,.ogg
) - Video files with embedded audio (
.mp4
,.mov
,.webm
)
Common Use Cases
- Video and podcast transcription
- Generating subtitles or captions
- Searchable transcripts and AI summaries
- Creating text-based chapter markers
Example Integration
Speech data can be combined with other Ittybit features such as Chapters or Clips: