Ittybit

Overview

The Speech Task transcribes spoken audio from a video or audio file into structured text.

Speech files are part of Intelligence and provide detailed transcripts, speaker information, confidence levels, and time-aligned segments.

When a Speech task runs, it creates an Intelligence file with kind: "speech" and a .json output containing the transcription data.

Example Output

{
  "id": "file_qrstuvwx1234",
  "object": "intelligence",
  "kind": "speech",
  "detected": true,
  "speakers": 2,
  "language": "en",
  "text": [
    "Hello, and welcome to UkeTube. I'm Jesse Doe.",
    "And I'm John Doe. Today we're going to be learning Sandstorm by Darude."
  ],
  "confidence": 0.72,
  "timeline": [
    {
      "index": 0,
      "start": 12.00,
      "end": 14.50,
      "detected": true,
      "speaker": 0,
      "text": "Hello, and welcome to UkeTube. I'm Jesse Doe.",
      "confidence": 0.89
    },
    {
      "index": 1,
      "start": 14.80,
      "end": 18.28,
      "detected": true,
      "speaker": 1,
      "text": "And I'm John Doe. Today we're going to be learning Sandstorm by Darude.",
      "confidence": 0.91
    }
  ],
  "created": "2025-01-01T01:23:45Z",
  "updated": "2025-01-01T01:23:45Z"
}

Creating a Speech Task

Speech tasks can be created using either a file already stored in your project or a public (or signed) URL.

import { IttybitClient } from "@ittybit/sdk";

const ittybit = new IttybitClient({
  apiKey: process.env.ITTYBIT_API_KEY!
});

const task = await ittybit.tasks.create({
  kind: "speech",
  url: "https://example.com/video.mp4",
  description: "Transcribe spoken audio to text",
  webhook_url: "https://your-app.com/speech-webhook"
});

console.log("Task created:", task.id);
console.log("Status:", task.status);

When processing completes, ittybit will create an Intelligence file in your project and (if a webhook_url was provided) send the results to that endpoint.

Webhook Example

You can handle Speech task results in your own server or Supabase Edge Function:

app.post("/speech-webhook", async (req, res) => {
  const { kind, status, results } = req.body || {};

  if (kind !== "speech" || status !== "completed") {
    return res.status(200).send("Not a completed Speech task");
  }

  console.log("Transcript:", results.text);
  console.log("Detected speakers:", results.speakers);
  console.log("Language:", results.language);

  res.status(200).send("Speech results received");
});

File Structure

Speech task results follow a consistent structure, with top-level and timeline-level properties:

Property	Type	Description
id	string	Unique file ID for the Intelligence file.
object	string	Always `"intelligence"`.
kind	string	Always `"speech"`.
detected	boolean	Whether speech was detected in the file.
speakers	integer	Number of distinct speakers detected.
language	string	Detected language code (ISO 639-1).
text	array	Transcript text segments (top-level, simplified).
confidence	number	Average confidence score for the transcript.
timeline	array	List of time-coded transcript segments with start, end, speaker, and confidence.
created / updated	string (ISO 8601)	Timestamps for creation and last update.

Supported Inputs

Speech tasks work with:

Audio files (.mp3, .m4a, .wav, .ogg)
Video files with embedded audio (.mp4, .mov, .webm)

Common Use Cases

Video and podcast transcription
Generating subtitles or captions
Searchable transcripts and AI summaries
Creating text-based chapter markers

Example Integration

Speech data can be combined with other Ittybit features such as Chapters or Clips:

const chapters = results.timeline.map((t, i) => ({
  index: i,
  start: t.start,
  end: t.end,
  text: t.text
}));

// Example: create a short clip per speech segment
const clips = chapters.map(ch => ittybit.tasks.create({
  kind: "video",
  file_id: sourceFileId,
  start: ch.start,
  end: ch.end,
  filename: `clip-${ch.index}.mp4`,
  ref: `speech-segment-${ch.index}`
}));

Speech

On this page