Intelligence

Speech

View Markdown

Overview

The Speech Task transcribes spoken audio from a video or audio file into structured text.

Speech files are part of Intelligence and provide detailed transcripts, speaker information, confidence levels, and time-aligned segments.

When a Speech task runs, it creates an Intelligence file with kind: "speech" and a .json output containing the transcription data.


Example Output

{
  "id": "file_qrstuvwx1234",
  "object": "intelligence",
  "kind": "speech",
  "detected": true,
  "speakers": 2,
  "language": "en",
  "text": [
    "Hello, and welcome to UkeTube. I'm Jesse Doe.",
    "And I'm John Doe. Today we're going to be learning Sandstorm by Darude."
  ],
  "confidence": 0.72,
  "timeline": [
    {
      "index": 0,
      "start": 12.00,
      "end": 14.50,
      "detected": true,
      "speaker": 0,
      "text": "Hello, and welcome to UkeTube. I'm Jesse Doe.",
      "confidence": 0.89
    },
    {
      "index": 1,
      "start": 14.80,
      "end": 18.28,
      "detected": true,
      "speaker": 1,
      "text": "And I'm John Doe. Today we're going to be learning Sandstorm by Darude.",
      "confidence": 0.91
    }
  ],
  "created": "2025-01-01T01:23:45Z",
  "updated": "2025-01-01T01:23:45Z"
}

Creating a Speech Task

Speech tasks can be created using either a file already stored in your project or a public (or signed) URL.

import { IttybitClient } from "@ittybit/sdk";

const ittybit = new IttybitClient({
  apiKey: process.env.ITTYBIT_API_KEY!
});

const task = await ittybit.tasks.create({
  kind: "speech",
  url: "https://example.com/video.mp4",
  description: "Transcribe spoken audio to text",
  webhook_url: "https://your-app.com/speech-webhook"
});

console.log("Task created:", task.id);
console.log("Status:", task.status);

When processing completes, ittybit will create an Intelligence file in your project and (if a webhook_url was provided) send the results to that endpoint.


Webhook Example

You can handle Speech task results in your own server or Supabase Edge Function:

app.post("/speech-webhook", async (req, res) => {
  const { kind, status, results } = req.body || {};

  if (kind !== "speech" || status !== "completed") {
    return res.status(200).send("Not a completed Speech task");
  }

  console.log("Transcript:", results.text);
  console.log("Detected speakers:", results.speakers);
  console.log("Language:", results.language);

  res.status(200).send("Speech results received");
});

File Structure

Speech task results follow a consistent structure, with top-level and timeline-level properties:

PropertyTypeDescription
idstringUnique file ID for the Intelligence file.
objectstringAlways "intelligence".
kindstringAlways "speech".
detectedbooleanWhether speech was detected in the file.
speakersintegerNumber of distinct speakers detected.
languagestringDetected language code (ISO 639-1).
textarrayTranscript text segments (top-level, simplified).
confidencenumberAverage confidence score for the transcript.
timelinearrayList of time-coded transcript segments with start, end, speaker, and confidence.
created / updatedstring (ISO 8601)Timestamps for creation and last update.

Supported Inputs

Speech tasks work with:

  • Audio files (.mp3, .m4a, .wav, .ogg)
  • Video files with embedded audio (.mp4, .mov, .webm)

Common Use Cases

  • Video and podcast transcription
  • Generating subtitles or captions
  • Searchable transcripts and AI summaries
  • Creating text-based chapter markers

Example Integration

Speech data can be combined with other Ittybit features such as Chapters or Clips:

const chapters = results.timeline.map((t, i) => ({
  index: i,
  start: t.start,
  end: t.end,
  text: t.text
}));

// Example: create a short clip per speech segment
const clips = chapters.map(ch => ittybit.tasks.create({
  kind: "video",
  file_id: sourceFileId,
  start: ch.start,
  end: ch.end,
  filename: `clip-${ch.index}.mp4`,
  ref: `speech-segment-${ch.index}`
}));

On this page