Transcription (Speech-to-Text)

Overview

On-device speech recognition powered by Whisper. WhisperPipeline handles the full mel-spectrogram → encoder → decoder chain, with automatic VAD chunking (Silero-VAD pre-pass) and long-audio stitching so callers can pass arbitrarily long buffers in a single infer call.

Flutter consumers go through the singleton start_model + infer JSON path — there is no direct WhisperPipeline constructor on Dart. Both surfaces share the same on-disk cache and response shape.

Note

The transcription pipeline currently uses WhisperPipeline. More model families are planned for future releases.

Supported Models

Model

HF Repo

Notes

Whisper Large V3 Turbo

TheStageAI/thewhisper-large-v3-turbo

Auto VAD chunking, 10 s windows

API Reference

Full Constructor

let stt = try await WhisperPipeline(
    engines_path: "TheStageAI/thewhisper-large-v3-turbo",
    device: "npu",
    devices: nil,
    overlap_seconds: 0,
    use_internal_vad: true,
    revision: "main",
    on_load_progress: nil
)

Parameter

Type

Description

engines_path

String

HuggingFace repo ID or local path to the compiled engine bundle.

device

String

Compute backend: "npu", "gpu", or "cpu".

devices

[String]?

Optional multi-device list (overrides device).

overlap_seconds

Int

Overlap between consecutive audio windows (default 0).

use_internal_vad

Bool

Enable/disable the Silero-VAD pre-pass (default true).

revision

String

HuggingFace revision / branch (default "main").

on_load_progress

LoadProgressHandler?

Optional callback fired during download, extraction and loading phases.

Inputs / Outputs

Direction

Type

Description

input audio

[Float]

16 kHz mono PCM, samples in [-1.0, 1.0], any length.

input language

String (default "en")

Whisper language code: en, fr, de, es, pt, ja, ko, zh, ar, hi, ru, …

input config.max_new_tokens

Int?

Cap per-window decode.

input config.return_tokens

Bool (default false)

Include token IDs in ASRResult.

output ASRResult.text

String

Transcribed text.

output ASRResult.token_count

Int

Total decoded tokens (sum across windows).

output ASRResult.decode_seconds

Double

Decoder wall time.

output ASRResult.tokens

[Int]?

Token IDs (only if return_tokens == true).

Audio I/O

  • 16 kHz mono [Float], samples normalized to [-1.0, 1.0].

  • Long buffers are split internally into the bundle’s chunk_seconds windows. The shipping TheStageAI/thewhisper-large-v3-turbo uses 10 s windows.

  • Overlap between windows is configurable via the overlap_seconds constructor argument (default 0).

  • Mismatched-rate input is not auto-resampled — convert your mic capture to 16 kHz mono Float32 before calling infer.

Singleton API

import TheStageSDK

let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")

try await ai.start_model(
    model_name: "stt",
    engines_path: "TheStageAI/thewhisper-large-v3-turbo"
)

let json = try ai.infer(
    model_name: "stt",
    input_json: [
        "audio": audio_samples,
        "language": "en",
        "return_tokens": true
    ]
)
let text = json[0]["transcription"] as! String

Response Keys

Key

Type

Description

transcription

String

The transcribed text.

token_count

Int

Total decoded tokens.

decode_seconds

Double

Decoder wall time.

tokens

[Int]

Token IDs (only present when return_tokens == true).

Usage Guides

Transcribing a recorded audio file

The most common use case: load a pre-recorded audio file and get a text transcript. The key requirement is that your audio must be 16 kHz mono Float32 with samples in [-1.0, 1.0]. The pipeline does not auto-resample, so convert beforehand.

Swift:

import TheStageSDK
import AVFoundation

let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")

let stt = try await WhisperPipeline(
    engines_path: "TheStageAI/thewhisper-large-v3-turbo"
)

// Load audio and convert to 16 kHz mono Float32
let url = Bundle.main.url(forResource: "recording", withExtension: "wav")!
let file = try AVAudioFile(forReading: url)
let format = AVAudioFormat(commonFormat: .pcmFormatFloat32,
                           sampleRate: 16000, channels: 1, interleaved: false)!
let converter = AVAudioConverter(from: file.processingFormat, to: format)!
let capacity = AVAudioFrameCount(
    Double(file.length) * 16000.0 / file.processingFormat.sampleRate
)
let buffer = AVAudioPCMBuffer(pcmFormat: format, frameCapacity: capacity)!
try converter.convert(to: buffer, from: file)
let samples = Array(UnsafeBufferPointer(
    start: buffer.floatChannelData![0],
    count: Int(buffer.frameLength)
))

let result = stt.infer(audio: samples, language: "en")
print(result.text)

Flutter:

import 'package:thestage_apple_sdk/thestage_apple_sdk.dart';
import 'dart:typed_data';

await TheStageFlutterSDK.initialize(api_token: 'th_…');

await TheStageFlutterSDK.start_model(
  model_name: 'stt',
  engines_path: 'TheStageAI/thewhisper-large-v3-turbo',
);

// audio_samples must be Float32List at 16 kHz mono, values in [-1.0, 1.0]
final result = await TheStageFlutterSDK.infer(
  model_name: 'stt',
  input_json: {
    'audio': audio_samples,
    'language': 'en',
  },
);
print(result[0]['transcription']);

Attention

Always call initialize(apiToken:) before constructing any pipeline. Forgetting this is the most common source of “model loading fails” errors.

Live microphone transcription

For real-time transcription, capture audio from the microphone, accumulate samples, and periodically send them to the pipeline. The critical requirement is configuring your audio session to capture at 16 kHz mono.

Swift:

import TheStageSDK
import AVFoundation

let audioEngine = AVAudioEngine()
let inputNode = audioEngine.inputNode
let recordingFormat = AVAudioFormat(commonFormat: .pcmFormatFloat32,
                                    sampleRate: 16000,
                                    channels: 1,
                                    interleaved: false)!

var accumulated: [Float] = []

inputNode.installTap(onBus: 0, bufferSize: 4096, format: recordingFormat) {
    buffer, _ in
    let ptr = buffer.floatChannelData![0]
    let samples = Array(UnsafeBufferPointer(start: ptr,
                                             count: Int(buffer.frameLength)))
    accumulated.append(contentsOf: samples)
}

audioEngine.prepare()
try audioEngine.start()

// After the user stops speaking, transcribe the accumulated audio
audioEngine.stop()
inputNode.removeTap(onBus: 0)

let result = stt.infer(audio: accumulated, language: "en")
print(result.text)

Note

If your audio session runs at 44.1 kHz or 48 kHz, you must resample to 16 kHz before passing to infer. The pipeline does not resample internally.

Transcribing long audio (> 10 seconds)

You do not need to split long recordings manually. The pipeline automatically divides audio into 10-second windows (the bundle’s chunk_seconds) and stitches the transcripts together.

For better accuracy at window boundaries, set overlap_seconds so consecutive windows share a few seconds of audio. This prevents words that straddle a boundary from being cut off.

let stt = try await WhisperPipeline(
    engines_path: "TheStageAI/thewhisper-large-v3-turbo",
    overlap_seconds: 2   // 2 seconds of overlap between windows
)

// Pass the entire recording — the pipeline handles chunking
let result = stt.infer(audio: full_recording_samples, language: "en")
print(result.text)
await TheStageFlutterSDK.start_model(
  model_name: 'stt',
  engines_path: 'TheStageAI/thewhisper-large-v3-turbo',
  config: {'overlap_seconds': 2},
);

final result = await TheStageFlutterSDK.infer(
  model_name: 'stt',
  input_json: {'audio': full_recording_samples, 'language': 'en'},
);

Note

Each 10-second window is processed sequentially, so a 60-second recording takes roughly 6× the time of a single window. If latency matters, consider pre-segmenting with VAD and transcribing only the speech portions.

Disabling internal VAD for pre-segmented audio

WhisperPipeline includes a Silero-VAD pre-pass that detects speech segments before transcribing. When your audio is already segmented — for example, from TheStageVoiceAgent or your own VAD — disable the internal VAD to skip redundant processing.

Swift — constructor parameter:

let stt = try await WhisperPipeline(
    engines_path: "TheStageAI/thewhisper-large-v3-turbo",
    use_internal_vad: false
)

Swift — singleton API:

try await ai.start_model(
    model_name: "stt",
    engines_path: "TheStageAI/thewhisper-large-v3-turbo",
    config: ["use_internal_vad": false]
)

Flutter:

await TheStageFlutterSDK.start_model(
  model_name: 'stt',
  engines_path: 'TheStageAI/thewhisper-large-v3-turbo',
  config: {'use_internal_vad': false},
);

Attention

With internal VAD disabled, the pipeline transcribes everything — including silence. If your audio contains long silent stretches, you may get hallucinated text. Only disable VAD when you are certain the input contains speech.

Streaming live transcription (push-based ASR)

For real-time use cases like voice assistants and live captions, batch infer is the wrong tool — you want live partial transcripts that grow as the user speaks, then a final authoritative result when the turn ends.

WhisperPipeline.open_streamer() returns an ASRStreamer that mirrors the TTS streamer’s push-based shape: send audio with send(_:), read stable partial transcripts from partials, and call finish() for the authoritative end-of-turn transcript. A single serial worker re-decodes the growing buffer and commits stable text via LocalAgreement, so partials never flicker or retract.

Swift:

let stt = try await WhisperPipeline(
    engines_path: "TheStageAI/thewhisper-large-v3-turbo",
    revision: "main"
)

let streamer = stt.open_streamer(language: "en", partial_interval_ms: 600)

let captions = Task {
    for await text in streamer.partials {
        print("partial: \(text)")   // committed-so-far, grows monotonically
    }
}

for await frame in microphone_frames {          // [Float] @ 16 kHz mono
    streamer.send(frame)
    if vad_detected_pause { streamer.flush() }   // finalize segment + trim
}

let final_text = await streamer.finish()         // closes `partials`
await captions.value
print("final: \(final_text)")

Key behaviors:

  • partials is the cosmetic/live caption. finish() is the trusted result and always covers the complete audio (including the last word).

  • flush() at VAD pauses keeps per-pass latency flat on long turns: it commits settled text and re-decodes only the uncommitted tail afterward.

  • cancel() aborts the turn without a final decode — use it for barge-in.

  • partial_interval_ms (default 600) bounds how often partial passes run.

  • Convert mic input to 16 kHz mono Float32 first — input is not resampled.

Attention

Streaming ASR is a Swift-direct API on WhisperPipeline. There is no singleton/JSON or Flutter streaming-ASR entry point. For live speech-to-text on Flutter, use the Voice Agent, which runs the same streaming ASR internally.

Multi-language transcription

Whisper supports many languages. Set the language parameter to the appropriate ISO 639-1 code. The model does not auto-detect language — if you don’t specify one, it defaults to English.

Common language codes:

Language

Code

Language

Code

English

en

Japanese

ja

French

fr

Korean

ko

German

de

Chinese

zh

Spanish

es

Arabic

ar

Portuguese

pt

Hindi

hi

Russian

ru

Italian

it

Swift:

let result = stt.infer(audio: audio_samples, language: "fr")
print(result.text)  // French transcription

Flutter:

final result = await TheStageFlutterSDK.infer(
  model_name: 'stt',
  input_json: {
    'audio': audio_samples,
    'language': 'ja',
  },
);
print(result[0]['transcription']);  // Japanese transcription

Troubleshooting

Transcription is empty or inaccurate

The most common cause is wrong audio format. The pipeline requires 16 kHz mono Float32 with samples normalized to [-1.0, 1.0]. It does not auto-resample.

  1. Verify your sample rate is exactly 16000 Hz. Audio captured at 44.1 kHz or 48 kHz without resampling will produce garbage.

  2. Verify the audio is mono (single channel). Stereo input will be misinterpreted.

  3. Verify sample values are in [-1.0, 1.0]. Int16 samples (range -32768 to 32767) must be divided by 32768.0 first.

  4. Check that the audio actually contains speech — silent or near-silent input will produce empty results.

// Correct: convert Int16 samples to Float32
let float_samples = int16_samples.map { Float($0) / 32768.0 }
let result = stt.infer(audio: float_samples, language: "en")

Transcription drops words at chunk boundaries

When transcribing long audio, the pipeline splits it into 10-second windows. With the default overlap_seconds: 0, words that straddle the boundary between two windows may be cut off or duplicated.

Increase overlap_seconds to give the decoder shared context at each boundary:

let stt = try await WhisperPipeline(
    engines_path: "TheStageAI/thewhisper-large-v3-turbo",
    overlap_seconds: 2
)

A value of 1–2 seconds is usually sufficient. Higher overlap improves boundary accuracy but increases total processing time.

Slow transcription on long audio

Each 10-second window is processed sequentially. A 2-minute recording requires ~12 sequential decoder passes, which can take noticeable time on less powerful devices.

To speed things up:

  • Use VAD (use_internal_vad: true, the default) to skip silent segments. If only 30 seconds of a 2-minute file contain speech, only those segments are transcribed.

  • If you have pre-segmented audio, pass only the speech portions rather than the entire recording.

  • For real-time use cases, transcribe in shorter increments instead of accumulating minutes of audio before a single infer call.

Wrong language in output

Whisper does not auto-detect the spoken language. If you pass French audio without setting language: "fr", the model will try to decode it as English and produce nonsense.

Always set the language parameter explicitly:

let result = stt.infer(audio: french_audio, language: "fr")
final result = await TheStageFlutterSDK.infer(
  model_name: 'stt',
  input_json: {
    'audio': french_audio,
    'language': 'fr',
  },
);

See the language code table in Multi-language transcription above for supported codes.

Model loading fails

Follow the same pattern as other pipelines:

  1. Verify that TheStageAI.shared.initialize(apiToken:) completed successfully before constructing WhisperPipeline.

  2. On first launch the model must be downloaded. Check that the device has a working network connection.

  3. Attach an on_load_progress callback to see which phase is stuck:

let stt = try await WhisperPipeline(
    engines_path: "TheStageAI/thewhisper-large-v3-turbo",
    on_load_progress: { p in
        print("[\(p.model)] \(p.phase) \(Int(p.fraction * 100))%")
    }
)

If progress stalls at .downloading, the network is the bottleneck. If it stalls at .loading, the device may be out of memory. Use prefetch_engines to pre-download the bundle over Wi-Fi before the user needs it.

Load Progress

Swift:

let stt = try await WhisperPipeline(
    engines_path: "TheStageAI/thewhisper-large-v3-turbo",
    on_load_progress: { p in
        print("[\(p.model)] \(p.phase) \(Int(p.fraction * 100))%")
    }
)

Flutter:

TheStageFlutterSDK.on_progress.listen((event) {
  if (event['model_name'] != 'stt') return;
  final phase    = event['phase']    as String?;
  final fraction = event['progress'] as double?;
  print('[stt] $phase ${(fraction ?? 0) * 100}%');
});

await TheStageFlutterSDK.start_model(
  model_name: 'stt',
  engines_path: 'TheStageAI/thewhisper-large-v3-turbo',
);

Prefetch Engines

let engines_dir = try await ai.prefetch_engines(
    repo_id: "TheStageAI/thewhisper-large-v3-turbo"
)

let stt = try await WhisperPipeline(engines_path: engines_dir)

Cleanup

Swift:

_ = try ai.stop_model(model_name: "stt")

Flutter:

await TheStageFlutterSDK.stop_model(model_name: 'stt');