Transcription (Speech-to-Text)¶

Overview ¶

On-device speech recognition powered by Whisper. WhisperPipeline handles the full mel-spectrogram → encoder → decoder chain, with automatic VAD chunking (Silero-VAD pre-pass) and long-audio stitching so callers can pass arbitrarily long buffers in a single infer call.

Flutter consumers go through the singleton start_model + infer JSON path — there is no direct WhisperPipeline constructor on Dart. Both surfaces share the same on-disk cache and response shape.

Note

The transcription pipeline currently uses WhisperPipeline. More model families are planned for future releases.

Supported Models ¶

Model	HF Repo	Notes
Whisper Large V3 Turbo	`TheStageAI/thewhisper-large-v3-turbo`	Auto VAD chunking, 10 s windows

API Reference ¶

Full Constructor ¶

let stt = try await WhisperPipeline(
    engines_path: "TheStageAI/thewhisper-large-v3-turbo",
    device: "npu",
    devices: nil,
    overlap_seconds: 0,
    use_internal_vad: true,
    revision: "main",
    on_load_progress: nil
)

Parameter	Type	Description
`engines_path`	`String`	HuggingFace repo ID or local path to the compiled engine bundle.
`device`	`String`	Compute backend: `"npu"`, `"gpu"`, or `"cpu"`.
`devices`	`[String: String]?`	Optional per-component device map (e.g. `{"default": "npu", "melspec": "cpu"}`).
`overlap_seconds`	`Double`	Overlap between consecutive audio windows in seconds (default `0`).
`use_internal_vad`	`Bool`	Enable/disable the Silero-VAD pre-pass (default `true`).
`revision`	`String`	HuggingFace revision / branch (default `"main"`).
`on_load_progress`	`LoadProgressHandler?`	Optional callback fired during download, extraction and loading phases.

Inputs / Outputs ¶

Direction	Type	Description
input `audio`	`[Float]`	16 kHz mono PCM, samples in `[-1.0, 1.0]`, any length.
input `language`	`String` (default `"en"`)	Whisper language code: `en`, `fr`, `de`, `es`, `pt`, `ja`, `ko`, `zh`, `ar`, `hi`, `ru`, …
input `config.max_new_tokens`	`Int?`	Cap per-window decode.
input `config.return_tokens`	`Bool` (default `false`)	Include token IDs in `ASRResult`.
output `ASRResult.text`	`String`	Transcribed text.
output `ASRResult.token_count`	`Int`	Total decoded tokens (sum across windows).
output `ASRResult.decode_seconds`	`Double`	Decoder wall time.
output `ASRResult.tokens`	`[Int]?`	Token IDs (only if `return_tokens == true`).

Audio I/O ¶

16 kHz mono [Float], samples normalized to [-1.0, 1.0].
Long buffers are split internally into the bundle’s chunk_seconds windows. The shipping TheStageAI/thewhisper-large-v3-turbo uses 10 s windows.
Overlap between windows is configurable via the overlap_seconds constructor argument (default 0).
Mismatched-rate input is not auto-resampled — convert your mic capture to 16 kHz mono Float32 before calling infer.

Singleton API ¶

import TheStageSDK

let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")

try await ai.start_model(
    model_name: "stt",
    engines_path: "TheStageAI/thewhisper-large-v3-turbo"
)

let json = try ai.infer(
    model_name: "stt",
    input_json: [
        "audio": audio_samples,
        "language": "en",
        "return_tokens": true
    ]
)
let text = json[0]["transcription"] as! String

Response Keys ¶

Key	Type	Description
`transcription`	`String`	The transcribed text.
`token_count`	`Int`	Total decoded tokens.
`decode_seconds`	`Double`	Decoder wall time.
`tokens`	`[Int]`	Token IDs (only present when `return_tokens == true`).

Usage Guides ¶

Transcribing a recorded audio file ¶

The most common use case: load a pre-recorded audio file and get a text transcript. The key requirement is that your audio must be 16 kHz mono Float32 with samples in [-1.0, 1.0]. The pipeline does not auto-resample, so convert beforehand.

Swift:

import TheStageSDK
import AVFoundation

let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")

let stt = try await WhisperPipeline(
    engines_path: "TheStageAI/thewhisper-large-v3-turbo"
)

// Load audio and convert to 16 kHz mono Float32
let url = Bundle.main.url(forResource: "recording", withExtension: "wav")!
let file = try AVAudioFile(forReading: url)
let format = AVAudioFormat(commonFormat: .pcmFormatFloat32,
                           sampleRate: 16000, channels: 1, interleaved: false)!
let converter = AVAudioConverter(from: file.processingFormat, to: format)!
let capacity = AVAudioFrameCount(
    Double(file.length) * 16000.0 / file.processingFormat.sampleRate
)
let buffer = AVAudioPCMBuffer(pcmFormat: format, frameCapacity: capacity)!
try converter.convert(to: buffer, from: file)
let samples = Array(UnsafeBufferPointer(
    start: buffer.floatChannelData![0],
    count: Int(buffer.frameLength)
))

let result = stt.infer(audio: samples, language: "en")
print(result.text)

Flutter:

import 'package:thestage_apple_sdk/thestage_apple_sdk.dart';
import 'dart:typed_data';

await TheStageFlutterSDK.initialize(api_token: 'th_…');

await TheStageFlutterSDK.start_model(
  model_name: 'stt',
  engines_path: 'TheStageAI/thewhisper-large-v3-turbo',
);

// audio_samples must be Float32List at 16 kHz mono, values in [-1.0, 1.0]
final result = await TheStageFlutterSDK.infer(
  model_name: 'stt',
  input_json: {
    'audio': audio_samples,
    'language': 'en',
  },
);
print(result[0]['transcription']);

Attention

Always call initialize(apiToken:) before constructing any pipeline. Forgetting this is the most common source of “model loading fails” errors.

Live microphone transcription ¶

For real-time transcription, capture audio from the microphone, accumulate samples, and periodically send them to the pipeline. The critical requirement is configuring your audio session to capture at 16 kHz mono.

Swift:

import TheStageSDK
import AVFoundation

let audioEngine = AVAudioEngine()
let inputNode = audioEngine.inputNode
let recordingFormat = AVAudioFormat(commonFormat: .pcmFormatFloat32,
                                    sampleRate: 16000,
                                    channels: 1,
                                    interleaved: false)!

var accumulated: [Float] = []

inputNode.installTap(onBus: 0, bufferSize: 4096, format: recordingFormat) {
    buffer, _ in
    let ptr = buffer.floatChannelData![0]
    let samples = Array(UnsafeBufferPointer(start: ptr,
                                             count: Int(buffer.frameLength)))
    accumulated.append(contentsOf: samples)
}

audioEngine.prepare()
try audioEngine.start()

// After the user stops speaking, transcribe the accumulated audio
audioEngine.stop()
inputNode.removeTap(onBus: 0)

let result = stt.infer(audio: accumulated, language: "en")
print(result.text)

Note

If your audio session runs at 44.1 kHz or 48 kHz, you must resample to 16 kHz before passing to infer. The pipeline does not resample internally.

Transcribing long audio (> 10 seconds)¶

You do not need to split long recordings manually. The pipeline automatically divides audio into 10-second windows (the bundle’s chunk_seconds) and stitches the transcripts together.

For better accuracy at window boundaries, set overlap_seconds so consecutive windows share a few seconds of audio. This prevents words that straddle a boundary from being cut off.

let stt = try await WhisperPipeline(
    engines_path: "TheStageAI/thewhisper-large-v3-turbo",
    overlap_seconds: 2   // 2 seconds of overlap between windows
)

// Pass the entire recording — the pipeline handles chunking
let result = stt.infer(audio: full_recording_samples, language: "en")
print(result.text)

await TheStageFlutterSDK.start_model(
  model_name: 'stt',
  engines_path: 'TheStageAI/thewhisper-large-v3-turbo',
  config: {'overlap_seconds': 2},
);

final result = await TheStageFlutterSDK.infer(
  model_name: 'stt',
  input_json: {'audio': full_recording_samples, 'language': 'en'},
);

Note

Each 10-second window is processed sequentially, so a 60-second recording takes roughly 6× the time of a single window. If latency matters, consider pre-segmenting with VAD and transcribing only the speech portions.

Disabling internal VAD for pre-segmented audio ¶

WhisperPipeline includes a Silero-VAD pre-pass that detects speech segments before transcribing. When your audio is already segmented — for example, from TheStageVoiceAgent or your own VAD — disable the internal VAD to skip redundant processing.

Swift — constructor parameter:

let stt = try await WhisperPipeline(
    engines_path: "TheStageAI/thewhisper-large-v3-turbo",
    use_internal_vad: false
)

Swift — singleton API:

try await ai.start_model(
    model_name: "stt",
    engines_path: "TheStageAI/thewhisper-large-v3-turbo",
    config: ["use_internal_vad": false]
)

Flutter:

await TheStageFlutterSDK.start_model(
  model_name: 'stt',
  engines_path: 'TheStageAI/thewhisper-large-v3-turbo',
  config: {'use_internal_vad': false},
);

Attention

With internal VAD disabled, the pipeline transcribes everything — including silence. If your audio contains long silent stretches, you may get hallucinated text. Only disable VAD when you are certain the input contains speech.

Streaming live transcription (push-based ASR)¶

For real-time use cases like voice assistants and live captions, batch infer is the wrong tool — you want live partial transcripts that grow as the user speaks, then a final authoritative result when the turn ends.

WhisperPipeline.open_streamer() returns an ASRStreamer that mirrors the TTS streamer’s push-based shape: send audio with send(_:), read stable partial transcripts from partials, and call finish() for the authoritative end-of-turn transcript. A single serial worker re-decodes the growing buffer and commits stable text via LocalAgreement, so partials never flicker or retract.

Swift:

let stt = try await WhisperPipeline(
    engines_path: "TheStageAI/thewhisper-large-v3-turbo",
    revision: "main"
)

let streamer = stt.open_streamer(language: "en", partial_interval_ms: 600)

let captions = Task {
    for await text in streamer.partials {
        print("partial: \(text)")   // committed-so-far, grows monotonically
    }
}

for await frame in microphone_frames {          // [Float] @ 16 kHz mono
    streamer.send(frame)
    if vad_detected_pause { streamer.flush() }   // finalize segment + trim
}

let final_text = await streamer.finish()         // closes `partials`
await captions.value
print("final: \(final_text)")

Key behaviors:

partials is the cosmetic/live caption. finish() is the trusted result and always covers the complete audio (including the last word).
flush() at VAD pauses keeps per-pass latency flat on long turns: it commits settled text and re-decodes only the uncommitted tail afterward.
cancel() aborts the turn without a final decode — use it for barge-in.
partial_interval_ms (default 600) bounds how often partial passes run.
Convert mic input to 16 kHz mono Float32 first — input is not resampled.

Attention

Streaming ASR is a Swift-direct API on WhisperPipeline. There is no singleton/JSON or Flutter streaming-ASR entry point. For live speech-to-text on Flutter, use the Voice Agent, which runs the same streaming ASR internally.

Multi-language transcription ¶

Whisper supports many languages. Set the language parameter to the appropriate ISO 639-1 code. The model does not auto-detect language — if you don’t specify one, it defaults to English.

Common language codes:

Language	Code	Language	Code
English	`en`	Japanese	`ja`
French	`fr`	Korean	`ko`
German	`de`	Chinese	`zh`
Spanish	`es`	Arabic	`ar`
Portuguese	`pt`	Hindi	`hi`
Russian	`ru`	Italian	`it`

Swift:

let result = stt.infer(audio: audio_samples, language: "fr")
print(result.text)  // French transcription

Flutter:

final result = await TheStageFlutterSDK.infer(
  model_name: 'stt',
  input_json: {
    'audio': audio_samples,
    'language': 'ja',
  },
);
print(result[0]['transcription']);  // Japanese transcription

Troubleshooting ¶

Transcription is empty or inaccurate ¶

The most common cause is wrong audio format. The pipeline requires 16 kHz mono Float32 with samples normalized to [-1.0, 1.0]. It does not auto-resample.

Verify your sample rate is exactly 16000 Hz. Audio captured at 44.1 kHz or 48 kHz without resampling will produce garbage.
Verify the audio is mono (single channel). Stereo input will be misinterpreted.
Verify sample values are in [-1.0, 1.0]. Int16 samples (range -32768 to 32767) must be divided by 32768.0 first.
Check that the audio actually contains speech — silent or near-silent input will produce empty results.

// Correct: convert Int16 samples to Float32
let float_samples = int16_samples.map { Float($0) / 32768.0 }
let result = stt.infer(audio: float_samples, language: "en")

Transcription drops words at chunk boundaries ¶

When transcribing long audio, the pipeline splits it into 10-second windows. With the default overlap_seconds: 0, words that straddle the boundary between two windows may be cut off or duplicated.

Increase overlap_seconds to give the decoder shared context at each boundary:

let stt = try await WhisperPipeline(
    engines_path: "TheStageAI/thewhisper-large-v3-turbo",
    overlap_seconds: 2
)

A value of 1–2 seconds is usually sufficient. Higher overlap improves boundary accuracy but increases total processing time.

Slow transcription on long audio ¶

Each 10-second window is processed sequentially. A 2-minute recording requires ~12 sequential decoder passes, which can take noticeable time on less powerful devices.

To speed things up:

Use VAD (use_internal_vad: true, the default) to skip silent segments. If only 30 seconds of a 2-minute file contain speech, only those segments are transcribed.
If you have pre-segmented audio, pass only the speech portions rather than the entire recording.
For real-time use cases, transcribe in shorter increments instead of accumulating minutes of audio before a single infer call.

Wrong language in output ¶

Whisper does not auto-detect the spoken language. If you pass French audio without setting language: "fr", the model will try to decode it as English and produce nonsense.

Always set the language parameter explicitly:

let result = stt.infer(audio: french_audio, language: "fr")

final result = await TheStageFlutterSDK.infer(
  model_name: 'stt',
  input_json: {
    'audio': french_audio,
    'language': 'fr',
  },
);

See the language code table in Multi-language transcription above for supported codes.

Model loading fails ¶

Follow the same pattern as other pipelines:

Verify that TheStageAI.shared.initialize(apiToken:) completed successfully before constructing WhisperPipeline.
On first launch the model must be downloaded. Check that the device has a working network connection.
Attach an on_load_progress callback to see which phase is stuck:

let stt = try await WhisperPipeline(
    engines_path: "TheStageAI/thewhisper-large-v3-turbo",
    on_load_progress: { p in
        print("[\(p.model)] \(p.phase) \(Int(p.fraction * 100))%")
    }
)

If progress stalls at .downloading, the network is the bottleneck. If it stalls at .loading, the device may be out of memory. Use prefetch_engines to pre-download the bundle over Wi-Fi before the user needs it.

Load Progress ¶

Swift:

let stt = try await WhisperPipeline(
    engines_path: "TheStageAI/thewhisper-large-v3-turbo",
    on_load_progress: { p in
        print("[\(p.model)] \(p.phase) \(Int(p.fraction * 100))%")
    }
)

Flutter:

TheStageFlutterSDK.on_progress.listen((event) {
  if (event['model_name'] != 'stt') return;
  final phase    = event['phase']    as String?;
  final fraction = event['progress'] as double?;
  print('[stt] $phase ${(fraction ?? 0) * 100}%');
});

await TheStageFlutterSDK.start_model(
  model_name: 'stt',
  engines_path: 'TheStageAI/thewhisper-large-v3-turbo',
);

Prefetch Engines ¶

let engines_dir = try await ai.prefetch_engines(
    repo_id: "TheStageAI/thewhisper-large-v3-turbo"
)

let stt = try await WhisperPipeline(engines_path: engines_dir)

Cleanup ¶

Swift:

_ = try ai.stop_model(model_name: "stt")

Flutter:

await TheStageFlutterSDK.stop_model(model_name: 'stt');