Transcription (Speech-to-Text)¶
Overview¶
On-device speech recognition powered by Whisper. WhisperPipeline handles the full
mel-spectrogram → encoder → decoder chain, with automatic VAD chunking (Silero-VAD
pre-pass) and long-audio stitching so callers can pass arbitrarily long buffers in a
single infer call.
Flutter consumers go through the singleton start_model + infer JSON path — there
is no direct WhisperPipeline constructor on Dart. Both surfaces share the same on-disk
cache and response shape.
Note
The transcription pipeline currently uses WhisperPipeline. More model families are
planned for future releases.
Supported Models¶
Model |
HF Repo |
Notes |
|---|---|---|
Whisper Large V3 Turbo |
|
Auto VAD chunking, 10 s windows |
API Reference¶
Full Constructor¶
let stt = try await WhisperPipeline(
engines_path: "TheStageAI/thewhisper-large-v3-turbo",
device: "npu",
devices: nil,
overlap_seconds: 0,
use_internal_vad: true,
revision: "main",
on_load_progress: nil
)
Parameter |
Type |
Description |
|---|---|---|
|
|
HuggingFace repo ID or local path to the compiled engine bundle. |
|
|
Compute backend: |
|
|
Optional multi-device list (overrides |
|
|
Overlap between consecutive audio windows (default |
|
|
Enable/disable the Silero-VAD pre-pass (default |
|
|
HuggingFace revision / branch (default |
|
|
Optional callback fired during download, extraction and loading phases. |
Inputs / Outputs¶
Direction |
Type |
Description |
|---|---|---|
input |
|
16 kHz mono PCM, samples in |
input |
|
Whisper language code: |
input |
|
Cap per-window decode. |
input |
|
Include token IDs in |
output |
|
Transcribed text. |
output |
|
Total decoded tokens (sum across windows). |
output |
|
Decoder wall time. |
output |
|
Token IDs (only if |
Audio I/O¶
16 kHz mono
[Float], samples normalized to[-1.0, 1.0].Long buffers are split internally into the bundle’s
chunk_secondswindows. The shippingTheStageAI/thewhisper-large-v3-turbouses 10 s windows.Overlap between windows is configurable via the
overlap_secondsconstructor argument (default0).Mismatched-rate input is not auto-resampled — convert your mic capture to 16 kHz mono Float32 before calling
infer.
Singleton API¶
import TheStageSDK
let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")
try await ai.start_model(
model_name: "stt",
engines_path: "TheStageAI/thewhisper-large-v3-turbo"
)
let json = try ai.infer(
model_name: "stt",
input_json: [
"audio": audio_samples,
"language": "en",
"return_tokens": true
]
)
let text = json[0]["transcription"] as! String
Response Keys¶
Key |
Type |
Description |
|---|---|---|
|
|
The transcribed text. |
|
|
Total decoded tokens. |
|
|
Decoder wall time. |
|
|
Token IDs (only present when |
Usage Guides¶
Transcribing a recorded audio file¶
The most common use case: load a pre-recorded audio file and get a text
transcript. The key requirement is that your audio must be 16 kHz mono
Float32 with samples in [-1.0, 1.0]. The pipeline does not
auto-resample, so convert beforehand.
Swift:
import TheStageSDK
import AVFoundation
let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")
let stt = try await WhisperPipeline(
engines_path: "TheStageAI/thewhisper-large-v3-turbo"
)
// Load audio and convert to 16 kHz mono Float32
let url = Bundle.main.url(forResource: "recording", withExtension: "wav")!
let file = try AVAudioFile(forReading: url)
let format = AVAudioFormat(commonFormat: .pcmFormatFloat32,
sampleRate: 16000, channels: 1, interleaved: false)!
let converter = AVAudioConverter(from: file.processingFormat, to: format)!
let capacity = AVAudioFrameCount(
Double(file.length) * 16000.0 / file.processingFormat.sampleRate
)
let buffer = AVAudioPCMBuffer(pcmFormat: format, frameCapacity: capacity)!
try converter.convert(to: buffer, from: file)
let samples = Array(UnsafeBufferPointer(
start: buffer.floatChannelData![0],
count: Int(buffer.frameLength)
))
let result = stt.infer(audio: samples, language: "en")
print(result.text)
Flutter:
import 'package:thestage_apple_sdk/thestage_apple_sdk.dart';
import 'dart:typed_data';
await TheStageFlutterSDK.initialize(api_token: 'th_…');
await TheStageFlutterSDK.start_model(
model_name: 'stt',
engines_path: 'TheStageAI/thewhisper-large-v3-turbo',
);
// audio_samples must be Float32List at 16 kHz mono, values in [-1.0, 1.0]
final result = await TheStageFlutterSDK.infer(
model_name: 'stt',
input_json: {
'audio': audio_samples,
'language': 'en',
},
);
print(result[0]['transcription']);
Attention
Always call initialize(apiToken:) before constructing any pipeline.
Forgetting this is the most common source of “model loading fails” errors.
Live microphone transcription¶
For real-time transcription, capture audio from the microphone, accumulate samples, and periodically send them to the pipeline. The critical requirement is configuring your audio session to capture at 16 kHz mono.
Swift:
import TheStageSDK
import AVFoundation
let audioEngine = AVAudioEngine()
let inputNode = audioEngine.inputNode
let recordingFormat = AVAudioFormat(commonFormat: .pcmFormatFloat32,
sampleRate: 16000,
channels: 1,
interleaved: false)!
var accumulated: [Float] = []
inputNode.installTap(onBus: 0, bufferSize: 4096, format: recordingFormat) {
buffer, _ in
let ptr = buffer.floatChannelData![0]
let samples = Array(UnsafeBufferPointer(start: ptr,
count: Int(buffer.frameLength)))
accumulated.append(contentsOf: samples)
}
audioEngine.prepare()
try audioEngine.start()
// After the user stops speaking, transcribe the accumulated audio
audioEngine.stop()
inputNode.removeTap(onBus: 0)
let result = stt.infer(audio: accumulated, language: "en")
print(result.text)
Note
If your audio session runs at 44.1 kHz or 48 kHz, you must resample to
16 kHz before passing to infer. The pipeline does not resample
internally.
Transcribing long audio (> 10 seconds)¶
You do not need to split long recordings manually. The pipeline automatically
divides audio into 10-second windows (the bundle’s chunk_seconds) and
stitches the transcripts together.
For better accuracy at window boundaries, set overlap_seconds so
consecutive windows share a few seconds of audio. This prevents words that
straddle a boundary from being cut off.
let stt = try await WhisperPipeline(
engines_path: "TheStageAI/thewhisper-large-v3-turbo",
overlap_seconds: 2 // 2 seconds of overlap between windows
)
// Pass the entire recording — the pipeline handles chunking
let result = stt.infer(audio: full_recording_samples, language: "en")
print(result.text)
await TheStageFlutterSDK.start_model(
model_name: 'stt',
engines_path: 'TheStageAI/thewhisper-large-v3-turbo',
config: {'overlap_seconds': 2},
);
final result = await TheStageFlutterSDK.infer(
model_name: 'stt',
input_json: {'audio': full_recording_samples, 'language': 'en'},
);
Note
Each 10-second window is processed sequentially, so a 60-second recording takes roughly 6× the time of a single window. If latency matters, consider pre-segmenting with VAD and transcribing only the speech portions.
Disabling internal VAD for pre-segmented audio¶
WhisperPipeline includes a Silero-VAD pre-pass that detects speech segments
before transcribing. When your audio is already segmented — for example, from
TheStageVoiceAgent or your own VAD — disable the internal VAD to skip
redundant processing.
Swift — constructor parameter:
let stt = try await WhisperPipeline(
engines_path: "TheStageAI/thewhisper-large-v3-turbo",
use_internal_vad: false
)
Swift — singleton API:
try await ai.start_model(
model_name: "stt",
engines_path: "TheStageAI/thewhisper-large-v3-turbo",
config: ["use_internal_vad": false]
)
Flutter:
await TheStageFlutterSDK.start_model(
model_name: 'stt',
engines_path: 'TheStageAI/thewhisper-large-v3-turbo',
config: {'use_internal_vad': false},
);
Attention
With internal VAD disabled, the pipeline transcribes everything — including silence. If your audio contains long silent stretches, you may get hallucinated text. Only disable VAD when you are certain the input contains speech.
Streaming live transcription (push-based ASR)¶
For real-time use cases like voice assistants and live captions, batch infer
is the wrong tool — you want live partial transcripts that grow as the user
speaks, then a final authoritative result when the turn ends.
WhisperPipeline.open_streamer() returns an ASRStreamer that mirrors the
TTS streamer’s push-based shape: send audio with send(_:), read stable
partial transcripts from partials, and call finish() for the
authoritative end-of-turn transcript. A single serial worker re-decodes the
growing buffer and commits stable text via LocalAgreement, so partials never
flicker or retract.
Swift:
let stt = try await WhisperPipeline(
engines_path: "TheStageAI/thewhisper-large-v3-turbo",
revision: "main"
)
let streamer = stt.open_streamer(language: "en", partial_interval_ms: 600)
let captions = Task {
for await text in streamer.partials {
print("partial: \(text)") // committed-so-far, grows monotonically
}
}
for await frame in microphone_frames { // [Float] @ 16 kHz mono
streamer.send(frame)
if vad_detected_pause { streamer.flush() } // finalize segment + trim
}
let final_text = await streamer.finish() // closes `partials`
await captions.value
print("final: \(final_text)")
Key behaviors:
partialsis the cosmetic/live caption.finish()is the trusted result and always covers the complete audio (including the last word).flush()at VAD pauses keeps per-pass latency flat on long turns: it commits settled text and re-decodes only the uncommitted tail afterward.cancel()aborts the turn without a final decode — use it for barge-in.partial_interval_ms(default600) bounds how often partial passes run.Convert mic input to 16 kHz mono Float32 first — input is not resampled.
Attention
Streaming ASR is a Swift-direct API on WhisperPipeline. There is no
singleton/JSON or Flutter streaming-ASR entry point. For live speech-to-text
on Flutter, use the Voice Agent, which runs the same streaming ASR internally.
Multi-language transcription¶
Whisper supports many languages. Set the language parameter to the
appropriate ISO 639-1 code. The model does not auto-detect language — if
you don’t specify one, it defaults to English.
Common language codes:
Language |
Code |
Language |
Code |
|---|---|---|---|
English |
|
Japanese |
|
French |
|
Korean |
|
German |
|
Chinese |
|
Spanish |
|
Arabic |
|
Portuguese |
|
Hindi |
|
Russian |
|
Italian |
|
Swift:
let result = stt.infer(audio: audio_samples, language: "fr")
print(result.text) // French transcription
Flutter:
final result = await TheStageFlutterSDK.infer(
model_name: 'stt',
input_json: {
'audio': audio_samples,
'language': 'ja',
},
);
print(result[0]['transcription']); // Japanese transcription
Troubleshooting¶
Transcription is empty or inaccurate¶
The most common cause is wrong audio format. The pipeline requires 16 kHz
mono Float32 with samples normalized to [-1.0, 1.0]. It does not
auto-resample.
Verify your sample rate is exactly 16000 Hz. Audio captured at 44.1 kHz or 48 kHz without resampling will produce garbage.
Verify the audio is mono (single channel). Stereo input will be misinterpreted.
Verify sample values are in
[-1.0, 1.0]. Int16 samples (range-32768to32767) must be divided by32768.0first.Check that the audio actually contains speech — silent or near-silent input will produce empty results.
// Correct: convert Int16 samples to Float32
let float_samples = int16_samples.map { Float($0) / 32768.0 }
let result = stt.infer(audio: float_samples, language: "en")
Transcription drops words at chunk boundaries¶
When transcribing long audio, the pipeline splits it into 10-second windows.
With the default overlap_seconds: 0, words that straddle the boundary
between two windows may be cut off or duplicated.
Increase overlap_seconds to give the decoder shared context at each
boundary:
let stt = try await WhisperPipeline(
engines_path: "TheStageAI/thewhisper-large-v3-turbo",
overlap_seconds: 2
)
A value of 1–2 seconds is usually sufficient. Higher overlap improves boundary accuracy but increases total processing time.
Slow transcription on long audio¶
Each 10-second window is processed sequentially. A 2-minute recording requires ~12 sequential decoder passes, which can take noticeable time on less powerful devices.
To speed things up:
Use VAD (
use_internal_vad: true, the default) to skip silent segments. If only 30 seconds of a 2-minute file contain speech, only those segments are transcribed.If you have pre-segmented audio, pass only the speech portions rather than the entire recording.
For real-time use cases, transcribe in shorter increments instead of accumulating minutes of audio before a single
infercall.
Wrong language in output¶
Whisper does not auto-detect the spoken language. If you pass French audio
without setting language: "fr", the model will try to decode it as
English and produce nonsense.
Always set the language parameter explicitly:
let result = stt.infer(audio: french_audio, language: "fr")
final result = await TheStageFlutterSDK.infer(
model_name: 'stt',
input_json: {
'audio': french_audio,
'language': 'fr',
},
);
See the language code table in Multi-language transcription above for supported codes.
Model loading fails¶
Follow the same pattern as other pipelines:
Verify that
TheStageAI.shared.initialize(apiToken:)completed successfully before constructingWhisperPipeline.On first launch the model must be downloaded. Check that the device has a working network connection.
Attach an
on_load_progresscallback to see which phase is stuck:
let stt = try await WhisperPipeline(
engines_path: "TheStageAI/thewhisper-large-v3-turbo",
on_load_progress: { p in
print("[\(p.model)] \(p.phase) \(Int(p.fraction * 100))%")
}
)
If progress stalls at .downloading, the network is the bottleneck.
If it stalls at .loading, the device may be out of memory. Use
prefetch_engines to pre-download the bundle over Wi-Fi before the user
needs it.
Load Progress¶
Swift:
let stt = try await WhisperPipeline(
engines_path: "TheStageAI/thewhisper-large-v3-turbo",
on_load_progress: { p in
print("[\(p.model)] \(p.phase) \(Int(p.fraction * 100))%")
}
)
Flutter:
TheStageFlutterSDK.on_progress.listen((event) {
if (event['model_name'] != 'stt') return;
final phase = event['phase'] as String?;
final fraction = event['progress'] as double?;
print('[stt] $phase ${(fraction ?? 0) * 100}%');
});
await TheStageFlutterSDK.start_model(
model_name: 'stt',
engines_path: 'TheStageAI/thewhisper-large-v3-turbo',
);
Prefetch Engines¶
let engines_dir = try await ai.prefetch_engines(
repo_id: "TheStageAI/thewhisper-large-v3-turbo"
)
let stt = try await WhisperPipeline(engines_path: engines_dir)
Cleanup¶
Swift:
_ = try ai.stop_model(model_name: "stt")
Flutter:
await TheStageFlutterSDK.stop_model(model_name: 'stt');