Streaming¶
Overview¶
In batch mode, the user waits for the entire audio clip or the complete LLM response to generate before anything is delivered. For short prompts that is fine — for anything longer than a sentence it creates an awkward silence.
Streaming solves the latency problem by delivering results incrementally. TTS audio begins playing while later sentences are still being synthesised, and LLM tokens appear in the UI as they are generated. The perceived wait drops from seconds to milliseconds.
Streaming and batch inference use the same model bundles — no extra setup is
required. Any model you load with start_model already supports both modes.
Attention
The consumer task (the code that reads chunks) must be running before you start producing data. If you kick off generation first and attach a listener later, initial chunks will be lost and you may see no audio at all.
Usage Guides¶
Playing TTS audio as it generates¶
Without streaming, synthesising a paragraph of text means the user stares at a spinner until the entire audio clip is ready — often several seconds. With streaming, the first audio reaches the speaker in tens of milliseconds and playback continues seamlessly while the rest of the text is still being processed.
Swift — minimal consumer
import TheStageSDK
let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")
try await ai.start_model(
model_name: "tts",
engines_path: "TheStageAI/neutts-multilingual",
config: ["voice_id": "dave"],
revision: "develop"
)
let stream = try ai.infer_stream(
model_name: "tts",
input_json: ["text": "A long paragraph of text to speak aloud."]
)
for await chunk in stream {
guard let audio = chunk.audio else { continue }
play_audio(audio, sample_rate: chunk.sample_rate ?? 24000)
if chunk.is_final {
print("Decode tokens: \(chunk.generated_tokens ?? 0)")
print("Tok/s: \(chunk.tokens_per_second ?? 0)")
print("First chunk: \(chunk.time_to_first_token ?? 0)s")
print("Total wall-clock: \(chunk.total_seconds ?? 0)s")
}
}
Swift — AudioStreamPlayer (recommended)
For production use, AudioStreamPlayer handles sample buffering, scheduling,
and drain/stop lifecycle so you don’t have to manage raw PCM playback yourself.
import TheStageSDK
let ai = TheStageAI.shared
try await ai.start_model(
model_name: "tts",
engines_path: "TheStageAI/neutts-multilingual",
config: ["voice_id": "paul"]
)
let player = AudioStreamPlayer(
config: AudioStreamConfig(
sample_rate: 24000,
channels: 1,
buffer_size: 512
)
)
player.start()
let stream = try ai.infer_stream(
model_name: "tts",
input_json: ["text": "Hello! This is a streaming demo with real-time audio."]
)
for await chunk in stream {
guard let audio = chunk.audio, !audio.isEmpty else { continue }
player.enqueue(audio)
}
await player.drain()
player.stop()
Flutter
final stream = TheStageFlutterSDK.infer_stream(
model_name: 'tts',
input_json: {'text': 'A long paragraph of text to speak aloud.'},
);
final player = TheStageAudioPlayer(sampleRate: 24000);
await player.start();
await for (final chunk in stream) {
final audio = chunk['audio'] as Float32List?;
if (audio != null && audio.isNotEmpty) {
player.enqueue(audio);
}
if (chunk['is_final'] == true) break;
}
await player.drain();
await player.stop();
Note
Always call drain() before stop(). Draining waits for the audio
buffer to finish playing; stopping immediately silences the output.
Streaming LLM tokens to a chat UI¶
Chat applications feel sluggish when the assistant’s response appears all at once after a multi-second wait. Streaming LLM tokens lets you display text as it is generated — character by character — giving the user immediate feedback.
Each chunk carries a delta string containing one or more decoded tokens.
Append them to your UI in real time.
import TheStageSDK
let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")
try await ai.start_model(
model_name: "llm",
engines_path: "TheStageAI/Qwen3-0.6B"
)
let stream = try ai.infer_stream(
model_name: "llm",
input_json: [
"prompt": "Explain quantum computing in two sentences.",
"max_new_tokens": 128
]
)
var fullResponse = ""
for await chunk in stream {
if let delta = chunk.delta {
fullResponse += delta
updateChatBubble(fullResponse)
}
if chunk.is_final {
print("Generated \(chunk.generated_tokens ?? 0) tokens")
print("Speed: \(chunk.tokens_per_second ?? 0) tok/s")
}
}
Streaming ASR (live speech-to-text)¶
The inverse direction: push microphone audio in, read transcripts out in
real time. WhisperPipeline.open_streamer() returns an ASRStreamer that
mirrors the TTS streamer’s shape — send(_:) audio frames, drain stable
partials from partials, and finish() for the authoritative transcript.
A single serial worker re-decodes the growing buffer and commits stable text via LocalAgreement, so live captions never flicker or retract — they only grow monotonically as the user speaks.
Swift:
import TheStageSDK
let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")
let stt = try await WhisperPipeline(
engines_path: "TheStageAI/thewhisper-large-v3-turbo",
revision: "main"
)
let streamer = stt.open_streamer(language: "en", partial_interval_ms: 600)
let captions = Task {
for await text in streamer.partials {
print("partial: \(text)")
}
}
for await frame in microphone_frames { // [Float] @ 16 kHz mono
streamer.send(frame)
if vad_detected_pause { streamer.flush() } // trim buffer, keep latency flat
}
let final_text = await streamer.finish() // closes `partials`
await captions.value
print("final: \(final_text)")
partialsis the cosmetic/live caption — use it for UI captions.finish()is the trusted result that always covers the complete audio.flush()at VAD pauses commits settled text and re-decodes only the uncommitted tail, keeping per-pass latency flat on long turns.cancel()aborts the turn without a final decode (use for barge-in).partial_interval_ms(default600) bounds how often partial passes run.
Attention
Streaming ASR is a Swift-direct API on WhisperPipeline. There is no
singleton/JSON or Flutter entry point. For live speech-to-text on Flutter,
use the Voice Agent, which runs the same streaming ASR internally.
Piping LLM output directly into TTS¶
This is the real-time voice pattern: the LLM generates a response and TTS speaks it aloud as the text arrives — no intermediate buffering, no waiting for the full response. The user hears the answer within milliseconds of the first token.
The key idea is to open two concurrent tasks: a consumer that reads TTS audio chunks and plays them, and a producer that feeds LLM deltas into the TTS streamer. The consumer must start first.
Swift
import TheStageSDK
let ai = TheStageAI.shared
try await ai.start_model(
model_name: "llm",
engines_path: "TheStageAI/Qwen3-0.6B"
)
try await ai.start_model(
model_name: "tts",
engines_path: "TheStageAI/neutts-multilingual",
config: ["voice_id": "dave"]
)
let player = AudioStreamPlayer(sample_rate: 24000)
player.start()
let streamer = try ai.open_tts_streamer(model_name: "tts")
Task {
for await chunk in streamer.output {
if let pcm = chunk.audio, !pcm.isEmpty {
player.enqueue(pcm)
}
}
await player.drain()
player.stop()
}
let llm_stream = try ai.infer_stream(
model_name: "llm",
input_json: [
"prompt": "Tell me a joke",
"max_new_tokens": 256
]
)
for await chunk in llm_stream {
if let delta = chunk.delta, !delta.isEmpty {
streamer.send(delta)
}
if chunk.is_final {
streamer.stop_stream()
}
}
Flutter
final ttsStream = TheStageFlutterSDK.infer_stream(
model_name: 'tts',
input_json: {'text': ''},
stream_id: 'voice_agent_tts',
);
final player = TheStageAudioPlayer(sampleRate: 24000);
await player.start();
ttsStream.listen((chunk) {
final audio = chunk['audio'] as Float32List?;
if (audio != null) player.enqueue(audio);
});
final llmStream = TheStageFlutterSDK.infer_stream(
model_name: 'llm',
input_json: {
'prompt': 'Tell me a joke',
'max_new_tokens': 256,
},
);
await for (final chunk in llmStream) {
if (chunk['kind'] == 'text' && chunk['delta'] != null) {
await TheStageFlutterSDK.send(
stream_id: 'voice_agent_tts',
text: chunk['delta'],
);
}
if (chunk['is_final'] == true) {
await TheStageFlutterSDK.finish_stream(stream_id: 'voice_agent_tts');
}
}
await player.drain();
await player.stop();
Attention
In the Flutter variant, you must call finish_stream (not
stop_stream) to signal that all text has been sent. stop_stream
cancels the stream and discards pending audio. finish_stream lets the
TTS engine flush its remaining output.
Tuning time-to-first-audio¶
TTSStreamConfig controls how the TTS codec chunks its output. The defaults
work well for general use, but you can trade off initial chunk length for
faster perceived responsiveness.
The most impactful knob is first_frames_per_chunk. It controls how many
codec frames the streamer collects before emitting the very first audio chunk.
A lower value means the user hears audio sooner, but the initial segment is
shorter. A higher value produces a longer first segment at the cost of more
silence up front.
overlap_frames sets how many frames of audio overlap are used for
crossfade between chunks. Increasing this from the default of 1 smooths
transitions between sentences but adds a small amount of latency per chunk.
Swift:
let streamer = tts.open_streamer(
config: TTSStreamConfig(
frames_per_chunk: 25,
first_frames_per_chunk: 12,
lookforward: 5,
lookback: 50,
overlap_frames: 1
)
)
let s2 = try ai.open_tts_streamer(
model_name: "tts",
config: TTSStreamConfig(first_frames_per_chunk: 12)
)
Flutter:
final stream = TheStageFlutterSDK.infer_stream(
model_name: 'tts',
input_json: {
'text': 'Hello, world.',
'stream_config': {
'first_frames_per_chunk': 12,
'frames_per_chunk': 25,
'lookforward': 5,
'lookback': 50,
'overlap_frames': 1,
},
},
);
Note
Lower first_frames_per_chunk reduces time-to-first-audio at the cost of
a slightly shorter initial audio segment. The default values provide a good
balance for most use cases.
Goal |
Change |
Trade-off |
|---|---|---|
Faster first audio |
Lower |
Shorter initial audio segment |
Smoother sentence joins |
Increase |
Slightly higher per-chunk latency |
Larger audio chunks |
Increase |
Higher latency but fewer callbacks |
Cancelling a stream mid-generation¶
A user might navigate away, press stop, or barge in while the model is still generating. You should cancel the stream promptly to free compute resources.
Swift — break out of the for await loop or call cancel() explicitly:
// Break from the for-await loop, or call streamer.cancel()
Flutter — call stop_stream with the stream ID:
await TheStageFlutterSDK.stop_stream(stream_id: 'my_stream');
The stream will emit a final event with kind: 'cancelled' and close.
API Reference¶
TTSStreamConfig¶
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
25 |
Number of codec frames per audio chunk |
|
|
12 |
Frames in the first chunk (lower = faster first audio) |
|
|
5 |
Lookahead frames for overlap-add |
|
|
50 |
Lookbehind frames for context |
|
|
1 |
Overlap frames for crossfade |
AudioStreamConfig¶
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
24000 |
Audio sample rate in Hz |
|
|
1 |
Number of audio channels |
|
|
512 |
I/O buffer size (iOS only) |
|
|
|
Audio session category (iOS only) |
|
|
|
Audio session mode (iOS only) |
Flutter API Reference¶
Lifecycle
await TheStageFlutterSDK.initialize(api_token: 'th_…');
await TheStageFlutterSDK.start_model(
model_name: 'tts',
engines_path: 'TheStageAI/neutts-multilingual',
model_type: 'neutts-multilingual',
revision: 'develop',
config: {'voice_id': 'dave'},
);
await TheStageFlutterSDK.stop_model(model_name: 'tts');
Streaming
final stream = TheStageFlutterSDK.infer_stream(
model_name: 'tts',
input_json: {'text': 'Hello world.'},
);
await TheStageFlutterSDK.send(stream_id: id, text: 'more text');
await TheStageFlutterSDK.finish_stream(stream_id: id);
await TheStageFlutterSDK.stop_stream(stream_id: id);
Audio Player
final player = TheStageAudioPlayer(sampleRate: 24000);
await player.start();
player.enqueue(audioData);
await player.pause();
await player.resume();
await player.drain();
await player.stop();
TTS Chunk Format¶
Each TTS chunk delivered by the stream contains the following fields.
Field |
Type |
Description |
|---|---|---|
|
|
PCM audio samples, 24 kHz mono |
|
|
Always 24000 |
|
|
Sequential chunk number |
|
|
|
|
|
Seconds to first audio chunk |
|
|
Decode step count |
|
|
Decode speed |
|
|
Wall-clock time (final only) |
LLM Chunk Format¶
Each LLM chunk carries one or more decoded tokens and timing metadata on the final chunk.
Field |
Type |
Description |
|---|---|---|
|
|
Decoded token text |
|
|
Position in sequence |
|
|
|
|
|
Seconds to first token (final only) |
|
|
Input token count (final only) |
|
|
Output token count (final only) |
|
|
Generation speed (final only) |
|
|
Wall-clock time (final only) |
Architecture¶
TTSStreamer — Single Token Stream with Sentinels:
Producer Task Consumer Task
───────────── ─────────────
sentence_stream token_stream
│ │
▼ ▼
┌──────────┐ ┌───────────┐
│preprocess│ │is sentinel?│
└────┬─────┘ └─┬───────┬─┘
│ no yes
▼ │ │
┌──────────────┐ ▼ ▼
│decoder │ accumulate flush
│ .prefill() │ codes + fade-out
│ .decode_step│ │ + reset
│ (loop) │ ▼
└────┬─────────┘ ┌─────────┐
│ │codec.infer│
▼ │(autorelease)│
yield tokens └────┬────┘
│ │
▼ ▼
yield sentinel OLA + emit
The producer runs ahead — while the consumer decodes audio for the current sentence, the producer is already preprocessing and generating tokens for the next one.
Troubleshooting¶
No audio plays during streaming¶
The most common cause is attaching the consumer after generation has already
started. Because the stream delivers chunks in real time, any chunks emitted
before your for await loop (Swift) or listen callback (Flutter) begins
are lost.
Always start the consumer task first, then trigger generation:
let player = AudioStreamPlayer(sample_rate: 24000)
player.start()
let streamer = try ai.open_tts_streamer(model_name: "tts")
// Consumer runs FIRST
Task {
for await chunk in streamer.output {
if let pcm = chunk.audio, !pcm.isEmpty {
player.enqueue(pcm)
}
}
}
// Producer starts AFTER
streamer.send("Hello, world!")
streamer.stop_stream()
If you are on iOS, also verify that Info.plist includes the audio
background mode and that the AudioStreamConfig category is set to
.playback.
Choppy or glitchy audio between sentences¶
Audio glitches at sentence boundaries are usually caused by missing crossfade.
The TTS streamer uses overlap-add (OLA) to blend the tail of one chunk with the
head of the next. If overlap_frames is set to 0 you will hear clicks or
pops at every join.
Increase overlap_frames to smooth the transitions:
let streamer = try ai.open_tts_streamer(
model_name: "tts",
config: TTSStreamConfig(overlap_frames: 2)
)
The default value of 1 is sufficient for most voices. Only increase it if you hear artefacts with a specific voice or speaking style.
High latency before first audio chunk¶
Three things to check:
``first_frames_per_chunk`` is too high. The streamer waits until it has collected this many codec frames before emitting the first chunk. Lowering it (e.g. from 12 to 6) cuts the wait at the cost of a shorter initial audio segment.
First-run model download. The first call to
start_modeldownloads model weights from Hugging Face. Subsequent runs use the local cache. If you see high latency only on the first launch, this is expected — preload the model on a splash screen.Network round-trip (remote LLM). If you are piping a cloud LLM into TTS, the time-to-first-audio includes the LLM provider’s network latency. Switching to an on-device LLM eliminates this component entirely.
Stream never emits is_final¶
The TTS stream waits for an explicit end-of-input signal. Without it, the streamer assumes more text is coming and holds the last chunk in its buffer indefinitely.
In Swift, call stop_stream() on the streamer once all text has been
sent:
for await chunk in llm_stream {
if let delta = chunk.delta, !delta.isEmpty {
streamer.send(delta)
}
if chunk.is_final {
streamer.stop_stream() // ← signals end of input
}
}
In Flutter, call finish_stream() (not stop_stream(), which
cancels):
if (chunk['is_final'] == true) {
await TheStageFlutterSDK.finish_stream(stream_id: 'my_stream');
}