Streaming¶

Overview ¶

In batch mode, the user waits for the entire audio clip or the complete LLM response to generate before anything is delivered. For short prompts that is fine — for anything longer than a sentence it creates an awkward silence.

Streaming solves the latency problem by delivering results incrementally. TTS audio begins playing while later sentences are still being synthesised, and LLM tokens appear in the UI as they are generated. The perceived wait drops from seconds to milliseconds.

Streaming and batch inference use the same model bundles — no extra setup is required. Any model you load with start_model already supports both modes.

Attention

The consumer task (the code that reads chunks) must be running before you start producing data. If you kick off generation first and attach a listener later, initial chunks will be lost and you may see no audio at all.

Usage Guides ¶

Playing TTS audio as it generates ¶

Without streaming, synthesising a paragraph of text means the user stares at a spinner until the entire audio clip is ready — often several seconds. With streaming, the first audio reaches the speaker in tens of milliseconds and playback continues seamlessly while the rest of the text is still being processed.

Swift — minimal consumer

import TheStageSDK

let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")

try await ai.start_model(
    model_name: "tts",
    engines_path: "TheStageAI/neutts-multilingual",
    config: ["voice_id": "dave"],
    revision: "develop"
)

let stream = try ai.infer_stream(
    model_name: "tts",
    input_json: ["text": "A long paragraph of text to speak aloud."]
)

for await chunk in stream {
    guard let audio = chunk.audio else { continue }
    play_audio(audio, sample_rate: chunk.sample_rate ?? 24000)

    if chunk.is_final {
        print("Decode tokens: \(chunk.generated_tokens ?? 0)")
        print("Tok/s: \(chunk.tokens_per_second ?? 0)")
        print("First chunk: \(chunk.time_to_first_token ?? 0)s")
        print("Total wall-clock: \(chunk.total_seconds ?? 0)s")
    }
}

Swift — AudioStreamPlayer (recommended)

For production use, AudioStreamPlayer handles sample buffering, scheduling, and drain/stop lifecycle so you don’t have to manage raw PCM playback yourself.

import TheStageSDK

let ai = TheStageAI.shared
try await ai.start_model(
    model_name: "tts",
    engines_path: "TheStageAI/neutts-multilingual",
    config: ["voice_id": "paul"]
)

let player = AudioStreamPlayer(
    config: AudioStreamConfig(
        sample_rate: 24000,
        channels: 1,
        buffer_size: 512
    )
)
player.start()

let stream = try ai.infer_stream(
    model_name: "tts",
    input_json: ["text": "Hello! This is a streaming demo with real-time audio."]
)

for await chunk in stream {
    guard let audio = chunk.audio, !audio.isEmpty else { continue }
    player.enqueue(audio)
}

await player.drain()
player.stop()

Flutter

final stream = TheStageFlutterSDK.infer_stream(
  model_name: 'tts',
  input_json: {'text': 'A long paragraph of text to speak aloud.'},
);

final player = TheStageAudioPlayer(sampleRate: 24000);
await player.start();

await for (final chunk in stream) {
  final audio = chunk['audio'] as Float32List?;
  if (audio != null && audio.isNotEmpty) {
    player.enqueue(audio);
  }
  if (chunk['is_final'] == true) break;
}

await player.drain();
await player.stop();

Note

Always call drain() before stop(). Draining waits for the audio buffer to finish playing; stopping immediately silences the output.

Streaming LLM tokens to a chat UI ¶

Chat applications feel sluggish when the assistant’s response appears all at once after a multi-second wait. Streaming LLM tokens lets you display text as it is generated — character by character — giving the user immediate feedback.

Each chunk carries a delta string containing one or more decoded tokens. Append them to your UI in real time.

import TheStageSDK

let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")

try await ai.start_model(
    model_name: "llm",
    engines_path: "TheStageAI/Qwen3-0.6B"
)

let stream = try ai.infer_stream(
    model_name: "llm",
    input_json: [
        "prompt": "Explain quantum computing in two sentences.",
        "max_new_tokens": 128
    ]
)

var fullResponse = ""
for await chunk in stream {
    if let delta = chunk.delta {
        fullResponse += delta
        updateChatBubble(fullResponse)
    }
    if chunk.is_final {
        print("Generated \(chunk.generated_tokens ?? 0) tokens")
        print("Speed: \(chunk.tokens_per_second ?? 0) tok/s")
    }
}

Streaming ASR (live speech-to-text)¶

The inverse direction: push microphone audio in, read transcripts out in real time. WhisperPipeline.open_streamer() returns an ASRStreamer that mirrors the TTS streamer’s shape — send(_:) audio frames, drain stable partials from partials, and finish() for the authoritative transcript.

A single serial worker re-decodes the growing buffer and commits stable text via LocalAgreement, so live captions never flicker or retract — they only grow monotonically as the user speaks.

Swift:

import TheStageSDK

let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")

let stt = try await WhisperPipeline(
    engines_path: "TheStageAI/thewhisper-large-v3-turbo",
    revision: "main"
)

let streamer = stt.open_streamer(language: "en", partial_interval_ms: 600)

let captions = Task {
    for await text in streamer.partials {
        print("partial: \(text)")
    }
}

for await frame in microphone_frames {          // [Float] @ 16 kHz mono
    streamer.send(frame)
    if vad_detected_pause { streamer.flush() }   // trim buffer, keep latency flat
}

let final_text = await streamer.finish()         // closes `partials`
await captions.value
print("final: \(final_text)")

partials is the cosmetic/live caption — use it for UI captions. finish() is the trusted result that always covers the complete audio.
flush() at VAD pauses commits settled text and re-decodes only the uncommitted tail, keeping per-pass latency flat on long turns.
cancel() aborts the turn without a final decode (use for barge-in).
partial_interval_ms (default 600) bounds how often partial passes run.

Attention

Streaming ASR is a Swift-direct API on WhisperPipeline. There is no singleton/JSON or Flutter entry point. For live speech-to-text on Flutter, use the Voice Agent, which runs the same streaming ASR internally.

Piping LLM output directly into TTS ¶

This is the real-time voice pattern: the LLM generates a response and TTS speaks it aloud as the text arrives — no intermediate buffering, no waiting for the full response. The user hears the answer within milliseconds of the first token.

The key idea is to open two concurrent tasks: a consumer that reads TTS audio chunks and plays them, and a producer that feeds LLM deltas into the TTS streamer. The consumer must start first.

Swift

import TheStageSDK

let ai = TheStageAI.shared

try await ai.start_model(
    model_name: "llm",
    engines_path: "TheStageAI/Qwen3-0.6B"
)
try await ai.start_model(
    model_name: "tts",
    engines_path: "TheStageAI/neutts-multilingual",
    config: ["voice_id": "dave"]
)

let player = AudioStreamPlayer(sample_rate: 24000)
player.start()

let streamer = try ai.open_tts_streamer(model_name: "tts")

Task {
    for await chunk in streamer.output {
        if let pcm = chunk.audio, !pcm.isEmpty {
            player.enqueue(pcm)
        }
    }
    await player.drain()
    player.stop()
}

let llm_stream = try ai.infer_stream(
    model_name: "llm",
    input_json: [
        "prompt": "Tell me a joke",
        "max_new_tokens": 256
    ]
)

for await chunk in llm_stream {
    if let delta = chunk.delta, !delta.isEmpty {
        streamer.send(delta)
    }
    if chunk.is_final {
        streamer.stop_stream()
    }
}

Flutter

final ttsStream = TheStageFlutterSDK.infer_stream(
  model_name: 'tts',
  input_json: {'text': ''},
  stream_id: 'voice_agent_tts',
);

final player = TheStageAudioPlayer(sampleRate: 24000);
await player.start();

ttsStream.listen((chunk) {
  final audio = chunk['audio'] as Float32List?;
  if (audio != null) player.enqueue(audio);
});

final llmStream = TheStageFlutterSDK.infer_stream(
  model_name: 'llm',
  input_json: {
    'prompt': 'Tell me a joke',
    'max_new_tokens': 256,
  },
);

await for (final chunk in llmStream) {
  if (chunk['kind'] == 'text' && chunk['delta'] != null) {
    await TheStageFlutterSDK.send(
      stream_id: 'voice_agent_tts',
      text: chunk['delta'],
    );
  }
  if (chunk['is_final'] == true) {
    await TheStageFlutterSDK.finish_stream(stream_id: 'voice_agent_tts');
  }
}

await player.drain();
await player.stop();

Attention

In the Flutter variant, you must call finish_stream (not stop_stream) to signal that all text has been sent. stop_stream cancels the stream and discards pending audio. finish_stream lets the TTS engine flush its remaining output.

Tuning time-to-first-audio ¶

TTSStreamConfig controls how the TTS codec chunks its output. The defaults work well for general use, but you can trade off initial chunk length for faster perceived responsiveness.

The most impactful knob is first_frames_per_chunk. It controls how many codec frames the streamer collects before emitting the very first audio chunk. A lower value means the user hears audio sooner, but the initial segment is shorter. A higher value produces a longer first segment at the cost of more silence up front.

overlap_frames sets how many frames of audio overlap are used for crossfade between chunks. Increasing this from the default of 1 smooths transitions between sentences but adds a small amount of latency per chunk.

Swift:

let streamer = tts.open_streamer(
    config: TTSStreamConfig(
        frames_per_chunk: 25,
        first_frames_per_chunk: 12,
        lookforward: 5,
        lookback: 50,
        overlap_frames: 1
    )
)

let s2 = try ai.open_tts_streamer(
    model_name: "tts",
    config: TTSStreamConfig(first_frames_per_chunk: 12)
)

Flutter:

final stream = TheStageFlutterSDK.infer_stream(
  model_name: 'tts',
  input_json: {
    'text': 'Hello, world.',
    'stream_config': {
      'first_frames_per_chunk': 12,
      'frames_per_chunk': 25,
      'lookforward': 5,
      'lookback': 50,
      'overlap_frames': 1,
    },
  },
);

Note

Lower first_frames_per_chunk reduces time-to-first-audio at the cost of a slightly shorter initial audio segment. The default values provide a good balance for most use cases.

Quick tuning cheat-sheet¶
Goal	Change	Trade-off
Faster first audio	Lower `first_frames_per_chunk`	Shorter initial audio segment
Smoother sentence joins	Increase `overlap_frames`	Slightly higher per-chunk latency
Larger audio chunks	Increase `frames_per_chunk`	Higher latency but fewer callbacks

Cancelling a stream mid-generation ¶

A user might navigate away, press stop, or barge in while the model is still generating. You should cancel the stream promptly to free compute resources.

Swift — break out of the for await loop or call cancel() explicitly:

// Break from the for-await loop, or call streamer.cancel()

Flutter — call stop_stream with the stream ID:

await TheStageFlutterSDK.stop_stream(stream_id: 'my_stream');

The stream will emit a final event with kind: 'cancelled' and close.

API Reference ¶

TTSStreamConfig ¶

Parameter	Type	Default	Description
`frames_per_chunk`	`Int`	25	Number of codec frames per audio chunk
`first_frames_per_chunk`	`Int`	12	Frames in the first chunk (lower = faster first audio)
`lookforward`	`Int`	5	Lookahead frames for overlap-add
`lookback`	`Int`	50	Lookbehind frames for context
`overlap_frames`	`Int`	1	Overlap frames for crossfade

AudioStreamConfig ¶

Parameter	Type	Default	Description
`sample_rate`	`Double`	24000	Audio sample rate in Hz
`channels`	`UInt32`	1	Number of audio channels
`buffer_size`	`UInt32`	512	I/O buffer size (iOS only)
`category`	`AVAudioSession.Category`	`.playback`	Audio session category (iOS only)
`mode`	`AVAudioSession.Mode`	`.default`	Audio session mode (iOS only)

Flutter API Reference ¶

Lifecycle

await TheStageFlutterSDK.initialize(api_token: 'th_…');

await TheStageFlutterSDK.start_model(
  model_name: 'tts',
  engines_path: 'TheStageAI/neutts-multilingual',
  model_type: 'neutts-multilingual',
  revision: 'develop',
  config: {'voice_id': 'dave'},
);

await TheStageFlutterSDK.stop_model(model_name: 'tts');

Streaming

final stream = TheStageFlutterSDK.infer_stream(
  model_name: 'tts',
  input_json: {'text': 'Hello world.'},
);

await TheStageFlutterSDK.send(stream_id: id, text: 'more text');
await TheStageFlutterSDK.finish_stream(stream_id: id);
await TheStageFlutterSDK.stop_stream(stream_id: id);

Audio Player

final player = TheStageAudioPlayer(sampleRate: 24000);
await player.start();
player.enqueue(audioData);
await player.pause();
await player.resume();
await player.drain();
await player.stop();

TTS Chunk Format ¶

Each TTS chunk delivered by the stream contains the following fields.

Field	Type	Description
`audio`	`[Float]` / `Float32List`	PCM audio samples, 24 kHz mono
`sample_rate`	`Int`	Always 24000
`index`	`Int`	Sequential chunk number
`is_final`	`Bool`	`true` on the sentinel last chunk
`time_to_first_token`	`Double?`	Seconds to first audio chunk
`generated_tokens`	`Int?`	Decode step count
`tokens_per_second`	`Double?`	Decode speed
`total_seconds`	`Double?`	Wall-clock time (final only)

LLM Chunk Format ¶

Each LLM chunk carries one or more decoded tokens and timing metadata on the final chunk.

Field	Type	Description
`delta`	`String?`	Decoded token text
`index`	`Int`	Position in sequence
`is_final`	`Bool`	`true` for the sentinel chunk
`time_to_first_token`	`Double?`	Seconds to first token (final only)
`prompt_tokens`	`Int?`	Input token count (final only)
`generated_tokens`	`Int?`	Output token count (final only)
`tokens_per_second`	`Double?`	Generation speed (final only)
`total_seconds`	`Double?`	Wall-clock time (final only)

Architecture ¶

TTSStreamer — Single Token Stream with Sentinels:

Producer Task                          Consumer Task
─────────────                          ─────────────
sentence_stream                        token_stream
    │                                      │
    ▼                                      ▼
┌──────────┐                         ┌───────────┐
│preprocess│                         │is sentinel?│
└────┬─────┘                         └─┬───────┬─┘
     │                                 no      yes
     ▼                                 │        │
┌──────────────┐                       ▼        ▼
│decoder       │                  accumulate   flush
│  .prefill()  │                  codes        + fade-out
│  .decode_step│                       │        + reset
│  (loop)      │                       ▼
└────┬─────────┘                  ┌─────────┐
     │                            │codec.infer│
     ▼                            │(autorelease)│
yield tokens                      └────┬────┘
     │                                 │
     ▼                                 ▼
yield sentinel                    OLA + emit

The producer runs ahead — while the consumer decodes audio for the current sentence, the producer is already preprocessing and generating tokens for the next one.

Troubleshooting ¶

No audio plays during streaming ¶

The most common cause is attaching the consumer after generation has already started. Because the stream delivers chunks in real time, any chunks emitted before your for await loop (Swift) or listen callback (Flutter) begins are lost.

Always start the consumer task first, then trigger generation:

let player = AudioStreamPlayer(sample_rate: 24000)
player.start()

let streamer = try ai.open_tts_streamer(model_name: "tts")

// Consumer runs FIRST
Task {
    for await chunk in streamer.output {
        if let pcm = chunk.audio, !pcm.isEmpty {
            player.enqueue(pcm)
        }
    }
}

// Producer starts AFTER
streamer.send("Hello, world!")
streamer.stop_stream()

If you are on iOS, also verify that Info.plist includes the audio background mode and that the AudioStreamConfig category is set to .playback.

Choppy or glitchy audio between sentences ¶

Audio glitches at sentence boundaries are usually caused by missing crossfade. The TTS streamer uses overlap-add (OLA) to blend the tail of one chunk with the head of the next. If overlap_frames is set to 0 you will hear clicks or pops at every join.

Increase overlap_frames to smooth the transitions:

let streamer = try ai.open_tts_streamer(
    model_name: "tts",
    config: TTSStreamConfig(overlap_frames: 2)
)

The default value of 1 is sufficient for most voices. Only increase it if you hear artefacts with a specific voice or speaking style.

High latency before first audio chunk ¶

Three things to check:

``first_frames_per_chunk`` is too high. The streamer waits until it has collected this many codec frames before emitting the first chunk. Lowering it (e.g. from 12 to 6) cuts the wait at the cost of a shorter initial audio segment.
First-run model download. The first call to start_model downloads model weights from Hugging Face. Subsequent runs use the local cache. If you see high latency only on the first launch, this is expected — preload the model on a splash screen.
Network round-trip (remote LLM). If you are piping a cloud LLM into TTS, the time-to-first-audio includes the LLM provider’s network latency. Switching to an on-device LLM eliminates this component entirely.

Stream never emits `is_final`¶

The TTS stream waits for an explicit end-of-input signal. Without it, the streamer assumes more text is coming and holds the last chunk in its buffer indefinitely.

In Swift, call stop_stream() on the streamer once all text has been sent:

for await chunk in llm_stream {
    if let delta = chunk.delta, !delta.isEmpty {
        streamer.send(delta)
    }
    if chunk.is_final {
        streamer.stop_stream()  // ← signals end of input
    }
}

In Flutter, call finish_stream() (not stop_stream(), which cancels):

if (chunk['is_final'] == true) {
  await TheStageFlutterSDK.finish_stream(stream_id: 'my_stream');
}