TTS (Text-to-Speech)¶
Overview¶
On-device neural text-to-speech with batch and push-based streaming. Two public pipelines:
NeuTTSMultilingualPipeline— Qwen3-based, 9 languages.NeuTTSNanoPipeline— phoneme-based, English only, faster.
Flutter consumers go through the singleton start_model + infer /
infer_stream (JSON) path — there is no direct TTS pipeline
constructor on Dart. Both surfaces share the same on-disk cache and
response shape.
Supported Models¶
Model |
HF Repo |
Languages |
Architecture |
|---|---|---|---|
NeuTTS Multilingual |
|
English, French, German, Spanish, Portuguese, Japanese, Korean, Chinese, Urdu |
Qwen3-based |
NeuTTS Nano |
|
English only |
Phoneme-based |
API Reference¶
Full Constructor¶
NeuTTSMultilingualPipeline:
let tts = try await NeuTTSMultilingualPipeline(
engines_path: "TheStageAI/neutts-multilingual",
voice_id: "paul",
language: "english",
device: "npu",
devices: nil,
revision: "main",
on_load_progress: nil
)
NeuTTSNanoPipeline:
let nano = try await NeuTTSNanoPipeline(
engines_path: "TheStageAI/neutts-nano",
voice_id: "dave"
)
Inputs / Outputs¶
Direction |
Type |
Description |
|---|---|---|
input |
|
Text to synthesize. |
input |
|
Sampling temperature (voice default if nil). |
input |
|
Top-k sampling (voice default if nil). |
input |
|
Deterministic sampling. |
input |
|
Attach decoder traces. |
output |
|
24 kHz mono PCM, samples in |
output |
|
Always |
output |
|
Seconds of audio. |
output |
|
Real-time factor (duration / wall time). |
output |
|
Decode speed. |
output |
|
Only set if |
Streaming Hyperparameters¶
Field |
Default |
Description |
|---|---|---|
|
|
Codec frames decoded per emitted audio chunk after the first. |
|
|
Frames in the first chunk; smaller value lowers time-to-first-audio. |
|
|
Future frames decoded together with each chunk. |
|
|
Past frames re-decoded for context. |
|
|
Frames of crossfade between consecutive chunks. |
Swift:
let streamer = tts.open_streamer(
config: TTSStreamConfig(
frames_per_chunk: 25,
first_frames_per_chunk: 12,
lookforward: 5,
lookback: 50,
overlap_frames: 1
)
)
Flutter:
final stream = TheStageFlutterSDK.infer_stream(
model_name: 'tts',
input_json: {
'text': 'Hello, world.',
'stream_config': {
'frames_per_chunk': 25,
'first_frames_per_chunk': 12,
'lookforward': 5,
'lookback': 50,
'overlap_frames': 1,
},
},
);
Audio Output¶
24 kHz mono
[Float], samples in[-1.0, 1.0].Batch:
TTSResult.samplesis the full utterance.Streaming: each chunk is one sentence-sized PCM slice with overlap-add crossfading.
If your playback path runs at 16 kHz to match VAD/ASR, resample TTS output down.
Singleton API¶
try await ai.start_model(
model_name: "tts",
engines_path: "TheStageAI/neutts-multilingual",
config: ["voice_id": "paul", "language": "english"]
)
let json = try ai.infer(
model_name: "tts",
input_json: [
"text": "Hello, world!",
"temperature": 1.0,
"top_k": 50
]
)
let audio = json[0]["audio"] as! [Float]
JSON response keys: audio, sample_rate, duration, tokens_per_second, rtf, debug_info.
Usage Guides¶
Synthesizing speech from text (batch)¶
Use batch synthesis when you already have the complete text and do not need real-time playback. The pipeline synthesizes the entire utterance in one pass and returns the full audio buffer — ideal for notifications, pre-recorded prompts, or any scenario where latency to first audio is not critical.
Swift — direct constructor (recommended):
import TheStageSDK
let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")
let tts = try await NeuTTSMultilingualPipeline(
engines_path: "TheStageAI/neutts-multilingual",
voice_id: "paul",
language: "english"
)
let result = tts.infer(text: "Hello, world!")
let audio = result.samples
let sample_rate = result.sample_rate
The English-only Nano variant follows the same shape:
let tts = try await NeuTTSNanoPipeline(
engines_path: "TheStageAI/neutts-nano",
voice_id: "dave"
)
Note
The returned samples array is raw 24 kHz mono PCM. To hear it you
must feed it into an audio player configured for 24000 Hz — for example
AVAudioPlayerNode on iOS or any PCM-capable playback sink. Do not
assume the system default sample rate matches.
Flutter — JSON path:
import 'package:thestage_apple_sdk/thestage_apple_sdk.dart';
import 'dart:typed_data';
await TheStageFlutterSDK.initialize(api_token: 'th_…');
await TheStageFlutterSDK.start_model(
model_name: 'tts',
engines_path: 'TheStageAI/neutts-multilingual',
config: {'voice_id': 'paul', 'language': 'english'},
);
final result = await TheStageFlutterSDK.infer(
model_name: 'tts',
input_json: {'text': 'Hello, world!'},
);
final audio = result[0]['audio'] as Float32List;
final sampleRate = result[0]['sample_rate'] as int;
Real-time streaming TTS¶
When you want the user to hear audio as soon as possible — without waiting for the full utterance to finish synthesizing — use the streaming interface. This is essential for voice assistants, read-aloud features, and any interactive use case where perceived latency matters.
The streaming API uses a concurrent producer/consumer pattern. You push text into the streamer with send() while simultaneously draining audio chunks from streamer.output. These two operations must run concurrently: if you wait until all text is sent before reading output, internal audio buffers will stall and you will experience unnecessary delays or deadlocks.
Swift:
let streamer = tts.open_streamer()
let consumer = Task {
for await chunk in streamer.output {
if let pcm = chunk.audio { player.enqueue(pcm) }
}
}
streamer.send("Hello, world. ")
streamer.send("This sentence streams as it synthesizes.")
streamer.stop_stream()
await consumer.value
Attention
Always start the consumer task before calling send(). If you
call send() first without a concurrent reader, audio buffers back
up and synthesis stalls.
Note
If you already have the full text up-front, infer_stream(text:) does
the same thing in a single call.
Flutter:
const streamId = 'tts-utterance-1';
final player = TheStageAudioPlayer(sampleRate: 24000)..start();
final consumer = () async {
final stream = TheStageFlutterSDK.infer_stream(
model_name: 'tts',
input_json: {'text': ''},
stream_id: streamId,
);
await for (final chunk in stream) {
final audio = chunk['audio'] as Float32List?;
if (audio != null && audio.isNotEmpty) player.enqueue(audio);
if (chunk['is_final'] == true) break;
}
}();
await TheStageFlutterSDK.send(stream_id: streamId, text: 'Hello, world. ');
await TheStageFlutterSDK.send(
stream_id: streamId,
text: 'This sentence streams as it synthesizes.',
);
await TheStageFlutterSDK.finish_stream(stream_id: streamId);
await consumer;
Choosing between Multilingual and Nano models¶
The two pipelines target different trade-offs:
NeuTTS Nano |
NeuTTS Multilingual |
|
|---|---|---|
Architecture |
Phoneme-based encoder/decoder |
Qwen3-based language model |
Languages |
English only |
9 languages |
Latency |
Lower (lighter model, faster decode) |
Higher (heavier model) |
Quality |
Good for English |
Higher naturalness, better prosody control |
Use Nano for English-only apps where speed and memory footprint matter — for example a real-time voice assistant on older devices.
Use Multilingual when you need multi-language support, higher voice quality, or more expressive prosody. It handles code-switching (mixed-language text) better because the Qwen3 backbone understands linguistic context.
Switching voices¶
Both pipelines accept a voice_id at construction time. Each voice is a directory under voices/{voice_id}/ in the model bundle containing the speaker embedding and configuration.
let tts = try await NeuTTSMultilingualPipeline(
engines_path: "TheStageAI/neutts-multilingual",
voice_id: "dave",
language: "english"
)
Available voice presets:
paul— neutral male (default for Multilingual)dave— neutral male (default for Nano)
Note
You can add custom voices by placing a compatible speaker-embedding
directory under voices/ in your local engine cache. Refer to the
voice-cloning guide for details.
Piping LLM output directly into TTS (voice assistant pattern)¶
The most common pattern for building a voice assistant is: user speaks → ASR transcribes → LLM generates a reply → TTS speaks the reply. Because LLM output arrives token-by-token, you want to pipe it into the TTS streamer in real time so the user hears audio before the full response is generated.
The TTS streamer handles sentence segmentation internally — you can push partial text (even individual tokens) and the streamer will buffer until it has a complete sentence, then begin synthesis.
Swift:
let streamer = tts.open_streamer()
let consumer = Task {
for await chunk in streamer.output {
if let pcm = chunk.audio { player.enqueue(pcm) }
}
}
for await token in llm.stream(prompt: userQuery) {
streamer.send(token)
}
streamer.stop_stream()
await consumer.value
Flutter:
const streamId = 'voice-reply';
final player = TheStageAudioPlayer(sampleRate: 24000)..start();
final consumer = () async {
final stream = TheStageFlutterSDK.infer_stream(
model_name: 'tts',
input_json: {'text': ''},
stream_id: streamId,
);
await for (final chunk in stream) {
final audio = chunk['audio'] as Float32List?;
if (audio != null && audio.isNotEmpty) player.enqueue(audio);
if (chunk['is_final'] == true) break;
}
}();
await for (final token in llmStream) {
await TheStageFlutterSDK.send(stream_id: streamId, text: token);
}
await TheStageFlutterSDK.finish_stream(stream_id: streamId);
await consumer;
Adjusting audio quality vs latency¶
Two axes control the quality-latency trade-off:
Voice quality — controlled by sampling parameters at inference time:
temperature— higher values (e.g. 1.2) add expressiveness but may introduce artifacts. Lower values (e.g. 0.7) produce more stable but flatter speech.top_k— restricts the token pool at each decode step. Lower values (e.g. 20) are more conservative; higher values (e.g. 80) give more variation.
let result = tts.infer(
text: "Hello!",
config: TTSConfig(temperature: 0.8, top_k: 30)
)
Streaming latency — controlled by TTSStreamConfig:
first_frames_per_chunkis the most impactful knob. It controls how many codec frames must be decoded before the first audio chunk is emitted. Lower values = faster first audio, but each chunk is shorter so the decoder runs more often (slightly more total compute).
let streamer = tts.open_streamer(
config: TTSStreamConfig(
first_frames_per_chunk: 8,
frames_per_chunk: 25
)
)
Concrete guidance:
For the lowest perceived latency (voice assistant), set
first_frames_per_chunkto 6–10.For smoother playback with less overhead, use the default of 25.
overlap_framescontrols crossfade between consecutive chunks. Increase from 1 to 2–3 if you hear clicks at chunk boundaries.
Voices and Languages¶
Voices live under voices/{voice_id}/ inside the bundle. The
multilingual model supports:
english, french, german, spanish, portuguese, japanese,
korean, chinese, urdu
The Nano variant is English-only and ignores the language parameter.
Troubleshooting¶
No audio output / empty samples array¶
Verify the input
textis not empty or whitespace-only.Confirm the
voice_idyou passed matches a directory that exists undervoices/in the engine bundle.Pass
return_debug_info: truein the config and inspect the returneddebug_info— it contains decoder traces showing whether tokens were generated.If using the singleton API, make sure
start_modelcompleted successfully before callinginfer.
Audio sounds robotic or choppy during streaming¶
Check that
overlap_framesis at least 1. Setting it to 0 disables crossfading between chunks, causing audible clicks at boundaries.Ensure you are draining
streamer.outputconcurrently withsend()— not sequentially after all text is sent. Sequential reads cause buffers to fill, which stalls the decoder and produces irregular chunk timing.If individual chunks sound distorted, try increasing
frames_per_chunkto give the decoder more context per chunk.
Wrong language pronunciation¶
On
NeuTTSMultilingualPipeline, set thelanguageparameter to match your input text (e.g."french"). If omitted it defaults to English, which produces incorrect phonemization for other languages.NeuTTSNanoPipelineonly supports English. If you need other languages, switch to the Multilingual pipeline.
High latency before first audio¶
Switch from batch (
infer) to streaming (open_streamer). Batch mode waits for the full utterance to finish before returning any audio.Lower
first_frames_per_chunk(e.g. to 6–10). This is the primary control for time-to-first-audio.Use
prefetch_enginesat app startup to pre-download the model so that the firststart_modelcall does not include a network download.
Audio plays at wrong speed¶
TTS outputs 24 kHz mono PCM. If your audio player is configured for a different sample rate (e.g. 44100 Hz or 16000 Hz), playback will be too fast or too slow.
Set your audio player’s sample rate to exactly
24000Hz before enqueuing TTS samples.If your playback pipeline is fixed at 16 kHz (to match VAD/ASR), resample the TTS output from 24 kHz to 16 kHz before playing.
Load Progress¶
Swift:
let tts = try await NeuTTSMultilingualPipeline(
engines_path: "TheStageAI/neutts-multilingual",
voice_id: "paul",
on_load_progress: { p in
print("[\(p.model)] \(p.phase) \(Int(p.fraction * 100))%")
}
)
Flutter:
TheStageFlutterSDK.on_progress.listen((event) {
if (event['model_name'] != 'tts') return;
final phase = event['phase'] as String?;
final fraction = event['progress'] as double?;
print('[tts] $phase ${(fraction ?? 0) * 100}%');
});
await TheStageFlutterSDK.start_model(
model_name: 'tts',
engines_path: 'TheStageAI/neutts-multilingual',
config: {'voice_id': 'paul', 'language': 'english'},
);
Prefetch Engines¶
let engines_dir = try await ai.prefetch_engines(
repo_id: "TheStageAI/neutts-multilingual"
)
let tts = try await NeuTTSMultilingualPipeline(
engines_path: engines_dir,
voice_id: "paul"
)
Cleanup¶
Swift:
_ = try ai.stop_model(model_name: "tts")
Flutter:
await TheStageFlutterSDK.stop_model(model_name: 'tts');