TTS (Text-to-Speech)¶

Overview ¶

On-device neural text-to-speech with batch and push-based streaming. Public pipelines:

NeuTTSMultilingualPipeline — NeuTTS multilingual / nano-multilingual.
Qwen3TTSPipeline — Qwen3-TTS (12 Hz talker + MTP + codec).

Note

English espeak nano (TheStageAI/neutts / NeuTTSNanoPipeline) is not in the current HF fleet — it needs an app-side espeak phonemizer. Prefer neutts-multilingual or neutts-nano-multilingual.

Flutter / start_model auto-routes Qwen3-TTS vs NeuTTS from the bundle layout. There is no direct TTS constructor on Dart. All surfaces share the same on-disk cache and 24 kHz PCM output contract.

Supported Models ¶

Model	HF Repo	Languages	Architecture
NeuTTS Multilingual	`TheStageAI/neutts-multilingual`	English, French, German, Spanish, Portuguese, Japanese, Korean, Chinese, Urdu	NeuTTS
NeuTTS Nano multilingual	`TheStageAI/neutts-nano-multilingual`	Same multilingual set (smaller)	NeuTTS
Qwen3-TTS 12 Hz 0.6B	`TheStageAI/Qwen3-TTS-12Hz-0.6B-Base`	Bundle voices (default `b_ref`)	Qwen3-TTS

Qwen3-TTS ¶

let tts = try await Qwen3TTSPipeline(
    engines_path: "TheStageAI/Qwen3-TTS-12Hz-0.6B-Base",
    voice_id: "b_ref"
)
let result = tts.infer(text: "Hello, world!")

start_model with a Qwen3-TTS bundle selects this family automatically (default voice_id: b_ref). Streaming uses the same push infer_stream / streamer patterns as NeuTTS.

API Reference ¶

Full Constructor ¶

NeuTTSMultilingualPipeline:

let tts = try await NeuTTSMultilingualPipeline(
    engines_path: "TheStageAI/neutts-multilingual",
    voice_id: "paul",
    language: "english",
    device: "npu",
    devices: nil,
    on_load_progress: nil
)

Inputs / Outputs ¶

Direction	Type	Description
input `text`	`String`	Text to synthesize.
input `config.temperature`	`Double?`	Sampling temperature (voice default if nil).
input `config.top_k`	`Int?`	Top-k sampling (voice default if nil).
input `config.seed`	`UInt64?`	Deterministic sampling.
input `config.return_debug_info`	`Bool` (default `false`)	Attach decoder traces.
output `TTSResult.samples`	`[Float]`	24 kHz mono PCM, samples in `[-1.0, 1.0]`.
output `TTSResult.sample_rate`	`Int`	Always `24000`.
output `TTSResult.duration`	`Double`	Seconds of audio.
output `TTSResult.rtf`	`Double`	Real-time factor (duration / wall time).
output `TTSResult.tokens_per_second`	`Double`	Decode speed.
output `TTSResult.debug_info`	`TTSDebugInfo?`	Only set if `return_debug_info`.

Sampling (voice quality)¶

Omit temperature / top_k / seed to use the voice / bundle defaults (Qwen3-TTS talker is typically ~``temperature=0.9``, top_k=50). These knobs are independent of streaming chunking (TTSStreamConfig).

Goal	`temperature`	`top_k`	`seed`
Stable / clear (default-ish)	omit or `0.8–1.0`	omit or `40–50`	omit
More expressive / varied	`1.1–1.2`	`60–80`	omit
Safer / less artifact-prone	`0.6–0.8`	`20–30`	omit
Reproducible QA fixture	`0.8`	`50`	fixed (`42`)

Swift:

let result = tts.infer(
    text: "Welcome to the product demo.",
    config: TTSGenerationConfig(temperature: 0.8, top_k: 30)
)

// Deterministic fixture
let fixed = tts.infer(
    text: "Hello.",
    config: TTSGenerationConfig(temperature: 0.8, top_k: 50, seed: 42)
)

Flutter / JSON (sampling keys live next to text):

final result = await TheStageFlutterSDK.infer(
  model_name: 'tts',
  input_json: {
    'text': 'Welcome to the product demo.',
    'temperature': 0.8,
    'top_k': 30,
  },
);

Streaming Hyperparameters ¶

Field	Default	Description
`frames_per_chunk`	`25`	Codec frames decoded per emitted audio chunk after the first.
`first_frames_per_chunk`	`25`	Frames in the first chunk; smaller value lowers time-to-first-audio.
`lookforward`	`5`	Future frames decoded together with each chunk.
`lookback`	`50`	Past frames re-decoded for context.
`overlap_frames`	`1`	Frames of crossfade between consecutive chunks.

Swift:

let streamer = tts.open_streamer(
    config: TTSStreamConfig(
        frames_per_chunk: 25,
        first_frames_per_chunk: 12,
        lookforward: 5,
        lookback: 50,
        overlap_frames: 1
    )
)

Flutter:

final stream = TheStageFlutterSDK.infer_stream(
  model_name: 'tts',
  input_json: {
    'text': 'Hello, world.',
    'stream_config': {
      'frames_per_chunk': 25,
      'first_frames_per_chunk': 12,
      'lookforward': 5,
      'lookback': 50,
      'overlap_frames': 1,
    },
  },
);

Audio Output ¶

24 kHz mono [Float], samples in [-1.0, 1.0].
Batch: TTSResult.samples is the full utterance.
Streaming: each chunk is one sentence-sized PCM slice with overlap-add crossfading.
If your playback path runs at 16 kHz to match VAD/ASR, resample TTS output down.

Singleton API ¶

try await ai.start_model(
    model_name: "tts",
    engines_path: "TheStageAI/neutts-multilingual",
    config: ["voice_id": "paul", "language": "english"]
)

let json = try ai.infer(
    model_name: "tts",
    input_json: [
        "text": "Hello, world!",
        "temperature": 1.0,
        "top_k": 50
    ]
)
let audio = json[0]["audio"] as! [Float]

JSON response keys: audio, sample_rate, duration, tokens_per_second, rtf, debug_info.

Usage Guides ¶

Synthesizing speech from text (batch)¶

Use batch synthesis when you already have the complete text and do not need real-time playback. The pipeline synthesizes the entire utterance in one pass and returns the full audio buffer — ideal for notifications, pre-recorded prompts, or any scenario where latency to first audio is not critical.

Swift — direct constructor (recommended):

import TheStageSDK

let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")

let tts = try await NeuTTSMultilingualPipeline(
    engines_path: "TheStageAI/neutts-multilingual",
    voice_id: "paul",
    language: "english"
)

let result = tts.infer(text: "Hello, world!")
let audio = result.samples
let sample_rate = result.sample_rate

Note

The returned samples array is raw 24 kHz mono PCM. To hear it you must feed it into an audio player configured for 24000 Hz — for example AVAudioPlayerNode on iOS or any PCM-capable playback sink. Do not assume the system default sample rate matches.

Flutter — JSON path:

import 'package:thestage_apple_sdk/thestage_apple_sdk.dart';
import 'dart:typed_data';

await TheStageFlutterSDK.initialize(api_token: 'th_…');

await TheStageFlutterSDK.start_model(
  model_name: 'tts',
  engines_path: 'TheStageAI/neutts-multilingual',
  config: {'voice_id': 'paul', 'language': 'english'},
);

final result = await TheStageFlutterSDK.infer(
  model_name: 'tts',
  input_json: {'text': 'Hello, world!'},
);
final audio       = result[0]['audio']       as Float32List;
final sampleRate  = result[0]['sample_rate'] as int;

Real-time streaming TTS ¶

When you want the user to hear audio as soon as possible — without waiting for the full utterance to finish synthesizing — use the streaming interface. This is essential for voice assistants, read-aloud features, and any interactive use case where perceived latency matters.

The streaming API uses a concurrent producer/consumer pattern. You push text into the streamer with send() while simultaneously draining audio chunks from streamer.output. These two operations must run concurrently: if you wait until all text is sent before reading output, internal audio buffers will stall and you will experience unnecessary delays or deadlocks.

Swift:

let streamer = tts.open_streamer()

let consumer = Task {
    for await chunk in streamer.output {
        if let pcm = chunk.audio { player.enqueue(pcm) }
    }
}

streamer.send("Hello, world. ")
streamer.send("This sentence streams as it synthesizes.")
streamer.stop_stream()
await consumer.value

Attention

Always start the consumer task before calling send(). If you call send() first without a concurrent reader, audio buffers back up and synthesis stalls.

Note

If you already have the full text up-front, infer_stream(text:) does the same thing in a single call.

Flutter:

const streamId = 'tts-utterance-1';
final player = TheStageAudioPlayer(sampleRate: 24000)..start();

final consumer = () async {
  final stream = TheStageFlutterSDK.infer_stream(
    model_name: 'tts',
    input_json: {'text': ''},
    stream_id: streamId,
  );
  await for (final chunk in stream) {
    final audio = chunk['audio'] as Float32List?;
    if (audio != null && audio.isNotEmpty) player.enqueue(audio);
    if (chunk['is_final'] == true) break;
  }
}();

await TheStageFlutterSDK.send(stream_id: streamId, text: 'Hello, world. ');
await TheStageFlutterSDK.send(
  stream_id: streamId,
  text: 'This sentence streams as it synthesizes.',
);
await TheStageFlutterSDK.finish_stream(stream_id: streamId);
await consumer;

Choosing between Multilingual and Nano models ¶

The two pipelines target different trade-offs:

	NeuTTS Nano	NeuTTS Multilingual
Architecture	Phoneme-based encoder/decoder	Qwen3-based language model
Languages	English only	9 languages
Latency	Lower (lighter model, faster decode)	Higher (heavier model)
Quality	Good for English	Higher naturalness, better prosody control

Use Nano for English-only apps where speed and memory footprint matter — for example a real-time voice assistant on older devices.

Use Multilingual when you need multi-language support, higher voice quality, or more expressive prosody. It handles code-switching (mixed-language text) better because the Qwen3 backbone understands linguistic context.

Switching voices ¶

Both pipelines accept a voice_id at construction time. Each voice is a directory under voices/{voice_id}/ in the model bundle containing the speaker embedding and configuration.

let tts = try await NeuTTSMultilingualPipeline(
    engines_path: "TheStageAI/neutts-multilingual",
    voice_id: "dave",
    language: "english"
)

Available voice presets:

paul — neutral male (default for Multilingual)
dave — neutral male (default for Nano)

Note

You can add custom voices by placing a compatible speaker-embedding directory under voices/ in your local engine cache. Refer to the voice-cloning guide for details.

Piping LLM output directly into TTS (voice assistant pattern)¶

The most common pattern for building a voice assistant is: user speaks → ASR transcribes → LLM generates a reply → TTS speaks the reply. Because LLM output arrives token-by-token, you want to pipe it into the TTS streamer in real time so the user hears audio before the full response is generated.

The TTS streamer handles sentence segmentation internally — you can push partial text (even individual tokens) and the streamer will buffer until it has a complete sentence, then begin synthesis.

Swift:

let streamer = tts.open_streamer()

let consumer = Task {
    for await chunk in streamer.output {
        if let pcm = chunk.audio { player.enqueue(pcm) }
    }
}

for await token in llm.stream(prompt: userQuery) {
    streamer.send(token)
}
streamer.stop_stream()
await consumer.value

Flutter:

const streamId = 'voice-reply';
final player = TheStageAudioPlayer(sampleRate: 24000)..start();

final consumer = () async {
  final stream = TheStageFlutterSDK.infer_stream(
    model_name: 'tts',
    input_json: {'text': ''},
    stream_id: streamId,
  );
  await for (final chunk in stream) {
    final audio = chunk['audio'] as Float32List?;
    if (audio != null && audio.isNotEmpty) player.enqueue(audio);
    if (chunk['is_final'] == true) break;
  }
}();

await for (final token in llmStream) {
  await TheStageFlutterSDK.send(stream_id: streamId, text: token);
}
await TheStageFlutterSDK.finish_stream(stream_id: streamId);
await consumer;

Adjusting audio quality vs latency ¶

Two axes control the quality-latency trade-off:

Voice quality — controlled by sampling parameters at inference time (see Sampling (voice quality) above for a recipe table):

temperature — higher values (e.g. 1.2) add expressiveness but may introduce artifacts. Lower values (e.g. 0.7) produce more stable but flatter speech.
top_k — restricts the token pool at each decode step. Lower values (e.g. 20) are more conservative; higher values (e.g. 80) give more variation.

let result = tts.infer(
    text: "Hello!",
    config: TTSGenerationConfig(temperature: 0.8, top_k: 30)
)

Streaming latency — controlled by TTSStreamConfig:

first_frames_per_chunk is the most impactful knob. It controls how many codec frames must be decoded before the first audio chunk is emitted. Lower values = faster first audio, but each chunk is shorter so the decoder runs more often (slightly more total compute).

let streamer = tts.open_streamer(
    config: TTSStreamConfig(
        first_frames_per_chunk: 8,
        frames_per_chunk: 25
    )
)

Concrete guidance:

For the lowest perceived latency (voice assistant), set first_frames_per_chunk to 6–10.
For smoother playback with less overhead, use the default of 25.
overlap_frames controls crossfade between consecutive chunks. Increase from 1 to 2–3 if you hear clicks at chunk boundaries.

Voices and Languages ¶

Voices live under voices/{voice_id}/ inside the bundle. The multilingual model supports:

english, french, german, spanish, portuguese, japanese, korean, chinese, urdu

The Nano variant is English-only and ignores the language parameter.

Troubleshooting ¶

No audio output / empty samples array ¶

Verify the input text is not empty or whitespace-only.
Confirm the voice_id you passed matches a directory that exists under voices/ in the engine bundle.
Pass return_debug_info: true in the config and inspect the returned debug_info — it contains decoder traces showing whether tokens were generated.
If using the singleton API, make sure start_model completed successfully before calling infer.

Audio sounds robotic or choppy during streaming ¶

Check that overlap_frames is at least 1. Setting it to 0 disables crossfading between chunks, causing audible clicks at boundaries.
Ensure you are draining streamer.output concurrently with send() — not sequentially after all text is sent. Sequential reads cause buffers to fill, which stalls the decoder and produces irregular chunk timing.
If individual chunks sound distorted, try increasing frames_per_chunk to give the decoder more context per chunk.

Wrong language pronunciation ¶

On NeuTTSMultilingualPipeline, set the language parameter to match your input text (e.g. "french"). If omitted it defaults to English, which produces incorrect phonemization for other languages.

High latency before first audio ¶

Switch from batch (infer) to streaming (open_streamer). Batch mode waits for the full utterance to finish before returning any audio.
Lower first_frames_per_chunk (e.g. to 6–10). This is the primary control for time-to-first-audio.
Use prefetch_engines at app startup to pre-download the model so that the first start_model call does not include a network download.

Audio plays at wrong speed ¶

TTS outputs 24 kHz mono PCM. If your audio player is configured for a different sample rate (e.g. 44100 Hz or 16000 Hz), playback will be too fast or too slow.
Set your audio player’s sample rate to exactly 24000 Hz before enqueuing TTS samples.
If your playback pipeline is fixed at 16 kHz (to match VAD/ASR), resample the TTS output from 24 kHz to 16 kHz before playing.

Load Progress ¶

Swift:

let tts = try await NeuTTSMultilingualPipeline(
    engines_path: "TheStageAI/neutts-multilingual",
    voice_id: "paul",
    on_load_progress: { p in
        print("[\(p.model)] \(p.phase) \(Int(p.fraction * 100))%")
    }
)

Flutter:

TheStageFlutterSDK.on_progress.listen((event) {
  if (event['model_name'] != 'tts') return;
  final phase    = event['phase']    as String?;
  final fraction = event['progress'] as double?;
  print('[tts] $phase ${(fraction ?? 0) * 100}%');
});

await TheStageFlutterSDK.start_model(
  model_name: 'tts',
  engines_path: 'TheStageAI/neutts-multilingual',
  config: {'voice_id': 'paul', 'language': 'english'},
);

Prefetch Engines ¶

let engines_dir = try await ai.prefetch_engines(
    repo_id: "TheStageAI/neutts-multilingual"
)

let tts = try await NeuTTSMultilingualPipeline(
    engines_path: engines_dir,
    voice_id: "paul"
)

Cleanup ¶

Swift:

_ = try ai.stop_model(model_name: "tts")

Flutter:

await TheStageFlutterSDK.stop_model(model_name: 'tts');