TTS (Text-to-Speech)

Overview

On-device neural text-to-speech with batch and push-based streaming. Two public pipelines:

  • NeuTTSMultilingualPipeline — Qwen3-based, 9 languages.

  • NeuTTSNanoPipeline — phoneme-based, English only, faster.

Flutter consumers go through the singleton start_model + infer / infer_stream (JSON) path — there is no direct TTS pipeline constructor on Dart. Both surfaces share the same on-disk cache and response shape.

Supported Models

Model

HF Repo

Languages

Architecture

NeuTTS Multilingual

TheStageAI/neutts-multilingual

English, French, German, Spanish, Portuguese, Japanese, Korean, Chinese, Urdu

Qwen3-based

NeuTTS Nano

TheStageAI/neutts-nano

English only

Phoneme-based

API Reference

Full Constructor

NeuTTSMultilingualPipeline:

let tts = try await NeuTTSMultilingualPipeline(
    engines_path: "TheStageAI/neutts-multilingual",
    voice_id: "paul",
    language: "english",
    device: "npu",
    devices: nil,
    revision: "main",
    on_load_progress: nil
)

NeuTTSNanoPipeline:

let nano = try await NeuTTSNanoPipeline(
    engines_path: "TheStageAI/neutts-nano",
    voice_id: "dave"
)

Inputs / Outputs

Direction

Type

Description

input text

String

Text to synthesize.

input config.temperature

Double?

Sampling temperature (voice default if nil).

input config.top_k

Int?

Top-k sampling (voice default if nil).

input config.seed

UInt64?

Deterministic sampling.

input config.return_debug_info

Bool (default false)

Attach decoder traces.

output TTSResult.samples

[Float]

24 kHz mono PCM, samples in [-1.0, 1.0].

output TTSResult.sample_rate

Int

Always 24000.

output TTSResult.duration

Double

Seconds of audio.

output TTSResult.rtf

Double

Real-time factor (duration / wall time).

output TTSResult.tokens_per_second

Double

Decode speed.

output TTSResult.debug_info

TTSDebugInfo?

Only set if return_debug_info.

Streaming Hyperparameters

Field

Default

Description

frames_per_chunk

25

Codec frames decoded per emitted audio chunk after the first.

first_frames_per_chunk

25

Frames in the first chunk; smaller value lowers time-to-first-audio.

lookforward

5

Future frames decoded together with each chunk.

lookback

50

Past frames re-decoded for context.

overlap_frames

1

Frames of crossfade between consecutive chunks.

Swift:

let streamer = tts.open_streamer(
    config: TTSStreamConfig(
        frames_per_chunk: 25,
        first_frames_per_chunk: 12,
        lookforward: 5,
        lookback: 50,
        overlap_frames: 1
    )
)

Flutter:

final stream = TheStageFlutterSDK.infer_stream(
  model_name: 'tts',
  input_json: {
    'text': 'Hello, world.',
    'stream_config': {
      'frames_per_chunk': 25,
      'first_frames_per_chunk': 12,
      'lookforward': 5,
      'lookback': 50,
      'overlap_frames': 1,
    },
  },
);

Audio Output

  • 24 kHz mono [Float], samples in [-1.0, 1.0].

  • Batch: TTSResult.samples is the full utterance.

  • Streaming: each chunk is one sentence-sized PCM slice with overlap-add crossfading.

  • If your playback path runs at 16 kHz to match VAD/ASR, resample TTS output down.

Singleton API

try await ai.start_model(
    model_name: "tts",
    engines_path: "TheStageAI/neutts-multilingual",
    config: ["voice_id": "paul", "language": "english"]
)

let json = try ai.infer(
    model_name: "tts",
    input_json: [
        "text": "Hello, world!",
        "temperature": 1.0,
        "top_k": 50
    ]
)
let audio = json[0]["audio"] as! [Float]

JSON response keys: audio, sample_rate, duration, tokens_per_second, rtf, debug_info.

Usage Guides

Synthesizing speech from text (batch)

Use batch synthesis when you already have the complete text and do not need real-time playback. The pipeline synthesizes the entire utterance in one pass and returns the full audio buffer — ideal for notifications, pre-recorded prompts, or any scenario where latency to first audio is not critical.

Swift — direct constructor (recommended):

import TheStageSDK

let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")

let tts = try await NeuTTSMultilingualPipeline(
    engines_path: "TheStageAI/neutts-multilingual",
    voice_id: "paul",
    language: "english"
)

let result = tts.infer(text: "Hello, world!")
let audio = result.samples
let sample_rate = result.sample_rate

The English-only Nano variant follows the same shape:

let tts = try await NeuTTSNanoPipeline(
    engines_path: "TheStageAI/neutts-nano",
    voice_id: "dave"
)

Note

The returned samples array is raw 24 kHz mono PCM. To hear it you must feed it into an audio player configured for 24000 Hz — for example AVAudioPlayerNode on iOS or any PCM-capable playback sink. Do not assume the system default sample rate matches.

Flutter — JSON path:

import 'package:thestage_apple_sdk/thestage_apple_sdk.dart';
import 'dart:typed_data';

await TheStageFlutterSDK.initialize(api_token: 'th_…');

await TheStageFlutterSDK.start_model(
  model_name: 'tts',
  engines_path: 'TheStageAI/neutts-multilingual',
  config: {'voice_id': 'paul', 'language': 'english'},
);

final result = await TheStageFlutterSDK.infer(
  model_name: 'tts',
  input_json: {'text': 'Hello, world!'},
);
final audio       = result[0]['audio']       as Float32List;
final sampleRate  = result[0]['sample_rate'] as int;

Real-time streaming TTS

When you want the user to hear audio as soon as possible — without waiting for the full utterance to finish synthesizing — use the streaming interface. This is essential for voice assistants, read-aloud features, and any interactive use case where perceived latency matters.

The streaming API uses a concurrent producer/consumer pattern. You push text into the streamer with send() while simultaneously draining audio chunks from streamer.output. These two operations must run concurrently: if you wait until all text is sent before reading output, internal audio buffers will stall and you will experience unnecessary delays or deadlocks.

Swift:

let streamer = tts.open_streamer()

let consumer = Task {
    for await chunk in streamer.output {
        if let pcm = chunk.audio { player.enqueue(pcm) }
    }
}

streamer.send("Hello, world. ")
streamer.send("This sentence streams as it synthesizes.")
streamer.stop_stream()
await consumer.value

Attention

Always start the consumer task before calling send(). If you call send() first without a concurrent reader, audio buffers back up and synthesis stalls.

Note

If you already have the full text up-front, infer_stream(text:) does the same thing in a single call.

Flutter:

const streamId = 'tts-utterance-1';
final player = TheStageAudioPlayer(sampleRate: 24000)..start();

final consumer = () async {
  final stream = TheStageFlutterSDK.infer_stream(
    model_name: 'tts',
    input_json: {'text': ''},
    stream_id: streamId,
  );
  await for (final chunk in stream) {
    final audio = chunk['audio'] as Float32List?;
    if (audio != null && audio.isNotEmpty) player.enqueue(audio);
    if (chunk['is_final'] == true) break;
  }
}();

await TheStageFlutterSDK.send(stream_id: streamId, text: 'Hello, world. ');
await TheStageFlutterSDK.send(
  stream_id: streamId,
  text: 'This sentence streams as it synthesizes.',
);
await TheStageFlutterSDK.finish_stream(stream_id: streamId);
await consumer;

Choosing between Multilingual and Nano models

The two pipelines target different trade-offs:

NeuTTS Nano

NeuTTS Multilingual

Architecture

Phoneme-based encoder/decoder

Qwen3-based language model

Languages

English only

9 languages

Latency

Lower (lighter model, faster decode)

Higher (heavier model)

Quality

Good for English

Higher naturalness, better prosody control

Use Nano for English-only apps where speed and memory footprint matter — for example a real-time voice assistant on older devices.

Use Multilingual when you need multi-language support, higher voice quality, or more expressive prosody. It handles code-switching (mixed-language text) better because the Qwen3 backbone understands linguistic context.

Switching voices

Both pipelines accept a voice_id at construction time. Each voice is a directory under voices/{voice_id}/ in the model bundle containing the speaker embedding and configuration.

let tts = try await NeuTTSMultilingualPipeline(
    engines_path: "TheStageAI/neutts-multilingual",
    voice_id: "dave",
    language: "english"
)

Available voice presets:

  • paul — neutral male (default for Multilingual)

  • dave — neutral male (default for Nano)

Note

You can add custom voices by placing a compatible speaker-embedding directory under voices/ in your local engine cache. Refer to the voice-cloning guide for details.

Piping LLM output directly into TTS (voice assistant pattern)

The most common pattern for building a voice assistant is: user speaks → ASR transcribes → LLM generates a reply → TTS speaks the reply. Because LLM output arrives token-by-token, you want to pipe it into the TTS streamer in real time so the user hears audio before the full response is generated.

The TTS streamer handles sentence segmentation internally — you can push partial text (even individual tokens) and the streamer will buffer until it has a complete sentence, then begin synthesis.

Swift:

let streamer = tts.open_streamer()

let consumer = Task {
    for await chunk in streamer.output {
        if let pcm = chunk.audio { player.enqueue(pcm) }
    }
}

for await token in llm.stream(prompt: userQuery) {
    streamer.send(token)
}
streamer.stop_stream()
await consumer.value

Flutter:

const streamId = 'voice-reply';
final player = TheStageAudioPlayer(sampleRate: 24000)..start();

final consumer = () async {
  final stream = TheStageFlutterSDK.infer_stream(
    model_name: 'tts',
    input_json: {'text': ''},
    stream_id: streamId,
  );
  await for (final chunk in stream) {
    final audio = chunk['audio'] as Float32List?;
    if (audio != null && audio.isNotEmpty) player.enqueue(audio);
    if (chunk['is_final'] == true) break;
  }
}();

await for (final token in llmStream) {
  await TheStageFlutterSDK.send(stream_id: streamId, text: token);
}
await TheStageFlutterSDK.finish_stream(stream_id: streamId);
await consumer;

Adjusting audio quality vs latency

Two axes control the quality-latency trade-off:

Voice quality — controlled by sampling parameters at inference time:

  • temperature — higher values (e.g. 1.2) add expressiveness but may introduce artifacts. Lower values (e.g. 0.7) produce more stable but flatter speech.

  • top_k — restricts the token pool at each decode step. Lower values (e.g. 20) are more conservative; higher values (e.g. 80) give more variation.

let result = tts.infer(
    text: "Hello!",
    config: TTSConfig(temperature: 0.8, top_k: 30)
)

Streaming latency — controlled by TTSStreamConfig:

  • first_frames_per_chunk is the most impactful knob. It controls how many codec frames must be decoded before the first audio chunk is emitted. Lower values = faster first audio, but each chunk is shorter so the decoder runs more often (slightly more total compute).

let streamer = tts.open_streamer(
    config: TTSStreamConfig(
        first_frames_per_chunk: 8,
        frames_per_chunk: 25
    )
)

Concrete guidance:

  • For the lowest perceived latency (voice assistant), set first_frames_per_chunk to 6–10.

  • For smoother playback with less overhead, use the default of 25.

  • overlap_frames controls crossfade between consecutive chunks. Increase from 1 to 2–3 if you hear clicks at chunk boundaries.

Voices and Languages

Voices live under voices/{voice_id}/ inside the bundle. The multilingual model supports:

english, french, german, spanish, portuguese, japanese, korean, chinese, urdu

The Nano variant is English-only and ignores the language parameter.

Troubleshooting

No audio output / empty samples array

  • Verify the input text is not empty or whitespace-only.

  • Confirm the voice_id you passed matches a directory that exists under voices/ in the engine bundle.

  • Pass return_debug_info: true in the config and inspect the returned debug_info — it contains decoder traces showing whether tokens were generated.

  • If using the singleton API, make sure start_model completed successfully before calling infer.

Audio sounds robotic or choppy during streaming

  • Check that overlap_frames is at least 1. Setting it to 0 disables crossfading between chunks, causing audible clicks at boundaries.

  • Ensure you are draining streamer.output concurrently with send() — not sequentially after all text is sent. Sequential reads cause buffers to fill, which stalls the decoder and produces irregular chunk timing.

  • If individual chunks sound distorted, try increasing frames_per_chunk to give the decoder more context per chunk.

Wrong language pronunciation

  • On NeuTTSMultilingualPipeline, set the language parameter to match your input text (e.g. "french"). If omitted it defaults to English, which produces incorrect phonemization for other languages.

  • NeuTTSNanoPipeline only supports English. If you need other languages, switch to the Multilingual pipeline.

High latency before first audio

  • Switch from batch (infer) to streaming (open_streamer). Batch mode waits for the full utterance to finish before returning any audio.

  • Lower first_frames_per_chunk (e.g. to 6–10). This is the primary control for time-to-first-audio.

  • Use prefetch_engines at app startup to pre-download the model so that the first start_model call does not include a network download.

Audio plays at wrong speed

  • TTS outputs 24 kHz mono PCM. If your audio player is configured for a different sample rate (e.g. 44100 Hz or 16000 Hz), playback will be too fast or too slow.

  • Set your audio player’s sample rate to exactly 24000 Hz before enqueuing TTS samples.

  • If your playback pipeline is fixed at 16 kHz (to match VAD/ASR), resample the TTS output from 24 kHz to 16 kHz before playing.

Load Progress

Swift:

let tts = try await NeuTTSMultilingualPipeline(
    engines_path: "TheStageAI/neutts-multilingual",
    voice_id: "paul",
    on_load_progress: { p in
        print("[\(p.model)] \(p.phase) \(Int(p.fraction * 100))%")
    }
)

Flutter:

TheStageFlutterSDK.on_progress.listen((event) {
  if (event['model_name'] != 'tts') return;
  final phase    = event['phase']    as String?;
  final fraction = event['progress'] as double?;
  print('[tts] $phase ${(fraction ?? 0) * 100}%');
});

await TheStageFlutterSDK.start_model(
  model_name: 'tts',
  engines_path: 'TheStageAI/neutts-multilingual',
  config: {'voice_id': 'paul', 'language': 'english'},
);

Prefetch Engines

let engines_dir = try await ai.prefetch_engines(
    repo_id: "TheStageAI/neutts-multilingual"
)

let tts = try await NeuTTSMultilingualPipeline(
    engines_path: engines_dir,
    voice_id: "paul"
)

Cleanup

Swift:

_ = try ai.stop_model(model_name: "tts")

Flutter:

await TheStageFlutterSDK.stop_model(model_name: 'tts');