Voice Agent¶

Overview ¶

End-to-end on-device voice assistant: VAD → STT → LLM → TTS, with neural end-of-turn detection (the smart-turn-v3 model), interruption handling (barge-in), streaming transcription (live partial captions), and sentence-level TTS streaming for sub-second time-to-first-audio.

State Machine ¶

                     ┌────────────────────────────────────────┐
                     │  if config.wake_word == nil            │
idle → loading ─────►│  listening ⇄ thinking → speaking       │──► listening
                     │                                        │
                     │  else (wake-word configured)           │
                     │  sleeping ─WW─► listening ⇄ thinking   │──► speaking ──► sleeping
                     └────────────────────────────────────────┘

States¶
State	Meaning
`idle`	Models not loaded
`loading`	Models being downloaded / loaded
`sleeping`	Wake-word standby
`listening`	Mic open, VAD scanning for speech
`thinking`	Speech committed, LLM is generating
`speaking`	TTS streaming audio to the speaker

Usage Guides ¶

Building a basic voice assistant ¶

The quickest way to get a working voice loop is to wire up a cloud LLM provider, point the agent at the pre-trained VAD / STT / TTS bundles, and subscribe to the event stream. The agent handles the full listen → think → speak cycle automatically.

Swift

import TheStageSDK

let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")

let llm = TheStageOpenAICompatibleProvider(
    endpoint: "https://api.openai.com/v1/chat/completions",
    api_key: "sk-...",
    model: "gpt-4o-mini"
)

var config = TheStageAgentConfig(
    vad: "TheStageAI/silero-vad",
    stt: "TheStageAI/thewhisper-large-v3-turbo",
    tts: "TheStageAI/neutts-multilingual",
    llm: llm
)
config.system_prompt = "You are a helpful voice assistant. Keep replies short."

let agent = TheStageVoiceAgent(config: config)

Task {
    for await event in agent.events {
        switch event.kind {
        case .state_changed:    print("[STATE] \(event.data["state"] ?? "?")")
        case .user_request:     print("[YOU] \(event.data["text"] ?? "")")
        case .response_delta:   print(event.data["delta"] ?? "", terminator: "")
        case .response_done:    print("\n[ASSISTANT DONE]")
        case .error:            print("[ERROR] \(event.data["message"] ?? "")")
        default: break
        }
    }
}

Task {
    for await delta in agent.llm_deltas.recv() {
        // Append delta to a chat bubble
    }
}

try await agent.start()

The events stream delivers every state transition and content event. You only need to handle the kinds you care about — the agent keeps running regardless.

Flutter

import 'package:thestage_apple_sdk/thestage_apple_sdk.dart';

await TheStageFlutterSDK.initialize(api_token: 'th_…');

final agent = TheStageVoiceAgentFlutter();

agent.events.listen((event) {
  switch (event['kind']) {
    case 'state_changed': print('STATE: ${event['state']}');
    case 'user_request':  print('YOU: ${event['text']}');
    case 'response_delta': stdout.write(event['delta']);
    case 'response_done': print('\nASSISTANT DONE');
  }
});

agent.llmDeltas.listen((delta) => /* update assistant bubble */);
agent.transcripts.listen((text) => /* show user turn */);
agent.vadProbabilities.listen((p) => /* drive a level meter */);

await agent.start(config: {
  'vad': 'TheStageAI/silero-vad',
  'stt': 'TheStageAI/thewhisper-large-v3-turbo',
  'tts': 'TheStageAI/neutts-multilingual',
  'llm_provider': 'openai_compatible',
  'llm_endpoint': 'https://api.openai.com/v1/chat/completions',
  'llm_api_key': 'sk-...',
  'llm_model': 'gpt-4o-mini',
  'system_prompt': 'You are a helpful voice assistant.',
});

await agent.interrupt();
await agent.say('Welcome back!');
await agent.stop();

Using an on-device LLM (fully offline)¶

TheStageLocalLLMProvider does not load the model. It streams through TheStageAI.shared.infer_stream(model_name:), so you must start_model first and pass the same handle as model_path / Flutter llm_model.

Local replies use the LLM bundle’s generation defaults. Agent max_tokens / temperature are ignored for the local provider — tune sampling on the LLM itself (see LLM (Language Model) Generation parameters / Real-world recipes), not on TheStageAgentConfig.

try await TheStageAI.shared.start_model(
    model_name: "llm",
    engines_path: "TheStageAI/Qwen3-0.6B"   // or LFM2.5 / Gemma3 / local dir
)

let llm = TheStageLocalLLMProvider(model_path: "llm")

var config = TheStageAgentConfig(
    vad: "TheStageAI/silero-vad",
    stt: "TheStageAI/thewhisper-large-v3-turbo",
    tts: "TheStageAI/neutts-nano-multilingual",
    llm: llm
)
config.system_prompt = "You are a concise offline voice assistant."
config.auto_listen = false

let agent = TheStageVoiceAgent(config: config)
try await agent.start()
try await agent.begin_listening()

In Flutter, set llm_provider to "local" and set llm_model to the start_model handle (not a bare HF id unless you also registered under that id):

await TheStageFlutterSDK.start_model(
  model_name: 'llm',
  engines_path: 'TheStageAI/Qwen3-0.6B',
  model_type: 'thestage_llm',
);

await agent.start(config: {
  'vad': 'TheStageAI/silero-vad',
  'stt': 'TheStageAI/thewhisper-large-v3-turbo',
  'tts': 'TheStageAI/neutts-multilingual',
  'llm_provider': 'local',
  'llm_model': 'llm',
  'auto_listen': false,
  'system_prompt': 'You are a concise offline voice assistant.',
});
await agent.beginListening();

Note

You may use the HF repo id as the handle (model_name / model_path / llm_model all "TheStageAI/Qwen3-0.6B") — keep them identical.

Handling barge-in (user interrupts the assistant)¶

Barge-in is when the user starts talking while the assistant is still speaking — for example, saying “stop” or asking a follow-up question before the current answer finishes. The agent detects this through VAD and immediately stops TTS playback so the user feels heard.

The behaviour is controlled by interrupt_mode (InterruptMode):

.vad — barge-in when sustained speech exceeds the VAD threshold for at least interrupt_min_speech_ms milliseconds (typical default).
.none — barge-in disabled; the user waits for the assistant to finish.
.vad_wake_word / .vad_speaker_id / .vad_speaker_id_wake_word — require wake-word and/or speaker verification during sustained speech.

You can tune sensitivity at runtime (legacy InterruptTrigger / .speech_only is accepted by update_interrupt_config for compatibility):

await agent.update_interrupt_config(
    min_speech_ms: 200,
    mode: .speech_only   // maps onto VAD speech barge-in
)

Lower interrupt_min_speech_ms makes barge-in more responsive but increases the risk of false triggers from background noise. Higher values require a more deliberate interruption.

Attention

On macOS, enabling .speech_only without hardware AEC will cause the agent to interrupt itself — its own TTS output is picked up by the microphone and interpreted as user speech. Either keep .none or use external headphones with a directional microphone.

Adding a wake word ¶

By default the agent transitions straight from loading to listening and is always active. If you want the assistant to wait for a trigger phrase before it starts processing speech, configure a wake-word model.

With a wake word enabled, the state machine gains a sleeping state. The agent sits in low-power standby and only transitions to listening when it detects the wake phrase.

var config = TheStageAgentConfig(
    vad: "TheStageAI/silero-vad",
    stt: "TheStageAI/thewhisper-large-v3-turbo",
    tts: "TheStageAI/neutts-multilingual",
    llm: llm
)
config.wake_word = "TheStageAI/wake-word-hey-assistant"
config.ww_threshold_score = 0.5

await agent.start(config: {
  'vad': 'TheStageAI/silero-vad',
  'stt': 'TheStageAI/thewhisper-large-v3-turbo',
  'tts': 'TheStageAI/neutts-multilingual',
  'llm_provider': 'openai_compatible',
  'llm_endpoint': 'https://api.openai.com/v1/chat/completions',
  'llm_api_key': 'sk-...',
  'llm_model': 'gpt-4o-mini',
  'wake_word': 'TheStageAI/wake-word-hey-assistant',
  'ww_threshold_score': 0.5,
});

After the assistant finishes speaking it returns to sleeping rather than listening, so the user must say the wake phrase again for each new conversation turn.

Building a chat UI on top of the agent ¶

The agent exposes several real-time output channels you can bind to UI elements. These are independent of the events stream and deliver typed values instead of generic event dictionaries.

Channel	UI use case
`llm_deltas`	Append each token to the assistant’s chat bubble for a typing effect
`partial_transcripts`	Show a live “listening…” caption while the user is speaking
`transcripts`	Display the finalized user turn in the chat history
`vad_probabilities`	Drive a microphone level meter or voice-activity indicator

Swift

Task {
    for await delta in agent.llm_deltas.recv() {
        assistantBubble.text += delta
    }
}

Task {
    for await partial in agent.partial_transcripts.recv() {
        listeningLabel.text = partial
    }
}

Task {
    for await transcript in agent.transcripts.recv() {
        chatHistory.append(UserMessage(text: transcript))
    }
}

Task {
    for await probability in agent.vad_probabilities.recv() {
        micMeter.level = probability
    }
}

Flutter

agent.llmDeltas.listen((delta) {
  setState(() => assistantText += delta);
});

agent.transcripts.listen((text) {
  setState(() => chatHistory.add(UserMessage(text)));
});

agent.vadProbabilities.listen((p) {
  setState(() => micLevel = p);
});

Note

Flutter equivalents: agent.llmDeltas, agent.transcripts, agent.vadProbabilities (Stream types).

Running in background on iOS ¶

The voice agent needs the microphone and audio output to stay active when the app moves to the background. Without the audio background mode, iOS suspends the app and the agent stops listening.

Add audio to UIBackgroundModes in your Info.plist:

<key>UIBackgroundModes</key>
<array>
    <string>audio</string>
</array>

Note

With NPU defaults, all pipelines keep running while the app is backgrounded.

This is sufficient for the agent to continue operating when the user switches apps or locks the screen. No code changes are needed — the audio session category (set to .playback by default in AudioStreamConfig) keeps the session alive.

Events ¶

The agent communicates state changes and content through an event stream. Subscribe to agent.events (Swift) or agent.events.listen (Flutter) to drive your UI or trigger application logic.

Kind	Data Keys	When
`state_changed`	`state`	State transition
`user_request_partial`	`text`	Stable partial caption mid-turn
`user_request`	`text`, `source`	User request finalized
`response_delta`	`delta`	LLM token arrived
`response_done`	`text`, `reason`, `interrupted`	Response stream finished
`playback_started`	—	First TTS sample reached speaker
`playback_ended`	`reason`	Speaker stopped
`metrics`	`loading_model`, …	Heartbeat metrics
`error`	`message`	Recoverable error

API Reference ¶

All configuration is passed through TheStageAgentConfig (Swift) or the config dictionary (Flutter). The tables below group related settings. Most defaults work well out of the box — you typically only need to set the model paths and an LLM provider to get started.

Models ¶

Which model bundles to load. vad and stt are required for the agent to function. tts is optional if you only need speech-to-text. wake_word enables standby mode.

Field	Type	Default	Description
`vad`	String	required	HF id or local path of Silero VAD bundle
`stt`	String	required	HF id or local path of Whisper bundle
`tts`	String?	nil	HF id or local path of NeuTTS bundle
`tts_voice`	String	“paul”	Voice preset id
`wake_word`	String?	nil	Optional wake-word bundle
`stt_language`	String	“en”	Whisper decode language

Compute Device Routing ¶

Controls which hardware accelerator each model runs on. The default "npu" uses the Apple Neural Engine, which is the fastest option on supported devices. Change to "gpu" or "cpu" only if you need to debug or the model does not support the Neural Engine.

Field	Type	Default	Description
`vad_device`	String	“npu”	Silero VAD compute device
`stt_device`	String	“npu”	Whisper coarse default
`stt_devices`	[String:String]?	nil	Per-module override
`tts_device`	String	“npu”	NeuTTS coarse default
`tts_devices`	[String:String]?	nil	Per-module override

LLM ¶

The language model that generates the assistant’s responses. Choose between a cloud provider (TheStageOpenAICompatibleProvider) for maximum quality or a local model (TheStageLocalLLMProvider) for offline / privacy use cases.

Field	Type	Default	Description
`llm`	TheStageLLMProvider	required	Local or remote LLM (Swift)
`llm_provider`	String	required	`"local"` or `"openai_compatible"` (Flutter)
`llm_model`	String	required (Flutter)	For `"local"`: `start_model` handle. For cloud: remote model id
`llm_endpoint` / `llm_api_key`	String	—	Required for `"openai_compatible"`
`system_prompt`	String	helpful default	System message
`max_tokens`	Int	256	Soft cap for remote providers; ignored by local LLM
`temperature`	Double	0.7	Remote sampling; ignored by local LLM
`chat_memory`	TheStageChatMemory	SlidingWindowMemory(max_turns: 10)	History strategy
`auto_listen`	Bool	`true`	If `false`, call `begin_listening` / `beginListening` after `start`

VAD / Endpointing ¶

Voice Activity Detection determines when the user starts and stops speaking. vad_threshold and vad_onset_ms control how much evidence the system needs before it decides speech has begun. silence_timeout_ms controls how long the user must be silent before the turn is committed to the LLM.

Field	Type	Default	Description
`vad_threshold`	Double	0.8	Speech probability threshold
`vad_onset_ms`	Int	96	Sustained voiced duration to trigger onset
`silence_timeout_ms`	Int	608	Trailing silence to commit turn
`max_accumulation_ms`	Int	30000	Hard cap on single turn
`pre_roll_ms`	Int	200	Pre-roll captured before onset

Turn Detection ¶

By default the agent commits the user’s turn after a fixed silence timeout (.vad mode). Switching to .dnn mode uses a neural model (smart-turn-v3) that listens to the audio and predicts whether the user is actually done speaking — even during short pauses. This dramatically reduces premature cut-offs for users who pause mid-sentence.

Field	Type	Default	Description
`turn_detection_mode`	enum	.vad	`.vad` or `.dnn`
`turn_detector`	String?	nil	smart-turn engines repo/path
`turn_eot_threshold`	Double	0.85	Completion prob threshold
`turn_eot_confirm_count`	Int	2	Consecutive done verdicts needed
`turn_pause_trigger_ms`	Int	256	Trailing silence before first model call
`turn_reeval_interval_ms`	Int	120	Re-run cadence on sustained pause
`turn_max_silence_ms`	Int	2000	Hard fallback
`turn_window_ms`	Int	8000	Trailing audio window fed to model
`turn_min_speech_ms`	Int	250	Minimum speech before consulting model

Streaming Transcription (ASR)¶

Live partial captions let you show the user what the agent is hearing while they are still talking. Disable asr_streaming if you only need the final transcript.

Field	Type	Default	Description
`asr_streaming`	Bool	true	Emit live partial captions
`asr_partial_interval_ms`	Int	600	Minimum new audio between caption passes
`speculative_whisper`	Bool	true	Speculative full-utterance pass at first VAD pause

Interruption / AEC ¶

Controls how the agent handles barge-in. On iOS, the hardware Acoustic Echo Cancellation (AEC) unit filters out the device’s own speaker output, so the agent can safely listen for user speech while it is talking. On macOS, AEC is unreliable — the default is .none to prevent self-interruption.

Field	Type	Default	Description
`interrupt_mode`	InterruptTrigger	.speech_only (iOS) / .none (macOS)	How user can barge in
`interrupt_min_speech_ms`	Int	600	Sustained speech needed to interrupt
`interrupt_threshold`	Double	0.9	VAD prob threshold for barge-in
`interrupt_min_playback_ms`	Int	250	Grace at TTS turn start
`aec_enabled`	Bool	true	Voice Processing IO (iOS only)

Wake-Word Standby ¶

When a wake-word model is configured, the agent enters sleeping mode after each interaction and only wakes when the trigger phrase is detected. Leave wake_word as nil (the default) to skip standby entirely.

Field	Type	Default	Description
`wake_word`	String?	nil	HF id or local path of wake-word bundle
`ww_threshold_score`	Double	0.5	Wake-word classifier threshold
`ww_device`	String	“npu”	Wake-word compute device

Programmatic Controls ¶

agent.interrupt()
agent.say("Hi there!")
agent.send_request("What time is it?")
await agent.set_voice("dave")
let history = await agent.history()
await agent.clear_history()
await agent.update_interrupt_config(min_speech_ms: 200, mode: .speech_only)
await agent.update_turn_config(eot_threshold: 0.6, pause_trigger_ms: 256)
await agent.stop()

Note

TheStageAI.infer and TheStageAI.infer_stream are nonisolated, so each node runs on independent tasks. Don’t wrap inference calls in Task { @MainActor in ... }.

Latency ¶

Measured on M-class Mac with OpenAI gpt-4o-mini:

Turn	LLM 1st Token	First Audio	Full Speak
Short reply	487 ms	521 ms	3.3 s
Long monologue	575 ms	601 ms	53.1 s
Mid-length	1226 ms	1497 ms	5.8 s

Troubleshooting ¶

Agent keeps interrupting itself ¶

This happens when the agent’s own TTS output is picked up by the microphone and interpreted as user speech, triggering a barge-in. The root cause is missing or ineffective Acoustic Echo Cancellation (AEC).

On macOS: set interrupt_mode to .none. macOS does not have hardware-level AEC, so the default is already .none — if you overrode it to .speech_only, revert or use headphones.

On iOS: verify that aec_enabled is true (the default). If you disabled it, re-enable it. If self-interruption still occurs with AEC enabled, increase interrupt_min_speech_ms (e.g. to 800) so the agent requires a longer burst of detected speech before it triggers a barge-in.

Agent responds too slowly or cuts off the user ¶

These are two sides of the same coin — turn detection timing.

Cuts off the user mid-sentence: the agent commits the turn too eagerly. Switch from .vad to .dnn turn detection mode so the neural model can distinguish between a pause and the end of the utterance:

config.turn_detection_mode = .dnn
config.turn_detector = "TheStageAI/smart-turn-v3"

If you stay in .vad mode, increase silence_timeout_ms (default 608 ms) to give the user more breathing room.

Responds too slowly: the agent waits too long after the user finishes. In .dnn mode, lower turn_eot_threshold (default 0.85) so the model commits sooner. In .vad mode, decrease silence_timeout_ms.

Model loading takes too long ¶

The first call to agent.start() downloads all model bundles from Hugging Face and compiles them for the target device. This can take tens of seconds on a cold start.

Use prefetch_engines on a loading or splash screen to move the download out of the critical path:

try await ai.prefetch_engines(
    engines: [
        "TheStageAI/silero-vad",
        "TheStageAI/thewhisper-large-v3-turbo",
        "TheStageAI/neutts-multilingual"
    ]
)

Once cached, subsequent launches load models from disk in under a second.

Transcription is inaccurate ¶

Check these common causes:

Audio format mismatch. The STT pipeline expects 16 kHz mono Float32 input. If your audio session is configured for a different sample rate or channel count, the Whisper model receives garbled audio. The default AudioStreamConfig values are correct — only check this if you overrode them.
Wrong language. stt_language defaults to "en". If the user is speaking another language, set this to the correct BCP-47 code (e.g. "es", "de", "ja").
Noisy environment. Whisper is robust but not immune to loud background noise. On iOS, ensure aec_enabled is true so the device’s own speaker output is cancelled from the microphone signal.

No audio output on device ¶

If the agent appears to work (events fire, transcripts appear) but no sound is heard:

Missing background mode. On iOS, add audio to UIBackgroundModes in Info.plist. Without it, the audio session may be deactivated by the system.
```
<key>UIBackgroundModes</key>
<array>
    <string>audio</string>
</array>
```
TTS not configured. If tts is nil in the config, the agent operates in speech-to-text-only mode. Set it to a valid TTS bundle path (NeuTTS or Qwen3-TTS).
Silent mode / volume. On iOS, check that the device is not in silent mode and the media volume is up. The AudioStreamConfig category defaults to .playback, which respects the media volume slider.
Speaker ID. Modes that use vad_speaker_id* need speaker_id set to a speaker-embedding bundle and an enrolled vector via enroll_speaker. See Speaker embedding.

Agent checklist ¶

Wire vad / stt / tts / llm (local or OpenAI-compatible).
Local LLM: start_model first; model_path / llm_model must match the handle.
Mic/ASR 16 kHz; TTS playback 24 kHz.
Speaker ID → Speaker embedding.
Do not wrap infer / infer_stream in @MainActor tasks.