Voice Agent¶
Overview¶
End-to-end on-device voice assistant: VAD → STT → LLM → TTS, with neural end-of-turn detection (the smart-turn-v3 model), interruption handling (barge-in), streaming transcription (live partial captions), and sentence-level TTS streaming for sub-second time-to-first-audio.
State Machine¶
┌────────────────────────────────────────┐
│ if config.wake_word == nil │
idle → loading ─────►│ listening ⇄ thinking → speaking │──► listening
│ │
│ else (wake-word configured) │
│ sleeping ─WW─► listening ⇄ thinking │──► speaking ──► sleeping
└────────────────────────────────────────┘
State |
Meaning |
|---|---|
|
Models not loaded |
|
Models being downloaded / loaded |
|
Wake-word standby |
|
Mic open, VAD scanning for speech |
|
Speech committed, LLM is generating |
|
TTS streaming audio to the speaker |
Usage Guides¶
Building a basic voice assistant¶
The quickest way to get a working voice loop is to wire up a cloud LLM provider, point the agent at the pre-trained VAD / STT / TTS bundles, and subscribe to the event stream. The agent handles the full listen → think → speak cycle automatically.
Swift
import TheStageSDK
let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")
let llm = TheStageOpenAICompatibleProvider(
endpoint: "https://api.openai.com/v1/chat/completions",
api_key: "sk-...",
model: "gpt-4o-mini"
)
var config = TheStageAgentConfig(
vad: "TheStageAI/silero-vad",
stt: "TheStageAI/thewhisper-large-v3-turbo",
tts: "TheStageAI/neutts-multilingual",
llm: llm
)
config.system_prompt = "You are a helpful voice assistant. Keep replies short."
let agent = TheStageVoiceAgent(config: config)
Task {
for await event in agent.events {
switch event.kind {
case .state_changed: print("[STATE] \(event.data["state"] ?? "?")")
case .user_request: print("[YOU] \(event.data["text"] ?? "")")
case .response_delta: print(event.data["delta"] ?? "", terminator: "")
case .response_done: print("\n[ASSISTANT DONE]")
case .error: print("[ERROR] \(event.data["message"] ?? "")")
default: break
}
}
}
Task {
for await delta in agent.llm_deltas.recv() {
// Append delta to a chat bubble
}
}
try await agent.start()
The events stream delivers every state transition and content event. You
only need to handle the kinds you care about — the agent keeps running
regardless.
Flutter
import 'package:thestage_apple_sdk/thestage_apple_sdk.dart';
await TheStageFlutterSDK.initialize(api_token: 'th_…');
final agent = TheStageVoiceAgentFlutter();
agent.events.listen((event) {
switch (event['kind']) {
case 'state_changed': print('STATE: ${event['state']}');
case 'user_request': print('YOU: ${event['text']}');
case 'response_delta': stdout.write(event['delta']);
case 'response_done': print('\nASSISTANT DONE');
}
});
agent.llmDeltas.listen((delta) => /* update assistant bubble */);
agent.transcripts.listen((text) => /* show user turn */);
agent.vadProbabilities.listen((p) => /* drive a level meter */);
await agent.start(config: {
'vad': 'TheStageAI/silero-vad',
'stt': 'TheStageAI/thewhisper-large-v3-turbo',
'tts': 'TheStageAI/neutts-multilingual',
'llm_provider': 'openai_compatible',
'llm_endpoint': 'https://api.openai.com/v1/chat/completions',
'llm_api_key': 'sk-...',
'llm_model': 'gpt-4o-mini',
'system_prompt': 'You are a helpful voice assistant.',
});
await agent.interrupt();
await agent.say('Welcome back!');
await agent.stop();
Using an on-device LLM (fully offline)¶
When privacy matters or the device has no network connection, you can swap the cloud LLM for a local model. The rest of the pipeline (VAD, STT, TTS) already runs on-device, so this makes the entire voice loop fully offline.
let llm = TheStageLocalLLMProvider(
model: "TheStageAI/neullm-small"
)
var config = TheStageAgentConfig(
vad: "TheStageAI/silero-vad",
stt: "TheStageAI/thewhisper-large-v3-turbo",
tts: "TheStageAI/neutts-multilingual",
llm: llm
)
In Flutter, set llm_provider to "local" and provide a llm_model
path instead of an endpoint:
await agent.start(config: {
'vad': 'TheStageAI/silero-vad',
'stt': 'TheStageAI/thewhisper-large-v3-turbo',
'tts': 'TheStageAI/neutts-multilingual',
'llm_provider': 'local',
'llm_model': 'TheStageAI/neullm-small',
'system_prompt': 'You are a helpful offline assistant.',
});
Note
Local LLMs are smaller and faster to respond but less capable than large cloud models. Test with your actual prompts to make sure the quality meets your needs.
Handling barge-in (user interrupts the assistant)¶
Barge-in is when the user starts talking while the assistant is still speaking — for example, saying “stop” or asking a follow-up question before the current answer finishes. The agent detects this through VAD and immediately stops TTS playback so the user feels heard.
The behaviour is controlled by interrupt_mode:
.speech_only— barge-in triggers when the user’s voice exceeds the VAD threshold for at leastinterrupt_min_speech_msmilliseconds. This is the default on iOS, where hardware AEC filters out the speaker’s own audio..none— barge-in is disabled. The user must wait for the assistant to finish. This is the default on macOS, where AEC is less reliable and the agent’s own TTS output can be mistaken for user speech.
You can tune sensitivity at runtime:
await agent.update_interrupt_config(
min_speech_ms: 200,
mode: .speech_only
)
Lower interrupt_min_speech_ms makes barge-in more responsive but increases
the risk of false triggers from background noise. Higher values require a more
deliberate interruption.
Attention
On macOS, enabling .speech_only without hardware AEC will cause the
agent to interrupt itself — its own TTS output is picked up by the
microphone and interpreted as user speech. Either keep .none or use
external headphones with a directional microphone.
Adding a wake word¶
By default the agent transitions straight from loading to listening
and is always active. If you want the assistant to wait for a trigger phrase
before it starts processing speech, configure a wake-word model.
With a wake word enabled, the state machine gains a sleeping state. The
agent sits in low-power standby and only transitions to listening when it
detects the wake phrase.
var config = TheStageAgentConfig(
vad: "TheStageAI/silero-vad",
stt: "TheStageAI/thewhisper-large-v3-turbo",
tts: "TheStageAI/neutts-multilingual",
llm: llm
)
config.wake_word = "TheStageAI/wake-word-hey-assistant"
config.ww_threshold_score = 0.5
await agent.start(config: {
'vad': 'TheStageAI/silero-vad',
'stt': 'TheStageAI/thewhisper-large-v3-turbo',
'tts': 'TheStageAI/neutts-multilingual',
'llm_provider': 'openai_compatible',
'llm_endpoint': 'https://api.openai.com/v1/chat/completions',
'llm_api_key': 'sk-...',
'llm_model': 'gpt-4o-mini',
'wake_word': 'TheStageAI/wake-word-hey-assistant',
'ww_threshold_score': 0.5,
});
After the assistant finishes speaking it returns to sleeping rather than
listening, so the user must say the wake phrase again for each new
conversation turn.
Building a chat UI on top of the agent¶
The agent exposes several real-time output channels you can bind to UI elements.
These are independent of the events stream and deliver typed values instead
of generic event dictionaries.
Channel |
UI use case |
|---|---|
|
Append each token to the assistant’s chat bubble for a typing effect |
|
Show a live “listening…” caption while the user is speaking |
|
Display the finalized user turn in the chat history |
|
Drive a microphone level meter or voice-activity indicator |
Swift
Task {
for await delta in agent.llm_deltas.recv() {
assistantBubble.text += delta
}
}
Task {
for await partial in agent.partial_transcripts.recv() {
listeningLabel.text = partial
}
}
Task {
for await transcript in agent.transcripts.recv() {
chatHistory.append(UserMessage(text: transcript))
}
}
Task {
for await probability in agent.vad_probabilities.recv() {
micMeter.level = probability
}
}
Flutter
agent.llmDeltas.listen((delta) {
setState(() => assistantText += delta);
});
agent.transcripts.listen((text) {
setState(() => chatHistory.add(UserMessage(text)));
});
agent.vadProbabilities.listen((p) {
setState(() => micLevel = p);
});
Note
Flutter equivalents: agent.llmDeltas, agent.transcripts,
agent.vadProbabilities (Stream types).
Running in background on iOS¶
The voice agent needs the microphone and audio output to stay active when the app moves to the background. Without the audio background mode, iOS suspends the app and the agent stops listening.
Add audio to UIBackgroundModes in your Info.plist:
<key>UIBackgroundModes</key>
<array>
<string>audio</string>
</array>
Note
With NPU defaults, all pipelines keep running while the app is backgrounded.
This is sufficient for the agent to continue operating when the user switches
apps or locks the screen. No code changes are needed — the audio session
category (set to .playback by default in AudioStreamConfig) keeps the
session alive.
Events¶
The agent communicates state changes and content through an event stream.
Subscribe to agent.events (Swift) or agent.events.listen (Flutter)
to drive your UI or trigger application logic.
Kind |
Data Keys |
When |
|---|---|---|
|
|
State transition |
|
|
Stable partial caption mid-turn |
|
|
User request finalized |
|
|
LLM token arrived |
|
|
Response stream finished |
|
— |
First TTS sample reached speaker |
|
|
Speaker stopped |
|
|
Heartbeat metrics |
|
|
Recoverable error |
API Reference¶
All configuration is passed through TheStageAgentConfig (Swift) or the
config dictionary (Flutter). The tables below group related settings. Most
defaults work well out of the box — you typically only need to set the model
paths and an LLM provider to get started.
Models¶
Which model bundles to load. vad and stt are required for the agent to
function. tts is optional if you only need speech-to-text. wake_word
enables standby mode.
Field |
Type |
Default |
Description |
|---|---|---|---|
|
String |
required |
HF id or local path of Silero VAD bundle |
|
String |
required |
HF id or local path of Whisper bundle |
|
String? |
nil |
HF id or local path of NeuTTS bundle |
|
String |
“paul” |
Voice preset id |
|
String? |
nil |
Optional wake-word bundle |
|
String |
“en” |
Whisper decode language |
Compute Device Routing¶
Controls which hardware accelerator each model runs on. The default "npu"
uses the Apple Neural Engine, which is the fastest option on supported devices.
Change to "gpu" or "cpu" only if you need to debug or the model does
not support the Neural Engine.
Field |
Type |
Default |
Description |
|---|---|---|---|
|
String |
“npu” |
Silero VAD compute device |
|
String |
“npu” |
Whisper coarse default |
|
[String:String]? |
nil |
Per-module override |
|
String |
“npu” |
NeuTTS coarse default |
|
[String:String]? |
nil |
Per-module override |
LLM¶
The language model that generates the assistant’s responses. Choose between a
cloud provider (TheStageOpenAICompatibleProvider) for maximum quality or a
local model (TheStageLocalLLMProvider) for offline / privacy use cases.
Field |
Type |
Default |
Description |
|---|---|---|---|
|
TheStageLLMProvider |
required |
Local or remote LLM (Swift) |
|
String |
required |
“local” or “openai_compatible” (Flutter) |
|
String |
helpful default |
System message |
|
Int |
256 |
Generation cap |
|
Double |
0.7 |
Sampling temperature |
|
TheStageChatMemory |
SlidingWindowMemory(max_turns: 10) |
History strategy |
VAD / Endpointing¶
Voice Activity Detection determines when the user starts and stops speaking.
vad_threshold and vad_onset_ms control how much evidence the system
needs before it decides speech has begun. silence_timeout_ms controls how
long the user must be silent before the turn is committed to the LLM.
Field |
Type |
Default |
Description |
|---|---|---|---|
|
Double |
0.8 |
Speech probability threshold |
|
Int |
96 |
Sustained voiced duration to trigger onset |
|
Int |
608 |
Trailing silence to commit turn |
|
Int |
30000 |
Hard cap on single turn |
|
Int |
200 |
Pre-roll captured before onset |
Turn Detection¶
By default the agent commits the user’s turn after a fixed silence timeout
(.vad mode). Switching to .dnn mode uses a neural model
(smart-turn-v3) that listens to the audio and predicts whether the user is
actually done speaking — even during short pauses. This dramatically reduces
premature cut-offs for users who pause mid-sentence.
Field |
Type |
Default |
Description |
|---|---|---|---|
|
enum |
.vad |
|
|
String? |
nil |
smart-turn engines repo/path |
|
Double |
0.85 |
Completion prob threshold |
|
Int |
2 |
Consecutive done verdicts needed |
|
Int |
256 |
Trailing silence before first model call |
|
Int |
120 |
Re-run cadence on sustained pause |
|
Int |
5000 |
Hard fallback |
|
Int |
8000 |
Trailing audio window fed to model |
|
Int |
250 |
Minimum speech before consulting model |
Streaming Transcription (ASR)¶
Live partial captions let you show the user what the agent is hearing while
they are still talking. Disable asr_streaming if you only need the final
transcript.
Field |
Type |
Default |
Description |
|---|---|---|---|
|
Bool |
true |
Emit live partial captions |
|
Int |
600 |
Minimum new audio between caption passes |
|
Bool |
true |
Speculative full-utterance pass at first VAD pause |
Interruption / AEC¶
Controls how the agent handles barge-in. On iOS, the hardware Acoustic Echo
Cancellation (AEC) unit filters out the device’s own speaker output, so the
agent can safely listen for user speech while it is talking. On macOS, AEC is
unreliable — the default is .none to prevent self-interruption.
Field |
Type |
Default |
Description |
|---|---|---|---|
|
InterruptTrigger |
.speech_only (iOS) / .none (macOS) |
How user can barge in |
|
Int |
600 |
Sustained speech needed to interrupt |
|
Double |
0.9 |
VAD prob threshold for barge-in |
|
Int |
250 |
Grace at TTS turn start |
|
Bool |
true |
Voice Processing IO (iOS only) |
Wake-Word Standby¶
When a wake-word model is configured, the agent enters sleeping mode after
each interaction and only wakes when the trigger phrase is detected. Leave
wake_word as nil (the default) to skip standby entirely.
Field |
Type |
Default |
Description |
|---|---|---|---|
|
String? |
nil |
HF id or local path of wake-word bundle |
|
Double |
0.5 |
Wake-word classifier threshold |
|
String |
“npu” |
Wake-word compute device |
Programmatic Controls¶
agent.interrupt()
agent.say("Hi there!")
agent.send_request("What time is it?")
await agent.set_voice("dave")
let history = await agent.history()
await agent.clear_history()
await agent.update_interrupt_config(min_speech_ms: 200, mode: .speech_only)
await agent.update_turn_config(eot_threshold: 0.6, pause_trigger_ms: 256)
await agent.stop()
Note
TheStageAI.infer and TheStageAI.infer_stream are nonisolated,
so each node runs on independent tasks. Don’t wrap inference calls in
Task { @MainActor in ... }.
Latency¶
Measured on M-class Mac with OpenAI gpt-4o-mini:
Turn |
LLM 1st Token |
First Audio |
Full Speak |
|---|---|---|---|
Short reply |
487 ms |
521 ms |
3.3 s |
Long monologue |
575 ms |
601 ms |
53.1 s |
Mid-length |
1226 ms |
1497 ms |
5.8 s |
Troubleshooting¶
Agent keeps interrupting itself¶
This happens when the agent’s own TTS output is picked up by the microphone and interpreted as user speech, triggering a barge-in. The root cause is missing or ineffective Acoustic Echo Cancellation (AEC).
On macOS: set interrupt_mode to .none. macOS does not have
hardware-level AEC, so the default is already .none — if you overrode it to
.speech_only, revert or use headphones.
On iOS: verify that aec_enabled is true (the default). If you
disabled it, re-enable it. If self-interruption still occurs with AEC enabled,
increase interrupt_min_speech_ms (e.g. to 800) so the agent requires a
longer burst of detected speech before it triggers a barge-in.
Agent responds too slowly or cuts off the user¶
These are two sides of the same coin — turn detection timing.
Cuts off the user mid-sentence: the agent commits the turn too eagerly.
Switch from .vad to .dnn turn detection mode so the neural model can
distinguish between a pause and the end of the utterance:
config.turn_detection_mode = .dnn
config.turn_detector = "TheStageAI/smart-turn-v3"
If you stay in .vad mode, increase silence_timeout_ms (default 608 ms)
to give the user more breathing room.
Responds too slowly: the agent waits too long after the user finishes. In
.dnn mode, lower turn_eot_threshold (default 0.85) so the model
commits sooner. In .vad mode, decrease silence_timeout_ms.
Model loading takes too long¶
The first call to agent.start() downloads all model bundles from Hugging
Face and compiles them for the target device. This can take tens of seconds on
a cold start.
Use prefetch_engines on a loading or splash screen to move the download out
of the critical path:
try await ai.prefetch_engines(
engines: [
"TheStageAI/silero-vad",
"TheStageAI/thewhisper-large-v3-turbo",
"TheStageAI/neutts-multilingual"
]
)
Once cached, subsequent launches load models from disk in under a second.
Transcription is inaccurate¶
Check these common causes:
Audio format mismatch. The STT pipeline expects 16 kHz mono Float32 input. If your audio session is configured for a different sample rate or channel count, the Whisper model receives garbled audio. The default
AudioStreamConfigvalues are correct — only check this if you overrode them.Wrong language.
stt_languagedefaults to"en". If the user is speaking another language, set this to the correct BCP-47 code (e.g."es","de","ja").Noisy environment. Whisper is robust but not immune to loud background noise. On iOS, ensure
aec_enabledistrueso the device’s own speaker output is cancelled from the microphone signal.
No audio output on device¶
If the agent appears to work (events fire, transcripts appear) but no sound is heard:
Missing background mode. On iOS, add
audiotoUIBackgroundModesinInfo.plist. Without it, the audio session may be deactivated by the system.<key>UIBackgroundModes</key> <array> <string>audio</string> </array>
TTS not configured. If
ttsisnilin the config, the agent operates in speech-to-text-only mode. Set it to a valid NeuTTS bundle path.Silent mode / volume. On iOS, check that the device is not in silent mode and the media volume is up. The
AudioStreamConfigcategory defaults to.playback, which respects the media volume slider.