Voice Agent

Overview

End-to-end on-device voice assistant: VAD → STT → LLM → TTS, with neural end-of-turn detection (the smart-turn-v3 model), interruption handling (barge-in), streaming transcription (live partial captions), and sentence-level TTS streaming for sub-second time-to-first-audio.

State Machine

                     ┌────────────────────────────────────────┐
                     │  if config.wake_word == nil            │
idle → loading ─────►│  listening ⇄ thinking → speaking       │──► listening
                     │                                        │
                     │  else (wake-word configured)           │
                     │  sleeping ─WW─► listening ⇄ thinking   │──► speaking ──► sleeping
                     └────────────────────────────────────────┘
States

State

Meaning

idle

Models not loaded

loading

Models being downloaded / loaded

sleeping

Wake-word standby

listening

Mic open, VAD scanning for speech

thinking

Speech committed, LLM is generating

speaking

TTS streaming audio to the speaker

Usage Guides

Building a basic voice assistant

The quickest way to get a working voice loop is to wire up a cloud LLM provider, point the agent at the pre-trained VAD / STT / TTS bundles, and subscribe to the event stream. The agent handles the full listen → think → speak cycle automatically.

Swift

import TheStageSDK

let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")

let llm = TheStageOpenAICompatibleProvider(
    endpoint: "https://api.openai.com/v1/chat/completions",
    api_key: "sk-...",
    model: "gpt-4o-mini"
)

var config = TheStageAgentConfig(
    vad: "TheStageAI/silero-vad",
    stt: "TheStageAI/thewhisper-large-v3-turbo",
    tts: "TheStageAI/neutts-multilingual",
    llm: llm
)
config.system_prompt = "You are a helpful voice assistant. Keep replies short."

let agent = TheStageVoiceAgent(config: config)

Task {
    for await event in agent.events {
        switch event.kind {
        case .state_changed:    print("[STATE] \(event.data["state"] ?? "?")")
        case .user_request:     print("[YOU] \(event.data["text"] ?? "")")
        case .response_delta:   print(event.data["delta"] ?? "", terminator: "")
        case .response_done:    print("\n[ASSISTANT DONE]")
        case .error:            print("[ERROR] \(event.data["message"] ?? "")")
        default: break
        }
    }
}

Task {
    for await delta in agent.llm_deltas.recv() {
        // Append delta to a chat bubble
    }
}

try await agent.start()

The events stream delivers every state transition and content event. You only need to handle the kinds you care about — the agent keeps running regardless.

Flutter

import 'package:thestage_apple_sdk/thestage_apple_sdk.dart';

await TheStageFlutterSDK.initialize(api_token: 'th_…');

final agent = TheStageVoiceAgentFlutter();

agent.events.listen((event) {
  switch (event['kind']) {
    case 'state_changed': print('STATE: ${event['state']}');
    case 'user_request':  print('YOU: ${event['text']}');
    case 'response_delta': stdout.write(event['delta']);
    case 'response_done': print('\nASSISTANT DONE');
  }
});

agent.llmDeltas.listen((delta) => /* update assistant bubble */);
agent.transcripts.listen((text) => /* show user turn */);
agent.vadProbabilities.listen((p) => /* drive a level meter */);

await agent.start(config: {
  'vad': 'TheStageAI/silero-vad',
  'stt': 'TheStageAI/thewhisper-large-v3-turbo',
  'tts': 'TheStageAI/neutts-multilingual',
  'llm_provider': 'openai_compatible',
  'llm_endpoint': 'https://api.openai.com/v1/chat/completions',
  'llm_api_key': 'sk-...',
  'llm_model': 'gpt-4o-mini',
  'system_prompt': 'You are a helpful voice assistant.',
});

await agent.interrupt();
await agent.say('Welcome back!');
await agent.stop();

Using an on-device LLM (fully offline)

When privacy matters or the device has no network connection, you can swap the cloud LLM for a local model. The rest of the pipeline (VAD, STT, TTS) already runs on-device, so this makes the entire voice loop fully offline.

let llm = TheStageLocalLLMProvider(
    model: "TheStageAI/neullm-small"
)

var config = TheStageAgentConfig(
    vad: "TheStageAI/silero-vad",
    stt: "TheStageAI/thewhisper-large-v3-turbo",
    tts: "TheStageAI/neutts-multilingual",
    llm: llm
)

In Flutter, set llm_provider to "local" and provide a llm_model path instead of an endpoint:

await agent.start(config: {
  'vad': 'TheStageAI/silero-vad',
  'stt': 'TheStageAI/thewhisper-large-v3-turbo',
  'tts': 'TheStageAI/neutts-multilingual',
  'llm_provider': 'local',
  'llm_model': 'TheStageAI/neullm-small',
  'system_prompt': 'You are a helpful offline assistant.',
});

Note

Local LLMs are smaller and faster to respond but less capable than large cloud models. Test with your actual prompts to make sure the quality meets your needs.

Handling barge-in (user interrupts the assistant)

Barge-in is when the user starts talking while the assistant is still speaking — for example, saying “stop” or asking a follow-up question before the current answer finishes. The agent detects this through VAD and immediately stops TTS playback so the user feels heard.

The behaviour is controlled by interrupt_mode:

  • .speech_only — barge-in triggers when the user’s voice exceeds the VAD threshold for at least interrupt_min_speech_ms milliseconds. This is the default on iOS, where hardware AEC filters out the speaker’s own audio.

  • .none — barge-in is disabled. The user must wait for the assistant to finish. This is the default on macOS, where AEC is less reliable and the agent’s own TTS output can be mistaken for user speech.

You can tune sensitivity at runtime:

await agent.update_interrupt_config(
    min_speech_ms: 200,
    mode: .speech_only
)

Lower interrupt_min_speech_ms makes barge-in more responsive but increases the risk of false triggers from background noise. Higher values require a more deliberate interruption.

Attention

On macOS, enabling .speech_only without hardware AEC will cause the agent to interrupt itself — its own TTS output is picked up by the microphone and interpreted as user speech. Either keep .none or use external headphones with a directional microphone.

Adding a wake word

By default the agent transitions straight from loading to listening and is always active. If you want the assistant to wait for a trigger phrase before it starts processing speech, configure a wake-word model.

With a wake word enabled, the state machine gains a sleeping state. The agent sits in low-power standby and only transitions to listening when it detects the wake phrase.

var config = TheStageAgentConfig(
    vad: "TheStageAI/silero-vad",
    stt: "TheStageAI/thewhisper-large-v3-turbo",
    tts: "TheStageAI/neutts-multilingual",
    llm: llm
)
config.wake_word = "TheStageAI/wake-word-hey-assistant"
config.ww_threshold_score = 0.5
await agent.start(config: {
  'vad': 'TheStageAI/silero-vad',
  'stt': 'TheStageAI/thewhisper-large-v3-turbo',
  'tts': 'TheStageAI/neutts-multilingual',
  'llm_provider': 'openai_compatible',
  'llm_endpoint': 'https://api.openai.com/v1/chat/completions',
  'llm_api_key': 'sk-...',
  'llm_model': 'gpt-4o-mini',
  'wake_word': 'TheStageAI/wake-word-hey-assistant',
  'ww_threshold_score': 0.5,
});

After the assistant finishes speaking it returns to sleeping rather than listening, so the user must say the wake phrase again for each new conversation turn.

Building a chat UI on top of the agent

The agent exposes several real-time output channels you can bind to UI elements. These are independent of the events stream and deliver typed values instead of generic event dictionaries.

Channel

UI use case

llm_deltas

Append each token to the assistant’s chat bubble for a typing effect

partial_transcripts

Show a live “listening…” caption while the user is speaking

transcripts

Display the finalized user turn in the chat history

vad_probabilities

Drive a microphone level meter or voice-activity indicator

Swift

Task {
    for await delta in agent.llm_deltas.recv() {
        assistantBubble.text += delta
    }
}

Task {
    for await partial in agent.partial_transcripts.recv() {
        listeningLabel.text = partial
    }
}

Task {
    for await transcript in agent.transcripts.recv() {
        chatHistory.append(UserMessage(text: transcript))
    }
}

Task {
    for await probability in agent.vad_probabilities.recv() {
        micMeter.level = probability
    }
}

Flutter

agent.llmDeltas.listen((delta) {
  setState(() => assistantText += delta);
});

agent.transcripts.listen((text) {
  setState(() => chatHistory.add(UserMessage(text)));
});

agent.vadProbabilities.listen((p) {
  setState(() => micLevel = p);
});

Note

Flutter equivalents: agent.llmDeltas, agent.transcripts, agent.vadProbabilities (Stream types).

Running in background on iOS

The voice agent needs the microphone and audio output to stay active when the app moves to the background. Without the audio background mode, iOS suspends the app and the agent stops listening.

Add audio to UIBackgroundModes in your Info.plist:

<key>UIBackgroundModes</key>
<array>
    <string>audio</string>
</array>

Note

With NPU defaults, all pipelines keep running while the app is backgrounded.

This is sufficient for the agent to continue operating when the user switches apps or locks the screen. No code changes are needed — the audio session category (set to .playback by default in AudioStreamConfig) keeps the session alive.

Events

The agent communicates state changes and content through an event stream. Subscribe to agent.events (Swift) or agent.events.listen (Flutter) to drive your UI or trigger application logic.

Kind

Data Keys

When

state_changed

state

State transition

user_request_partial

text

Stable partial caption mid-turn

user_request

text, source

User request finalized

response_delta

delta

LLM token arrived

response_done

text, reason, interrupted

Response stream finished

playback_started

First TTS sample reached speaker

playback_ended

reason

Speaker stopped

metrics

loading_model, …

Heartbeat metrics

error

message

Recoverable error

API Reference

All configuration is passed through TheStageAgentConfig (Swift) or the config dictionary (Flutter). The tables below group related settings. Most defaults work well out of the box — you typically only need to set the model paths and an LLM provider to get started.

Models

Which model bundles to load. vad and stt are required for the agent to function. tts is optional if you only need speech-to-text. wake_word enables standby mode.

Field

Type

Default

Description

vad

String

required

HF id or local path of Silero VAD bundle

stt

String

required

HF id or local path of Whisper bundle

tts

String?

nil

HF id or local path of NeuTTS bundle

tts_voice

String

“paul”

Voice preset id

wake_word

String?

nil

Optional wake-word bundle

stt_language

String

“en”

Whisper decode language

Compute Device Routing

Controls which hardware accelerator each model runs on. The default "npu" uses the Apple Neural Engine, which is the fastest option on supported devices. Change to "gpu" or "cpu" only if you need to debug or the model does not support the Neural Engine.

Field

Type

Default

Description

vad_device

String

“npu”

Silero VAD compute device

stt_device

String

“npu”

Whisper coarse default

stt_devices

[String:String]?

nil

Per-module override

tts_device

String

“npu”

NeuTTS coarse default

tts_devices

[String:String]?

nil

Per-module override

LLM

The language model that generates the assistant’s responses. Choose between a cloud provider (TheStageOpenAICompatibleProvider) for maximum quality or a local model (TheStageLocalLLMProvider) for offline / privacy use cases.

Field

Type

Default

Description

llm

TheStageLLMProvider

required

Local or remote LLM (Swift)

llm_provider

String

required

“local” or “openai_compatible” (Flutter)

system_prompt

String

helpful default

System message

max_tokens

Int

256

Generation cap

temperature

Double

0.7

Sampling temperature

chat_memory

TheStageChatMemory

SlidingWindowMemory(max_turns: 10)

History strategy

VAD / Endpointing

Voice Activity Detection determines when the user starts and stops speaking. vad_threshold and vad_onset_ms control how much evidence the system needs before it decides speech has begun. silence_timeout_ms controls how long the user must be silent before the turn is committed to the LLM.

Field

Type

Default

Description

vad_threshold

Double

0.8

Speech probability threshold

vad_onset_ms

Int

96

Sustained voiced duration to trigger onset

silence_timeout_ms

Int

608

Trailing silence to commit turn

max_accumulation_ms

Int

30000

Hard cap on single turn

pre_roll_ms

Int

200

Pre-roll captured before onset

Turn Detection

By default the agent commits the user’s turn after a fixed silence timeout (.vad mode). Switching to .dnn mode uses a neural model (smart-turn-v3) that listens to the audio and predicts whether the user is actually done speaking — even during short pauses. This dramatically reduces premature cut-offs for users who pause mid-sentence.

Field

Type

Default

Description

turn_detection_mode

enum

.vad

.vad or .dnn

turn_detector

String?

nil

smart-turn engines repo/path

turn_eot_threshold

Double

0.85

Completion prob threshold

turn_eot_confirm_count

Int

2

Consecutive done verdicts needed

turn_pause_trigger_ms

Int

256

Trailing silence before first model call

turn_reeval_interval_ms

Int

120

Re-run cadence on sustained pause

turn_max_silence_ms

Int

5000

Hard fallback

turn_window_ms

Int

8000

Trailing audio window fed to model

turn_min_speech_ms

Int

250

Minimum speech before consulting model

Streaming Transcription (ASR)

Live partial captions let you show the user what the agent is hearing while they are still talking. Disable asr_streaming if you only need the final transcript.

Field

Type

Default

Description

asr_streaming

Bool

true

Emit live partial captions

asr_partial_interval_ms

Int

600

Minimum new audio between caption passes

speculative_whisper

Bool

true

Speculative full-utterance pass at first VAD pause

Interruption / AEC

Controls how the agent handles barge-in. On iOS, the hardware Acoustic Echo Cancellation (AEC) unit filters out the device’s own speaker output, so the agent can safely listen for user speech while it is talking. On macOS, AEC is unreliable — the default is .none to prevent self-interruption.

Field

Type

Default

Description

interrupt_mode

InterruptTrigger

.speech_only (iOS) / .none (macOS)

How user can barge in

interrupt_min_speech_ms

Int

600

Sustained speech needed to interrupt

interrupt_threshold

Double

0.9

VAD prob threshold for barge-in

interrupt_min_playback_ms

Int

250

Grace at TTS turn start

aec_enabled

Bool

true

Voice Processing IO (iOS only)

Wake-Word Standby

When a wake-word model is configured, the agent enters sleeping mode after each interaction and only wakes when the trigger phrase is detected. Leave wake_word as nil (the default) to skip standby entirely.

Field

Type

Default

Description

wake_word

String?

nil

HF id or local path of wake-word bundle

ww_threshold_score

Double

0.5

Wake-word classifier threshold

ww_device

String

“npu”

Wake-word compute device

Programmatic Controls

agent.interrupt()
agent.say("Hi there!")
agent.send_request("What time is it?")
await agent.set_voice("dave")
let history = await agent.history()
await agent.clear_history()
await agent.update_interrupt_config(min_speech_ms: 200, mode: .speech_only)
await agent.update_turn_config(eot_threshold: 0.6, pause_trigger_ms: 256)
await agent.stop()

Note

TheStageAI.infer and TheStageAI.infer_stream are nonisolated, so each node runs on independent tasks. Don’t wrap inference calls in Task { @MainActor in ... }.

Latency

Measured on M-class Mac with OpenAI gpt-4o-mini:

Turn

LLM 1st Token

First Audio

Full Speak

Short reply

487 ms

521 ms

3.3 s

Long monologue

575 ms

601 ms

53.1 s

Mid-length

1226 ms

1497 ms

5.8 s

Troubleshooting

Agent keeps interrupting itself

This happens when the agent’s own TTS output is picked up by the microphone and interpreted as user speech, triggering a barge-in. The root cause is missing or ineffective Acoustic Echo Cancellation (AEC).

On macOS: set interrupt_mode to .none. macOS does not have hardware-level AEC, so the default is already .none — if you overrode it to .speech_only, revert or use headphones.

On iOS: verify that aec_enabled is true (the default). If you disabled it, re-enable it. If self-interruption still occurs with AEC enabled, increase interrupt_min_speech_ms (e.g. to 800) so the agent requires a longer burst of detected speech before it triggers a barge-in.

Agent responds too slowly or cuts off the user

These are two sides of the same coin — turn detection timing.

Cuts off the user mid-sentence: the agent commits the turn too eagerly. Switch from .vad to .dnn turn detection mode so the neural model can distinguish between a pause and the end of the utterance:

config.turn_detection_mode = .dnn
config.turn_detector = "TheStageAI/smart-turn-v3"

If you stay in .vad mode, increase silence_timeout_ms (default 608 ms) to give the user more breathing room.

Responds too slowly: the agent waits too long after the user finishes. In .dnn mode, lower turn_eot_threshold (default 0.85) so the model commits sooner. In .vad mode, decrease silence_timeout_ms.

Model loading takes too long

The first call to agent.start() downloads all model bundles from Hugging Face and compiles them for the target device. This can take tens of seconds on a cold start.

Use prefetch_engines on a loading or splash screen to move the download out of the critical path:

try await ai.prefetch_engines(
    engines: [
        "TheStageAI/silero-vad",
        "TheStageAI/thewhisper-large-v3-turbo",
        "TheStageAI/neutts-multilingual"
    ]
)

Once cached, subsequent launches load models from disk in under a second.

Transcription is inaccurate

Check these common causes:

  1. Audio format mismatch. The STT pipeline expects 16 kHz mono Float32 input. If your audio session is configured for a different sample rate or channel count, the Whisper model receives garbled audio. The default AudioStreamConfig values are correct — only check this if you overrode them.

  2. Wrong language. stt_language defaults to "en". If the user is speaking another language, set this to the correct BCP-47 code (e.g. "es", "de", "ja").

  3. Noisy environment. Whisper is robust but not immune to loud background noise. On iOS, ensure aec_enabled is true so the device’s own speaker output is cancelled from the microphone signal.

No audio output on device

If the agent appears to work (events fire, transcripts appear) but no sound is heard:

  1. Missing background mode. On iOS, add audio to UIBackgroundModes in Info.plist. Without it, the audio session may be deactivated by the system.

    <key>UIBackgroundModes</key>
    <array>
        <string>audio</string>
    </array>
    
  2. TTS not configured. If tts is nil in the config, the agent operates in speech-to-text-only mode. Set it to a valid NeuTTS bundle path.

  3. Silent mode / volume. On iOS, check that the device is not in silent mode and the media volume is up. The AudioStreamConfig category defaults to .playback, which respects the media volume slider.