VAD (Voice Activity Detection)

Overview

Stateful per-chunk speech detection. Gates the mic capture for Whisper / TTS, or runs as a batch segmenter to slice a longer recording into speech regions.

VAD is reached through the singleton (JSON) path on both Swift and Flutter — the response shape is identical.

Supported Models

Model

HF Repo

Notes

Silero VAD

TheStageAI/silero-vad

Stateful LSTM, 512-sample chunks at 16 kHz

API Reference

Inputs / Outputs (single-chunk mode)

Direction

Type

Description

input audio

[Float]

16 kHz mono PCM, exactly 512 samples (32 ms).

input reset_state

Bool (default false)

Reset the LSTM state between independent utterances.

output probability

Double

Speech probability in [0.0, 1.0].

Audio I/O

  • 16 kHz mono [Float], samples in [-1.0, 1.0].

  • Chunk size: exactly 512 samples per infer call. Smaller chunks are zero-padded; larger chunks are rejected.

  • Stateful. The model keeps an LSTM hidden state across calls. Pass "reset_state": true between independent utterances.

  • Internal context. A 64-sample carry-over from the previous chunk is prepended automatically.

Segment Extraction (batch mode)

Parameter

Type

Description

audio

[Float]

16 kHz mono PCM, any length.

extract_segments

Bool

Must be true to enable batch mode.

threshold

Double

Speech probability threshold (e.g. 0.5).

neg_threshold

Double

Negative threshold for speech end (-1.0 to disable).

min_speech_duration_ms

Int

Minimum speech segment duration in milliseconds.

min_silence_duration_ms

Int

Minimum silence gap to split segments.

speech_pad_ms

Int

Padding added around each detected segment.

Singleton API

import TheStageSDK

let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")

try await ai.start_model(
    model_name: "vad",
    engines_path: "TheStageAI/silero-vad"
)

let json = try ai.infer(
    model_name: "vad",
    input_json: ["audio": audio_chunk]
)
let probability = json[0]["probability"] as! Double

Usage Guides

Detecting if someone is speaking

The simplest use case: determine whether a given audio chunk contains speech. Use this to gate microphone input before sending to a transcription pipeline — it avoids wasting compute on silence and prevents spurious transcription results from background noise.

Swift:

import TheStageSDK

let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")

try await ai.start_model(
    model_name: "vad",
    engines_path: "TheStageAI/silero-vad"
)

let result = try ai.infer(
    model_name: "vad",
    input_json: ["audio": audio_chunk]
)
let probability = result[0]["probability"] as! Double
if probability > 0.5 {
    print("Speech detected!")
}

Flutter:

import 'package:thestage_apple_sdk/thestage_apple_sdk.dart';
import 'dart:typed_data';

await TheStageFlutterSDK.initialize(api_token: 'th_…');

await TheStageFlutterSDK.start_model(
  model_name: 'vad',
  engines_path: 'TheStageAI/silero-vad',
);

final result = await TheStageFlutterSDK.infer(
  model_name: 'vad',
  input_json: {'audio': audio_chunk},
);
final probability = result[0]['probability'] as double;
if (probability > 0.5) {
  print('Speech detected!');
}

Note

The audio_chunk must be exactly 512 samples of 16 kHz mono PCM (32 ms of audio). Configure your microphone capture buffer size to match this requirement.

Building a real-time speech gate (mic → VAD → transcription)

This is the fundamental pattern for any speech application. The flow is:

  1. Capture 512-sample chunks from the microphone at 16 kHz.

  2. Feed each chunk to VAD and check the speech probability.

  3. While speech is detected, accumulate chunks into a buffer.

  4. When speech ends (probability drops below threshold), send the accumulated buffer to Whisper for transcription.

This avoids running expensive ASR on silence and gives you clean utterance boundaries for free.

Swift:

let threshold = 0.5
for chunk in microphoneStream {
    let result = try ai.infer(
        model_name: "vad",
        input_json: ["audio": chunk]
    )
    let probability = result[0]["probability"] as! Double

    if probability > threshold {
        speechBuffer.append(contentsOf: chunk)
    } else if !speechBuffer.isEmpty {
        let transcript = try ai.infer(
            model_name: "stt",
            input_json: ["audio": speechBuffer]
        )
        speechBuffer.removeAll()
    }
}

Flutter:

const threshold = 0.5;
final speechBuffer = <double>[];

await for (final Float32List chunk in microphoneStream) {
  final result = await TheStageFlutterSDK.infer(
    model_name: 'vad',
    input_json: {'audio': chunk},
  );
  final probability = result[0]['probability'] as double;

  if (probability > threshold) {
    speechBuffer.addAll(chunk);
  } else if (speechBuffer.isNotEmpty) {
    final pcm = Float32List.fromList(speechBuffer);
    await TheStageFlutterSDK.infer(
      model_name: 'stt',
      input_json: {'audio': pcm},
    );
    speechBuffer.clear();
  }
}

Attention

In production, add a minimum speech duration check (e.g. 200 ms) before sending to Whisper. Very short bursts are often false positives from transient noises.

Slicing a recording into speech segments (batch mode)

When you have a complete audio recording and want to find all speech regions — for example, to pre-process a long file before transcription — use the segment extraction mode. Instead of calling infer chunk-by-chunk, you pass the entire recording and get back a list of start/end sample indices for each detected speech segment.

This is ideal for:

  • Transcription pre-processing (only transcribe the speech parts).

  • Audio editing tools that need to highlight spoken regions.

  • Splitting multi-speaker recordings at silence boundaries.

let result = try ai.infer(
    model_name: "vad",
    input_json: [
        "audio": long_audio,
        "extract_segments": true,
        "threshold": 0.5,
        "neg_threshold": -1.0,
        "min_speech_duration_ms": 250,
        "min_silence_duration_ms": 100,
        "speech_pad_ms": 30
    ]
)
for seg in result {
    let start = seg["start"] as! Int
    let end = seg["end"] as! Int
    let slice = Array(long_audio[start..<end])
}

Note

speech_pad_ms adds padding around each detected segment so you don’t clip the beginning or end of words. A value of 30–50 ms works well for most use cases.

Tuning VAD sensitivity

The threshold parameter controls how aggressively the model classifies a chunk as speech. The probability output is compared against this threshold to make a binary decision.

  • Higher threshold (0.7–0.8): fewer false positives — the model only triggers on clear, confident speech. Good for noisy environments where you want to ignore background sounds, but may miss quiet or distant speakers.

  • Lower threshold (0.3–0.4): catches more speech including soft or distant voices, but also triggers more on ambient noise, keyboard clicks, or music.

  • Default (0.5): a balanced starting point for most environments.

let result = try ai.infer(
    model_name: "vad",
    input_json: [
        "audio": chunk,
        "extract_segments": true,
        "threshold": 0.7
    ]
)

Concrete guidance:

  • Quiet office → 0.5

  • Noisy café / car → 0.7–0.8

  • Capturing distant speakers (conference room) → 0.3–0.4

  • Push-to-talk (you know the user intends to speak) → 0.3

Resetting state between independent utterances

VAD uses a stateful LSTM that carries hidden state across consecutive infer calls. This is intentional: it lets the model use temporal context to improve accuracy across a continuous audio stream.

However, if you are processing multiple independent audio clips (e.g. different files, or restarting a recording session), you must pass reset_state: true on the first chunk of each new clip. Otherwise the LSTM carries over state from the previous clip, which causes inaccurate probability readings — typically inflated values that make the model think silence is speech.

// Processing clip A
for chunk in clipA {
    let _ = try ai.infer(model_name: "vad", input_json: ["audio": chunk])
}

// Processing clip B — reset state first
for (i, chunk) in clipB.enumerated() {
    let _ = try ai.infer(
        model_name: "vad",
        input_json: [
            "audio": chunk,
            "reset_state": i == 0
        ]
    )
}

Attention

A common bug: forgetting to reset between clips causes VAD to always return high probability on the new clip’s first few chunks, even if they are silence.

Troubleshooting

VAD always returns high probability (even on silence)

You likely forgot to call reset_state: true between independent audio clips. The LSTM internal state bleeds over from the previous audio, and if the previous clip ended with speech, the model continues predicting speech even on silence.

Fix: Pass "reset_state": true on the first chunk after switching to a new audio source or after any gap in the audio stream.

VAD misses speech at the start of an utterance

The first few chunks after a state reset may have lower sensitivity as the LSTM “warms up” — it needs a few frames of context to build confidence.

Workarounds:

  • Lower the threshold for the first ~100 ms (3 chunks at 32 ms each) after a reset. For example, use 0.3 for the first 3 chunks and then switch to your normal threshold.

  • If using the voice agent framework, configure pre_roll_ms to capture a small buffer of audio before the VAD trigger so you don’t miss the onset of speech.

Wrong chunk size error

VAD requires exactly 512 samples per infer call at 16 kHz (32 ms of audio).

  • Chunks smaller than 512 samples are zero-padded internally, which may reduce accuracy.

  • Chunks larger than 512 samples are rejected with an error.

Fix: Configure your microphone capture buffer to produce exactly 512 samples per callback. At 16 kHz, this means a 32 ms buffer. If your audio framework uses a different buffer size, accumulate samples in an intermediate ring buffer and dispatch exactly 512 at a time.

VAD not accurate in noisy environments

The Silero model works best on clean or moderate-noise audio. In high-noise environments (construction, loud music, wind), it may produce frequent false positives.

Mitigations:

  • Raise the threshold to 0.7–0.8.

  • Apply a noise gate or simple spectral subtraction before feeding audio to VAD.

  • Use min_speech_duration_ms (in batch mode) to filter out very short false triggers.

  • If the noise is stationary (e.g. fan, AC), consider a simple energy-based pre-filter that only feeds chunks above a minimum RMS to VAD.

How VAD relates to WhisperPipeline’s internal VAD

WhisperPipeline has its own built-in Silero VAD pre-pass that runs automatically before transcription (use_internal_vad: true by default). This internal VAD skips silence regions so Whisper only processes speech.

If you are already using standalone VAD to gate audio before sending to Whisper, you are running VAD twice — once in your code and once inside WhisperPipeline. This is wasteful and can cause subtle issues (different thresholds, double-filtered boundaries).

Fix: When using standalone VAD as a speech gate, set use_internal_vad: false on WhisperPipeline:

try await ai.start_model(
    model_name: "stt",
    engines_path: "TheStageAI/whisper-v3",
    config: ["use_internal_vad": false]
)

Cleanup

Swift:

_ = try ai.stop_model(model_name: "vad")

Flutter:

await TheStageFlutterSDK.stop_model(model_name: 'vad');