LLM (Language Model)¶

Overview ¶

On-device language model inference with batch and token-by-token streaming. TheStageLLM wraps Qwen3 / Gemma3 / LFM2.5 chat models with KV cache, chat-template rendering, and stop-token policy.

Chat templates are auto-detected from the model bundle — you never pick them manually. The engine resolves EOS tokens, stop sequences, and KV-cache horizon from the same bundle metadata.

Flutter consumers go through the singleton start_model + infer / infer_stream (JSON) path — there is no direct LLM constructor on Dart. Both surfaces share the same on-disk cache and the same response shape.

Supported Models ¶

Model	HF repo	Parameters	Chat template
Qwen3-0.6B	`TheStageAI/Qwen3-0.6B`	0.6B	Qwen3
Gemma3-1B	`TheStageAI/Gemma3-1B`	1B	Gemma3
LFM2.5-230M	`TheStageAI/LFM2.5-230M`	230M	LFM2
LFM2.5-350M	`TheStageAI/LFM2.5-350M`	350M	LFM2

The bundle’s engines_path accepts either a HuggingFace repo id or a local directory. The chat template, EOS / stop tokens and KV-cache horizon all come from the bundle — you don’t pick them.

API Reference ¶

Full Constructor ¶

let llm = try await TheStageLLM(
    engines_path: "TheStageAI/Qwen3-0.6B",
    device: "npu",
    chat_template: nil,
    // revision omitted → ModelRevisionMap (vA.B for this SDK)
    on_load_progress: nil
)

TheStageAI.shared.initialize(apiToken:) must have succeeded before this call returns.

Parameter	Type	Description
`engines_path`	`String`	HuggingFace repo id or a local directory containing the compiled engine bundle.
`device`	`String`	Compute device (`"npu"` / `"gpu"` / `"cpu"`). Prefer `"npu"` on Apple silicon.
`max_context_size`	`Int`	Deprecated / ignored. Context horizon comes from the bundle metadata.
`chat_template`	`String?`	Override the auto-detected chat template. `nil` uses the bundle default.
`revision`	`String?`	HuggingFace revision. Omit to use this SDK’s `ModelRevisionMap` pin (`vA.B`). Ignored for a local `engines_path`.
`on_load_progress`	`Closure?`	Optional progress callback (see Load Progress).

Inputs / Outputs ¶

Sampling defaults are per-model: each bundle ships a tuned generation_defaults preset. When you omit a sampling field it keeps the bundle value — do not assume a global default like 0.7 / top_k=20.

Swift: start from llm.generation_defaults, override fields, pass config:. Flutter / JSON: put the same keys in input_json (except min_new_tokens, which is Swift-only).

Direction	Key	Type	Description
input	`prompt`	`String`	The user message.
input	`system_prompt`	`String?`	Optional system message; defaults to the bundle’s `default_system_prompt`.
input	`max_new_tokens`	`Int` (default 512)	Hard cap on generated tokens. Hitting it → `stop_reason == "max_new_tokens"`.
input	`min_new_tokens`	`Int` (default 0)	Swift `LLMGenerationConfig` only — not applied via JSON overlay.
input	`temperature`	`Float`	Sampling randomness. `0` = greedy. Higher = more variety.
input	`top_k`	`Int`	Keep only the top-k logits. `0` = disabled.
input	`top_p`	`Float`	Nucleus sampling cumulative-probability cap. `1.0` = disabled.
input	`min_p`	`Float`	Drop tokens below `min_p × p(max)`. `0.0` = disabled.
input	`repetition_penalty`	`Float`	Penalize already-seen tokens. `1.0` = disabled.
input	`enable_thinking`	`Bool`	Toggle the model’s thinking/reasoning prelude (Qwen3 etc.).
input	`seed`	`UInt64?`	Deterministic sampling seed.
output	`LLMResult.text`	`String`	Decoded response.
output	`LLMResult.prompt_tokens` / `generated_tokens`	`Int`	Token counts.
output	`LLMResult.tokens_per_second`	`Double`	Decode speed.
output	`LLMResult.time_to_first_token` / `total_seconds`	`Double`	Latency breakdown.
output	`LLMResult.stop_reason`	`String`	`"eos"` / `"max_new_tokens"` / `"stop_sequence"` / `"unknown"`.

Generation parameters (what to set)¶

Omit sampling fields unless you have a reason — the bundle preset is usually right. Override only the knobs you care about.

Knob	Plain meaning	Typical range	Notes
`max_new_tokens`	Hard cap on reply length	64–1024	Truncation → `stop_reason == "max_new_tokens"`.
`temperature`	How random next-token picks are	0–1.2	`0` = most deterministic. High values risk nonsense.
`top_k`	Keep only the k most likely tokens	0 / 10–50	`0` = off. Lower = safer, more repetitive.
`top_p`	Keep the smallest set whose probs sum to p	0.8–1.0	`1.0` = off. Often paired with moderate temperature.
`min_p`	Drop weak long-tail tokens	0–0.1	`0` = off.
`repetition_penalty`	Discourage already-seen tokens	1.0–1.2	`1.0` = off. Helps if the model loops.
`enable_thinking`	Qwen3 reasoning prelude on/off	true/false	Off for short chat UX; on for harder reasoning.
`seed`	Fix the RNG	any `UInt64`	Same device + same bundle + same inputs → same text.

Real-world recipes¶

Copy a block that matches the product job. Values are starting points — tune on device.

1. Short factual Q&A (support bot, FAQ):

var config = llm.generation_defaults
config.max_new_tokens = 128
config.temperature = 0.3
config.top_k = 20
config.top_p = 0.9
config.repetition_penalty = 1.05
config.enable_thinking = false

let result = llm.infer(
    prompt: "What is the capital of France?",
    system_prompt: "Answer in one short sentence. No fluff.",
    config: config
)

final result = await TheStageFlutterSDK.infer(
  model_name: 'llm',
  input_json: {
    'prompt': 'What is the capital of France?',
    'system_prompt': 'Answer in one short sentence. No fluff.',
    'max_new_tokens': 128,
    'temperature': 0.3,
    'top_k': 20,
    'top_p': 0.9,
    'repetition_penalty': 1.05,
    'enable_thinking': false,
  },
);

2. Creative / chatty reply (story, brainstorm, casual chat):

var config = llm.generation_defaults
config.max_new_tokens = 512
config.temperature = 0.9
config.top_k = 40
config.top_p = 0.95
config.enable_thinking = false

3. Longer structured answer (summarize, explain, bullets):

var config = llm.generation_defaults
config.max_new_tokens = 768
config.temperature = 0.5
config.top_p = 0.9
config.repetition_penalty = 1.1
config.enable_thinking = false

Check result.stop_reason. If it is "max_new_tokens", raise the cap.

4. Hard reasoning (Qwen3 thinking):

var config = llm.generation_defaults
config.max_new_tokens = 1024
config.temperature = 0.6
config.enable_thinking = true

Thinking tokens still count toward max_new_tokens — budget headroom.

5. Deterministic tests / golden fixtures:

var config = llm.generation_defaults
config.temperature = 0
config.seed = 42
config.max_new_tokens = 64

Same seed is only guaranteed on the same device + same bundle revision.

Singleton API ¶

Use this when you want lifecycle (stop_model), JSON dispatch (infer(model_name:input_json:)), or are driving the SDK from Flutter. Both flows share the same on-disk cache.

try await ai.start_model(
    model_name: "llm",
    engines_path: "TheStageAI/Qwen3-0.6B"
)

let json = try ai.infer(
    model_name: "llm",
    input_json: [
        "prompt": "What is 2+2?",
        "system_prompt": "You are a helpful assistant.",
        "max_new_tokens": 256,
        "temperature": 0.7,
        "top_k": 20,
        "top_p": 0.8,
        "min_p": 0.0,
        "repetition_penalty": 1.1,
        "enable_thinking": false,
        "seed": 42
    ]
)
let text = json[0]["text"] as! String

JSON streaming yields typed InferenceStreamChunk values — delta carries each token’s text:

let stream = try ai.infer_stream(
    model_name: "llm",
    input_json: ["prompt": "Tell me a story.", "max_new_tokens": 512]
)

for await chunk in stream {
    if !chunk.is_final, let delta = chunk.delta {
        print(delta, terminator: "")
    }
    if chunk.is_final, let tps = chunk.tokens_per_second {
        print("\n--- \(tps) tok/s ---")
    }
}

Note

The JSON path is single-turn. For multi-turn chat history use the direct TheStageLLM API; chat templates are rendered for you.

The Flutter TheStageFlutterSDK.infer / infer_stream calls hit this exact JSON path, so the response keys below apply unchanged on Dart.

Response Keys ¶

The JSON response dictionary contains the following keys:

Key	Description
`text`	Decoded model response.
`prompt_tokens`	Number of tokens in the prompt.
`generated_tokens`	Number of tokens generated.
`prefill_seconds`	Time spent on prompt prefill.
`decode_seconds`	Time spent on token decoding.
`tokens_per_second`	Decode throughput.
`time_to_first_token`	Latency until the first generated token.
`total_seconds`	Wall-clock time for the full call.
`stop_reason`	`"eos"` / `"max_new_tokens"` / `"stop_sequence"` / `"unknown"`.

Usage Guides ¶

Basic chat completion ¶

When you need a single question-and-answer exchange — the simplest way to get a response from the model. In Swift, the direct TheStageLLM constructor is recommended because it gives you typed results and streaming for free. Flutter uses the JSON singleton path instead.

Swift — direct constructor (recommended):

import TheStageSDK

let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")

let llm = try await TheStageLLM(
    engines_path: "TheStageAI/Qwen3-0.6B"   // HF repo id, or a local dir
)

let result = llm.infer(
    prompt: "What is 2+2?",
    system_prompt: "You are a helpful assistant.",
    max_new_tokens: 64
)
print(result.text)

Flutter — JSON path:

import 'package:thestage_apple_sdk/thestage_apple_sdk.dart';

await TheStageFlutterSDK.initialize(api_token: 'th_…');

await TheStageFlutterSDK.start_model(
  model_name: 'llm',
  engines_path: 'TheStageAI/Qwen3-0.6B',
);

final result = await TheStageFlutterSDK.infer(
  model_name: 'llm',
  input_json: {
    'prompt': 'What is 2+2?',
    'system_prompt': 'You are a helpful assistant.',
    'max_new_tokens': 64,
  },
);
print(result[0]['text']);

Attention

Always call initialize(apiToken:) before constructing any pipeline. Forgetting this is the most common source of “model loading fails” errors.

Streaming responses to a chat UI ¶

In a chat interface, users expect to see text appear word-by-word rather than waiting several seconds for a complete response. Streaming delivers each token as it is generated, keeping the UI responsive and giving users immediate feedback.

Each chunk before the final sentinel carries one delta of text. The final chunk has is_final == true and includes per-call performance metrics like tokens_per_second.

Swift:

for await chunk in llm.infer_stream(
    prompt: "Tell me a story.",
    max_new_tokens: 512
) {
    if chunk.is_final {
        let tps = chunk.tokens_per_second ?? 0
        print("\n--- \(tps) tok/s ---")
    } else {
        print(chunk.text, terminator: "")
    }
}

Flutter:

final stream = TheStageFlutterSDK.infer_stream(
  model_name: 'llm',
  input_json: {
    'prompt': 'Tell me a story.',
    'max_new_tokens': 512,
  },
);

await for (final chunk in stream) {
  if (chunk['is_final'] == true) {
    final tps = chunk['tokens_per_second'] as double? ?? 0;
    print('\n--- $tps tok/s ---');
  } else {
    final delta = chunk['delta'] as String?;
    if (delta != null) stdout.write(delta);
  }
}

Note

Check is_final to capture tokens_per_second, total_seconds, and stop_reason. These metrics are only available on the terminal chunk and are useful for performance monitoring and detecting truncated responses.

Setting a system prompt for persona control ¶

The system_prompt parameter shapes how the model responds. Use it to constrain the topic, set a tone, or give the model a specific role. Without one, the bundle’s default system prompt is used.

Swift:

let result = llm.infer(
    prompt: "How do I make risotto?",
    system_prompt: "You are a cooking assistant. Only answer questions about recipes.",
    max_new_tokens: 256
)

Flutter:

final result = await TheStageFlutterSDK.infer(
  model_name: 'llm',
  input_json: {
    'prompt': 'How do I make risotto?',
    'system_prompt': 'You are a cooking assistant. Only answer questions about recipes.',
    'max_new_tokens': 256,
  },
);

The model will refuse off-topic questions like “What’s the weather?” and stay focused on recipes. System prompts work with both one-shot infer and streaming infer_stream.

Attention

System prompts consume context tokens. Very long system prompts leave fewer tokens for the conversation. Keep them concise — a few sentences is usually enough.

Choosing the right model for your device ¶

Each model trades off speed and memory for output quality. Pick the one that matches your target hardware:

Model	Size	Speed	Best for
LFM2.5-230M	230M (smallest / fastest)	Fastest	Tight latency budgets, low footprint
Qwen3-0.6B	0.6B	Fast	iPhone, low-RAM iPads, quick replies
LFM2.5-350M	350M	Fast	Strong speed/quality on Apple silicon
Gemma3-1B	1B	Moderate	Balanced quality/speed on recent devices

LFM2.5-230M is the lightest and fastest option for short replies.

Qwen3-0.6B fits comfortably on devices with 4 GB+ RAM and is a solid default when you want a small general chat model.

Gemma3-1B is a good middle ground — better reasoning with moderate memory overhead.

// Fast / small footprint
let llm = try await TheStageLLM(engines_path: "TheStageAI/LFM2.5-230M")

// General chat default
let llm = try await TheStageLLM(engines_path: "TheStageAI/Qwen3-0.6B")

Note

If you’re unsure, start with Qwen3-0.6B. You can swap models later by changing only the engines_path — the rest of your code stays the same.

Deterministic output with seed ¶

For testing and reproducibility, set the seed parameter so the model produces the same output every time for the same input. Without a seed, the sampler uses a random seed on each call.

Swift:

let result = llm.infer(
    prompt: "Write a haiku about the ocean.",
    seed: 42,
    temperature: 0.7
)
// Run it again with the same seed — same output
let result2 = llm.infer(
    prompt: "Write a haiku about the ocean.",
    seed: 42,
    temperature: 0.7
)
assert(result.text == result2.text)

Flutter:

final result = await TheStageFlutterSDK.infer(
  model_name: 'llm',
  input_json: {
    'prompt': 'Write a haiku about the ocean.',
    'seed': 42,
    'temperature': 0.7,
  },
);

Attention

Determinism is only guaranteed on the same device with the same model bundle. Different hardware or model versions may decode differently even with identical seeds.

Troubleshooting ¶

Model loading fails or hangs ¶

The most common cause is a missing initialize(apiToken:) call. The SDK must authenticate before it can download or load any model bundle.

Verify that TheStageAI.shared.initialize(apiToken:) completed successfully before constructing TheStageLLM.
On first launch the model must be downloaded. Check that the device has a working network connection.
Attach an on_load_progress callback to see which phase is stuck:

let llm = try await TheStageLLM(
    engines_path: "TheStageAI/Qwen3-0.6B",
    on_load_progress: { p in
        print("[\(p.model)] \(p.phase) \(Int(p.fraction * 100))%")
    }
)

If progress stalls at .downloading, the network is the bottleneck. If it stalls at .loading, the device may be out of memory.

Response is cut off mid-sentence ¶

When the output ends abruptly, the model likely hit the max_new_tokens limit before finishing its thought.

Increase max_new_tokens (default is 512). Try 1024 or higher for longer answers.
Check stop_reason in the result — if it says "max_new_tokens" instead of "eos", the model was still generating when the cap was reached.

let result = llm.infer(
    prompt: "Explain quantum computing in detail.",
    max_new_tokens: 1024
)
print(result.stop_reason) // "eos" means it finished naturally

Slow first response (cold start)¶

The first inference after loading a model is noticeably slower than subsequent ones. This is expected — the engine performs JIT compilation and warms internal caches on the first run.

To minimize cold-start impact:

Use prefetch_engines to download the model bundle ahead of time so loading doesn’t add network latency on top of warm-up.
Issue a short throwaway prompt (e.g. "hi") during app startup to warm the engine before the user’s first real query.
Subsequent inferences will be significantly faster.

Out of memory on older devices ¶

Larger models need more RAM. If the app crashes or the system kills it during model loading, the model is too large for the device.

Switch to a smaller bundle (LFM2.5-230M or Qwen3-0.6B).
Stop other pipelines (transcription, TTS) before loading the LLM when RAM is tight. Context horizon comes from the bundle — max_context_size is deprecated and ignored.

let llm = try await TheStageLLM(
    engines_path: "TheStageAI/LFM2.5-230M",
    device: "npu"
)

Wrong or garbled output ¶

If the model produces nonsensical, repetitive, or off-topic text:

Temperature too high: Values above 1.0 make sampling nearly random. Prefer a factual recipe (temperature: 0.3) or start from llm.generation_defaults and only lower temperature — do not assume a global 0.7 default.
Model mismatch: Ensure the engines_path points to a valid TheStageAI model bundle. Custom or corrupted bundles may use the wrong chat template, producing garbled output.
Prompt formatting: The SDK applies chat templates automatically. Do not add <|im_start|> or other template tokens manually — they will be double-applied and confuse the model.

var config = llm.generation_defaults
config.temperature = 0.3
config.top_k = 10
config.enable_thinking = false
let result = llm.infer(prompt: "Summarize this article.", config: config)

Load Progress ¶

on_load_progress is optional. When set, the handler fires through four phases with a monotonic fraction in 0...1:

Swift:

let llm = try await TheStageLLM(
    engines_path: "TheStageAI/Qwen3-0.6B",
    on_load_progress: { p in
        print("[\(p.model)] \(p.phase) \(Int(p.fraction * 100))%")
    }
)

Cache hits skip .downloading / .extracting and emit only .loading followed by .ready. Failed loads do not emit .ready.

Flutter:

TheStageFlutterSDK.on_progress.listen((event) {
  if (event['model_name'] != 'llm') return;
  final phase    = event['phase']    as String?;
  final fraction = event['progress'] as double?;
  print('[llm] $phase ${(fraction ?? 0) * 100}%');
});

await TheStageFlutterSDK.start_model(
  model_name: 'llm',
  engines_path: 'TheStageAI/Qwen3-0.6B',
);

Prefetch Engines ¶

let engines_dir = try await ai.prefetch_engines(
    repo_id: "TheStageAI/Qwen3-0.6B"
)

// Later — instant load, no network:
let llm = try await TheStageLLM(engines_path: engines_dir)

Cleanup ¶

TheStageLLM is a normal Swift object — drop the reference to release it. When you used the singleton API:

Swift:

_ = try ai.stop_model(model_name: "llm")

Flutter:

await TheStageFlutterSDK.stop_model(model_name: 'llm');