LLM (Language Model)¶
Overview¶
On-device language model inference with batch and token-by-token
streaming. TheStageLLM wraps Qwen2 / Qwen3 / Gemma3 chat models with
KV cache, chat-template rendering, and stop-token policy.
Chat templates are auto-detected from the model bundle — you never pick them manually. The engine resolves EOS tokens, stop sequences, and KV-cache horizon from the same bundle metadata.
Flutter consumers go through the singleton start_model +
infer / infer_stream (JSON) path — there is no direct LLM
constructor on Dart. Both surfaces share the same on-disk cache and
the same response shape.
Supported Models¶
Model |
HF repo |
Parameters |
Chat template |
|---|---|---|---|
Qwen2.5-1.5B |
|
1.5B |
Qwen2 |
Qwen3-0.6B |
|
0.6B |
Qwen3 |
Gemma3-1B |
|
1B |
Gemma3 |
The bundle’s engines_path accepts either a HuggingFace repo id or a
local directory. The chat template, EOS / stop tokens and KV-cache
horizon all come from the bundle — you don’t pick them.
API Reference¶
Full Constructor¶
let llm = try await TheStageLLM(
engines_path: "TheStageAI/Qwen3-0.6B",
device: "gpu",
max_context_size: 2048,
chat_template: nil,
revision: "main",
on_load_progress: nil
)
TheStageAI.shared.initialize(apiToken:) must have succeeded before
this call returns.
Parameter |
Type |
Description |
|---|---|---|
|
|
HuggingFace repo id or a local directory containing the compiled engine bundle. |
|
|
Compute device. Defaults to |
|
|
Maximum KV-cache context window. Defaults to |
|
|
Override the auto-detected chat template. |
|
|
HuggingFace revision / branch. Defaults to |
|
|
Optional progress callback (see Load Progress). |
Inputs / Outputs¶
Direction |
Key |
Type |
Description |
|---|---|---|---|
input |
|
|
The user message. |
input |
|
|
Optional system message; defaults to the bundle’s |
input |
|
|
Maximum tokens to generate. |
input |
|
|
Sampling temperature. |
input |
|
|
Top-k sampling. |
input |
|
|
Deterministic sampling seed. |
output |
|
|
Decoded response. |
output |
|
|
Token counts. |
output |
|
|
Decode speed. |
output |
|
|
Latency breakdown. |
output |
|
|
|
Singleton API¶
Use this when you want lifecycle (stop_model), JSON dispatch
(infer(model_name:input_json:)), or are driving the SDK from
Flutter. Both flows share the same on-disk cache.
try await ai.start_model(
model_name: "llm",
engines_path: "TheStageAI/Qwen3-0.6B"
)
let json = try ai.infer(
model_name: "llm",
input_json: [
"prompt": "What is 2+2?",
"system_prompt": "You are a helpful assistant.",
"max_new_tokens": 256,
"temperature": 0.7,
"top_k": 20,
"seed": 42
]
)
let text = json[0]["text"] as! String
JSON streaming yields typed InferenceStreamChunk values — delta
carries each token’s text:
let stream = try ai.infer_stream(
model_name: "llm",
input_json: ["prompt": "Tell me a story.", "max_new_tokens": 512]
)
for await chunk in stream {
if !chunk.is_final, let delta = chunk.delta {
print(delta, terminator: "")
}
if chunk.is_final, let tps = chunk.tokens_per_second {
print("\n--- \(tps) tok/s ---")
}
}
Note
The JSON path is single-turn. For multi-turn chat history use the
direct TheStageLLM API; chat templates are rendered for you.
The Flutter TheStageFlutterSDK.infer / infer_stream calls hit
this exact JSON path, so the response keys below apply unchanged on Dart.
Response Keys¶
The JSON response dictionary contains the following keys:
Key |
Description |
|---|---|
|
Decoded model response. |
|
Number of tokens in the prompt. |
|
Number of tokens generated. |
|
Time spent on prompt prefill. |
|
Time spent on token decoding. |
|
Decode throughput. |
|
Latency until the first generated token. |
|
Wall-clock time for the full call. |
|
|
Usage Guides¶
Basic chat completion¶
When you need a single question-and-answer exchange — the simplest way to get a
response from the model. In Swift, the direct TheStageLLM constructor is
recommended because it gives you typed results and streaming for free. Flutter
uses the JSON singleton path instead.
Swift — direct constructor (recommended):
import TheStageSDK
let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")
let llm = try await TheStageLLM(
engines_path: "TheStageAI/Qwen3-0.6B" // HF repo id, or a local dir
)
let result = llm.infer(
prompt: "What is 2+2?",
system_prompt: "You are a helpful assistant.",
max_new_tokens: 64
)
print(result.text)
Flutter — JSON path:
import 'package:thestage_apple_sdk/thestage_apple_sdk.dart';
await TheStageFlutterSDK.initialize(api_token: 'th_…');
await TheStageFlutterSDK.start_model(
model_name: 'llm',
engines_path: 'TheStageAI/Qwen3-0.6B',
);
final result = await TheStageFlutterSDK.infer(
model_name: 'llm',
input_json: {
'prompt': 'What is 2+2?',
'system_prompt': 'You are a helpful assistant.',
'max_new_tokens': 64,
},
);
print(result[0]['text']);
Attention
Always call initialize(apiToken:) before constructing any pipeline.
Forgetting this is the most common source of “model loading fails” errors.
Streaming responses to a chat UI¶
In a chat interface, users expect to see text appear word-by-word rather than waiting several seconds for a complete response. Streaming delivers each token as it is generated, keeping the UI responsive and giving users immediate feedback.
Each chunk before the final sentinel carries one delta of text. The final chunk
has is_final == true and includes per-call performance metrics like
tokens_per_second.
Swift:
for await chunk in llm.infer_stream(
prompt: "Tell me a story.",
max_new_tokens: 512
) {
if chunk.is_final {
let tps = chunk.tokens_per_second ?? 0
print("\n--- \(tps) tok/s ---")
} else {
print(chunk.text, terminator: "")
}
}
Flutter:
final stream = TheStageFlutterSDK.infer_stream(
model_name: 'llm',
input_json: {
'prompt': 'Tell me a story.',
'max_new_tokens': 512,
},
);
await for (final chunk in stream) {
if (chunk['is_final'] == true) {
final tps = chunk['tokens_per_second'] as double? ?? 0;
print('\n--- $tps tok/s ---');
} else {
final delta = chunk['delta'] as String?;
if (delta != null) stdout.write(delta);
}
}
Note
Check is_final to capture tokens_per_second, total_seconds, and
stop_reason. These metrics are only available on the terminal chunk and
are useful for performance monitoring and detecting truncated responses.
Setting a system prompt for persona control¶
The system_prompt parameter shapes how the model responds. Use it to
constrain the topic, set a tone, or give the model a specific role. Without one,
the bundle’s default system prompt is used.
Swift:
let result = llm.infer(
prompt: "How do I make risotto?",
system_prompt: "You are a cooking assistant. Only answer questions about recipes.",
max_new_tokens: 256
)
Flutter:
final result = await TheStageFlutterSDK.infer(
model_name: 'llm',
input_json: {
'prompt': 'How do I make risotto?',
'system_prompt': 'You are a cooking assistant. Only answer questions about recipes.',
'max_new_tokens': 256,
},
);
The model will refuse off-topic questions like “What’s the weather?” and stay
focused on recipes. System prompts work with both one-shot infer and
streaming infer_stream.
Attention
System prompts consume context tokens. Very long system prompts leave fewer tokens for the conversation. Keep them concise — a few sentences is usually enough.
Choosing the right model for your device¶
Each model trades off speed and memory for output quality. Pick the one that matches your target hardware:
Model |
Size |
Speed |
Best for |
|---|---|---|---|
Qwen3-0.6B |
0.6B (smallest) |
Fastest |
iPhone, low-RAM iPads, quick replies |
Gemma3-1B |
1B (mid) |
Moderate |
Balanced quality/speed on recent iPhones |
Qwen2.5-1.5B |
1.5B (largest) |
Slowest |
Highest quality, iPad Pro / M-series Macs |
Qwen3-0.6B fits comfortably on any device with 4 GB+ RAM and produces responses quickly. Start here if latency matters more than nuance.
Gemma3-1B is a good middle ground — noticeably better reasoning than 0.6B with moderate memory overhead.
Qwen2.5-1.5B produces the most coherent and detailed answers but requires more RAM and takes longer per token. Use it on devices with 6 GB+ RAM (iPhone 15 Pro and newer, any iPad Pro, any Mac).
// On older iPhones — fast and lightweight
let llm = try await TheStageLLM(engines_path: "TheStageAI/Qwen3-0.6B")
// On iPad Pro — best quality
let llm = try await TheStageLLM(engines_path: "TheStageAI/Qwen2.5-1.5B")
Note
If you’re unsure, start with Qwen3-0.6B. You can swap models later by
changing only the engines_path — the rest of your code stays the same.
Deterministic output with seed¶
For testing and reproducibility, set the seed parameter so the model
produces the same output every time for the same input. Without a seed, the
sampler uses a random seed on each call.
Swift:
let result = llm.infer(
prompt: "Write a haiku about the ocean.",
seed: 42,
temperature: 0.7
)
// Run it again with the same seed — same output
let result2 = llm.infer(
prompt: "Write a haiku about the ocean.",
seed: 42,
temperature: 0.7
)
assert(result.text == result2.text)
Flutter:
final result = await TheStageFlutterSDK.infer(
model_name: 'llm',
input_json: {
'prompt': 'Write a haiku about the ocean.',
'seed': 42,
'temperature': 0.7,
},
);
Attention
Determinism is only guaranteed on the same device with the same model bundle. Different hardware or model versions may decode differently even with identical seeds.
Troubleshooting¶
Model loading fails or hangs¶
The most common cause is a missing initialize(apiToken:) call. The SDK
must authenticate before it can download or load any model bundle.
Verify that
TheStageAI.shared.initialize(apiToken:)completed successfully before constructingTheStageLLM.On first launch the model must be downloaded. Check that the device has a working network connection.
Attach an
on_load_progresscallback to see which phase is stuck:
let llm = try await TheStageLLM(
engines_path: "TheStageAI/Qwen3-0.6B",
on_load_progress: { p in
print("[\(p.model)] \(p.phase) \(Int(p.fraction * 100))%")
}
)
If progress stalls at .downloading, the network is the bottleneck.
If it stalls at .loading, the device may be out of memory.
Response is cut off mid-sentence¶
When the output ends abruptly, the model likely hit the max_new_tokens
limit before finishing its thought.
Increase
max_new_tokens(default is 512). Try 1024 or higher for longer answers.Check
stop_reasonin the result — if it says"max_new_tokens"instead of"eos", the model was still generating when the cap was reached.
let result = llm.infer(
prompt: "Explain quantum computing in detail.",
max_new_tokens: 1024
)
print(result.stop_reason) // "eos" means it finished naturally
Slow first response (cold start)¶
The first inference after loading a model is noticeably slower than subsequent ones. This is expected — the engine performs JIT compilation and warms internal caches on the first run.
To minimize cold-start impact:
Use
prefetch_enginesto download the model bundle ahead of time so loading doesn’t add network latency on top of warm-up.Issue a short throwaway prompt (e.g.
"hi") during app startup to warm the engine before the user’s first real query.Subsequent inferences will be significantly faster.
Out of memory on older devices¶
Larger models need more RAM. If the app crashes or the system kills it during model loading, the model is too large for the device.
Switch to Qwen3-0.6B — it has the smallest memory footprint.
Reduce
max_context_sizefrom the default 2048 to 1024 or 512. A smaller context window uses less memory.Stop other pipelines (transcription, TTS) before loading the LLM. Only one large model should be resident at a time on constrained devices.
let llm = try await TheStageLLM(
engines_path: "TheStageAI/Qwen3-0.6B",
max_context_size: 1024
)
Wrong or garbled output¶
If the model produces nonsensical, repetitive, or off-topic text:
Temperature too high: Values above 1.0 make sampling nearly random. Try
temperature: 0.7(the default) or lower for factual tasks.Model mismatch: Ensure the
engines_pathpoints to a valid TheStageAI model bundle. Custom or corrupted bundles may use the wrong chat template, producing garbled output.Prompt formatting: The SDK applies chat templates automatically. Do not add
<|im_start|>or other template tokens manually — they will be double-applied and confuse the model.
let result = llm.infer(
prompt: "Summarize this article.",
temperature: 0.3, // lower temperature for factual tasks
top_k: 10
)
Load Progress¶
on_load_progress is optional. When set, the handler fires through
four phases with a monotonic fraction in 0...1:
Swift:
let llm = try await TheStageLLM(
engines_path: "TheStageAI/Qwen3-0.6B",
on_load_progress: { p in
print("[\(p.model)] \(p.phase) \(Int(p.fraction * 100))%")
}
)
Cache hits skip .downloading / .extracting and emit only
.loading followed by .ready. Failed loads do not emit .ready.
Flutter:
TheStageFlutterSDK.on_progress.listen((event) {
if (event['model_name'] != 'llm') return;
final phase = event['phase'] as String?;
final fraction = event['progress'] as double?;
print('[llm] $phase ${(fraction ?? 0) * 100}%');
});
await TheStageFlutterSDK.start_model(
model_name: 'llm',
engines_path: 'TheStageAI/Qwen3-0.6B',
);
Prefetch Engines¶
let engines_dir = try await ai.prefetch_engines(
repo_id: "TheStageAI/Qwen3-0.6B"
)
// Later — instant load, no network:
let llm = try await TheStageLLM(engines_path: engines_dir)
Cleanup¶
TheStageLLM is a normal Swift object — drop the reference to release
it. When you used the singleton API:
Swift:
_ = try ai.stop_model(model_name: "llm")
Flutter:
await TheStageFlutterSDK.stop_model(model_name: 'llm');