LLM (Language Model)

Overview

On-device language model inference with batch and token-by-token streaming. TheStageLLM wraps Qwen2 / Qwen3 / Gemma3 chat models with KV cache, chat-template rendering, and stop-token policy.

Chat templates are auto-detected from the model bundle — you never pick them manually. The engine resolves EOS tokens, stop sequences, and KV-cache horizon from the same bundle metadata.

Flutter consumers go through the singleton start_model + infer / infer_stream (JSON) path — there is no direct LLM constructor on Dart. Both surfaces share the same on-disk cache and the same response shape.

Supported Models

Model

HF repo

Parameters

Chat template

Qwen2.5-1.5B

TheStageAI/Qwen2.5-1.5B

1.5B

Qwen2

Qwen3-0.6B

TheStageAI/Qwen3-0.6B

0.6B

Qwen3

Gemma3-1B

TheStageAI/Gemma3-1B

1B

Gemma3

The bundle’s engines_path accepts either a HuggingFace repo id or a local directory. The chat template, EOS / stop tokens and KV-cache horizon all come from the bundle — you don’t pick them.

API Reference

Full Constructor

let llm = try await TheStageLLM(
    engines_path: "TheStageAI/Qwen3-0.6B",
    device: "gpu",
    max_context_size: 2048,
    chat_template: nil,
    revision: "main",
    on_load_progress: nil
)

TheStageAI.shared.initialize(apiToken:) must have succeeded before this call returns.

Parameter

Type

Description

engines_path

String

HuggingFace repo id or a local directory containing the compiled engine bundle.

device

String

Compute device. Defaults to "gpu".

max_context_size

Int

Maximum KV-cache context window. Defaults to 2048.

chat_template

String?

Override the auto-detected chat template. nil uses the bundle default.

revision

String

HuggingFace revision / branch. Defaults to "main".

on_load_progress

Closure?

Optional progress callback (see Load Progress).

Inputs / Outputs

Direction

Key

Type

Description

input

prompt

String

The user message.

input

system_prompt

String?

Optional system message; defaults to the bundle’s default_system_prompt.

input

max_new_tokens

Int (default 512)

Maximum tokens to generate.

input

temperature

Float (default 0.7)

Sampling temperature.

input

top_k

Int (default 20)

Top-k sampling.

input

seed

UInt64?

Deterministic sampling seed.

output

LLMResult.text

String

Decoded response.

output

LLMResult.prompt_tokens / generated_tokens

Int

Token counts.

output

LLMResult.tokens_per_second

Double

Decode speed.

output

LLMResult.time_to_first_token / total_seconds

Double

Latency breakdown.

output

LLMResult.stop_reason

String

"eos" / "max_new_tokens" / "stop_sequence" / "unknown".

Singleton API

Use this when you want lifecycle (stop_model), JSON dispatch (infer(model_name:input_json:)), or are driving the SDK from Flutter. Both flows share the same on-disk cache.

try await ai.start_model(
    model_name: "llm",
    engines_path: "TheStageAI/Qwen3-0.6B"
)

let json = try ai.infer(
    model_name: "llm",
    input_json: [
        "prompt": "What is 2+2?",
        "system_prompt": "You are a helpful assistant.",
        "max_new_tokens": 256,
        "temperature": 0.7,
        "top_k": 20,
        "seed": 42
    ]
)
let text = json[0]["text"] as! String

JSON streaming yields typed InferenceStreamChunk values — delta carries each token’s text:

let stream = try ai.infer_stream(
    model_name: "llm",
    input_json: ["prompt": "Tell me a story.", "max_new_tokens": 512]
)

for await chunk in stream {
    if !chunk.is_final, let delta = chunk.delta {
        print(delta, terminator: "")
    }
    if chunk.is_final, let tps = chunk.tokens_per_second {
        print("\n--- \(tps) tok/s ---")
    }
}

Note

The JSON path is single-turn. For multi-turn chat history use the direct TheStageLLM API; chat templates are rendered for you.

The Flutter TheStageFlutterSDK.infer / infer_stream calls hit this exact JSON path, so the response keys below apply unchanged on Dart.

Response Keys

The JSON response dictionary contains the following keys:

Key

Description

text

Decoded model response.

prompt_tokens

Number of tokens in the prompt.

generated_tokens

Number of tokens generated.

prefill_seconds

Time spent on prompt prefill.

decode_seconds

Time spent on token decoding.

tokens_per_second

Decode throughput.

time_to_first_token

Latency until the first generated token.

total_seconds

Wall-clock time for the full call.

stop_reason

"eos" / "max_new_tokens" / "stop_sequence" / "unknown".

Usage Guides

Basic chat completion

When you need a single question-and-answer exchange — the simplest way to get a response from the model. In Swift, the direct TheStageLLM constructor is recommended because it gives you typed results and streaming for free. Flutter uses the JSON singleton path instead.

Swift — direct constructor (recommended):

import TheStageSDK

let ai = TheStageAI.shared
try await ai.initialize(apiToken: "th_…")

let llm = try await TheStageLLM(
    engines_path: "TheStageAI/Qwen3-0.6B"   // HF repo id, or a local dir
)

let result = llm.infer(
    prompt: "What is 2+2?",
    system_prompt: "You are a helpful assistant.",
    max_new_tokens: 64
)
print(result.text)

Flutter — JSON path:

import 'package:thestage_apple_sdk/thestage_apple_sdk.dart';

await TheStageFlutterSDK.initialize(api_token: 'th_…');

await TheStageFlutterSDK.start_model(
  model_name: 'llm',
  engines_path: 'TheStageAI/Qwen3-0.6B',
);

final result = await TheStageFlutterSDK.infer(
  model_name: 'llm',
  input_json: {
    'prompt': 'What is 2+2?',
    'system_prompt': 'You are a helpful assistant.',
    'max_new_tokens': 64,
  },
);
print(result[0]['text']);

Attention

Always call initialize(apiToken:) before constructing any pipeline. Forgetting this is the most common source of “model loading fails” errors.

Streaming responses to a chat UI

In a chat interface, users expect to see text appear word-by-word rather than waiting several seconds for a complete response. Streaming delivers each token as it is generated, keeping the UI responsive and giving users immediate feedback.

Each chunk before the final sentinel carries one delta of text. The final chunk has is_final == true and includes per-call performance metrics like tokens_per_second.

Swift:

for await chunk in llm.infer_stream(
    prompt: "Tell me a story.",
    max_new_tokens: 512
) {
    if chunk.is_final {
        let tps = chunk.tokens_per_second ?? 0
        print("\n--- \(tps) tok/s ---")
    } else {
        print(chunk.text, terminator: "")
    }
}

Flutter:

final stream = TheStageFlutterSDK.infer_stream(
  model_name: 'llm',
  input_json: {
    'prompt': 'Tell me a story.',
    'max_new_tokens': 512,
  },
);

await for (final chunk in stream) {
  if (chunk['is_final'] == true) {
    final tps = chunk['tokens_per_second'] as double? ?? 0;
    print('\n--- $tps tok/s ---');
  } else {
    final delta = chunk['delta'] as String?;
    if (delta != null) stdout.write(delta);
  }
}

Note

Check is_final to capture tokens_per_second, total_seconds, and stop_reason. These metrics are only available on the terminal chunk and are useful for performance monitoring and detecting truncated responses.

Setting a system prompt for persona control

The system_prompt parameter shapes how the model responds. Use it to constrain the topic, set a tone, or give the model a specific role. Without one, the bundle’s default system prompt is used.

Swift:

let result = llm.infer(
    prompt: "How do I make risotto?",
    system_prompt: "You are a cooking assistant. Only answer questions about recipes.",
    max_new_tokens: 256
)

Flutter:

final result = await TheStageFlutterSDK.infer(
  model_name: 'llm',
  input_json: {
    'prompt': 'How do I make risotto?',
    'system_prompt': 'You are a cooking assistant. Only answer questions about recipes.',
    'max_new_tokens': 256,
  },
);

The model will refuse off-topic questions like “What’s the weather?” and stay focused on recipes. System prompts work with both one-shot infer and streaming infer_stream.

Attention

System prompts consume context tokens. Very long system prompts leave fewer tokens for the conversation. Keep them concise — a few sentences is usually enough.

Choosing the right model for your device

Each model trades off speed and memory for output quality. Pick the one that matches your target hardware:

Model

Size

Speed

Best for

Qwen3-0.6B

0.6B (smallest)

Fastest

iPhone, low-RAM iPads, quick replies

Gemma3-1B

1B (mid)

Moderate

Balanced quality/speed on recent iPhones

Qwen2.5-1.5B

1.5B (largest)

Slowest

Highest quality, iPad Pro / M-series Macs

Qwen3-0.6B fits comfortably on any device with 4 GB+ RAM and produces responses quickly. Start here if latency matters more than nuance.

Gemma3-1B is a good middle ground — noticeably better reasoning than 0.6B with moderate memory overhead.

Qwen2.5-1.5B produces the most coherent and detailed answers but requires more RAM and takes longer per token. Use it on devices with 6 GB+ RAM (iPhone 15 Pro and newer, any iPad Pro, any Mac).

// On older iPhones — fast and lightweight
let llm = try await TheStageLLM(engines_path: "TheStageAI/Qwen3-0.6B")

// On iPad Pro — best quality
let llm = try await TheStageLLM(engines_path: "TheStageAI/Qwen2.5-1.5B")

Note

If you’re unsure, start with Qwen3-0.6B. You can swap models later by changing only the engines_path — the rest of your code stays the same.

Deterministic output with seed

For testing and reproducibility, set the seed parameter so the model produces the same output every time for the same input. Without a seed, the sampler uses a random seed on each call.

Swift:

let result = llm.infer(
    prompt: "Write a haiku about the ocean.",
    seed: 42,
    temperature: 0.7
)
// Run it again with the same seed — same output
let result2 = llm.infer(
    prompt: "Write a haiku about the ocean.",
    seed: 42,
    temperature: 0.7
)
assert(result.text == result2.text)

Flutter:

final result = await TheStageFlutterSDK.infer(
  model_name: 'llm',
  input_json: {
    'prompt': 'Write a haiku about the ocean.',
    'seed': 42,
    'temperature': 0.7,
  },
);

Attention

Determinism is only guaranteed on the same device with the same model bundle. Different hardware or model versions may decode differently even with identical seeds.

Troubleshooting

Model loading fails or hangs

The most common cause is a missing initialize(apiToken:) call. The SDK must authenticate before it can download or load any model bundle.

  1. Verify that TheStageAI.shared.initialize(apiToken:) completed successfully before constructing TheStageLLM.

  2. On first launch the model must be downloaded. Check that the device has a working network connection.

  3. Attach an on_load_progress callback to see which phase is stuck:

let llm = try await TheStageLLM(
    engines_path: "TheStageAI/Qwen3-0.6B",
    on_load_progress: { p in
        print("[\(p.model)] \(p.phase) \(Int(p.fraction * 100))%")
    }
)

If progress stalls at .downloading, the network is the bottleneck. If it stalls at .loading, the device may be out of memory.

Response is cut off mid-sentence

When the output ends abruptly, the model likely hit the max_new_tokens limit before finishing its thought.

  1. Increase max_new_tokens (default is 512). Try 1024 or higher for longer answers.

  2. Check stop_reason in the result — if it says "max_new_tokens" instead of "eos", the model was still generating when the cap was reached.

let result = llm.infer(
    prompt: "Explain quantum computing in detail.",
    max_new_tokens: 1024
)
print(result.stop_reason) // "eos" means it finished naturally

Slow first response (cold start)

The first inference after loading a model is noticeably slower than subsequent ones. This is expected — the engine performs JIT compilation and warms internal caches on the first run.

To minimize cold-start impact:

  • Use prefetch_engines to download the model bundle ahead of time so loading doesn’t add network latency on top of warm-up.

  • Issue a short throwaway prompt (e.g. "hi") during app startup to warm the engine before the user’s first real query.

  • Subsequent inferences will be significantly faster.

Out of memory on older devices

Larger models need more RAM. If the app crashes or the system kills it during model loading, the model is too large for the device.

  • Switch to Qwen3-0.6B — it has the smallest memory footprint.

  • Reduce max_context_size from the default 2048 to 1024 or 512. A smaller context window uses less memory.

  • Stop other pipelines (transcription, TTS) before loading the LLM. Only one large model should be resident at a time on constrained devices.

let llm = try await TheStageLLM(
    engines_path: "TheStageAI/Qwen3-0.6B",
    max_context_size: 1024
)

Wrong or garbled output

If the model produces nonsensical, repetitive, or off-topic text:

  • Temperature too high: Values above 1.0 make sampling nearly random. Try temperature: 0.7 (the default) or lower for factual tasks.

  • Model mismatch: Ensure the engines_path points to a valid TheStageAI model bundle. Custom or corrupted bundles may use the wrong chat template, producing garbled output.

  • Prompt formatting: The SDK applies chat templates automatically. Do not add <|im_start|> or other template tokens manually — they will be double-applied and confuse the model.

let result = llm.infer(
    prompt: "Summarize this article.",
    temperature: 0.3,    // lower temperature for factual tasks
    top_k: 10
)

Load Progress

on_load_progress is optional. When set, the handler fires through four phases with a monotonic fraction in 0...1:

Swift:

let llm = try await TheStageLLM(
    engines_path: "TheStageAI/Qwen3-0.6B",
    on_load_progress: { p in
        print("[\(p.model)] \(p.phase) \(Int(p.fraction * 100))%")
    }
)

Cache hits skip .downloading / .extracting and emit only .loading followed by .ready. Failed loads do not emit .ready.

Flutter:

TheStageFlutterSDK.on_progress.listen((event) {
  if (event['model_name'] != 'llm') return;
  final phase    = event['phase']    as String?;
  final fraction = event['progress'] as double?;
  print('[llm] $phase ${(fraction ?? 0) * 100}%');
});

await TheStageFlutterSDK.start_model(
  model_name: 'llm',
  engines_path: 'TheStageAI/Qwen3-0.6B',
);

Prefetch Engines

let engines_dir = try await ai.prefetch_engines(
    repo_id: "TheStageAI/Qwen3-0.6B"
)

// Later — instant load, no network:
let llm = try await TheStageLLM(engines_path: engines_dir)

Cleanup

TheStageLLM is a normal Swift object — drop the reference to release it. When you used the singleton API:

Swift:

_ = try ai.stop_model(model_name: "llm")

Flutter:

await TheStageFlutterSDK.stop_model(model_name: 'llm');