Apple Compiler and Inference API¶
Attention
Access to Qlip requires an API token from the TheStage AI Platform and additional access, which can be requested by contacting frameworks@thestage.ai.
Here we will cover the following topics:
Overview¶
Main pipeline for compiling and running inference of PyTorch models with Qlip on Apple devices:
Compilation workflow
Step 1: Importing Required Modules and Model Preparation
Obtain a PyTorch model.
Initialize compilation manager
AppleCompileManager.
Step 2: Setting up the model for compilation
Specify builder configuration
AppleBuilderConfig.- Set up the submodules for compilation
setup_modules()setup_model()
Step 3: Capturing shape profiles
Obtain context manager
ShapeProfileManagerwithshape_profile().Run examples with different input shapes to capture shapes.
Step 4: Tracing the Model and Compilation
Compile the model/submodules with
compile().
Step 5: Running Inference
Specify inference session configuration
AppleSessionConfig.- Set up inference manager
AppleInferenceManager. from workspace
from compile manager:
from_compilemanager()
- Set up inference manager
- Set up the submodules or the whole model for inference
setup_model()setup_modules()
Run inference of the compiled model.
Why use Qlip compiler for Apple?
Acceleration of inference on Apple Silicon (M1, M2, M3, M4 chips).
Not JIT compilation. Compiled models can be saved to disk and reused with minimal cold start time.
Support for dynamic shapes and optimization for specific performance-critical inputs.
Allows mixing PyTorch code and multiple compiled models in a single pipeline.
Natively supports compilation of models with quantized weights and activations produced by Quantization API and Quantization Algorithms.
Supports compilation of models by blocks, allowing optimization of specific parts of the model separately.
Supports compilation of models to
float16data type.Supports compilation of quantized models with w8a8 (M4+ chips) and w4a16 configurations.
API Reference¶
Apple backend is based on CoreML tools.
Base Classes¶
Compiled module class for compilation. |
|
Shape profile manager controls shape collection. |
|
Compiled module for inference. |
Apple-specific Classes¶
Apple manager for compilation. |
|
Builder configuration for Apple. |
|
Apple inference manager. |
|
Configuration for Apple Inference Session. |
PyTorch Model Compilation and Inference¶
This section covers how to compile and run inference of PyTorch models using Qlip on Apple devices.
Basic Model Compilation¶
The basic compilation workflow for Apple:
Initialize the compile manager with
AppleCompileManagerSet up the model for compilation with
AppleBuilderConfigCapture shape profiles by running example inputs
Compile the model
Run inference with the compiled model
In this example, we have a Resnet-18 model and wish to use it for image classification. We wish to support inference on different batch sizes from 1 to 4.
First of all, we obtain the model in float16 data type and set up the model for compilation
with AppleCompileManager.
import torch
import torchvision.models as models
from qlip.compiler.apple import AppleCompileManager, AppleBuilderConfig
from qlip.inference.apple import AppleInferenceManager
import coremltools as ct
# Device and dtype - Apple backend compiles on CPU
device = "cpu"
dtype = torch.float16
# creating models for compilation and comparison
model_orig = models.resnet18(pretrained=True).to(device).to(dtype)
model_qlip = models.resnet18(pretrained=True).to(device).to(dtype)
model_orig.eval()
model_qlip.eval()
# different input for tracing
input_1 = torch.randn(1, 3, 224, 224).to(device).to(dtype)
input_2 = torch.randn(4, 3, 224, 224).to(device).to(dtype)
# Setup builder configuration
config = AppleBuilderConfig(
compute_units=ct.ComputeUnit.CPU_AND_NE,
compute_precision=torch.float16,
minimum_deployment_target=ct.target.macOS15,
)
# setup model for compilation
cm = AppleCompileManager(model_qlip, workspace="model_qlip", format="mlprogram")
model_qlip = cm.setup_model(builder_config=config)
To compile a model with different input shapes, you can use a dynamic shapes strategy. For that, you need to trace the model with different input shapes. Using two inputs with batch sizes 1 and 4 and then compiling the model with a dynamic shape profile, you will be able to run inference on any input shape with batch size between 1 and 4.
# trace different input shapes which we wish to cover
with cm.shape_profile(type="dynamic"):
model_qlip(input_1)
model_qlip(input_2)
# compile model
cm.compile()
Now we can run inference on any input shape with batch size between 1 and 4 and compare results with the original model.
import time
input_3 = torch.randn(2, 3, 224, 224).to(device).to(dtype)
def benchmark(model, input, n_repeat=10):
t = time.time()
with torch.no_grad():
for i in range(n_repeat):
output = model(input)
return time.time() - t, output
t_qlip, output_qlip = benchmark(model_qlip, input_3)
t_orig, output_orig = benchmark(model_orig, input_3)
print("DIFF: ", (output_qlip - output_orig).abs().median())
print(f"Qlip model inference time: {t_qlip:.4f} seconds")
print(f"Original model inference time: {t_orig:.4f} seconds")
To delete a compiled model and free memory, you have to delete the compiled module and the compile manager:
del model_qlip
del cm
Serialization of Compiled Models¶
The compile manager saves the compiled model to the workspace directory.
You can use the format parameter to specify the serialization format.
Apple backend supports two formats:
mlpackage— serialized MLPackage file (default, set viaformat="mlprogram").mlmodelc— pre-compiled CoreML model.
Compiled models are automatically encrypted into .qlip files. An encryption key
(qlip_key.bin) is saved in the workspace directory and is required to load the
models at inference time.
# Initialize compile manager with MLPackage format (default)
cm = AppleCompileManager(model_qlip, workspace="model_qlip", format="mlprogram")
# Or use pre-compiled CoreML model format
cm = AppleCompileManager(model_qlip, workspace="model_qlip", format="mlmodelc")
Inference of Compiled Models¶
You can run the compiled model directly with the default session configuration.
Advanced inference options are available through the inference manager AppleInferenceManager. It can be initialized from the compile manager with from_compilemanager().
To load the compiled model from the workspace, you can use the same class AppleInferenceManager.
To set up the model from the workspace, you can use the following methods:
setup_model()- to set up the whole model from the workspace.setup_modules()- to set up the modules from the workspace.auto_setup()- to automatically set up the model and/or submodules from the workspace.
from qlip.inference.apple import AppleInferenceManager, AppleSessionConfig
import coremltools as ct
# Initialize inference manager with session configuration
config = AppleSessionConfig(compute_units=ct.ComputeUnit.CPU_AND_NE)
imanager = AppleInferenceManager.from_compilemanager(cm, config=config)
# load compiled model from the workspace
imanager = AppleInferenceManager(model, workspace="model_qlip")
model_qlip = imanager.auto_setup()
Working with Multiple Shapes¶
We already saw how to compile a model with dynamic shapes, but there are a few more options to specify input shapes. Generally, there are two options to specify input shapes:
Dynamic shape profile: Use minimum and maximum shapes observed during tracing.
Static shape profiles: Use a list of shapes that were observed during tracing. This is useful when you have a fixed set of shapes that you want to support with maximum performance.
The compile manager provides a context manager ShapeProfileManager to keep track of input shapes.
It can also skip n first shapes during tracking.
Note
Apple backend supports only one dynamic or multiple static shape profiles.
Dynamic shape profile
A single call of the ShapeProfileManager context manager with dynamic type will capture one dynamic shape profile.
with compile_manager.shape_profile(type="dynamic"):
model(torch.randn(1, 3, 224, 224).to(device).to(dtype))
model(torch.randn(4, 3, 224, 224).to(device).to(dtype))
# ... add more shapes as desired
In this case, we captured minimum batch size of 1 and maximum batch size of 4. The compiled model will work with any input shape with batch size between 1 and 4.
Static shape profiles
A single call of the ShapeProfileManager context manager with static type will capture as many static shape profiles as forward calls are made in the context.
# Trace with different fixed shapes in static mode
with cm.shape_profile(type="static"):
model_qlip(torch.randn(1, 3, 224, 224).to(device).to(dtype))
model_qlip(torch.randn(4, 3, 224, 224).to(device).to(dtype))
model_qlip(torch.randn(8, 3, 224, 224).to(device).to(dtype))
# ... add more shapes as needed
In the case above, we traced the model with three different shapes. Note that the compiled model will only work with these shapes. You can do the same for a model with multiple inputs. Take into account that compilation time increases with the number of supported shapes.
Skip n shapes
Skipping first n shapes during tracking is useful when you have multiple calls of the compiled module with different shapes.
For example, it is common for an LLM to have prefill and decode stages with different shapes.
To track shapes only for the decode stage, you can skip the prefill stage, which takes one forward pass.
Here is an example of how to obtain a dynamic shape profile with batch sizes from 1 to 4 only for the decode stage:
inputs_bs1 = {"input_ids": torch.randint(0, 1000, (1, 1))}
inputs_bs4 = {"input_ids": torch.randint(0, 1000, (4, 1))}
with cm.shape_profile(type="dynamic") as sp:
with sp.skip_n(n=1):
model.generate(**inputs_bs1)
with sp.skip_n(n=1):
model.generate(**inputs_bs4)
# ... add more shapes as needed
Fake mode
You can enable fake mode to skip actual computation during shape profiling. This is useful when you want to profile the model without actually running it, e.g. to avoid out of memory errors for large models.
When using fake mode with JIT trace export (default for Apple backend), you need to collect real inputs before shape profiling with fake mode. This is because JIT trace export requires real tensors, not fake tensors.
# First, collect real inputs for export (without fake mode)
with cm.collect_inputs():
model(torch.randn(1, 3, 224, 224).to(device).to(dtype))
# Then, do shape profiling with fake mode
with cm.shape_profile(type="static", fake_mode=True):
model(torch.randn(1, 3, 224, 224).to(device).to(dtype))
model(torch.randn(4, 3, 224, 224).to(device).to(dtype))
# ... add more shapes as needed
Compilation of Quantized Models¶
The following quantization configurations are supported by the Apple backend:
Supported quantization types
w8a8(both weights and activations are quantized)Quantization with
int8data types for linear and convolution layers. Supported starting from Apple M4 chips (M4, M4 Pro, M4 Max) and later. This configuration leverages the Neural Engine’s int8-int8 compute path for significant latency improvements.
w4a16block quantization of weights withint4data type and no quantization of activations.This type of quantization helps reduce memory usage for large models and can also accelerate inference for small batches, when weight memory transfer takes a significant part of the inference time.
We do not provide pre-defined quantization configurations for the Apple backend,
but you can manually specify quantization parameters for QuantizationManager.
Example of setting up quantization for Apple:
from qlip.quantization import QuantizationManager
from qlip.quantization import QuantScheme
device = "cpu"
dtype = torch.float16
# initialize model
model = create_my_model().to(device).to(dtype)
# Prepare quantization wrapper
quantizer = QuantizationManager()
# Select modules to quantize (Linear in this case)
modules = [
mod for mod in model.modules()
if isinstance(mod, torch.nn.Linear)
]
# Setup w8a8 quantization manually
quantizer.setup_modules(
modules,
weights_scheme=QuantScheme("int", 8, symmetric=True),
weights_granularity="per-channel",
activations_scheme=QuantScheme("int", 8, symmetric=False),
)
# run calibration to estimate activation ranges
model(input)
model.eval()
Now we can compile this quantized model with Qlip compiler:
from qlip.compiler.apple import AppleCompileManager, AppleBuilderConfig
import coremltools as ct
config = AppleBuilderConfig(
compute_units=ct.ComputeUnit.CPU_AND_NE,
compute_precision=torch.float16,
minimum_deployment_target=ct.target.macOS15,
)
# Setup compile manager for quantized model
cm = AppleCompileManager(model, workspace="model_qlip", format="mlprogram")
model_qlip = cm.setup_model(builder_config=config)
# Trace with example inputs
with cm.shape_profile("static"):
model_qlip(input)
# Compile the model
cm.compile()
# Run inference
output = model_qlip(input)
Compilation of Components and Submodules¶
You can compile specific parts of the model separately
using the setup_modules() method.
To compile one component, use the component parameter. You can also provide a custom name for the component with the component_name parameter.
cm.setup_modules(component="encoder.embeddings", component_name="embeddings")
You can also compile specific modules by their types with the module_types parameter.
cm.setup_modules(module_types=[torch.nn.Linear])
If you want to compile specific modules by their names, you can use the modules parameter.
cm.setup_modules(modules=["encoder", "decoder"])
You can also address submodules within a component with the component parameter.
cm.setup_modules(component="encoder.embeddings", module_types=[torch.nn.Linear])
cm.setup_modules(component="encoder", modules=["embeddings", "transformer"])
You can compile specific parts of the model separately to run on different compute units.
from qlip.compiler.apple import AppleCompileManager, AppleBuilderConfig
from qlip.inference.apple import AppleInferenceManager, AppleSessionConfig
import coremltools as ct
config = AppleBuilderConfig(
compute_units=ct.ComputeUnit.CPU_AND_NE,
compute_precision=torch.float16,
)
cm = AppleCompileManager(model, workspace=tmp_path)
cm.setup_modules(modules=["encoder", "decoder"], builder_config=config)
with cm.shape_profile("static"):
model(input)
cm.compile()
ne_config = AppleSessionConfig(compute_units=ct.ComputeUnit.CPU_AND_NE)
gpu_config = AppleSessionConfig(compute_units=ct.ComputeUnit.CPU_AND_GPU)
imanager = AppleInferenceManager.from_compilemanager(cm)
imanager.setup_modules(modules=["encoder"], inference_config=ne_config)
imanager.setup_modules(modules=["decoder"], inference_config=gpu_config)
output = model(input)
Custom Axes Names¶
When using dynamic shape profiles, Qlip automatically generates unique names for each dynamic axis. By default, axes with the same dynamic dimension across different inputs are treated as independent. This can cause compilation errors when the compiler encounters incompatible axis constraints, or lead to suboptimal profiles because the compiler doesn’t know the dimensions are always equal.
By assigning the same custom name to multiple axes, you tell the compiler that these axes are linked and always have the same value.
Consider a model with two linear layers that each receive a separate input. Both inputs share the same batch dimension, but the compiler has no way to know that — it assigns independent axis names to each input by default. When using dynamic shapes, this mismatch can cause compilation errors because the compiler sees two unrelated dynamic dimensions where there is actually one.
class TwoInputModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.linear_x = torch.nn.Linear(128, 128)
self.linear_y = torch.nn.Linear(128, 128)
def forward(self, x, y):
return self.linear_x(x) + self.linear_y(y)
To fix this, assign the same axis name to the batch dimension of both inputs with
set_axes_names(). The compiler will then
treat them as a single linked axis:
cm.set_axes_names({
"linear_x": {
"input_0": "batch_size",
},
"linear_y": {
"input_0": "batch_size", # linked to the same axis
},
})
Enabling and Disabling Compiled Modules¶
After compilation, you can toggle compiled modules on and off without removing them. This is useful for debugging or comparing compiled vs. original model outputs.
# Disable compiled computation (use original PyTorch modules)
cm.enable(False)
output_original = model(input)
# Re-enable compiled computation
cm.enable(True)
output_compiled = model(input)
To permanently restore the original modules and remove compiled computation:
cm.remove()
# model now uses original PyTorch modules
Troubleshooting and Limitations¶
This section covers common issues and limitations when using Qlip for model compilation and inference on Apple devices.
Handling input tuples and dictionaries¶
Models can accept multiple inputs as arguments and keyword arguments. Non-tensor inputs are interpreted as constant values.
Note
Qlip supports tuples and dictionaries as inputs for compilation and inference but only with a single level of nesting.
Export Options¶
Apple backend supports two exporters:
JIT Trace Export (default): Uses
torch.jit.traceto export the model. This is the default exporter.Torch Export: Uses
torch.exportto export the model.
The default exporter is JIT Trace. To use Torch Export instead, set the exporter before compilation:
from qlip.compiler.apple.backend import AppleBackend
from qlip.compiler.apple.exporter import AppleExporterTrace, AppleExporterExport
# Use JIT Trace export (default)
AppleBackend.exporter = AppleExporterTrace
# Or use Torch Export
AppleBackend.exporter = AppleExporterExport
# Then proceed with compilation
compile_manager = AppleCompileManager(model, workspace="model_qlip")
# ... rest of compilation code
If an exporter does not work for your model, try switching to the other one.
Neural Engine (NPU) Usage¶
Qlip does not guarantee that compiled models will utilize the Apple Neural Engine (NPU). The decision of which compute unit (CPU, GPU, or Neural Engine) is used for execution is controlled by CoreML based on various factors including model architecture, operation types, and system resources.
While you can specify compute units in AppleBuilderConfig and AppleSessionConfig
(e.g., compute_units=ct.ComputeUnit.CPU_AND_NE), CoreML may still choose to execute parts of the model
on different compute units for optimal performance or compatibility reasons.
To verify which compute units are actually being used, you can use CoreML’s profiling tools or check the model’s metadata.
Compute Unit Options¶
Apple backend provides several compute unit options through CoreML:
ct.ComputeUnit.ALL- Use all available compute units (CPU, GPU, Neural Engine)ct.ComputeUnit.CPU_ONLY- Use only CPUct.ComputeUnit.CPU_AND_GPU- Use CPU and GPUct.ComputeUnit.CPU_AND_NE- Use CPU and Neural Engine (recommended for most models)
import coremltools as ct
# For builder config during compilation
config = AppleBuilderConfig(
compute_units=ct.ComputeUnit.CPU_AND_NE,
compute_precision=torch.float16,
)
# For session config during inference
session_config = AppleSessionConfig(
compute_units=ct.ComputeUnit.CPU_AND_NE
)
Optimization Hints¶
Apple session config supports optimization_hints for fine-tuning CoreML inference behavior.
For example, you can set the specialization strategy to optimize for fast prediction:
config = AppleSessionConfig(
compute_units=ct.ComputeUnit.CPU_AND_NE,
optimization_hints={"specializationStrategy": "fastPrediction"},
)
Deployment Targets¶
When compiling models for Apple devices, you should specify the minimum deployment target to ensure compatibility:
import coremltools as ct
config = AppleBuilderConfig(
compute_units=ct.ComputeUnit.CPU_AND_NE,
compute_precision=torch.float16,
minimum_deployment_target=ct.target.macOS15, # or ct.target.iOS18, etc.
)
Available targets include:
- ct.target.macOS13, ct.target.macOS14, ct.target.macOS15
- ct.target.iOS16, ct.target.iOS17, ct.target.iOS18
Wrong shapes¶
When encountering shape-related errors during inference, it usually means that the model was not compiled with the correct dynamic shapes or the input shapes do not match the expected shapes.
For Apple backend with static shape profiles, ensure that the input shapes at inference time exactly match one of the shapes used during shape profiling.
Out of memory errors¶
OOM errors during compilation can be caused by several reasons:
Common causes of OOM errors
Large model size: The model is too large to fit into the available memory.
In this case, you can try to compile the model by blocks using setup_modules().
Fake mode: Use fake mode during shape profiling to avoid allocating real tensors and running computation. This can significantly reduce peak memory usage during profiling. See the Fake mode subsection for more details.
# First, collect real inputs for export
with cm.collect_inputs():
model(torch.randn(1, 3, 224, 224).to(device).to(dtype))
# Then, do shape profiling with fake mode
with cm.shape_profile(type="static", fake_mode=True):
model(torch.randn(1, 3, 224, 224).to(device).to(dtype))
model(torch.randn(4, 3, 224, 224).to(device).to(dtype))
Unload models after compilation (keep_compiled=False): When compiling multiple modules serially,
accumulated compiled models can consume significant memory. Passing keep_compiled=False to
compile() unloads each compiled model from memory immediately
after compilation. Models are still saved to the workspace and can be loaded later via
AppleInferenceManager.
cm.compile(keep_compiled=False)
Model Compatibility¶
Not all PyTorch operations are supported by CoreML. If you encounter errors during compilation:
Try a different exporter: Switch between JIT Trace and Torch Export
Compile by blocks: Use
setup_modules()to compile only the supported parts of the modelCheck operation support: Refer to CoreML documentation for supported operations
# If full model compilation fails, try compiling specific modules
cm = AppleCompileManager(model, workspace="model_qlip", format="mlprogram")
cm.setup_modules(modules=["encoder"], builder_config=config)
with cm.shape_profile("static"):
model(input)
cm.compile()