Qlip: Full-Stack AI Framework for Neural Network Optimization¶
Attention
Access to Qlip requires an API token from the TheStage AI Platform and additional access, which can be requested by contacting frameworks@thestage.ai.
Here we we will cover the following library overview:
Qlip is a comprehensive framework for optimizing neural networks through fine-tuning, quantization, sparsification, compilation, and deployment. The framework consists of three primary packages:
Qlip.Core
Provides foundational APIs for building quantization, pruning, and sparsification algorithms
Includes compilation tools for optimized inference on NVIDIA GPUs and Apple Silicon devices
Qlip.Algorithms
Contains state-of-the-art quantization and pruning algorithms built on Qlip.Core
Enables single-line setup for acceleration procedures
ANNA: Automated Neural Networks Accelerator for optimal quality-performance trade-offs
Qlip.Serve
A meta-framework built on Nvidia Triton Inference Server for deploying models
Provides intuitive interfaces for creating inference endpoints and asynchronous pipelines
Key Benefits
Performance Optimization: Accelerate PyTorch models and reduce inference costs
Simplicity: Easy-to-use quantization, pruning, and compilation workflows
ANNA: Automated Neural Networks Accelerator for optimal quality-performance trade-offs
Advanced Features: Improved flash attention for LLMs
Developer-Focused: Designed by AI engineers with customization in mind
Pre-Built Solutions: Ready-to-use compression and acceleration algorithms
Cross-Platform: Compile for NVIDIA GPUs and Apple Silicon
Deployment Ready: Streamlined inference endpoint creation
In this documentation, we will cover the following topics:
Installation¶
Install Qlip using the provided wheel files:
# For standard NVIDIA support
pip install qlip.core[nvidia]
pip install qlip.algorithms
# For NVIDIA Blackwell architecture support
pip install torch==2.7.0+cu128 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
Verify your installation:
import qlip
import torch
print("My master device:", qlip.node.device)
# Move tensor to master device
tensor = torch.randn(10, 10).to(qlip.node.device)
Example output (on a 4xH100 node):
2024-12-10 09:54:20: Qlip: INFO: Node on local rank 0 was initialized:
========================================================================
Compute unit: cuda
Available devices:
Device type: cuda; Device index: 0; Name: NVIDIA H100 80GB HBM3
Device type: cuda; Device index: 1; Name: NVIDIA H100 80GB HBM3
Device type: cuda; Device index: 2; Name: NVIDIA H100 80GB HBM3
Device type: cuda; Device index: 3; Name: NVIDIA H100 80GB HBM3
========================================================================
My master device: torch.device(type="cuda", index=0)
System Requirements¶
Operating System: Linux
Python: Version 3.10 - 3.12
Architecture: x86 64-bit CPU
GPU: NVIDIA with CUDA 11.8 or higher
Framework: PyTorch 2.4 or higher
Quantization, Sparsification & ANNA (Automated NN Accelerator)¶
Qlip offers a unified interface for neural network optimization through quantization, pruning, and sparsification. Apply various algorithms to your models with minimal code.
Features
ANNA: Automated Neural Networks Accelerator finds optimal quality configurations with full control over quality/performance trade-offs
Flexible Quantization: Support for integer (2-16 bits), float8, and various granularity options (per-tensor, per-channel)
Advanced Training: Quantization/Sparsification Aware Training with pre-defined algorithms like SmoothQuant and LSQ
Proven Performance: FLUX.1-Schnell achieves 2.1x speedup and LLama-3.1-8B-instruct achieves 4.2x speedup on Nvidia H100 GPUs compared to original bfloat16 models
Note
While this API provides tools to optimize compressed models in PyTorch, not all configurations
can be compiled for inference on every target device. For NVIDIA GPU deployment, use the
pre-defined configurations from the qlip.deploy.nvidia
package.
Quick Start Example
The API is designed for simplicity and intuitiveness. Quantize your model for production deployment with just a few lines of code:
import torch
import qlip
from qlip.quantization import QuantScheme
from qlip_algorithms.quantization import PostTrainingQuantization
# Wrap your model for quantization
handle, model, _ = PostTrainingQuantization.setup_model(
model,
weights_scheme=QuantScheme('int', 8, symmetric=True),
activations_scheme=QuantScheme('int', 8, symmetric=False),
modules_types=(torch.nn.Linear,)
)
# Evaluate model to estimate activation ranges
model(input)
# Initialize quantization parameters
model.eval()
# ...
Compilation and Inference API¶
Model compilation is essential for efficient inference on target hardware. This process converts models into optimized formats with improvements like layer fusion, operator replacement, memory layout transformations, and pre-allocation.
Qlip’s high-level API supports compilation and inference on NVIDIA GPUs and Apple Silicon devices (experimental). The NVIDIA compiler leverages TensorRT and CUDA libraries for high-performance inference.
Key Features
Precision Options: Compile models to FP32, FP16, and BF16 data types
Quantized Model Support: w8a8 fp8/int8, w4a16 with int4 weights
Dynamic Shapes: Inference on varying input sizes without recompilation
Block-Based Compilation: Optimize specific model parts with memory reuse
LLM Optimization: Enhanced flash attention with paged attention support
Seamless Integration: Combine PyTorch code with compiled models in a single pipeline
Serialization: Save compiled models to disk for quick reloading
Kernel Generation: Efficient operations fusion and optimization
Performance: Over 2x speedup compared to original PyTorch models on NVIDIA GPUs
ResNet-18 Example
Compile and run inference on NVIDIA GPUs:
import qlip
import torch
import torchvision.models as models
from qlip.deploy.nvidia import NvidiaBuilderConfig
# Select fastest device
device = qlip.node.device
dtype = torch.float16
# Create model for compilation
model_qlip = models.resnet18(pretrained=True).to(device).to(dtype)
model_qlip.eval()
# Prepare test input
input = torch.randn(1, 3, 224, 224).to(device).to(dtype)
# Configure compilation
nvidia_config = NvidiaBuilderConfig(
builder_flags={"FP16"},
io_dtype="base"
)
# Compile model
model_qlip = qlip.compile(
model_qlip,
# "range" will use minimum and maximum shapes during tracing
dynamic_shapes="range",
builder_config=nvidia_config
)
# Trace input shapes
model_qlip(input)
# Finalize compilation
model_qlip.compile()
# Run inference with compiled model
output = model_qlip(input)