Qlip: Full-Stack AI Framework for Neural Network Optimization¶

Attention

Access to Qlip requires an API token from the TheStage AI Platform and additional access, which can be requested by contacting frameworks@thestage.ai.

Here we we will cover the following library overview:

Installation
- System Requirements
Quantization, Sparsification & ANNA (Automated NN Accelerator)
Compilation and Inference API

Qlip is a comprehensive framework for optimizing neural networks through fine-tuning, quantization, sparsification, compilation, and deployment. The framework consists of three primary packages:

Qlip.Core

Provides foundational APIs for building quantization, pruning, and sparsification algorithms
Includes compilation tools for optimized inference on NVIDIA GPUs and Apple Silicon devices

Qlip.Algorithms

Contains state-of-the-art quantization and pruning algorithms built on Qlip.Core
Enables single-line setup for acceleration procedures
ANNA: Automated Neural Networks Accelerator for optimal quality-performance trade-offs

Qlip.Serve

A meta-framework built on Nvidia Triton Inference Server for deploying models
Provides intuitive interfaces for creating inference endpoints and asynchronous pipelines

Key Benefits

Performance Optimization: Accelerate PyTorch models and reduce inference costs
Simplicity: Easy-to-use quantization, pruning, and compilation workflows
ANNA: Automated Neural Networks Accelerator for optimal quality-performance trade-offs
Advanced Features: Improved flash attention for LLMs
Developer-Focused: Designed by AI engineers with customization in mind
Pre-Built Solutions: Ready-to-use compression and acceleration algorithms
Cross-Platform: Compile for NVIDIA GPUs and Apple Silicon
Deployment Ready: Streamlined inference endpoint creation

In this documentation, we will cover the following topics:

Installation ¶

Install Qlip using the provided wheel files:

# For standard NVIDIA support
pip install qlip.core[nvidia] --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
pip install qlip.algorithms --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple

# For NVIDIA Blackwell architecture support
pip install torch==2.7.0+cu128 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

Verify your installation:

import qlip
import torch

print("My master device:", qlip.node.device)
# Move tensor to master device
tensor = torch.randn(10, 10).to(qlip.node.device)

Example output (on a 4xH100 node):

2024-12-10 09:54:20: Qlip: INFO: Node on local rank 0 was initialized:

========================================================================
Compute unit: cuda
Available devices:
    Device type: cuda; Device index: 0; Name: NVIDIA H100 80GB HBM3
    Device type: cuda; Device index: 1; Name: NVIDIA H100 80GB HBM3
    Device type: cuda; Device index: 2; Name: NVIDIA H100 80GB HBM3
    Device type: cuda; Device index: 3; Name: NVIDIA H100 80GB HBM3
========================================================================

My master device: torch.device(type="cuda", index=0)

System Requirements ¶

Operating System: Linux
Python: Version 3.10 - 3.12
Architecture: x86 64-bit CPU
GPU: NVIDIA with CUDA 11.8 or higher
Framework: PyTorch 2.4 or higher

Quantization, Sparsification & ANNA (Automated NN Accelerator)¶

Qlip offers a unified interface for neural network optimization through quantization, pruning, and sparsification. Apply various algorithms to your models with minimal code.

Features

ANNA: Automated Neural Networks Accelerator finds optimal quality configurations with full control over quality/performance trade-offs
Flexible Quantization: Support for integer (2-16 bits), float8, and various granularity options (per-tensor, per-channel)
Advanced Training: Quantization/Sparsification Aware Training with pre-defined algorithms like SmoothQuant and LSQ
Proven Performance: FLUX.1-Schnell achieves 2.1x speedup and LLama-3.1-8B-instruct achieves 4.2x speedup on Nvidia H100 GPUs compared to original bfloat16 models

Note

While this API provides tools to optimize compressed models in PyTorch, not all configurations can be compiled for inference on every target device. For NVIDIA GPU deployment, use the pre-defined configurations from the qlip.deploy.nvidia package.

Quick Start Example

The API is designed for simplicity and intuitiveness. Quantize your model for production deployment with just a few lines of code:

import torch
import qlip
from qlip.quantization import QuantScheme
from qlip_algorithms.quantization import PostTrainingQuantization

# Wrap your model for quantization
handle, model, _ = PostTrainingQuantization.setup_model(
    model,
    weights_scheme=QuantScheme('int', 8, symmetric=True),
    activations_scheme=QuantScheme('int', 8, symmetric=False),
    modules_types=(torch.nn.Linear,)
)

# Evaluate model to estimate activation ranges
model(input)

# Initialize quantization parameters
model.eval()
# ...

Compilation and Inference API ¶

Model compilation is essential for efficient inference on target hardware. This process converts models into optimized formats with improvements like layer fusion, operator replacement, memory layout transformations, and pre-allocation.

Qlip’s high-level API supports compilation and inference on NVIDIA GPUs and Apple Silicon devices (experimental). The NVIDIA compiler leverages TensorRT and CUDA libraries for high-performance inference.

Key Features

Precision Options: Compile models to FP32, FP16, and BF16 data types
Quantized Model Support: w8a8 fp8/int8, w4a16 with int4 weights
Dynamic Shapes: Inference on varying input sizes without recompilation
Block-Based Compilation: Optimize specific model parts with memory reuse
LLM Optimization: Enhanced flash attention with paged attention support
Seamless Integration: Combine PyTorch code with compiled models in a single pipeline
Serialization: Save compiled models to disk for quick reloading
Kernel Generation: Efficient operations fusion and optimization
Performance: Over 2x speedup compared to original PyTorch models on NVIDIA GPUs

ResNet-18 Example

Compile and run inference on NVIDIA GPUs:

import qlip
import torch
import torchvision.models as models
from qlip.deploy.nvidia import NvidiaBuilderConfig

# Select fastest device
device = qlip.node.device
dtype = torch.float16

# Create model for compilation
model_qlip = models.resnet18(pretrained=True).to(device).to(dtype)
model_qlip.eval()

# Prepare test input
input = torch.randn(1, 3, 224, 224).to(device).to(dtype)

# Configure compilation
nvidia_config = NvidiaBuilderConfig(
    builder_flags={"FP16"},
    io_dtype="base"
)
# Compile model
model_qlip = qlip.compile(
    model_qlip,
    # "range" will use minimum and maximum shapes during tracing
    dynamic_shapes="range",
    builder_config=nvidia_config
)

# Trace input shapes
model_qlip(input)
# Finalize compilation
model_qlip.compile()
# Run inference with compiled model
output = model_qlip(input)

Qlip: Full-Stack AI Framework for Neural Network Optimization

Qlip: Full-Stack AI Framework for Neural Network Optimization¶

Installation¶

System Requirements¶

Quantization, Sparsification & ANNA (Automated NN Accelerator)¶

Compilation and Inference API¶

Installation ¶

System Requirements ¶

Compilation and Inference API ¶