Qlip: Full-Stack AI Framework for Neural Network Optimization


Attention

Access to Qlip requires an API token from the TheStage AI Platform and additional access, which can be requested by contacting frameworks@thestage.ai.

Here we we will cover the following library overview:

Qlip is a comprehensive framework for optimizing neural networks through fine-tuning, quantization, sparsification, compilation, and deployment. The framework consists of three primary packages:

Qlip.Core

  • Provides foundational APIs for building quantization, pruning, and sparsification algorithms

  • Includes compilation tools for optimized inference on NVIDIA GPUs and Apple Silicon devices

Qlip.Algorithms

  • Contains state-of-the-art quantization and pruning algorithms built on Qlip.Core

  • Enables single-line setup for acceleration procedures

  • ANNA: Automated Neural Networks Accelerator for optimal quality-performance trade-offs

Qlip.Serve

  • A meta-framework built on Nvidia Triton Inference Server for deploying models

  • Provides intuitive interfaces for creating inference endpoints and asynchronous pipelines

Key Benefits

  • Performance Optimization: Accelerate PyTorch models and reduce inference costs

  • Simplicity: Easy-to-use quantization, pruning, and compilation workflows

  • ANNA: Automated Neural Networks Accelerator for optimal quality-performance trade-offs

  • Advanced Features: Improved flash attention for LLMs

  • Developer-Focused: Designed by AI engineers with customization in mind

  • Pre-Built Solutions: Ready-to-use compression and acceleration algorithms

  • Cross-Platform: Compile for NVIDIA GPUs and Apple Silicon

  • Deployment Ready: Streamlined inference endpoint creation

In this documentation, we will cover the following topics:

Installation


Install Qlip using the provided wheel files:

# For standard NVIDIA support
pip install qlip.core[nvidia]
pip install qlip.algorithms

# For NVIDIA Blackwell architecture support
pip install torch==2.7.0+cu128 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

Verify your installation:

import qlip
import torch

print("My master device:", qlip.node.device)
# Move tensor to master device
tensor = torch.randn(10, 10).to(qlip.node.device)

Example output (on a 4xH100 node):

2024-12-10 09:54:20: Qlip: INFO: Node on local rank 0 was initialized:

========================================================================
Compute unit: cuda
Available devices:
    Device type: cuda; Device index: 0; Name: NVIDIA H100 80GB HBM3
    Device type: cuda; Device index: 1; Name: NVIDIA H100 80GB HBM3
    Device type: cuda; Device index: 2; Name: NVIDIA H100 80GB HBM3
    Device type: cuda; Device index: 3; Name: NVIDIA H100 80GB HBM3
========================================================================

My master device: torch.device(type="cuda", index=0)

System Requirements


  • Operating System: Linux

  • Python: Version 3.10 - 3.12

  • Architecture: x86 64-bit CPU

  • GPU: NVIDIA with CUDA 11.8 or higher

  • Framework: PyTorch 2.4 or higher

Quantization, Sparsification & ANNA (Automated NN Accelerator)


Qlip offers a unified interface for neural network optimization through quantization, pruning, and sparsification. Apply various algorithms to your models with minimal code.

Features

  • ANNA: Automated Neural Networks Accelerator finds optimal quality configurations with full control over quality/performance trade-offs

  • Flexible Quantization: Support for integer (2-16 bits), float8, and various granularity options (per-tensor, per-channel)

  • Advanced Training: Quantization/Sparsification Aware Training with pre-defined algorithms like SmoothQuant and LSQ

  • Proven Performance: FLUX.1-Schnell achieves 2.1x speedup and LLama-3.1-8B-instruct achieves 4.2x speedup on Nvidia H100 GPUs compared to original bfloat16 models

Note

While this API provides tools to optimize compressed models in PyTorch, not all configurations can be compiled for inference on every target device. For NVIDIA GPU deployment, use the pre-defined configurations from the qlip.deploy.nvidia package.

Quick Start Example

The API is designed for simplicity and intuitiveness. Quantize your model for production deployment with just a few lines of code:

import torch
import qlip
from qlip.quantization import QuantScheme
from qlip_algorithms.quantization import PostTrainingQuantization

# Wrap your model for quantization
handle, model, _ = PostTrainingQuantization.setup_model(
    model,
    weights_scheme=QuantScheme('int', 8, symmetric=True),
    activations_scheme=QuantScheme('int', 8, symmetric=False),
    modules_types=(torch.nn.Linear,)
)

# Evaluate model to estimate activation ranges
model(input)

# Initialize quantization parameters
model.eval()
# ...

Compilation and Inference API


Model compilation is essential for efficient inference on target hardware. This process converts models into optimized formats with improvements like layer fusion, operator replacement, memory layout transformations, and pre-allocation.

Qlip’s high-level API supports compilation and inference on NVIDIA GPUs and Apple Silicon devices (experimental). The NVIDIA compiler leverages TensorRT and CUDA libraries for high-performance inference.

Key Features

  • Precision Options: Compile models to FP32, FP16, and BF16 data types

  • Quantized Model Support: w8a8 fp8/int8, w4a16 with int4 weights

  • Dynamic Shapes: Inference on varying input sizes without recompilation

  • Block-Based Compilation: Optimize specific model parts with memory reuse

  • LLM Optimization: Enhanced flash attention with paged attention support

  • Seamless Integration: Combine PyTorch code with compiled models in a single pipeline

  • Serialization: Save compiled models to disk for quick reloading

  • Kernel Generation: Efficient operations fusion and optimization

  • Performance: Over 2x speedup compared to original PyTorch models on NVIDIA GPUs

ResNet-18 Example

Compile and run inference on NVIDIA GPUs:

import qlip
import torch
import torchvision.models as models
from qlip.deploy.nvidia import NvidiaBuilderConfig

# Select fastest device
device = qlip.node.device
dtype = torch.float16

# Create model for compilation
model_qlip = models.resnet18(pretrained=True).to(device).to(dtype)
model_qlip.eval()

# Prepare test input
input = torch.randn(1, 3, 224, 224).to(device).to(dtype)

# Configure compilation
nvidia_config = NvidiaBuilderConfig(
    builder_flags={"FP16"},
    io_dtype="base"
)
# Compile model
model_qlip = qlip.compile(
    model_qlip,
    # "range" will use minimum and maximum shapes during tracing
    dynamic_shapes="range",
    builder_config=nvidia_config
)

# Trace input shapes
model_qlip(input)
# Finalize compilation
model_qlip.compile()
# Run inference with compiled model
output = model_qlip(input)