Quantization Algorithms

Attention

Access to Qlip requires an API token from the TheStage AI Platform and additional access, which can be requested by contacting frameworks@thestage.ai.

QLIP provides advanced quantization algorithms to efficiently compress neural networks while maintaining performance. This section covers three main algorithms: Post-Training Quantization (PTQ), SmoothQuant, Learned Step Size Quantization (LSQ).

Here we will cover the following topics:

Overview

The quantization algorithms in QLIP are designed with the following features:

State-of-the-Art Algorithms

QLIP implements cutting-edge quantization methods for post-training quantization and quantization-aware training.

Unified Interface Design

All QLIP algorithms follow a consistent API pattern that integrates seamlessly into existing workflows. Whether you’re doing post-training quantization or quantization-aware training, the interface remains intuitive and requires minimal code changes.

Custimizable quantization schemes
  • Multiple data types: INT4, INT8, FP8, FP16 with both symmetric and asymmetric quantization

  • Flexible granularities: Per-tensor, per-channel, per-token quantization for optimal quality-performance trade-offs

  • Advanced observers: Percentile-based, min-max, and custom observers for accurate scale estimation

  • Dynamic quantization: Runtime quantization for deployment scenarios requiring maximum flexibility

Model Architecture Agnostic

QLIP works with any PyTorch model architecture including Transformers (LLaMA, BERT, GPT), Convolutional Neural Networks (ResNet, EfficientNet), and Diffusion Models (Stable Diffusion, DALL-E). The algorithms automatically adapt to different layer types and activation patterns.

API Reference


PostTrainingQuantization API


class qlip_algorithms.quantization.PostTrainingQuantization

Implementation of post-training quantization that converts pre-trained models to use lower precision without retraining.

PostTrainingQuantization provides static and dynamic quantization with advanced observers for accurate scale/zero-point estimation. It supports flexible granularities (per-tensor, per-channel, per-token) and custom quantization schemes including INT4/8 and FP8.

PostTrainingQuantization.setup_model(model, weights_scheme=QuantScheme('int', 8, True), weights_granularity='per-channel', activations_scheme=QuantScheme('int', 8, True), activations_granularity='per-batch', modules_types=(torch.nn.Conv2d, torch.nn.Linear), except_layers=[], placement='input', attention_cls=None, quantization_mode='static', observer=None, observer_kwargs={}, calibration_iterations=np.inf, train_weights_offset=False, train_activations_offset=False, inplace=True, weights_scale_offset_dtype=torch.float32, activations_scale_offset_dtype=torch.float32)

Sets up post-training quantization for an entire model by automatically identifying and quantizing specified module types.

Parameters

  • model: torch.nn.Module - PyTorch model to quantize

  • weights_scheme: Union[QuantScheme, Dict[str, Any]] = QuantScheme(“int”, 8, True) - Quantization scheme for weights (data type, bit-width, symmetric/asymmetric)

  • weights_granularity: Union[str, QuantGranularity, Dict[str, Any]] = “per-channel” - Quantization granularity for weights (“per-tensor”, “per-channel”, “per-token”)

  • activations_scheme: Union[QuantScheme, Dict[str, Any]] = QuantScheme(“int”, 8, True) - Quantization scheme for activations

  • activations_granularity: Union[str, QuantGranularity, Dict[str, Any]] = “per-batch” - Quantization granularity for activations

  • modules_types: tuple = (torch.nn.Conv2d, torch.nn.Linear) - Types of modules to quantize automatically

  • except_layers: list = [] - List of layer names to exclude from quantization

  • placement: Union[str, QuantPlacement] = “input” - Determines where quantizers are placed “input” quantizes input activations of modules, “output” quantizes output activations, “attention” sets quantizers for quantized attention kernels

  • attention_cls: str = None - Name of attention module class for specialized quantization (e.g., ‘MultiheadAttention’)

  • quantization_mode: str = “static” - Quantization mode (“static” for calibration-based, “dynamic” for runtime)

  • observer: Observer = None - Observer class for collecting activation statistics (defaults to BoundsObserver)

  • observer_kwargs: dict = {} - Additional parameters for the observer (e.g., percentile settings)

  • calibration_iterations: int = np.inf - Maximum number of calibration batches for collecting statistics

  • train_weights_offset: bool = False - Whether weight zero-points are trainable

  • train_activations_offset: bool = False - Whether activation zero-points are trainable

  • inplace: bool = True - Whether to modify the original model or create a copy

  • weights_scale_offset_dtype: torch.dtype = torch.float32 - Data type for weight quantization parameters

  • activations_scale_offset_dtype: torch.dtype = torch.float32 - Data type for activation quantization parameters

Returns

  • Tuple[torch.nn.Module, QuantizationManager, Dict[str, List]]: Tuple containing the quantized model, quantization manager, and parameter groups for optimizers

PostTrainingQuantization.setup_modules(modules, weights_scheme=QuantScheme('int', 8, True), weights_granularity='per-channel', activations_scheme=QuantScheme('int', 8, False), activations_granularity='per-batch', placement='input', quantization_mode='static', observer=None, observer_kwargs={}, calibration_iterations=np.inf, train_weights_offset=False, train_activations_offset=False, weights_scale_offset_dtype=torch.float32, activations_scale_offset_dtype=torch.float32)

Sets up post-training quantization for a specific list of modules with fine-grained control.

Parameters

  • modules: list - List of torch.nn.Module instances to quantize

  • Other parameters: Same as setup_model method

Returns

  • QuantizationManager: Quantization manager used to quantize the modules

PostTrainingQuantization.configure_model(model)

Configures model for calibration by setting quantizers to training mode and other modules to evaluation mode.

Parameters

  • model: torch.nn.Module - Model with quantization already set up

Returns

  • torch.nn.Module: Configured model ready for calibration


SmoothQuant API


class qlip_algorithms.quantization.SmoothQuant

Implementation of SmoothQuant algorithm that addresses activation outliers in transformer models through mathematically equivalent transformations.

SmoothQuant migrates quantization difficulty from activations to weights by applying channel-wise scaling transformations. It identifies outlier channels, applies scaling to reduce activation magnitudes, and compensates in weights to maintain mathematical equivalence. The algorithm enables aggressive quantization of both weights and activations, particularly effective for large language models with severe activation outliers.

SmoothQuant.setup_model(model, transformer_block_type=None, fusion_groups=None, ignore_modules=None, modules_types=(nn.Linear, nn.Conv2d), alpha=0.8, weights_scheme=QuantScheme('int', 8, True), weights_granularity='per-tensor', activations_scheme=QuantScheme('int', 8, True), activations_granularity='per-batch', quantization_mode='static', calibration_iterations=128, weights_scale_offset_dtype=torch.float32, activations_scale_offset_dtype=torch.float32, need_sync=False, inplace=True)

Sets up SmoothQuant for a transformer model by identifying transformer blocks and applying channel-wise smoothing.

Parameters

  • model: nn.Module - PyTorch transformer model to apply SmoothQuant

  • transformer_block_type: Type or str = None - Class type or class name of transformer blocks (e.g., ‘LlamaDecoderLayer’, ‘TransformerBlock’)

  • fusion_groups: List[Dict[str, Any]] = None - Defines which linear layers should share smoothing scales (defaults to common transformer patterns)

  • ignore_modules: List[Type] = None - List of module types to exclude from quantization

  • modules_types: Tuple[Type, …] = (nn.Linear, nn.Conv2d) - Types of modules to quantize

  • alpha: float = 0.8 - Smoothness migration factor (0.0=no smoothing, 1.0=maximum smoothing). Higher values transfer more difficulty to weights

  • weights_scheme: Union[QuantScheme, Dict[str, Any]] = QuantScheme(“int”, 8, True) - Quantization scheme for weights

  • weights_granularity: Union[str, QuantGranularity, Dict[str, Any]] = “per-tensor” - Granularity for weight quantization

  • activations_scheme: Union[QuantScheme, Dict[str, Any]] = QuantScheme(“int”, 8, True) - Quantization scheme for activations

  • activations_granularity: Union[str, QuantGranularity, Dict[str, Any]] = “per-batch” - Granularity for activation quantization

  • quantization_mode: str = “static” - Quantization mode (“static” or “dynamic”)

  • calibration_iterations: int = 128 - Number of batches for both equalization and quantization calibration phases

  • weights_scale_offset_dtype: torch.dtype = torch.float32 - Data type for weight quantization parameters

  • activations_scale_offset_dtype: torch.dtype = torch.float32 - Data type for activation quantization parameters

  • need_sync: bool = False - Whether to synchronize statistics across distributed processes

  • inplace: bool = True - Whether to modify the original model or create a copy

Returns

  • Tuple[nn.Module, Dict[str, Manager]]: Tuple containing the model with SmoothQuant applied and dictionary of managers

SmoothQuant.setup_modules(modules, transformer_blocks=None, fusion_groups=None, alpha=0.8, weights_scheme=QuantScheme('int', 8, True), weights_granularity='per-channel', activations_scheme=QuantScheme('int', 8, True), activations_granularity='per-batch', quantization_mode='static', calibration_iterations=np.inf, weights_scale_offset_dtype=torch.float32, activations_scale_offset_dtype=torch.float32, need_sync=False)

Sets up SmoothQuant for specific transformer blocks and modules with detailed control over fusion groups.

Parameters

  • modules: List[nn.Module] - List of specific modules to quantize

  • transformer_blocks: List[nn.Module] = None - List of transformer block instances

  • fusion_groups: List[Dict[str, Any]] = None - Fusion group configurations defining layer relationships

  • Other parameters: Same as setup_model method

Returns

  • Dict[str, Manager]: Dictionary containing quantization and adapter managers

SmoothQuant.configure_equalization(model)

Configures model for the first calibration phase to collect activation statistics for equalization.

Parameters

  • model: nn.Module - Model with SmoothQuant setup

Returns

  • nn.Module: Model configured for equalization calibration

SmoothQuant.configure_quantization(model)

Configures model for the second calibration phase to calibrate quantizers with smoothed distributions.

Parameters

  • model: nn.Module - Model with equalization completed

Returns

  • nn.Module: Model configured for quantization calibration

SmoothQuant.fuse_scales(model, transformer_block_type, fusion_groups=None)

Fuses smoothing scales into previous normalization layers to reduce computational overhead.

Parameters

  • model: nn.Module - Model with SmoothQuant applied

  • transformer_block_type: Type - Transformer block class type

  • fusion_groups: List[Dict[str, Any]] = None - Fusion group configurations


Learned Step Size Quantization (LSQ) API


class qlip_algorithms.quantization.LSQ

Implementation of Learned Step Size Quantization (LSQ) for quantization-aware training.

LSQ learns optimal quantizer step sizes during training by using mathematically principled gradient scaling to balance step size updates with weight updates. The algorithm achieves state-of-the-art accuracy for low-precision neural networks by optimizing network weights and quantization parameters through backpropagation. See LSQ paper for more details.

LSQ.setup_model(model, weights_scheme=QuantScheme('int', 8, True), weights_granularity='per-tensor', activations_scheme=QuantScheme('uint', 8, False), modules_types=(torch.nn.Conv2d,), except_layers=[], placement='input', observer=None, observer_kwargs={}, trainable_offsets=False, calibration_iterations=1, inplace=False)

Sets up LSQ for an entire model by configuring learnable quantization parameters for specified module types.

Parameters

  • model: torch.nn.Module - PyTorch model to set up with LSQ

  • weights_scheme: Union[QuantScheme, Dict[str, Any]] = QuantScheme(“int”, 8, True) - Quantization scheme for weights with learnable step sizes

  • weights_granularity: Union[str, QuantGranularity, Dict[str, Any]] = “per-tensor” - Granularity for weight quantization (“per-tensor” or “per-channel”)

  • activations_scheme: Union[QuantScheme, Dict[str, Any]] = QuantScheme(“uint”, 8, False) - Quantization scheme for activations with learnable step sizes

  • modules_types: tuple = (torch.nn.Conv2d,) - Types of modules to apply LSQ quantization

  • except_layers: list = [] - List of layer names to exclude from LSQ quantization

  • placement: Union[str, QuantPlacement] = “input” - Determines where quantizers are placed “input” quantizes input activations of modules, “output” quantizes output activations, “attention” sets quantizers for quantized attention kernels

  • observer: Observer = None - Observer for initial scale/offset estimation (defaults to MinMaxObserver)

  • observer_kwargs: dict = {} - Additional parameters for the observer

  • trainable_offsets: bool = False - Whether zero-point offsets are trainable (enables LSQ+)

  • calibration_iterations: int = 1 - Number of batches for initial quantization parameter estimation

  • inplace: bool = False - Whether to modify the original model or create a copy

Returns

  • Tuple[QuantizationManager, torch.nn.Module, Dict[str, List]]: Tuple containing quantization manager, model with LSQ applied, and parameter groups for optimizer configuration

LSQ.setup_modules(modules, weights_scheme=QuantScheme('int', 8, True), weights_granularity='per-tensor', activations_scheme=QuantScheme('uint', 8, False), placement='input', observer=None, observer_kwargs={}, trainable_offsets=False, calibration_iterations=1)

Sets up LSQ for a specific list of modules with fine-grained control over quantization parameters.

Parameters

  • modules: list - List of torch.nn.Module instances to apply LSQ quantization

  • Other parameters: Same as setup_model method (excluding model, modules_types, except_layers, inplace)

Returns

  • QuantizationManager: Manager with LSQ quantization configured for the specified modules

LSQ.wrap_modules_groups(modules_groups)

Applies different LSQ configurations to groups of modules, enabling heterogeneous quantization strategies.

Parameters

  • modules_groups: list - List of dictionaries, each containing ‘modules’ (list of modules) and ‘config’ (LSQ configuration parameters)

Returns

  • QuantizationManager: Extended manager with LSQ applied to all module groups with their respective configurations

Note

LSQ Training Requirements: LSQ requires quantization-aware training and cannot be applied post-training. Use higher learning rates (10-100x) for quantization parameters compared to model weights. The algorithm includes built-in gradient scaling to ensure stable training by balancing quantization parameter updates with weight updates.

Post-Training Quantization (PTQ)


Post-Training Quantization is the most straightforward quantization method that converts a pre-trained model to use lower precision without requiring retraining.

Basic usage example


This simple example shows how to quantize a model to INT8 weights and activations for NVIDIA hardware. The steps are:

  1. Import the PostTrainingQuantization algorithm and NVIDIA static quantization configuration

  2. Load the model

  3. Wrap model with PTQ algorithm

  4. Call PostTrainingQuantization.configure_model to set quantizers to training mode, which will use observers to collect activations statistics for scale and offset estimation

  5. Run calibration data through the model to collect statistics

  6. Run inference on test data to check the quality of the quantized model

from qlip_algorithms.quantization import PostTrainingQuantization
from qlip.deploy.nvidia import NVIDIA_INT_W8A8
import torch

model = torch.load('model.pth')
# Setup PTQ with INT8 weights and activations
model, handle = PostTrainingQuantization.setup_model(
    model=model,
    **NVIDIA_INT_W8A8,
    calibration_iterations=256
)
# Configure model for calibration
PostTrainingQuantization.configure_model(model)
# Run calibration data through the model
with torch.no_grad():
    for i, batch in enumerate(calibration_dataloader):
        model(batch)
        if i > 256: # run only 256 batches for calibration
            break

Setting custom modules for quantization

By default PostTrainingQuantization.setup_model will use all linear and convolutional layers in the model for quantization. Below we use PostTrainingQuantization.setup_modules method to quantize only linear layers except layers with fc in the name.

modules = [
    module for name, module in model.named_modules()
    if isinstance(module, torch.nn.Linear) and 'fc' not in name
]
model, handle = PostTrainingQuantization.setup_modules(
    modules=modules, **NVIDIA_INT_W8A8, calibration_iterations=256
)

Quantization schemes


Quantize model to align with NVIDIA hardware. Static int8 weights and activations quantization scheme:

from qlip.deploy.nvidia import NVIDIA_INT_W8A8

model, handle = PostTrainingQuantization.setup_model(
    model=model, **NVIDIA_INT_W8A8, calibration_iterations=256
)

For higher accuracy of quantized model use dynamic quantization scheme:

from qlip.deploy.nvidia import NVIDIA_INT_W8A8_PER_TOKEN_DYNAMIC

model, handle = PostTrainingQuantization.setup_model(
    model=model, **NVIDIA_INT_W8A8_PER_TOKEN_DYNAMIC
)

Note

When using dynamic quantization no need to calibrate model on calibration data. Scales are calculated on the fly during inference.

Quantize model attention layers to use NVIDIA FP8 attention kernels:

from qlip.deploy.nvidia import NVIDIA_FLOAT_W8A8

# extract attention modules from model
modules = [module for module in model.modules() if isinstance(module, Attention)]
# setup PTQ with FP8 quantization scheme with placement='attention'
model, handle = PostTrainingQuantization.setup_model(
    modules=modules, **NVIDIA_FLOAT_W8A8,
    calibration_iterations=256, placement='attention'
)

Custom weights and activations quantization schemes

We already saw how to use quantization schemes configured for compatibility with NVIDIA hardware. Below we show how to set up custom quantization schemes to quantize model to 4-bit weights and 8-bit activations.

PostTrainingQuantization.setup_model(
    model=model,
    weights_scheme={'scheme': 'int', 'bits': 4, 'symmetric': True},
    activations_scheme={'scheme': 'uint', 'bits': 8, 'symmetric': False},
    calibration_iterations=256
)

Quantization scale and zero-point estimation


Activations quantizers use observers mechanism to collect statistics for quantization scale and offset estimation. By default PercentileObserver is used. It calculates percentile_min and percentile_max of observed distribution to obtain left and right boundaries of the quantized interval. Here is example how to specify percentile values for left and right boundaries:

from qlip.deploy.nvidia import NVIDIA_INT_W8A8

model, handle = PostTrainingQuantization.setup_model(
    model=model, **NVIDIA_INT_W8A8,
    observer_kwargs={'percentile_min': 0.05, 'percentile_max': 0.95}
)

Implement custom observer to use absoulte minimum and maximum values of activations to build boundaries of the distribution.

Running with distributed data parallel


To run with distributed data parallel, use need_sync=True parameter in observer kwargs. It will synchronize statistics of quantized modules across all processes, thus after calibration each process will have the same quantization parameters.

model, handle = PostTrainingQuantization.setup_model(
    model=model, **NVIDIA_INT_W8A8,
    observer_kwargs={'need_sync': True}
)

SmoothQuant


SmoothQuant addresses the challenge of quantizing models with activation outliers by migrating the quantization difficulty from activations to weights through mathematically equivalent transformations. It applies channel-wise equalization to balance the quantization difficulty. See SmoothQuant paper for more details.

../../../_images/smoothquant.png

Basic usage example


Wrap model with SmoothQuant algorithm

Below we wrap model with INT8 weights and activations quantization scheme for NVIDIA hardware using SmoothQuant algorithm.

from qlip_algorithms.quantization import SmoothQuant
from qlip.deploy.nvidia import NVIDIA_INT_W8A8
from transformers import AutoModelForCausalLM

# Load transformer model
model = AutoModelForCausalLM.from_pretrained('meta-llama/Meta-Llama-3.1-8B')

# Setup SmoothQuant
model, handle = SmoothQuant.setup_model(
    model=model,
    calibration_iterations=512,
    **NVIDIA_INT_W8A8
)

Configure model for calibration

Calibration of SmoothQuant algorithm requires two phases:

Phase 1: Collect activation statistics to identify outlier channels that are difficult to quantize.

Phase 2: Apply channel-wise smoothing and calibrate quantization parameters (scales and zero-points).

# Calibration process (two phases)
# Phase 1: Equalization calibration
SmoothQuant.configure_equalization(model)
for batch in calibration_dataloader:
    model(**batch)

# Phase 2: Quantization calibration
SmoothQuant.configure_quantization(model)
for batch in calibration_dataloader:
    model(**batch)

Quantization schemes


To obtain higher accuracy of quantized model use per-token dynamic quantization scheme:

from qlip.deploy.nvidia import NVIDIA_INT_W8A8_PER_TOKEN_DYNAMIC

model, handle = SmoothQuant.setup_model(
    model=model, **NVIDIA_INT_W8A8_PER_TOKEN_DYNAMIC
)

Note

SmoothQuant supports all the same quantization schemes as PTQ.

Smoothing factor (α)


To balance the quantization difficulty SmoothQuant applies the following transformation: Y = (diag(s)^(-1) * X) * (diag(s) * W) where s is the scaling factor controlled by parameter α. It controls the migration strength between activations and weights.

  • α = 0: quantization difficulty migration from weights to activations

  • α = 0.5: Balanced migration

  • α = 0.8: Optimal in most cases

  • α = 1.0: Full migration from activations to weights

Typically, for larger models, larger α values tend to yield better results. For LLama-3.1-8B, α = 0.8 is a good choice.

Fusion of equalization operations


SmoothQuant performs equalization of activations before it’s quantized. These operations may be fused to previous layers such as LayerNorm or RMSNorm to avoid redundant computations while preserving mathematical equivalence.

Let’s suppose we are working with LLama-3.1-8B model. In LlamaDecoderLayer before query, key, value projections there is a LayerNorm module, so we can fuse equalization operations of these projections to the layernorm module, but to do that we need to have the same equalization factor for all 3 projections.

../../../_images/smoothquant_fusion.png

Below is an example of how to configure the algorithm to share equalization factor for specific modules. For this we need to specify transformer block type and pass fusion_groups dict with keys - name of the module to fuse to, values - list of linear modules whose equalization factor will be fused to the key.

from qlip_algorithms.quantization import SmoothQuant

fusion_groups = {
    'input_layernorm': ['self_attn.q_proj', 'self_attn.k_proj', 'self_attn.v_proj']
}
model, handle = SmoothQuant.setup_model(
    model=model,
    fusion_groups=fusion_groups,
    transformer_block_type='LlamaDecoderLayer',
    **NVIDIA_INT_W8A8,
)
# ... Calibrate

Once model is calibrated, use SmoothQuant.fuse_equalization to fuse equalization operations to the specified modules.

SmoothQuant.fuse_equalization(model, fusion_groups)

Save checkpoint with QlipConfig:

from qlip import QlipConfig

config = QlipConfig.from_model(model)
config.save('config.pth')

Note

In theory it’s better to make such offline fusion of equalization operations if it’s possible to avoid overheads related to on fly equalization operations, but in practice compiler usually fuses these operations itself, so the overhead is negligible.

Learned Step Size Quantization (LSQ)


Learned Step Size Quantization (LSQ) is a quantization-aware training method that learns optimal quantizer step sizes during training, achieving state-of-the-art accuracy for low-precision neural networks. Unlike post-training quantization methods, LSQ optimizes both network weights and quantization parameters. See LSQ paper for more details.

Note

LSQ requires quantization-aware training and cannot be applied post-training like PTQ. The method learns quantization parameters with model weights, requiring access to training data.

When to Use LSQ:

  • When highest accuracy is required for low-bit quantization

  • When training data and computational resources are available

  • For models where post-training quantization shows significant accuracy degradation

Basic usage example


Here is an example of how to use LSQ to quantize a model to 4-bit weights and 8-bit activations. The algorithm workflow is:

  1. Setup LSQ with 4-bit weights and 8-bit activations

  2. Run calibration on representative data to estimate initial quantization parameters

  3. Setup optimizer

  4. Train the model as usual

from qlip_algorithms.quantization import LSQ
import torch
import torch.optim as optim

# Setup LSQ with 4-bit weights and 8-bit activations
handle, quantized_model, param_groups = LSQ.setup_model(
    model=model,
    weights_scheme={'scheme': 'int', 'bits': 4, 'symmetric': True},
    activations_scheme={'scheme': 'uint', 'bits': 8, 'symmetric': False},
    calibration_iterations=32
)
# Run calibration on representative data to estimate initial quantization parameters
LSQ.configure_model(quantized_model)

for batch in calibration_dataloader:
    quantized_model(batch)

# Setup optimizer with quantization parameter groups
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# Training loop with LSQ
for epoch in range(num_epochs):
    for batch in train_dataloader:
        optimizer.zero_grad()
        output = quantized_model(batch['input'])
        loss = criterion(output, batch['target'])
        loss.backward()
        optimizer.step()

Setting different learning rate for quantization parameters


handle, quantized_model, param_groups = LSQ.setup_model(
    model=model,
    weights_scheme={'scheme': 'int', 'bits': 4, 'symmetric': True},
    activations_scheme={'scheme': 'uint', 'bits': 8, 'symmetric': False},
    calibration_iterations=32
)
optimizer = optim.SGD([
    {'params': param_groups['model'], 'lr': 1e-4},
    # Higher LR for quantization scales parameters
    {'params': param_groups['scales'], 'lr': 1e-2}
])

Serialize and deserialize quantized model


Save quantized model checkpoint to file:

from qlip import QlipConfig

config = QlipConfig.from_model(quantized_model)
config.save('config.pth')

Apply quantization configuration to model from file:

from qlip import QlipConfig

config = QlipConfig.from_file('config.pth')
model = config.apply(model)