Rate this Page

UIntxWeightOnlyConfig#

class torchao.prototype.quantization.UIntxWeightOnlyConfig(group_size: int | None = 128, bit_width: int = 4, packing_bitwidth: int | None = None, set_inductor_config: bool = True)[source][source]#
Weight-only uintx quantization using bit-packed format with gemlite (dropbox/gemlite)

Triton kernels.

Supports 4-bit (asymmetric, grouped) and 8-bit (symmetric, per-channel) quantization. Uses gemlite library for efficient Triton-based GEMM.

Parameters:
  • group_size – quantization group size. Use None for per-channel (required for 8-bit). Valid values: 32, 64, 128, 256, 512, 1024, None. Default: 128.

  • bit_width – quantization bit width, 4 or 8. Default: 4.

  • packing_bitwidth – bit width for packing, 8/16/32/None (auto). Default: None.

  • set_inductor_config – if True, set recommended torchinductor config. Default: True.

Example:

# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the BSD 3-Clause license found in the
# LICENSE file in the root directory of this source tree.

import torch
import torch.nn as nn

from torchao.prototype.quantization import UIntxWeightOnlyConfig
from torchao.quantization import quantize_

model = nn.Sequential(nn.Linear(512, 256, device="cuda", dtype=torch.float16))

# 4-bit asymmetric groupwise quantization (default)
config = UIntxWeightOnlyConfig(
    group_size=128,
    bit_width=4,
    packing_bitwidth=32,
)
quantize_(model, config)

# 8-bit symmetric per-channel quantization
model_8bit = nn.Sequential(nn.Linear(512, 256, device="cuda", dtype=torch.float16))
config_8bit = UIntxWeightOnlyConfig(
    group_size=None,  # per-channel (required for 8-bit)
    bit_width=8,
)
quantize_(model_8bit, config_8bit)