UIntxWeightOnlyConfig#
- class torchao.prototype.quantization.UIntxWeightOnlyConfig(group_size: int | None = 128, bit_width: int = 4, packing_bitwidth: int | None = None, set_inductor_config: bool = True)[source][source]#
- Weight-only uintx quantization using bit-packed format with gemlite (dropbox/gemlite)
Triton kernels.
Supports 4-bit (asymmetric, grouped) and 8-bit (symmetric, per-channel) quantization. Uses gemlite library for efficient Triton-based GEMM.
- Parameters:
group_size – quantization group size. Use None for per-channel (required for 8-bit). Valid values: 32, 64, 128, 256, 512, 1024, None. Default: 128.
bit_width – quantization bit width, 4 or 8. Default: 4.
packing_bitwidth – bit width for packing, 8/16/32/None (auto). Default: None.
set_inductor_config – if True, set recommended torchinductor config. Default: True.
Example:
# Copyright (c) Meta Platforms, Inc. and affiliates. # All rights reserved. # # This source code is licensed under the BSD 3-Clause license found in the # LICENSE file in the root directory of this source tree. import torch import torch.nn as nn from torchao.prototype.quantization import UIntxWeightOnlyConfig from torchao.quantization import quantize_ model = nn.Sequential(nn.Linear(512, 256, device="cuda", dtype=torch.float16)) # 4-bit asymmetric groupwise quantization (default) config = UIntxWeightOnlyConfig( group_size=128, bit_width=4, packing_bitwidth=32, ) quantize_(model, config) # 8-bit symmetric per-channel quantization model_8bit = nn.Sequential(nn.Linear(512, 256, device="cuda", dtype=torch.float16)) config_8bit = UIntxWeightOnlyConfig( group_size=None, # per-channel (required for 8-bit) bit_width=8, ) quantize_(model_8bit, config_8bit)