Int8DynamicActivationUIntxWeightConfig#
- class torchao.prototype.quantization.Int8DynamicActivationUIntxWeightConfig(group_size: int | None = 128, bit_width: int = 4, packing_bitwidth: int | None = None, set_inductor_config: bool = True)[source][source]#
- Dynamic activation + uintx weight quantization using gemlite (dropbox/gemlite)
Triton kernels.
Activations are quantized dynamically at runtime (int8). Weights use bit-packed uintx format. Supports 4-bit and 8-bit weight quantization.
- Parameters:
group_size – quantization group size. Use None for per-channel (required for 8-bit). Valid values: 32, 64, 128, 256, 512, 1024, None. Default: 128.
bit_width – weight quantization bit width, 4 or 8. Default: 4.
packing_bitwidth – bit width for packing, 8/16/32/None (auto). Default: None.
set_inductor_config – if True, set recommended torchinductor config. Default: True.
Example:
# Copyright (c) Meta Platforms, Inc. and affiliates. # All rights reserved. # # This source code is licensed under the BSD 3-Clause license found in the # LICENSE file in the root directory of this source tree. import torch import torch.nn as nn from torchao.prototype.quantization import Int8DynamicActivationUIntxWeightConfig from torchao.quantization import quantize_ model = nn.Sequential(nn.Linear(512, 256, device="cuda", dtype=torch.float16)) # int8 dynamic activation + 4-bit grouped weight quantization config = Int8DynamicActivationUIntxWeightConfig( group_size=128, bit_width=4, packing_bitwidth=32, ) quantize_(model, config) # int8 dynamic activation + 8-bit per-channel weight quantization model_8bit = nn.Sequential(nn.Linear(512, 256, device="cuda", dtype=torch.float16)) config_8bit = Int8DynamicActivationUIntxWeightConfig( group_size=None, # per-channel (required for 8-bit) bit_width=8, ) quantize_(model_8bit, config_8bit)