Rate this Page

Int4WeightOnlyConfig#

class torchao.quantization.Int4WeightOnlyConfig(group_size: int = 128, set_inductor_config: bool = True, int4_packing_format: Int4PackingFormat = Int4PackingFormat.PLAIN, int4_choose_qparams_algorithm: Int4ChooseQParamsAlgorithm = Int4ChooseQParamsAlgorithm.TINYGEMM, version: int = 2)[source][source]#

Configuration for int4 weight only quantization, only groupwise quantization is supported right now, and we support version 1 and version 2, that are implemented differently although with same support. In version 2, different target are mainly distinguished by packing_format arg, and in version 1, mainly by layout.

Parameters
  • group_size – parameter for quantization, controls the granularity of quantization, smaller size is more fine grained, choices are [256, 128, 64, 32], used in both version 1 and 2

  • int4_packing_format – the packing format for int4 tensor, used in version 2 only int4_choose_qparams_algorithm: variants of choose qparams algorithm to use for int4, currently support TINYGEMM (“tinygemm”) and HQQ (“hqq”), used in version 2 only

  • set_inductor_config – if True, adjusts torchinductor settings to recommended values. used in both version 1 and 2

  • version – version of the config to use, default is 2

Example:

import torch
from torchao.quantization import Int4WeightOnlyConfig, quantize_

# Note: int4_packing_format varies by backend
if torch.cuda.is_available():
    # CUDA: Optimized with tile packing and HQQ
    config = Int4WeightOnlyConfig(group_size=32, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq")
elif torch.xpu.is_available():
    # XPU: Use plain_int32 packing
    config = Int4WeightOnlyConfig(group_size=32, int4_packing_format="plain_int32")

quantize_(model, config)