int4_weight_only¶

torchao.quantization.int4_weight_only(group_size=128, layout=TensorCoreTiledLayout(inner_k_tiles=8), use_hqq=False, zero_point_domain=None)[source]¶

Applies uint4 weight-only asymmetric per-group quantization to linear layers, using “tensor_core_tiled” layout for speedup with tinygemm kernel

Note

This is targeting tinygemm int4mm kernel (torch.ops.aten._weight_int4pack_mm and torch.ops.aten._weight_int4pack_mm_for_cpu), the main difference of quantization algorithm compared to the more traditional type of integer quantization is the following: 1). zero_point is in floating point domain instead of integer domain (zero_point_domain`=`ZeroPointDomain.FLOAT) 2). floating point zero does not have to be exactly representable (preserve_zero`=False in `choose_qparams_affine) please follow the relevant code in choose_qparams_affine, quantize_affine and dequantize_affine to learn about how the quantization parameters are chosen and how the Tensor is quantized/dequantized for tinygemm

Parameters:

group_size – parameter for quantization, controls the granularity of quantization, smaller size is more fine grained, choices are [256, 128, 64, 32]
layout – layout type for quantized tensor, default is TensorCoreTiledLayout(inner_k_tiles=8)
use_hqq – whether to use hqq or default quantization mode, default is False
zero_point_domain – data type of zeros points, choices are [None(then the value is determined by the layout), ZeroPointDomain.FLOAT, ZeroPointDomain.INT, ZeroPointDomain.NONE]

int4_weight_only¶

Docs

Tutorials

Resources