Rate this Page

quantize_affine#

torchao.quantization.quantize_affine(input: Tensor, block_size: Tuple[int, ...], scale: Tensor, zero_point: Optional[Tensor], output_dtype: dtype, quant_min: Optional[Union[int, float]] = None, quant_max: Optional[Union[int, float]] = None) Tensor[source][source]#
Parameters
  • input (torch.Tensor) – original float32, float16 or bfloat16 Tensor

  • block_size – (Tuple[int, …]): granularity of quantization, this means the size of the tensor elements that’s sharing the same qparam e.g. when size is the same as the input tensor dimension, we are using per tensor quantization

  • scale (float) – quantization parameter for affine quantization

  • zero_point (int) – quantization parameter for affine quantization

  • output_dtype (torch.dtype) – requested dtype (e.g. torch.uint8) for output Tensor

  • quant_min (Optional[int]) – minimum quantized value for output Tensor, if not specified, it will be derived from dtype

  • quant_max (Optional[int]) – maximum quantized value for output Tensor, if not specified, it will be derived from dtype

Note

How can block_size represent different granularities? let’s say we have a Tensor of size: (3, 3, 10, 10), here is the table showing how block_size represents different granularities:

granularity type | block_size

per_tensor | (3, 3, 10, 10) per_axis (axis=0) | (1, 3, 10, 10) per_axis (axis=1) | (3, 1, 10, 10)

per_group (groupsize=2) | (3, 3, 10, 2) per_group (groupsize=2) for axis = 3 | (3, 3, 2, 10)

Output:

quantized tensor with requested dtype