quantize_affine¶
- torchao.quantization.quantize_affine(input: Tensor, block_size: Tuple[int, ...], scale: Tensor, zero_point: Optional[Tensor], output_dtype: dtype, quant_min: Optional[Union[int, float]] = None, quant_max: Optional[Union[int, float]] = None, zero_point_domain: ZeroPointDomain = ZeroPointDomain.INT) Tensor [source]¶
- Parameters:
input (torch.Tensor) – original float32, float16 or bfloat16 Tensor
block_size – (Tuple[int, …]): granularity of quantization, this means the size of the tensor elements that’s sharing the same qparam e.g. when size is the same as the input tensor dimension, we are using per tensor quantization
scale (float) – quantization parameter for affine quantization
zero_point (int) – quantization parameter for affine quantization
output_dtype (torch.dtype) – requested dtype (e.g. torch.uint8) for output Tensor
quant_min (Optional[int]) – minimum quantized value for output Tensor, if not specified, it will be derived from dtype
quant_max (Optional[int]) – maximum quantized value for output Tensor, if not specified, it will be derived from dtype
zero_point_domain (ZeroPointDomain) – the domain that zero_point is in, should be either integer or float if zero_point is in integer domain, zero point is added to the quantized integer value during quantization if zero_point is in floating point domain, zero point is subtracted from the floating point (unquantized) value during quantization default is ZeroPointDomain.INT
Note
How can block_size represent different granularities? let’s say we have a Tensor of size: (3, 3, 10, 10), here is the table showing how block_size represents different granularities:
- granularity type | block_size
per_tensor | (3, 3, 10, 10) per_axis (axis=0) | (1, 3, 10, 10) per_axis (axis=1) | (3, 1, 10, 10)
per_group (groupsize=2) | (3, 3, 10, 2) per_group (groupsize=2) for axis = 3 | (3, 3, 2, 10)
- Output:
quantized tensor with requested dtype