AffineQuantizedTensor¶

class torchao.dtypes.AffineQuantizedTensor(tensor_impl: AQTTensorImpl, block_size: Tuple[int, ...], shape: Size, quant_min: Optional[Union[int, float]] = None, quant_max: Optional[Union[int, float]] = None, zero_point_domain: ZeroPointDomain = ZeroPointDomain.INT, dtype=None, strides=None)[source]¶

Affine quantized tensor subclass. Affine quantization means we quantize the floating point tensor with an affine transformation:: quantized_tensor = float_tensor / scale + zero_point

To see what happens during choose_qparams, quantization and dequantization for affine quantization, please checkout https://github.com/pytorch/ao/blob/main/torchao/quantization/quant_primitives.py and check the three quant primitive ops: choose_qparams_affine, quantize_affine qand dequantize_affine

The shape and dtype of the tensor subclass represent how the tensor subclass looks externally, regardless of the internal representation’s type or orientation.

fields:

tensor_impl (AQTTensorImpl): tensor that serves as a general tensor impl storage for the quantized data,: e.g. storing plain tensors (int_data, scale, zero_point) or packed formats depending on device and operator/kernel
block_size (Tuple[int, …]): granularity of quantization, this means the size of the tensor elements that’s sharing the same qparam: e.g. when size is the same as the input tensor dimension, we are using per tensor quantization

shape (torch.Size): the shape for the original high precision Tensor quant_min (Optional[int]): minimum quantized value for the Tensor, if not specified, it will be derived from dtype of int_data quant_max (Optional[int]): maximum quantized value for the Tensor, if not specified, it will be derived from dtype of int_data zero_point_domain (ZeroPointDomain): the domain that zero_point is in, should be either integer or float

if zero_point is in integer domain, zero point is added to the quantized integer value during quantization if zero_point is in floating point domain, zero point is subtracted from the floating point (unquantized) value during quantization default is ZeroPointDomain.INT

dtype: dtype for original high precision tensor, e.g. torch.float32

dequantize() → Tensor[source]¶: Given a quantized Tensor, dequantize it and return the dequantized float Tensor.

to(*args, **kwargs) → Tensor[source]¶

Performs Tensor dtype and/or device conversion. A torch.dtype and torch.device are inferred from the arguments of self.to(*args, **kwargs).

Note

If the self Tensor already has the correct torch.dtype and torch.device, then self is returned. Otherwise, the returned tensor is a copy of self with the desired torch.dtype and torch.device.

Here are the ways to call to:

to(dtype, non_blocking=False, copy=False, memory_format=torch.preserve_format) → Tensor[source]

Returns a Tensor with the specified dtype

Args:
memory_format (torch.memory_format, optional): the desired memory format of returned Tensor. Default: torch.preserve_format.

to(device=None, dtype=None, non_blocking=False, copy=False, memory_format=torch.preserve_format) → Tensor[source]

Returns a Tensor with the specified device and (optional) dtype. If dtype is None it is inferred to be self.dtype. When non_blocking, tries to convert asynchronously with respect to the host if possible, e.g., converting a CPU Tensor with pinned memory to a CUDA Tensor. When copy is set, a new Tensor is created even when the Tensor already matches the desired conversion.

Args:
memory_format (torch.memory_format, optional): the desired memory format of returned Tensor. Default: torch.preserve_format.

to(other, non_blocking=False, copy=False) → Tensor[source]: Returns a Tensor with same torch.dtype and torch.device as the Tensor other. When non_blocking, tries to convert asynchronously with respect to the host if possible, e.g., converting a CPU Tensor with pinned memory to a CUDA Tensor. When copy is set, a new Tensor is created even when the Tensor already matches the desired conversion.

Example:

>>> tensor = torch.randn(2, 2)  # Initially dtype=float32, device=cpu
>>> tensor.to(torch.float64)
tensor([[-0.5044,  0.0005],
        [ 0.3310, -0.0584]], dtype=torch.float64)

>>> cuda0 = torch.device('cuda:0')
>>> tensor.to(cuda0)
tensor([[-0.5044,  0.0005],
        [ 0.3310, -0.0584]], device='cuda:0')

>>> tensor.to(cuda0, dtype=torch.float64)
tensor([[-0.5044,  0.0005],
        [ 0.3310, -0.0584]], dtype=torch.float64, device='cuda:0')

>>> other = torch.randn((), dtype=torch.float64, device=cuda0)
>>> tensor.to(other, non_blocking=True)
tensor([[-0.5044,  0.0005],
        [ 0.3310, -0.0584]], dtype=torch.float64, device='cuda:0')

AffineQuantizedTensor¶

Docs

Tutorials

Resources