Int8DynamicActivationIntxWeightConfig#

class torchao.quantization.Int8DynamicActivationIntxWeightConfig(weight_dtype: dtype = torch.int8, weight_granularity: Granularity = PerGroup(group_size=32), weight_mapping_type: MappingType = MappingType.SYMMETRIC, weight_scale_dtype: Optional[dtype] = None, act_mapping_type: MappingType = MappingType.ASYMMETRIC, intx_packing_format: IntxPackingFormat = IntxPackingFormat.UNPACKED_TO_INT8, intx_choose_qparams_algorithm: IntxChooseQParamsAlgorithm = IntxChooseQParamsAlgorithm.AFFINE, version: int = 2)[source][source]#

Configuration for dynamically quantizing activations to torch.int8 and weights to torch.intx, with 1 <= x <= 8. More specifically, activations are dynamically quantized to 8-bits at a per-token granularity with scales/zeros. Weights are quantized with scales/zeros in a groupwise or channelwise manner using the number of bits specified by weight_dtype.

This layout is identical to Int8DynamicActivationInt4WeightConfig when weight_dtype is torch.int4 and other args are the same. However, this layout is more general and supports other weight dtypes.

Parameters

weight_dtype –

The dtype to use for weight quantization. Must be torch.intx, where 1 <= x <= 8.

` weight_granularity`: The granularity to use for weight quantization. Must be PerGroup or PerAxis(axis=0).

weight_mapping_type: The type of mapping to use for the weight quantization.: Must be one of MappingType.ASYMMETRIC or MappingType.SYMMETRIC. MappingType.SYMMETRIC requires ZeroPointDomain.NONE

weight_scale_dtype: The dtype to use for the weight scale. act_mapping_type: The type of mapping to use for the activation quantization.

Must be one of MappingType.ASYMMETRIC or MappingType.SYMMETRIC.

intx_packing_format: The format to use for the packed weight tensor (version 2 only).

unpacked_to_int8: this format is the default and is intended for export applications like ExecuTorch.
opaque_torchao_auto: this format is optimized for CPU performance.

intx_choose_qparams_algorithm: The algorithm to use for choosing the quantization parameters. version: version of the config to use, only subset of above args are valid based on version, see note for more details.

Example:

import torch.nn as nn

from torchao.quantization import Int8DynamicActivationIntxWeightConfig, quantize_

model = nn.Sequential(nn.Linear(2048, 2048, device="cuda"))
quantize_(model, Int8DynamicActivationIntxWeightConfig())

Int8DynamicActivationIntxWeightConfig#

Docs

Tutorials

Resources