Int8DynamicActivationInt8WeightConfig#

class torchao.quantization.Int8DynamicActivationInt8WeightConfig(layout: Layout | None = None, act_mapping_type: MappingType | None = MappingType.SYMMETRIC, weight_only_decode: bool = False, granularity: Granularity | Tuple[Granularity, Granularity] | list[Granularity] | None = PerRow(dim=-1), set_inductor_config: bool = True, version: int = 1)[source][source]#

Configuration for applying int8 dynamic per-token activation and int8 per-channel weight quantization to linear layers.

Parameters:

layout – Optional[Layout] = PlainLayout() - Tensor layout for the quantized weights. Controls how the quantized data is stored and accessed.
granularity – Optional[Union[Granularity, Tuple[Granularity, Granularity], List[Granularity]]] = PerRow() The granularity for quantization. Can be either a single granularity (applied to both activations and weights) or a tuple / list of two granularities (first for activations, second for weights). If None, defaults to PerRow for both. Only PerTensor and PerRow are supported.
act_mapping_type – Optional[MappingType] = MappingType.SYMMETRIC - Mapping type for activation quantization. SYMMETRIC and ASYMMETRIC are supported for version 2.
weight_only_decode – bool = False - If True, only quantizes weights during forward pass and keeps activations in original precision during decode operations.
set_inductor_config – bool = True - If True, adjusts torchinductor settings to recommended values for better performance with this quantization scheme.
version (int) – the version of the config, version 1 is using AffineQuantizedTensor that we plan to deprecate/split, version 2 is using Int8Tensor

Example:

import torch.nn as nn

from torchao.quantization import Int8DynamicActivationInt8WeightConfig, quantize_

model = nn.Sequential(nn.Linear(2048, 2048, device="cuda"))
quantize_(model, Int8DynamicActivationInt8WeightConfig())

Int8DynamicActivationInt8WeightConfig#

Docs

Tutorials

Resources