Int8DynamicActivationInt8WeightConfig#

class torchao.quantization.Int8DynamicActivationInt8WeightConfig(act_mapping_type: MappingType | None = MappingType.SYMMETRIC, weight_only_decode: bool = False, granularity: Granularity | list[Granularity] | None = PerRow(dim=-1), set_inductor_config: bool = True, version: int = 2, reduce_range: bool | None = False)[source][source]#

Configuration for applying int8 dynamic per-token activation and int8 per-channel weight quantization to linear layers.

Parameters:

granularity – Optional[Union[Granularity, List[Granularity]]] = PerRow() The granularity for quantization. Can be either a single granularity (applied to both activations and weights) or a list of two granularities (first for activations, second for weights). If None, defaults to PerRow for both. Only PerTensor and PerRow are supported.
act_mapping_type – Optional[MappingType] = MappingType.SYMMETRIC - Mapping type for activation quantization. SYMMETRIC and ASYMMETRIC are supported.
set_inductor_config – bool = True - If True, adjusts torchinductor settings to recommended values for better performance with this quantization scheme.
version (int) – the version of the config
reduce_range (Optional[bool] = False) – If True, use reduced activation and weight quantization ranges to avoid overflow on CPU without VNNI. Users can call should_reduce_range() to help determine.

Example:

import torch.nn as nn

from torchao.quantization import Int8DynamicActivationInt8WeightConfig, quantize_

model = nn.Sequential(nn.Linear(2048, 2048, device="cuda"))
quantize_(model, Int8DynamicActivationInt8WeightConfig())

Int8DynamicActivationInt8WeightConfig#

Docs

Tutorials

Resources