Rate this Page

torchao.quantization#

Created On: Jan 29, 2026 | Last Updated On: Jan 29, 2026

Main Quantization APIs#

quantize_

Convert the weight of linear modules in the model with config, model is modified inplace

FqnToConfig

Configuration class for applying different quantization configs to modules or parameters based on their fully qualified names (FQNs).

Workflow Configs#

float8 weight configs#

Float8DynamicActivationFloat8WeightConfig

Configuration for applying float8 dynamic symmetric quantization to both activations and weights of linear layers.

Float8DynamicActivationFloat8SemiSparseWeightConfig

Applies float8 dynamic quantization to activations and float8 quantization followed by compression to sparse semi-structured tensor to weights of linear layers.

Float8WeightOnlyConfig

Configuration for applying float8 weight-only symmetric per-channel quantization to linear layers.

int8 weight configs#

Int8DynamicActivationInt8WeightConfig

Configuration for applying int8 dynamic symmetric per-token activation and int8 per-channel weight quantization to linear layers.

Int8WeightOnlyConfig

Configuration for applying int8 weight-only symmetric per-channel quantization to linear layers.

int4 weight configs#

Int4WeightOnlyConfig

Configuration for int4 weight only quantization, only groupwise quantization is supported right now, and we support version 1 and version 2, that are implemented differently although with same support.

Float8DynamicActivationInt4WeightConfig

Configuration for apply float8 dynamic per row quantization and int4 per group weight quantization to linear (only group_size 128 is supported right now since underlying kernel used only supports 128 and above and no benefits of making it bigger)

Int8DynamicActivationInt4WeightConfig

Configuration for applying int8 dynamic per token asymmetric activation quantization and int4 per group weight symmetric quantization to linear This is used to produce a model for executorch backend, but currently executorch did not support lowering for the quantized model from this flow yet

intx weight configs#

IntxWeightOnlyConfig

Configuration for quantizing weights to torch.intx, with 1 <= x <= 8. Weights are quantized with scales/zeros in a groupwise or channelwise manner using the number of bits specified by weight_dtype. :param weight_dtype: The dtype to use for weight quantization. Must be torch.intx, where 1 <= x <= 8. :param granularity: The granularity to use for weight quantization. Must be PerGroup or PerAxis(0). :param mapping_type: The type of mapping to use for the weight quantization. Must be one of MappingType.ASYMMETRIC or MappingType.SYMMETRIC. :param scale_dtype: The dtype to use for the weight scale. :param intx_packing_format: The format to use for the packed weight tensor (version 2 only). :param intx_choose_qparams_algorithm: The algorithm to use for choosing the quantization parameters. :param version: version of the config to use, only subset of above args are valid based on version, see note for more details.

Int8DynamicActivationIntxWeightConfig

Configuration for dynamically quantizing activations to torch.int8 and weights to torch.intx, with 1 <= x <= 8.

mx weight configs (prototype)#

MXDynamicActivationMXWeightConfig

MX Format Inference Quantization

nvfp4 weight configs (prototype)#

NVFP4DynamicActivationNVFP4WeightConfig

NVIDIA FP4 (NVFP4) Inference Quantization Configuration

NVFP4WeightOnlyConfig