torchao.quantization#

Created On: Jun 24, 2026 | Last Updated On: Jun 24, 2026

Main Quantization APIs#

`quantize_`	Convert the weight of linear modules in the model with config, model is modified inplace
`FqnToConfig`	Configuration class for applying different quantization configs to modules or parameters based on their fully qualified names (FQNs).

Workflow Configs#

float8 weight configs#

`Float8DynamicActivationFloat8WeightConfig`	Configuration for applying float8 dynamic symmetric quantization to both activations and weights of linear layers.
`Float8WeightOnlyConfig`	Configuration for applying float8 weight-only symmetric quantization to linear layers.

int8 weight configs#

`Int8DynamicActivationInt8WeightConfig`	Configuration for applying int8 dynamic per-token activation and int8 per-channel weight quantization to linear layers.
`Int8WeightOnlyConfig`	Configuration for applying int8 weight-only symmetric per-channel quantization to linear layers.

int4 weight configs#

`Int4WeightOnlyConfig`	Configuration for int4 weight only quantization, only groupwise quantization is supported.
`Float8DynamicActivationInt4WeightConfig`	Configuration for apply float8 dynamic per row quantization and int4 per group weight quantization to linear (only group_size 128 is supported right now since underlying kernel used only supports 128 and above and no benefits of making it bigger)

intx weight configs#

IntxWeightOnlyConfig

Configuration for quantizing weights to torch.intx, with 1 <= x <= 8. Weights are quantized with scales/zeros in a groupwise or channelwise manner using the number of bits specified by weight_dtype. :param weight_dtype: The dtype to use for weight quantization. Must be torch.intx, where 1 <= x <= 8. :param granularity: The granularity to use for weight quantization. Must be PerGroup or PerAxis(0). :param mapping_type: The type of mapping to use for the weight quantization. Must be one of MappingType.ASYMMETRIC or MappingType.SYMMETRIC. :param scale_dtype: The dtype to use for the weight scale. :param intx_packing_format: The format to use for the packed weight tensor (version 2 only). :param intx_choose_qparams_algorithm: The algorithm to use for choosing the quantization parameters. :param version: version of the config to use, only subset of above args are valid based on version, see note for more details.

Int8DynamicActivationIntxWeightConfig

Configuration for dynamically quantizing activations to torch.int8 and weights to torch.intx, with 1 <= x <= 8.

mx weight configs (prototype)#

MXDynamicActivationMXWeightConfig

MX Format Inference Quantization

nvfp4 weight configs (prototype)#

`NVFP4DynamicActivationNVFP4WeightConfig`	NVIDIA FP4 (NVFP4) Inference Quantization Configuration
`NVFP4WeightOnlyConfig`	NVIDIA FP4 (NVFP4) Weight-Only Quantization Configuration

uintx weight configs (prototype)#

`UIntxWeightOnlyConfig`	Weight-only uintx quantization using bit-packed format with gemlite (dropbox/gemlite)
`Int8DynamicActivationUIntxWeightConfig`	Dynamic activation + uintx weight quantization using gemlite (dropbox/gemlite)