torchao.quantization#
Created On: Jan 29, 2026 | Last Updated On: Jan 29, 2026
Main Quantization APIs#
Convert the weight of linear modules in the model with config, model is modified inplace |
|
Configuration class for applying different quantization configs to modules or parameters based on their fully qualified names (FQNs). |
Workflow Configs#
float8 weight configs#
Configuration for applying float8 dynamic symmetric quantization to both activations and weights of linear layers. |
|
Applies float8 dynamic quantization to activations and float8 quantization followed by compression to sparse semi-structured tensor to weights of linear layers. |
|
Configuration for applying float8 weight-only symmetric per-channel quantization to linear layers. |
int8 weight configs#
Configuration for applying int8 dynamic symmetric per-token activation and int8 per-channel weight quantization to linear layers. |
|
Configuration for applying int8 weight-only symmetric per-channel quantization to linear layers. |
int4 weight configs#
Configuration for int4 weight only quantization, only groupwise quantization is supported right now, and we support version 1 and version 2, that are implemented differently although with same support. |
|
Configuration for apply float8 dynamic per row quantization and int4 per group weight quantization to linear (only group_size 128 is supported right now since underlying kernel used only supports 128 and above and no benefits of making it bigger) |
|
Configuration for applying int8 dynamic per token asymmetric activation quantization and int4 per group weight symmetric quantization to linear This is used to produce a model for executorch backend, but currently executorch did not support lowering for the quantized model from this flow yet |
intx weight configs#
Configuration for quantizing weights to torch.intx, with 1 <= x <= 8. Weights are quantized with scales/zeros in a groupwise or channelwise manner using the number of bits specified by weight_dtype. :param weight_dtype: The dtype to use for weight quantization. Must be torch.intx, where 1 <= x <= 8. :param granularity: The granularity to use for weight quantization. Must be PerGroup or PerAxis(0). :param mapping_type: The type of mapping to use for the weight quantization. Must be one of MappingType.ASYMMETRIC or MappingType.SYMMETRIC. :param scale_dtype: The dtype to use for the weight scale. :param intx_packing_format: The format to use for the packed weight tensor (version 2 only). :param intx_choose_qparams_algorithm: The algorithm to use for choosing the quantization parameters. :param version: version of the config to use, only subset of above args are valid based on version, see note for more details. |
|
Configuration for dynamically quantizing activations to torch.int8 and weights to torch.intx, with 1 <= x <= 8. |
mx weight configs (prototype)#
MX Format Inference Quantization |
nvfp4 weight configs (prototype)#
NVIDIA FP4 (NVFP4) Inference Quantization Configuration |
|