torchao.quantization¶

Main Quantization APIs¶

Convert the weight of linear modules in the model with config, model is modified inplace

`Int4WeightOnlyConfig`	Configuration for int4 weight only quantization, only groupwise quantization is supported right now, and we support version 1 and version 2, that are implemented differently although with same support.
`Float8DynamicActivationInt4WeightConfig`	Configuration for apply float8 dynamic per row quantization and int4 per group weight quantization to linear (only group_size 128 is supported right now since underlying kernel used only supports 128 and above and no benefits of making it bigger)
`Float8DynamicActivationFloat8WeightConfig`	Configuration for applying float8 dynamic symmetric quantization to both activations and weights of linear layers.
`Float8WeightOnlyConfig`	Configuration for applying float8 weight-only symmetric per-channel quantization to linear layers.
`Int8DynamicActivationInt4WeightConfig`	Configuration for applying int8 dynamic per token asymmetric activation quantization and int4 per group weight symmetric quantization to linear This is used to produce a model for executorch backend, but currently executorch did not support lowering for the quantized model from this flow yet
`Int8WeightOnlyConfig`	Configuration for applying int8 weight-only symmetric per-channel quantization to linear layers.
`Int8DynamicActivationInt8WeightConfig`	Configuration for applying int8 dynamic symmetric per-token activation and int8 per-channel weight quantization to linear layers.

`choose_qparams_affine`	param input: fp32, bf16, fp16 input Tensor
`choose_qparams_affine_with_min_max`	A variant of `choose_qparams_affine()` operator that pass in min_val and max_val directly instead of deriving these from a single input.
`quantize_affine`	param input: original float32, float16 or bfloat16 Tensor
`dequantize_affine`	param input: quantized tensor, should match the dtype dtype argument
`safe_int_mm`	Performs a safe integer matrix multiplication, considering different paths for torch.compile, cublas, and fallback cases.
`int_scaled_matmul`	Performs scaled integer matrix multiplication.
`MappingType`	How floating point number is mapped to integer number
`TorchAODType`	Placeholder for dtypes that do not exist in PyTorch core yet.