Shortcuts

torchao.quantization

Main Quantization APIs

quantize_

Convert the weight of linear modules in the model with config, model is modified inplace

Inference APIs for quantize_

Int4WeightOnlyConfig

Configuration for int4 weight only quantization, only groupwise quantization is supported right now, and we support version 1 and version 2, that are implemented differently although with same support.

Float8DynamicActivationInt4WeightConfig

Configuration for apply float8 dynamic per row quantization and int4 per group weight quantization to linear (only group_size 128 is supported right now since underlying kernel used only supports 128 and above and no benefits of making it bigger)

Float8DynamicActivationFloat8WeightConfig

Configuration for applying float8 dynamic symmetric quantization to both activations and weights of linear layers.

Float8WeightOnlyConfig

Configuration for applying float8 weight-only symmetric per-channel quantization to linear layers.

Int8DynamicActivationInt4WeightConfig

Configuration for applying int8 dynamic per token asymmetric activation quantization and int4 per group weight symmetric quantization to linear This is used to produce a model for executorch backend, but currently executorch did not support lowering for the quantized model from this flow yet

Int8WeightOnlyConfig

Configuration for applying int8 weight-only symmetric per-channel quantization to linear layers.

Int8DynamicActivationInt8WeightConfig

Configuration for applying int8 dynamic symmetric per-token activation and int8 per-channel weight quantization to linear layers.

Quantization Primitives

choose_qparams_affine

param input:

fp32, bf16, fp16 input Tensor

choose_qparams_affine_with_min_max

A variant of choose_qparams_affine() operator that pass in min_val and max_val directly instead of deriving these from a single input.

quantize_affine

param input:

original float32, float16 or bfloat16 Tensor

dequantize_affine

param input:

quantized tensor, should match the dtype dtype argument

safe_int_mm

Performs a safe integer matrix multiplication, considering different paths for torch.compile, cublas, and fallback cases.

int_scaled_matmul

Performs scaled integer matrix multiplication.

MappingType

How floating point number is mapped to integer number

TorchAODType

Placeholder for dtypes that do not exist in PyTorch core yet.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources