Shortcuts

torchao.quantization

Main Quantization APIs

quantize_

Convert the weight of linear modules in the model with config, model is modified inplace

autoquant

Autoquantization is a process which identifies the fastest way to quantize each layer of a model over some set of potential qtensor subclasses.

Inference APIs for quantize_

Int4WeightOnlyConfig

Configuration for applying uint4 weight-only asymmetric per-group quantization to linear layers, using "tensor_core_tiled" layout for speedup with tinygemm kernel

Float8DynamicActivationFloat8WeightConfig

Configuration for applying float8 dynamic symmetric quantization to both activations and weights of linear layers.

Float8WeightOnlyConfig

Configuration for applying float8 weight-only symmetric per-channel quantization to linear layers.

Float8StaticActivationFloat8WeightConfig

Configuration for applying float8 static symmetric quantization to

Int8DynamicActivationInt4WeightConfig

Configuration for applying int8 dynamic per token asymmetric activation quantization and int4 per group weight symmetric quantization to linear This is used to produce a model for executorch backend, but currently executorch did not support lowering for the quantized model from this flow yet

GemliteUIntXWeightOnlyConfig

applies weight only 4 or 8 bit integer quantization and utilizes the gemlite triton kernel and its associated weight packing format.

Int8WeightOnlyConfig

Configuration for applying int8 weight-only symmetric per-channel quantization to linear layers.

Int8DynamicActivationInt8WeightConfig

Configuration for applying int8 dynamic symmetric per-token activation and int8 per-channel weight quantization to linear layers

UIntXWeightOnlyConfig

Configuration for applying uintx weight-only asymmetric per-group quantization to linear layers, using uintx quantization where x is the number of bits specified by dtype

FPXWeightOnlyConfig

Sub-byte floating point dtypes defined by ebits: exponent bits and mbits: mantissa bits e.g.

QAT APIs

IntXQuantizationAwareTrainingConfig

FromIntXQuantizationAwareTrainingConfig

Object that knows how to convert a model with fake quantized modules, such as FakeQuantizedLinear() and FakeQuantizedEmbedding(), back to model with the original, corresponding modules without fake quantization.

FakeQuantizeConfig

Config for how to fake quantize weights or activations.

Int4WeightOnlyQATQuantizer

Quantizer for performing QAT on a model, where linear layers have int4 fake quantized grouped per channel weights.

Int8DynActInt4WeightQATQuantizer

Quantizer for performing QAT on a model, where linear layers have int8 dynamic per token fake quantized activations and int4 fake quantized grouped per channel weights.

Int4WeightOnlyEmbeddingQATQuantizer

Quantizer for performing QAT on a model, where embedding layers have int4 fake quantized grouped per channel weights.

ComposableQATQuantizer

Composable quantizer that users can use to apply multiple QAT quantizers easily.

initialize_fake_quantizers

(Prototype) Initialize the scales and zero points on all FakeQuantizer in the model based on the provided example inputs.

Quantization Primitives

choose_qparams_affine

param input:

fp32, bf16, fp16 input Tensor

choose_qparams_affine_with_min_max

A variant of choose_qparams_affine() operator that pass in min_val and max_val directly instead of deriving these from a single input.

choose_qparams_affine_floatx

quantize_affine

param input:

original float32, float16 or bfloat16 Tensor

quantize_affine_floatx

Quantizes the float32 high precision floating point tensor to low precision floating point number and converts the result to unpacked floating point format with the format of 00SEEEMM (for fp6_e3m2) where S means sign bit, e means exponent bit and m means mantissa bit

dequantize_affine

param input:

quantized tensor, should match the dtype dtype argument

dequantize_affine_floatx

choose_qparams_and_quantize_affine_hqq

fake_quantize_affine

General fake quantize op for quantization-aware training (QAT).

fake_quantize_affine_cachemask

General fake quantize op for quantization-aware training (QAT).

safe_int_mm

Performs a safe integer matrix multiplication, considering different paths for torch.compile, cublas, and fallback cases.

int_scaled_matmul

Performs scaled integer matrix multiplication.

MappingType

How floating point number is mapped to integer number

ZeroPointDomain

Enum that indicate whether zero_point is in integer domain or floating point domain

TorchAODType

Placeholder for dtypes that do not exist in PyTorch core yet.

Other

to_linear_activation_quantized

swap_linear_with_smooth_fq_linear

Replaces linear layers in the model with their SmoothFakeDynamicallyQuantizedLinear equivalents.

smooth_fq_linear_to_inference

Prepares the model for inference by calculating the smoothquant scale for each SmoothFakeDynamicallyQuantizedLinear layer.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources