Shortcuts

torchao.quantization

autoquant

Autoquantization is a process which identifies the fastest way to quantize each layer of a model over some set of potential qtensor subclasses.

quantize_

Convert the weight of linear modules in the model with apply_tensor_subclass, model is modified inplace

int8_dynamic_activation_int4_weight

Applies int8 dynamic per token asymmetric activation quantization and int4 per group weight symmetric quantization to linear This is used to produce a model for executorch backend, but currently executorch did not support lowering for the quantized model from this flow yet

int8_dynamic_activation_int8_weight

Applies int8 dynamic symmetric per-token activation and int8 per-channel weight quantization to linear layers

int4_weight_only

Applies uint4 weight-only asymmetric per-group quantization to linear layers, using "tensor_core_tiled" layout for speedup with tinygemm kernel

int8_weight_only

Applies int8 weight-only symmetric per-channel quantization to linear layers.

float8_weight_only

Applies float8 weight-only symmetric per-channel quantization to linear layers.

float8_dynamic_activation_float8_weight

Applies float8 dynamic symmetric quantization to both activations and weights of linear layers.

float8_static_activation_float8_weight

Applies float8 static symmetric quantization to

uintx_weight_only

Applies uintx weight-only asymmetric per-group quantization to linear layers, using uintx quantization where x is the number of bits specified by dtype

fpx_weight_only

Sub-byte floating point dtypes defined by ebits: exponent bits and mbits: mantissa bits e.g.

to_linear_activation_quantized

swap_linear_with_smooth_fq_linear

Replaces linear layers in the model with their SmoothFakeDynamicallyQuantizedLinear equivalents.

smooth_fq_linear_to_inference

Prepares the model for inference by calculating the smoothquant scale for each SmoothFakeDynamicallyQuantizedLinear layer.

choose_qparams_affine

param input:

fp32, bf16, fp16 input Tensor

choose_qparams_affine_with_min_max

A variant of choose_qparams_affine() operator that pass in min_val and max_val directly instead of deriving these from a single input.

choose_qparams_affine_floatx

quantize_affine

param input:

original float32, float16 or bfloat16 Tensor

quantize_affine_floatx

Quantizes the float32 high precision floating point tensor to low precision floating point number and converts the result to unpacked floating point format with the format of 00SEEEMM (for fp6_e3m2) where S means sign bit, e means exponent bit and m means mantissa bit

dequantize_affine

param input:

quantized tensor, should match the dtype dtype argument

dequantize_affine_floatx

choose_qparams_and_quantize_affine_hqq

fake_quantize_affine

General fake quantize op for quantization-aware training (QAT).

fake_quantize_affine_cachemask

General fake quantize op for quantization-aware training (QAT).

safe_int_mm

Performs a safe integer matrix multiplication, considering different paths for torch.compile, cublas, and fallback cases.

int_scaled_matmul

Performs scaled integer matrix multiplication.

MappingType

How floating point number is mapped to integer number

ZeroPointDomain

Enum that indicate whether zero_point is in integer domain or floating point domain

TorchAODType

Placeholder for dtypes that do not exist in PyTorch core yet.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources