torchao.quantization¶

`autoquant`	Autoquantization is a process which identifies the fastest way to quantize each layer of a model over some set of potential qtensor subclasses.
`quantize_`	Convert the weight of linear modules in the model with apply_tensor_subclass, model is modified inplace
`int8_dynamic_activation_int4_weight`	Applies int8 dynamic per token asymmetric activation quantization and int4 per group weight symmetric quantization to linear This is used to produce a model for executorch backend, but currently executorch did not support lowering for the quantized model from this flow yet
`int8_dynamic_activation_int8_weight`	Applies int8 dynamic symmetric per-token activation and int8 per-channel weight quantization to linear layers
`int4_weight_only`	Applies uint4 weight-only asymmetric per-group quantization to linear layers, using "tensor_core_tiled" layout for speedup with tinygemm kernel
`int8_weight_only`	Applies int8 weight-only symmetric per-channel quantization to linear layers.
`float8_weight_only`	Applies float8 weight-only symmetric per-channel quantization to linear layers.
`float8_dynamic_activation_float8_weight`	Applies float8 dynamic symmetric quantization to both activations and weights of linear layers.
`float8_static_activation_float8_weight`	Applies float8 static symmetric quantization to
`uintx_weight_only`	Applies uintx weight-only asymmetric per-group quantization to linear layers, using uintx quantization where x is the number of bits specified by dtype
`fpx_weight_only`	Sub-byte floating point dtypes defined by ebits: exponent bits and mbits: mantissa bits e.g.
`to_linear_activation_quantized`
`swap_linear_with_smooth_fq_linear`	Replaces linear layers in the model with their SmoothFakeDynamicallyQuantizedLinear equivalents.
`smooth_fq_linear_to_inference`	Prepares the model for inference by calculating the smoothquant scale for each SmoothFakeDynamicallyQuantizedLinear layer.
`choose_qparams_affine`	param input: fp32, bf16, fp16 input Tensor
`choose_qparams_affine_with_min_max`	A variant of `choose_qparams_affine()` operator that pass in min_val and max_val directly instead of deriving these from a single input.
`choose_qparams_affine_floatx`
`quantize_affine`	param input: original float32, float16 or bfloat16 Tensor
`quantize_affine_floatx`	Quantizes the float32 high precision floating point tensor to low precision floating point number and converts the result to unpacked floating point format with the format of 00SEEEMM (for fp6_e3m2) where S means sign bit, e means exponent bit and m means mantissa bit
`dequantize_affine`	param input: quantized tensor, should match the dtype dtype argument
`dequantize_affine_floatx`
`choose_qparams_and_quantize_affine_hqq`
`fake_quantize_affine`	General fake quantize op for quantization-aware training (QAT).
`fake_quantize_affine_cachemask`	General fake quantize op for quantization-aware training (QAT).
`safe_int_mm`	Performs a safe integer matrix multiplication, considering different paths for torch.compile, cublas, and fallback cases.
`int_scaled_matmul`	Performs scaled integer matrix multiplication.
`MappingType`	How floating point number is mapped to integer number
`ZeroPointDomain`	Enum that indicate whether zero_point is in integer domain or floating point domain
`TorchAODType`	Placeholder for dtypes that do not exist in PyTorch core yet.

torchao.quantization¶

Docs

Tutorials

Resources