torchao.quantization¶
Main Quantization APIs¶
Convert the weight of linear modules in the model with config, model is modified inplace |
|
Autoquantization is a process which identifies the fastest way to quantize each layer of a model over some set of potential qtensor subclasses. |
Inference APIs for quantize_¶
Configuration for applying uint4 weight-only asymmetric per-group quantization to linear layers, using "tensor_core_tiled" layout for speedup with tinygemm kernel |
|
Configuration for applying float8 dynamic symmetric quantization to both activations and weights of linear layers. |
|
Configuration for applying float8 weight-only symmetric per-channel quantization to linear layers. |
|
Configuration for applying float8 static symmetric quantization to |
|
Configuration for applying int8 dynamic per token asymmetric activation quantization and int4 per group weight symmetric quantization to linear This is used to produce a model for executorch backend, but currently executorch did not support lowering for the quantized model from this flow yet |
|
applies weight only 4 or 8 bit integer quantization and utilizes the gemlite triton kernel and its associated weight packing format. |
|
Configuration for applying int8 weight-only symmetric per-channel quantization to linear layers. |
|
Configuration for applying int8 dynamic symmetric per-token activation and int8 per-channel weight quantization to linear layers |
|
Configuration for applying uintx weight-only asymmetric per-group quantization to linear layers, using uintx quantization where x is the number of bits specified by dtype |
|
Sub-byte floating point dtypes defined by ebits: exponent bits and mbits: mantissa bits e.g. |
QAT APIs¶
Object that knows how to convert a model with fake quantized modules, such as |
|
Config for how to fake quantize weights or activations. |
|
Quantizer for performing QAT on a model, where linear layers have int4 fake quantized grouped per channel weights. |
|
Quantizer for performing QAT on a model, where linear layers have int8 dynamic per token fake quantized activations and int4 fake quantized grouped per channel weights. |
|
Quantizer for performing QAT on a model, where embedding layers have int4 fake quantized grouped per channel weights. |
|
Composable quantizer that users can use to apply multiple QAT quantizers easily. |
|
(Prototype) Initialize the scales and zero points on all |
Quantization Primitives¶
|
|
A variant of |
|
|
|
Quantizes the float32 high precision floating point tensor to low precision floating point number and converts the result to unpacked floating point format with the format of 00SEEEMM (for fp6_e3m2) where S means sign bit, e means exponent bit and m means mantissa bit |
|
|
|
General fake quantize op for quantization-aware training (QAT). |
|
General fake quantize op for quantization-aware training (QAT). |
|
Performs a safe integer matrix multiplication, considering different paths for torch.compile, cublas, and fallback cases. |
|
Performs scaled integer matrix multiplication. |
|
How floating point number is mapped to integer number |
|
Enum that indicate whether zero_point is in integer domain or floating point domain |
|
Placeholder for dtypes that do not exist in PyTorch core yet. |
Other¶
Replaces linear layers in the model with their SmoothFakeDynamicallyQuantizedLinear equivalents. |
|
Prepares the model for inference by calculating the smoothquant scale for each SmoothFakeDynamicallyQuantizedLinear layer. |