torchao.quantization¶
Autoquantization is a process which identifies the fastest way to quantize each layer of a model over some set of potential qtensor subclasses. |
|
Convert the weight of linear modules in the model with apply_tensor_subclass, model is modified inplace |
|
Applies int8 dynamic per token asymmetric activation quantization and int4 per group weight symmetric quantization to linear This is used to produce a model for executorch backend, but currently executorch did not support lowering for the quantized model from this flow yet |
|
Applies int8 dynamic symmetric per-token activation and int8 per-channel weight quantization to linear layers |
|
Applies uint4 weight-only asymmetric per-group quantization to linear layers, using "tensor_core_tiled" layout for speedup with tinygemm kernel |
|
Applies int8 weight-only symmetric per-channel quantization to linear layers. |
|
Applies float8 weight-only symmetric per-channel quantization to linear layers. |
|
Applies float8 dynamic symmetric quantization to both activations and weights of linear layers. |
|
Applies float8 static symmetric quantization to |
|
Applies uintx weight-only asymmetric per-group quantization to linear layers, using uintx quantization where x is the number of bits specified by dtype |
|
Sub-byte floating point dtypes defined by ebits: exponent bits and mbits: mantissa bits e.g. |
|
Replaces linear layers in the model with their SmoothFakeDynamicallyQuantizedLinear equivalents. |
|
Prepares the model for inference by calculating the smoothquant scale for each SmoothFakeDynamicallyQuantizedLinear layer. |
|
|
|
A variant of |
|
|
|
Quantizes the float32 high precision floating point tensor to low precision floating point number and converts the result to unpacked floating point format with the format of 00SEEEMM (for fp6_e3m2) where S means sign bit, e means exponent bit and m means mantissa bit |
|
|
|
General fake quantize op for quantization-aware training (QAT). |
|
General fake quantize op for quantization-aware training (QAT). |
|
Performs a safe integer matrix multiplication, considering different paths for torch.compile, cublas, and fallback cases. |
|
Performs scaled integer matrix multiplication. |
|
How floating point number is mapped to integer number |
|
Enum that indicate whether zero_point is in integer domain or floating point domain |
|
Placeholder for dtypes that do not exist in PyTorch core yet. |