torchao.quantization¶
Main Quantization APIs¶
Convert the weight of linear modules in the model with config, model is modified inplace |
Inference APIs for quantize_¶
Configuration for int4 weight only quantization, only groupwise quantization is supported right now, and we support version 1 and version 2, that are implemented differently although with same support. |
|
Configuration for apply float8 dynamic per row quantization and int4 per group weight quantization to linear (only group_size 128 is supported right now since underlying kernel used only supports 128 and above and no benefits of making it bigger) |
|
Configuration for applying float8 dynamic symmetric quantization to both activations and weights of linear layers. |
|
Configuration for applying float8 weight-only symmetric per-channel quantization to linear layers. |
|
Configuration for applying int8 dynamic per token asymmetric activation quantization and int4 per group weight symmetric quantization to linear This is used to produce a model for executorch backend, but currently executorch did not support lowering for the quantized model from this flow yet |
|
Configuration for applying int8 weight-only symmetric per-channel quantization to linear layers. |
|
Configuration for applying int8 dynamic symmetric per-token activation and int8 per-channel weight quantization to linear layers. |
Quantization Primitives¶
|
|
A variant of |
|
|
|
|
|
Performs a safe integer matrix multiplication, considering different paths for torch.compile, cublas, and fallback cases. |
|
Performs scaled integer matrix multiplication. |
|
How floating point number is mapped to integer number |
|
Placeholder for dtypes that do not exist in PyTorch core yet. |