FPXWeightOnlyConfig¶
- class torchao.quantization.FPXWeightOnlyConfig(ebits: int, mbits: int, set_inductor_config: bool = True)[source]¶
Sub-byte floating point dtypes defined by ebits: exponent bits and mbits: mantissa bits e.g. fp6_e3_m2, fp6_e2_m3, … The packing format and kernels are from the fp6-llm paper: https://arxiv.org/abs/2401.14112 github repo: https://github.com/usyd-fsalab/fp6_llm, now renamed to quant-llm For more details for packing please see:
FpxTensorCoreAQTTensorImpl
This is experimental, will be merged with to_affine_quantized_floatx in the future