Quantized Inference#
Created On: Feb 06, 2026 | Last Updated On: Feb 06, 2026
For inference, we support dynamic and weight-only quantization of torch.nn.funtional.linear across various dtype configurations.
The pseudocode is as follows:
# high precision (baseline)
output_bf16 = input_bf16 @ weight_bf16.t()
# dynamic quantization (shown for fp8 rowwise)
output_bf16 = to_fp8(input_bf16) @ to_fp8(weight_fp8.t())
# weight-only quantization (shown for int4)
output_bf16 = input_bf16 @ weight_int4.t()
Quantization Techniques#
See the API Reference documentation for code examples and detailed documentation for each quantization config:
float8 weight configs:
Float8DynamicActivationFloat8WeightConfig,Float8WeightOnlyConfigint8 weight configs:
Int8DynamicActivationInt8WeightConfig,Int8WeightOnlyConfigint4 weight configs:
Int4WeightOnlyConfig,Float8DynamicActivationInt4WeightConfig,Int8DynamicActivationInt4WeightConfigintx weight configs:
IntxWeightOnlyConfig,Int8DynamicActivationIntxWeightConfig
Notes:
The quantization error incurred by applying int4 quantization to your model can be fairly significant, so using external techniques like GPTQ may be necessary to obtain a usable model.
Float8 quantization requires hardware with CUDA compute capability 8.9 or greater (e.g., H100).
Third-party backend CI status:
Accuracy benchmarks#
All the following benchmarks are for meta-llama/Llama-3.1-8B using lm-eval.
weight |
activation |
wikitext-perplexity |
winogrande |
checkpoint size (GB) |
|---|---|---|---|---|
bfloat16 |
bfloat16 |
7.3315 |
0.7380 |
16.1 |
float8_rowwise |
float8_rowwise |
7.4197 |
0.7388 |
9.1 |
int8_rowwise |
bfloat16 |
7.3451 |
0.7340 |
9.1 |
int8_rowwise |
int8_rowwise |
7.4535 |
0.7285 |
9.1 |
mxfp8 |
mxfp8 |
7.6034 |
0.7316 |
9.32 |
nvfp4 |
nvfp4 |
8.4459 |
0.7135 |
6.05 |
To reproduce, run the following command:
// on an H100
SKIP_VLLM=1 ./benchmarks/quantization/measure_accuracy_and_performance.sh h100
// on a B200
SKIP_VLLM=1 ./benchmarks/quantization/measure_accuracy_and_performance.sh b200
Performance benchmarks#
All the following benchmarks are for meta-llama/Llama-3.1-8B using torch==2.9.0 and vllm==0.13.0.
NVIDIA B200#
weight |
activation |
prefill toks/s |
decode toks/s |
prefill_speedup |
decode_speedup |
|---|---|---|---|---|---|
bfloat16 |
bfloat16 |
59099.9 |
14380 |
1 |
1 |
mxfp8 |
mxfp8 |
TODO(https://github.com/pytorch/ao/issues/3549) |
- |
- |
- |
nvfp4 |
nvfp4 |
102786 |
15218.9 |
1.739 |
1.058 |
float8_rowwise |
float8_rowwise |
69313.7 |
15984 |
1.173 |
1.112 |
NVIDIA H100#
weight |
activation |
prefill toks/s |
decode toks/s |
prefill_speedup |
decode_speedup |
|---|---|---|---|---|---|
bfloat16 |
bfloat16 |
30946.5 |
6612 |
1 |
1 |
float8_rowwise |
float8_rowwise |
45312.5 |
8025.95 |
1.464 |
1.214 |
int8_rowwwise |
bfloat16 |
28231.9 |
4309.8 |
0.912 |
0.652 |
int4 |
float8_rowwise |
TODO(https://github.com/pytorch/ao/issues/3550) |
- |
- |
- |
To reproduce these benchmarks, run
// on an h100
SKIP_LM_EVAL=1 ./benchmarks/quantization/measure_accuracy_and_performance.sh h100
// on a b200
SKIP_LM_EVAL=1 ./benchmarks/quantization/measure_accuracy_and_performance.sh h100
// under the hood, the actual vllm benchmark is doing the following:
// 1. prefill
vllm bench throughput --num_prompts 32 --input_len 4096 --output_len 32 --max_model_len 4128
// 2. decode
vllm bench throughput --num_prompts 128 --input_len 32 --output_len 2048 --max_model_len 2080
Other Available Quantization Techniques#
Int8DynamicActivationIntxWeightConfig Quantization#
We have kernels that do 8-bit dynamic quantization of activations and uintx groupwise quantization of weights. These kernels are experimental and can only be run on a device with an ARM CPU (e.g., a Mac computers with Apple silicon). The benchmarks below were run on an M1 Mac Pro, with 8 perf cores, and 2 efficiency cores, and 32GB of RAM. In all cases, torch.compile was used.
Model |
Technique |
Tokens/Second |
Memory Bandwidth (GB/s) |
Peak Memory (GB) |
Model Size (GB) |
|---|---|---|---|---|---|
Llama-3.1-8B |
Base (bfloat16) |
1.24 |
18.62 |
NA |
15.01 |
int8_dynamic_activation_intx_weight-4-256-false |
16.03 |
65.81 |
NA |
4.11 |
|
int8_dynamic_activation_intx_weight-3-256-false |
18.94 |
59.97 |
NA |
3.17 |
You can try out these apis with the quantize_ api as above alongside the config Int8DynamicActivationIntxWeightConfig. An example can be found in torchao/_models/llama/generate.py.
Codebook Quantization#
The benchmarks below were run on a single NVIDIA-A6000 GPU.
Model |
Technique |
wikitext-perplexity |
Tokens/Second |
Memory Bandwidth (GB/s) |
Peak Memory (GB) |
Model Size (GB) |
|---|---|---|---|---|---|---|
Llama-3-8B |
Base (bfloat16) |
7.590 |
32.36 |
485.71 |
16.19 |
15.01 |
codebook-4-64 |
9.533 |
1.73 |
8.62 |
23.11 |
4.98 |
|
Llama-3.1-8B |
Base (bfloat16) |
7.713 |
32.16 |
482.70 |
16.35 |
15.01 |
codebook-4-64 |
10.095 |
1.73 |
8.63 |
23.11 |
4.98 |
You try can out these apis with the quantize_ api as above alongside the config CodebookWeightOnlyConfig an example can be found in in torchao/_models/llama/generate.py.