Quantized Inference#

Created On: Feb 06, 2026 | Last Updated On: Feb 06, 2026

For inference, we support dynamic and weight-only quantization of torch.nn.funtional.linear across various dtype configurations. The pseudocode is as follows:

# high precision (baseline)
output_bf16 = input_bf16 @ weight_bf16.t()

# dynamic quantization (shown for fp8 rowwise)
output_bf16 = to_fp8(input_bf16) @ to_fp8(weight_fp8.t())

# weight-only quantization (shown for int4)
output_bf16 = input_bf16 @ weight_int4.t()

Quantization Techniques#

See the API Reference documentation for code examples and detailed documentation for each quantization config:

float8 weight configs: Float8DynamicActivationFloat8WeightConfig, Float8WeightOnlyConfig
int8 weight configs: Int8DynamicActivationInt8WeightConfig, Int8WeightOnlyConfig
int4 weight configs: Int4WeightOnlyConfig, Float8DynamicActivationInt4WeightConfig, Int8DynamicActivationInt4WeightConfig
intx weight configs: IntxWeightOnlyConfig, Int8DynamicActivationIntxWeightConfig

Notes:

The quantization error incurred by applying int4 quantization to your model can be fairly significant, so using external techniques like GPTQ may be necessary to obtain a usable model.
Float8 quantization requires hardware with CUDA compute capability 8.9 or greater (e.g., H100).
Third-party backend CI status:
- Ascend NPU(requires torch_npu ≥ 2.7.1)

Accuracy benchmarks#

All the following benchmarks are for meta-llama/Llama-3.1-8B using lm-eval.

weight	activation	wikitext-perplexity	winogrande	checkpoint size (GB)
bfloat16	bfloat16	7.3315	0.7380	16.1
float8_rowwise	float8_rowwise	7.4197	0.7388	9.1
int8_rowwise	bfloat16	7.3451	0.7340	9.1
int8_rowwise	int8_rowwise	7.4535	0.7285	9.1
mxfp8	mxfp8	7.6034	0.7316	9.32
nvfp4	nvfp4	8.4459	0.7135	6.05

To reproduce, run the following command:

// on an H100
SKIP_VLLM=1 ./benchmarks/quantization/measure_accuracy_and_performance.sh h100
// on a B200
SKIP_VLLM=1 ./benchmarks/quantization/measure_accuracy_and_performance.sh b200

Performance benchmarks#

All the following benchmarks are for meta-llama/Llama-3.1-8B using torch==2.9.0 and vllm==0.13.0.

NVIDIA B200#

weight	activation	prefill toks/s	decode toks/s	prefill_speedup	decode_speedup
bfloat16	bfloat16	59099.9	14380	1	1
mxfp8	mxfp8	TODO(https://github.com/pytorch/ao/issues/3549)	-	-	-
nvfp4	nvfp4	102786	15218.9	1.739	1.058
float8_rowwise	float8_rowwise	69313.7	15984	1.173	1.112

NVIDIA H100#

weight	activation	prefill toks/s	decode toks/s	prefill_speedup	decode_speedup
bfloat16	bfloat16	30946.5	6612	1	1
float8_rowwise	float8_rowwise	45312.5	8025.95	1.464	1.214
int8_rowwwise	bfloat16	28231.9	4309.8	0.912	0.652
int4	float8_rowwise	TODO(https://github.com/pytorch/ao/issues/3550)	-	-	-

To reproduce these benchmarks, run

// on an h100
SKIP_LM_EVAL=1 ./benchmarks/quantization/measure_accuracy_and_performance.sh h100
// on a b200
SKIP_LM_EVAL=1 ./benchmarks/quantization/measure_accuracy_and_performance.sh h100

// under the hood, the actual vllm benchmark is doing the following:
// 1. prefill
vllm bench throughput --num_prompts 32 --input_len 4096 --output_len 32 --max_model_len 4128
// 2. decode
vllm bench throughput --num_prompts 128 --input_len 32 --output_len 2048 --max_model_len 2080

Other Available Quantization Techniques#

Int8DynamicActivationIntxWeightConfig Quantization#

We have kernels that do 8-bit dynamic quantization of activations and uintx groupwise quantization of weights. These kernels are experimental and can only be run on a device with an ARM CPU (e.g., a Mac computers with Apple silicon). The benchmarks below were run on an M1 Mac Pro, with 8 perf cores, and 2 efficiency cores, and 32GB of RAM. In all cases, torch.compile was used.

Model	Technique	Tokens/Second	Memory Bandwidth (GB/s)	Peak Memory (GB)	Model Size (GB)
Llama-3.1-8B	Base (bfloat16)	1.24	18.62	NA	15.01
	int8_dynamic_activation_intx_weight-4-256-false	16.03	65.81	NA	4.11
	int8_dynamic_activation_intx_weight-3-256-false	18.94	59.97	NA	3.17

You can try out these apis with the quantize_ api as above alongside the config Int8DynamicActivationIntxWeightConfig. An example can be found in torchao/_models/llama/generate.py.

Codebook Quantization#

The benchmarks below were run on a single NVIDIA-A6000 GPU.

Model	Technique	wikitext-perplexity	Tokens/Second	Memory Bandwidth (GB/s)	Peak Memory (GB)	Model Size (GB)
Llama-3-8B	Base (bfloat16)	7.590	32.36	485.71	16.19	15.01
	codebook-4-64	9.533	1.73	8.62	23.11	4.98
Llama-3.1-8B	Base (bfloat16)	7.713	32.16	482.70	16.35	15.01
	codebook-4-64	10.095	1.73	8.63	23.11	4.98

You try can out these apis with the quantize_ api as above alongside the config CodebookWeightOnlyConfig an example can be found in in torchao/_models/llama/generate.py.