NVFP4DynamicActivationNVFP4WeightConfig#
- class torchao.prototype.mx_formats.NVFP4DynamicActivationNVFP4WeightConfig(use_triton_kernel: bool = True, use_dynamic_per_tensor_scale: bool = True, step: QuantizationStep | None = None)[source][source]#
NVIDIA FP4 (NVFP4) Inference Quantization Configuration
This is a specialized configuration for NVIDIA’s FP4 format. NVFP4 uses “double quantization” with two scale levels: - A global per_tensor_scale (float32) - Per-block scales (float8_e4m3fn, block_size=16), always dynamically calculated
The activation per_tensor_scale can be determined in two ways:
- Dynamic per_tensor_scale (default, step=None, use_dynamic_per_tensor_scale=True):
Both weight and activation per_tensor_scale are computed at runtime from the tensor amax
- Static per_tensor_scale via observer flow (step=”prepare”/”convert”):
Weight per_tensor_scale is computed from weight amax at convert time
Activation per_tensor_scale is determined statically during calibration: step=”prepare” inserts observers, then after running calibration data, step=”convert” extracts the observed amax and bakes the activation per_tensor_scale into the quantized weight tensor
At inference, the static activation per_tensor_scale is read from the weight tensor instead of being computed dynamically
Note: activation per-block scales are still computed dynamically at inference time
Note: When step is specified, use_dynamic_per_tensor_scale is automatically set to False.
Configuration parameters: - use_triton_kernel: bool, whether to use fused triton kernel for activation scaling (default: True).
Requires MSLK to be installed.
use_dynamic_per_tensor_scale: bool, whether to dynamically compute per tensor scale (default: True)
step: Optional[QuantizationStep], the quantization step for observer-based flow
Data: float4_e2m1fn_x2
Scales: float8_e4m3fn
Block size: 16 along the reduction dim
Note: Triton kernel only works with DYNAMIC mode and has constraints that input dimensions must satisfy M % 128 == 0 and K % 64 == 0. Will automatically fallback when constraints aren’t met.
Example:
import torch import torch.nn as nn from torchao.prototype.mx_formats.inference_workflow import ( NVFP4DynamicActivationNVFP4WeightConfig, ) from torchao.quantization import quantize_ model = nn.Linear(32, 128, bias=False, dtype=torch.bfloat16, device="cuda") config = NVFP4DynamicActivationNVFP4WeightConfig( use_dynamic_per_tensor_scale=True, use_triton_kernel=True, ) quantize_(model, config=config) model = torch.compile(model, fullgraph=True)