.. _quantization: Quantization (INT8 / FP8 / FP4) ================================= Torch-TensorRT supports post-training quantization (PTQ) with **INT8**, **FP8**, and **FP4** precisions via NVIDIA's `ModelOpt `_ library. ModelOpt inserts quantize/dequantize (QDQ) nodes into the model graph; Torch-TensorRT then converts those nodes into TRT quantization layers and sets the appropriate builder flags. ---- Prerequisites ------------- Install ModelOpt (requires ``nvidia-modelopt``): .. code-block:: bash pip install nvidia-modelopt Hardware requirements: * **INT8**: Any NVIDIA GPU with TensorRT support. * **FP8**: NVIDIA Hopper (H100) or newer. * **FP4 (NVFP4)**: NVIDIA Blackwell (B100/B200) or newer; requires TensorRT ≥ 10.8. ---- INT8 / FP8 PTQ Workflow ------------------------ **Step 1 — Calibrate the model with ModelOpt** ModelOpt's ``mtq.quantize`` replaces eligible layers with QDQ wrappers and calibrates the quantization scales using a small calibration dataset: .. code-block:: python import torch import modelopt.torch.quantization as mtq model = MyModel().eval().cuda() # Define a calibration loop (no gradient needed) def calibration_loop(model): for batch in calibration_dataloader: model(batch.cuda()) # INT8 configuration (per-tensor activations, per-channel weights) quant_cfg = mtq.INT8_DEFAULT_CFG # or FP8: quant_cfg = mtq.FP8_DEFAULT_CFG mtq.quantize(model, quant_cfg, forward_loop=calibration_loop) **Step 2 — Compile with Torch-TensorRT** Pass the quantized model (with QDQ nodes) to the Torch-TensorRT compiler together with the matching precision in ``enabled_precisions``: .. code-block:: python import torch_tensorrt inputs = [torch.randn(1, 3, 224, 224).cuda()] # INT8 trt_model = torch_tensorrt.compile( model, ir="dynamo", arg_inputs=inputs, enabled_precisions={torch.int8}, min_block_size=1, ) # FP8 trt_model = torch_tensorrt.compile( model, ir="dynamo", arg_inputs=inputs, enabled_precisions={torch.float8_e4m3fn}, min_block_size=1, ) output = trt_model(*inputs) ---- FP4 (NVFP4) Workflow --------------------- FP4 uses **dynamic block quantization** — weights are quantized offline to a block-scaled 4-bit format; activations are dynamically quantized at runtime. This path requires TensorRT ≥ 10.8 and a Blackwell GPU. .. code-block:: python import modelopt.torch.quantization as mtq # FP4 config (uses block quantization for weights) quant_cfg = mtq.NVFP4_DEFAULT_CFG mtq.quantize(model, quant_cfg, forward_loop=calibration_loop) trt_model = torch_tensorrt.compile( model, ir="dynamo", arg_inputs=inputs, enabled_precisions={torch.float4_e2m1fn_x2}, min_block_size=1, ) ---- Using ``ExportedProgram`` (``dynamo.compile``) ----------------------------------------------- When using the ``torch.export`` → ``dynamo.compile`` path, wrap the export step in ``export_torch_mode`` from ModelOpt so the QDQ custom ops are properly traced: .. code-block:: python from modelopt.torch.quantization.utils import export_torch_mode with export_torch_mode(): exp_program = torch.export.export(model, tuple(inputs)) trt_gm = torch_tensorrt.dynamo.compile( exp_program, arg_inputs=inputs, enabled_precisions={torch.int8}, ) ``MutableTorchTensorRTModule`` handles the ``export_torch_mode`` context automatically when quantization precisions are detected — no manual wrapping required. See :ref:`mutable_module`. ---- ``torch.compile`` Path ----------------------- Quantization also works with ``torch.compile``: .. code-block:: python trt_model = torch.compile( model, # already quantized with ModelOpt backend="torch_tensorrt", options={ "enabled_precisions": {torch.int8}, "min_block_size": 1, }, ) output = trt_model(*inputs) ---- How QDQ Nodes Are Converted ----------------------------- When Torch-TensorRT encounters ``torch.ops.tensorrt.quantize_op.default`` nodes in the graph (inserted by ModelOpt), the ``aten_ops_quantize_op`` converter maps them to TRT ``IQuantizeLayer`` / ``IDequantizeLayer`` pairs. The TRT builder then fuses these with adjacent compute layers (e.g. Conv, Linear) to produce INT8 or FP8 kernel variants. For FP4, ``torch.ops.tensorrt.dynamic_block_quantize_op.default`` nodes are converted via the dynamic block quantize converter, which uses TRT's ``add_dynamic_quantize`` API (TRT ≥ 10.8). The ``constant_folding`` lowering pass explicitly marks quantization ops as *impure* to prevent their scales from being folded away before the TRT conversion step. ---- Verifying Quantized Layers --------------------------- Use :ref:`dryrun` to check how many ops were partitioned into TRT blocks and whether the quantized layers were included: .. code-block:: python trt_gm = torch_tensorrt.dynamo.compile( exp_program, arg_inputs=inputs, enabled_precisions={torch.int8}, dryrun=True, ) ---- Supported Precision / Hardware Matrix --------------------------------------- .. list-table:: :widths: 15 30 30 25 :header-rows: 1 * - Precision - ``enabled_precisions`` value - Minimum GPU - TRT requirement * - INT8 - ``torch.int8`` or ``torch_tensorrt.dtype.int8`` - Any TRT-capable GPU - Any supported TRT version * - FP8 - ``torch.float8_e4m3fn`` or ``torch_tensorrt.dtype.fp8`` - NVIDIA Hopper (H100+) - TRT ≥ 8.6 * - FP4 (NVFP4) - ``torch.float4_e2m1fn_x2`` or ``torch_tensorrt.dtype.fp4`` - NVIDIA Blackwell (B100+) - TRT ≥ 10.8 ---- Troubleshooting --------------- **"Unable to import quantization op"** ModelOpt is not installed or ``torch.ops.tensorrt.quantize_op`` was not registered. Run ``pip install nvidia-modelopt`` and ensure the Torch-TensorRT package is imported before calling ``mtq.quantize``. **QDQ nodes fall back to PyTorch (not TRT)** Check that ``enabled_precisions`` includes the matching dtype. Also verify ``min_block_size`` is not too large — use ``dryrun=True`` to inspect coverage. **"TensorRT-RTX does not support int8 activation quantization"** INT8 activation quantization (``input_quantizer`` nodes) is not supported by **TensorRT-RTX** — the Windows-native RTX inference library. INT8 weight quantization still works. Use a weight-only INT8 ModelOpt config, or compile on Linux with standard TensorRT instead of TensorRT-RTX. **FP4 "requires TRT ≥ 10.8" error** Upgrade TensorRT. FP4 uses ``add_dynamic_quantize`` which is only available in TRT 10.8 and newer.