Troubleshooting#

This guide lists the most common errors encountered when compiling or running Torch-TensorRT models, with root-cause explanations and recommended fixes.

For systematic compilation debugging use the Debugger context manager. For pre-build coverage analysis use Dryrun Mode.


Compilation Errors#

“Lowering failed for node … No converter found”

A TensorRT converter does not exist for the ATen op in the graph. Options:

  • Check Operators Supported to confirm the op is listed. If it is, the issue may be that the specific overload or dtype combination is not covered.

  • Add torch_executed_ops={<op>} to run that op in PyTorch instead of TRT.

  • If require_full_compilation=True, remove it or add the op to torch_executed_ops.

  • Lower min_block_size so the surrounding TRT-convertible ops form a block even without the failing op.

“Block size … is smaller than min_block_size”

The TRT partition contains only a few convertible ops — fewer than min_block_size — and falls back to PyTorch. This is expected behavior.

  • Use dryrun=True to see which ops are unsupported and where the partitions form; see Dryrun Mode.

  • Lower min_block_size to allow smaller TRT blocks, or

  • Investigate whether the model can be restructured to group more ops together.

“assert_size_stride … Expected stride … but got …”

An input tensor’s stride does not match what was used during tracing. Most often caused by non-contiguous tensors (e.g. after a permute with no explicit contiguous() call).

Compilation hangs or takes very long (>30 minutes)

TRT engine compilation is CPU/GPU intensive and can legitimately take a long time for large models — especially with optimization_level=3 (default) or higher.

  • Check that compilation is actually progressing by enabling debug logging:

    torch_tensorrt.compile(model, ..., debug=True)
    

    You should see per-layer TRT build messages. If the last message is stuck at the same layer for more than 10 minutes, compilation may be hung.

  • Reduce optimization_level to 0 or 1 during development. Use higher levels only for final production builds.

  • Large models (>1B parameters) may need offload_module_to_cpu=True to avoid OOM during compilation.

  • If using torch.compile(backend="torch_tensorrt") (JIT), compilation is triggered on the first forward pass — the call will block until done. For large models, prefer the AOT path (ir="dynamo") to make the compilation step explicit.

  • Check whether the model contains ops that trigger very expensive TRT searches (e.g. very large convolutions). Use Dryrun Mode to identify the problematic ops and consider using torch_executed_ops to push them to PyTorch.

Export fails with “Cannot export …” / data-dependent control flow error

The model uses Python-level branching on tensor values which torch.export.export (strict mode) cannot trace.

  • Use torch_tensorrt.dynamo.trace(model, inputs, strict=False) to enable non-strict tracing (allows data-dependent control flow).

  • Alternatively, rewrite the dynamic branch as a TRT-compatible conditional.

“ModuleNotFoundError: No module named ‘modelopt’”

The model has INT8/FP8 quantization nodes but ModelOpt is not installed.

pip install nvidia-modelopt

See Quantization (INT8 / FP8 / FP4) for the full quantization workflow.


Memory Errors During Compilation#

CUDA out-of-memory during engine build

  • Set offload_module_to_cpu=True in compilation settings to free the original model from GPU during compilation.

  • Reduce workspace_size (default is large to allow TRT to find optimal kernels).

  • Use lazy_engine_init=True to defer engine initialization until all subgraph compilations are complete.

  • See Resource Management for a systematic memory reduction strategy.

Process killed (OOM) on CPU

TRT compilation can use up to 5× the model size in CPU memory.

  • Set TORCHTRT_ENABLE_BUILDER_MALLOC_TRIM=1 to reduce to ~3× model size.

  • Disable offload_module_to_cpu (False) to drop another 1× copy.


Runtime Errors#

“Engine failed to deserialize” / engine load fails

  • The TRT version on the loading machine is older than the one used to build the engine. Upgrade TRT or rebuild with version_compatible=True.

  • The GPU compute capability is lower than on the build machine. Rebuild with hardware_compatible=True (requires Ampere or newer).

  • The .ep file was generated with use_python_runtime=True which is not serializable. Rebuild with the default C++ runtime.

Shape mismatch at runtime / “Invalid input shape”

  • The model was compiled for static shapes but receives a different shape at runtime. Recompile using torch_tensorrt.Input with min/opt/max shapes to enable dynamic shapes.

  • If using MutableTorchTensorRTModule, call set_expected_dynamic_shape_range before the first forward pass.

Wrong numerical results (large error vs PyTorch)

  • Enable higher precision: cast model and inputs to float32, or remove model.half() / model.bfloat16() calls (enabled_precisions is deprecated; use use_explicit_typing=True and set dtypes in the model directly).

  • TF32 is enabled by default on Ampere and newer GPUs. Disable it with disable_tf32=True for bit-exact FP32 comparison.

  • If using cross-compiled Windows engines, floating-point results may differ slightly due to driver differences. Use optimization_level=0 to minimize kernel specialization.

“DynamicOutputAllocator required” / nonzero / unique ops fail

The model contains data-dependent-shape ops (nonzero, unique, masked_select, etc.) which require TRT’s output allocator.

  • Use PythonTorchTensorRTModule (use_python_runtime=True) — it activates the dynamic output allocator automatically via requires_output_allocator=True.

  • See CUDAGraphs and the Output Allocator for DynamicOutputAllocator details.


Accuracy / Performance Issues#

Model is slower than expected after compilation

  • Warm up the model with at least 5 forward passes before measuring — TRT engines load kernels lazily on first use.

  • Use CUDA events for timing, not time.time() (wall-clock includes Python and CPU/GPU sync overhead).

  • Run with dryrun=True to check what fraction of ops are running in TRT vs PyTorch (high PyTorch fallback = low coverage = slow).

  • Increase optimization_level (0–5, default 3) to allow TRT more time to find faster kernels.

  • Use FP16 for throughput-critical workloads: model.half() + use_explicit_typing=True, or enable_autocast=True + autocast_low_precision_type=torch.float16 (enabled_precisions is deprecated).

  • Try use_fast_partitioner=False for global partitioning — it is slower to compile but may produce better-performing partitions.

  • See Performance Tuning Guide for a complete benchmarking guide.

High latency with variable input sizes

  • Set min/opt/max shapes on torch_tensorrt.Input — TRT optimizes for the opt shape. Make opt match the most frequent production input.

  • Use CUDAGraphs (enable_cudagraphs=True) for fixed-shape low-latency inference; see CUDAGraphs and the Output Allocator.

“How do I do weight-only quantization (A16W8 / W8A16)?”

Weight-only quantization compresses model weights to INT8 while keeping activations in FP16/BF16. This is common for large language models where memory bandwidth is the bottleneck. To do this with ModelOpt, create a custom quantization config that enables weight_quantizer but disables input_quantizer:

import modelopt.torch.quantization as mtq

# Custom weight-only INT8 config
quant_cfg = {
    "quant_cfg": {
        "*weight_quantizer": {"num_bits": 8, "axis": 0},
        "*input_quantizer": {"enable": False},  # activations stay FP16
        "default": {"enable": False},
    },
    "algorithm": "max",
}

mtq.quantize(model, quant_cfg, forward_loop=calibration_loop)

trt_model = torch_tensorrt.compile(
    model, ir="dynamo", arg_inputs=inputs,
    use_explicit_typing=True,  # enabled_precisions deprecated; use model dtypes
)

See NVIDIA ModelOpt documentation for the full list of built-in quantization configs and customization options.


Distributed / Tensor-Parallel Issues#

“DTensor inputs detected but use_distributed_mode_trace=False”

The model uses DTensor parameters (from parallelize_module) but the default export path does not handle them.

  • Set use_distributed_mode_trace=True in compilation options.

  • See Distributed Inference for the full tensor-parallel workflow.


Getting More Information#

  1. Enable debug logging — wrap the compilation call in the torch_tensorrt.dynamo.Debugger context with log_level="debug" and inspect <logging_dir>/torch_tensorrt_logging.log.

  2. Capture FX graphs — use capture_fx_graph_before / capture_fx_graph_after in the torch_tensorrt.dynamo.Debugger to see what the graph looks like at each lowering-pass boundary.

  3. Dryrun — compile with dryrun=True to see the partition layout and coverage percentage without actually building TRT engines.

  4. Layer info — compile with save_layer_info=True in the Debugger and inspect engine_layer_info.json to see what TRT kernels were selected.

  5. File a bug — include the output of the above and the model’s exported_program.print_readable() when reporting issues at pytorch/TensorRT#issues.