Troubleshooting#

This guide lists the most common errors encountered when compiling or running Torch-TensorRT models, with root-cause explanations and recommended fixes.

For systematic compilation debugging use the Debugger context manager. For pre-build coverage analysis use Dryrun Mode.

Compilation Errors#

“Lowering failed for node … No converter found”

A TensorRT converter does not exist for the ATen op in the graph. Options:

Check Operators Supported to confirm the op is listed. If it is, the issue may be that the specific overload or dtype combination is not covered.

Add torch_executed_ops={<op>} to run that op in PyTorch instead of TRT.

If require_full_compilation=True, remove it or add the op to torch_executed_ops.

Lower min_block_size so the surrounding TRT-convertible ops form a block even without the failing op.

“Block size … is smaller than min_block_size”

The TRT partition contains only a few convertible ops — fewer than min_block_size — and falls back to PyTorch. This is expected behavior.

Use dryrun=True to see which ops are unsupported and where the partitions form; see Dryrun Mode.

Lower min_block_size to allow smaller TRT blocks, or

Investigate whether the model can be restructured to group more ops together.

“assert_size_stride … Expected stride … but got …”

An input tensor’s stride does not match what was used during tracing. Most often caused by non-contiguous tensors (e.g. after a permute with no explicit contiguous() call).

Add .contiguous() before passing the tensor to the model.

If you use complex inputs, see complex tensor handling note.

Compilation hangs or takes very long (>30 minutes)

TRT engine compilation is CPU/GPU intensive and can legitimately take a long time for large models — especially with optimization_level=3 (default) or higher.
Check that compilation is actually progressing by enabling debug logging:
torch_tensorrt.compile(model, ..., debug=True)
You should see per-layer TRT build messages. If the last message is stuck at the same layer for more than 10 minutes, compilation may be hung.
Reduce optimization_level to 0 or 1 during development. Use higher levels only for final production builds.

Large models (>1B parameters) may need offload_module_to_cpu=True to avoid OOM during compilation.

If using torch.compile(backend="torch_tensorrt") (JIT), compilation is triggered on the first forward pass — the call will block until done. For large models, prefer the AOT path (ir="dynamo") to make the compilation step explicit.

Check whether the model contains ops that trigger very expensive TRT searches (e.g. very large convolutions). Use Dryrun Mode to identify the problematic ops and consider using torch_executed_ops to push them to PyTorch.

Export fails with “Cannot export …” / data-dependent control flow error

The model uses Python-level branching on tensor values which torch.export.export (strict mode) cannot trace.

Use torch_tensorrt.dynamo.trace(model, inputs, strict=False) to enable non-strict tracing (allows data-dependent control flow).

Alternatively, rewrite the dynamic branch as a TRT-compatible conditional.

“ModuleNotFoundError: No module named ‘modelopt’”

The model has INT8/FP8 quantization nodes but ModelOpt is not installed.
pip install nvidia-modelopt
See Quantization (INT8 / FP8 / FP4) for the full quantization workflow.

Memory Errors During Compilation#

CUDA out-of-memory during engine build

Set offload_module_to_cpu=True in compilation settings to free the original model from GPU during compilation.

Reduce workspace_size (default is large to allow TRT to find optimal kernels).

Use lazy_engine_init=True to defer engine initialization until all subgraph compilations are complete.

See Resource Management for a systematic memory reduction strategy.

Process killed (OOM) on CPU

TRT compilation can use up to 5× the model size in CPU memory.

Set TORCHTRT_ENABLE_BUILDER_MALLOC_TRIM=1 to reduce to ~3× model size.

Disable offload_module_to_cpu (False) to drop another 1× copy.

Runtime Errors#

“Engine failed to deserialize” / engine load fails

The TRT version on the loading machine is older than the one used to build the engine. Upgrade TRT or rebuild with version_compatible=True.

The GPU compute capability is lower than on the build machine. Rebuild with hardware_compatible=True (requires Ampere or newer).

Shape mismatch at runtime / “Invalid input shape”

The model was compiled for static shapes but receives a different shape at runtime. Recompile using torch_tensorrt.Input with min/opt/max shapes to enable dynamic shapes.

If using MutableTorchTensorRTModule, call set_expected_dynamic_shape_range before the first forward pass.

Wrong numerical results (large error vs PyTorch)

Enable higher precision: cast model and inputs to float32, or remove model.half() / model.bfloat16() calls. Strong typing is always enabled — set dtypes directly in the model and inputs.

TF32 is enabled by default on Ampere and newer GPUs. Disable it with disable_tf32=True for bit-exact FP32 comparison.

If using cross-compiled Windows engines, floating-point results may differ slightly due to driver differences. Use optimization_level=0 to minimize kernel specialization.

“DynamicOutputAllocator required” / nonzero / unique ops fail

The model contains data-dependent-shape ops (nonzero, unique, masked_select, etc.) which require TRT’s output allocator.

Use torch_tensorrt.runtime.TorchTensorRTModule (or a compiled graph that wraps it) with requires_output_allocator=True so the runtime can use TRT’s output allocator when the engine needs dynamic output allocation.

See CUDAGraphs and the Output Allocator for DynamicOutputAllocator details.

Accuracy / Performance Issues#

Model is slower than expected after compilation

Warm up the model with at least 5 forward passes before measuring — TRT engines load kernels lazily on first use.

Use CUDA events for timing, not time.time() (wall-clock includes Python and CPU/GPU sync overhead).

Run with dryrun=True to check what fraction of ops are running in TRT vs PyTorch (high PyTorch fallback = low coverage = slow).

Increase optimization_level (0–5, default 3) to allow TRT more time to find faster kernels.

Use FP16 for throughput-critical workloads: model.half() with FP16 inputs, or enable_autocast=True + autocast_low_precision_type=torch.float16.

Try use_fast_partitioner=False for global partitioning — it is slower to compile but may produce better-performing partitions.

See Performance Tuning Guide for a complete benchmarking guide.

High latency with variable input sizes

Set min/opt/max shapes on torch_tensorrt.Input — TRT optimizes for the opt shape. Make opt match the most frequent production input.

Use CUDAGraphs (enable_cudagraphs=True) for fixed-shape low-latency inference; see CUDAGraphs and the Output Allocator.

“How do I do weight-only quantization (A16W8 / W8A16)?”

Weight-only quantization compresses model weights to INT8 while keeping activations in FP16/BF16. This is common for large language models where memory bandwidth is the bottleneck. To do this with ModelOpt, create a custom quantization config that enables weight_quantizer but disables input_quantizer:
import modelopt.torch.quantization as mtq

# Custom weight-only INT8 config
quant_cfg = {
    "quant_cfg": {
        "*weight_quantizer": {"num_bits": 8, "axis": 0},
        "*input_quantizer": {"enable": False},  # activations stay FP16
        "default": {"enable": False},
    },
    "algorithm": "max",
}

mtq.quantize(model, quant_cfg, forward_loop=calibration_loop)

trt_model = torch_tensorrt.compile(
    model, ir="dynamo", arg_inputs=inputs,
)
See NVIDIA ModelOpt documentation for the full list of built-in quantization configs and customization options.

Distributed / Tensor-Parallel Issues#

“DTensor inputs detected but use_distributed_mode_trace=False”

The model uses DTensor parameters (from parallelize_module) but the default export path does not handle them.

Set use_distributed_mode_trace=True in compilation options.

See Distributed Inference for the full tensor-parallel workflow.

Getting More Information#

Enable debug logging — wrap the compilation call in the torch_tensorrt.dynamo.Debugger context with log_level="debug" and inspect <logging_dir>/torch_tensorrt_logging.log.
Capture FX graphs — use capture_fx_graph_before / capture_fx_graph_after in the torch_tensorrt.dynamo.Debugger to see what the graph looks like at each lowering-pass boundary.
Dryrun — compile with dryrun=True to see the partition layout and coverage percentage without actually building TRT engines.
Layer info — compile with save_layer_info=True in the Debugger and inspect engine_layer_info.json to see what TRT kernels were selected.
File a bug — include the output of the above and the model’s exported_program.print_readable() when reporting issues at pytorch/TensorRT#issues.