Troubleshooting#
This guide lists the most common errors encountered when compiling or running Torch-TensorRT models, with root-cause explanations and recommended fixes.
For systematic compilation debugging use the Debugger context manager. For pre-build coverage analysis use Dryrun Mode.
Compilation Errors#
“Lowering failed for node … No converter found”
A TensorRT converter does not exist for the ATen op in the graph. Options:
Check Operators Supported to confirm the op is listed. If it is, the issue may be that the specific overload or dtype combination is not covered.
Add
torch_executed_ops={<op>}to run that op in PyTorch instead of TRT.If
require_full_compilation=True, remove it or add the op totorch_executed_ops.Lower
min_block_sizeso the surrounding TRT-convertible ops form a block even without the failing op.
“Block size … is smaller than min_block_size”
The TRT partition contains only a few convertible ops — fewer than
min_block_size— and falls back to PyTorch. This is expected behavior.
Use
dryrun=Trueto see which ops are unsupported and where the partitions form; see Dryrun Mode.Lower
min_block_sizeto allow smaller TRT blocks, orInvestigate whether the model can be restructured to group more ops together.
“assert_size_stride … Expected stride … but got …”
An input tensor’s stride does not match what was used during tracing. Most often caused by non-contiguous tensors (e.g. after a
permutewith no explicitcontiguous()call).
Add
.contiguous()before passing the tensor to the model.If you use complex inputs, see complex tensor handling note.
Compilation hangs or takes very long (>30 minutes)
TRT engine compilation is CPU/GPU intensive and can legitimately take a long time for large models — especially with
optimization_level=3(default) or higher.
Check that compilation is actually progressing by enabling debug logging:
torch_tensorrt.compile(model, ..., debug=True)You should see per-layer TRT build messages. If the last message is stuck at the same layer for more than 10 minutes, compilation may be hung.
Reduce
optimization_levelto 0 or 1 during development. Use higher levels only for final production builds.Large models (>1B parameters) may need
offload_module_to_cpu=Trueto avoid OOM during compilation.If using
torch.compile(backend="torch_tensorrt")(JIT), compilation is triggered on the first forward pass — the call will block until done. For large models, prefer the AOT path (ir="dynamo") to make the compilation step explicit.Check whether the model contains ops that trigger very expensive TRT searches (e.g. very large convolutions). Use Dryrun Mode to identify the problematic ops and consider using
torch_executed_opsto push them to PyTorch.
Export fails with “Cannot export …” / data-dependent control flow error
The model uses Python-level branching on tensor values which
torch.export.export(strict mode) cannot trace.
Use
torch_tensorrt.dynamo.trace(model, inputs, strict=False)to enable non-strict tracing (allows data-dependent control flow).Alternatively, rewrite the dynamic branch as a TRT-compatible conditional.
“ModuleNotFoundError: No module named ‘modelopt’”
The model has INT8/FP8 quantization nodes but ModelOpt is not installed.
pip install nvidia-modeloptSee Quantization (INT8 / FP8 / FP4) for the full quantization workflow.
Memory Errors During Compilation#
CUDA out-of-memory during engine build
Set
offload_module_to_cpu=Truein compilation settings to free the original model from GPU during compilation.Reduce
workspace_size(default is large to allow TRT to find optimal kernels).Use
lazy_engine_init=Trueto defer engine initialization until all subgraph compilations are complete.See Resource Management for a systematic memory reduction strategy.
Process killed (OOM) on CPU
TRT compilation can use up to 5× the model size in CPU memory.
Set
TORCHTRT_ENABLE_BUILDER_MALLOC_TRIM=1to reduce to ~3× model size.Disable
offload_module_to_cpu(False) to drop another 1× copy.
Runtime Errors#
“Engine failed to deserialize” / engine load fails
The TRT version on the loading machine is older than the one used to build the engine. Upgrade TRT or rebuild with
version_compatible=True.The GPU compute capability is lower than on the build machine. Rebuild with
hardware_compatible=True(requires Ampere or newer).The
.epfile was generated withuse_python_runtime=Truewhich is not serializable. Rebuild with the default C++ runtime.
Shape mismatch at runtime / “Invalid input shape”
The model was compiled for static shapes but receives a different shape at runtime. Recompile using
torch_tensorrt.Inputwithmin/opt/maxshapes to enable dynamic shapes.If using
MutableTorchTensorRTModule, callset_expected_dynamic_shape_rangebefore the first forward pass.
Wrong numerical results (large error vs PyTorch)
Enable higher precision: cast model and inputs to
float32, or removemodel.half()/model.bfloat16()calls (enabled_precisionsis deprecated; useuse_explicit_typing=Trueand set dtypes in the model directly).TF32 is enabled by default on Ampere and newer GPUs. Disable it with
disable_tf32=Truefor bit-exact FP32 comparison.If using cross-compiled Windows engines, floating-point results may differ slightly due to driver differences. Use
optimization_level=0to minimize kernel specialization.
“DynamicOutputAllocator required” / nonzero / unique ops fail
The model contains data-dependent-shape ops (
nonzero,unique,masked_select, etc.) which require TRT’s output allocator.
Use
PythonTorchTensorRTModule(use_python_runtime=True) — it activates the dynamic output allocator automatically viarequires_output_allocator=True.See CUDAGraphs and the Output Allocator for
DynamicOutputAllocatordetails.
Accuracy / Performance Issues#
Model is slower than expected after compilation
Warm up the model with at least 5 forward passes before measuring — TRT engines load kernels lazily on first use.
Use CUDA events for timing, not
time.time()(wall-clock includes Python and CPU/GPU sync overhead).Run with
dryrun=Trueto check what fraction of ops are running in TRT vs PyTorch (high PyTorch fallback = low coverage = slow).Increase
optimization_level(0–5, default 3) to allow TRT more time to find faster kernels.Use FP16 for throughput-critical workloads:
model.half()+use_explicit_typing=True, orenable_autocast=True+autocast_low_precision_type=torch.float16(enabled_precisionsis deprecated).Try
use_fast_partitioner=Falsefor global partitioning — it is slower to compile but may produce better-performing partitions.See Performance Tuning Guide for a complete benchmarking guide.
High latency with variable input sizes
Set
min/opt/maxshapes ontorch_tensorrt.Input— TRT optimizes for theoptshape. Makeoptmatch the most frequent production input.Use CUDAGraphs (
enable_cudagraphs=True) for fixed-shape low-latency inference; see CUDAGraphs and the Output Allocator.
“How do I do weight-only quantization (A16W8 / W8A16)?”
Weight-only quantization compresses model weights to INT8 while keeping activations in FP16/BF16. This is common for large language models where memory bandwidth is the bottleneck. To do this with ModelOpt, create a custom quantization config that enables
weight_quantizerbut disablesinput_quantizer:import modelopt.torch.quantization as mtq # Custom weight-only INT8 config quant_cfg = { "quant_cfg": { "*weight_quantizer": {"num_bits": 8, "axis": 0}, "*input_quantizer": {"enable": False}, # activations stay FP16 "default": {"enable": False}, }, "algorithm": "max", } mtq.quantize(model, quant_cfg, forward_loop=calibration_loop) trt_model = torch_tensorrt.compile( model, ir="dynamo", arg_inputs=inputs, use_explicit_typing=True, # enabled_precisions deprecated; use model dtypes )See NVIDIA ModelOpt documentation for the full list of built-in quantization configs and customization options.
Distributed / Tensor-Parallel Issues#
“DTensor inputs detected but use_distributed_mode_trace=False”
The model uses
DTensorparameters (fromparallelize_module) but the default export path does not handle them.
Set
use_distributed_mode_trace=Truein compilation options.See Distributed Inference for the full tensor-parallel workflow.
Getting More Information#
Enable debug logging — wrap the compilation call in the
torch_tensorrt.dynamo.Debuggercontext withlog_level="debug"and inspect<logging_dir>/torch_tensorrt_logging.log.Capture FX graphs — use
capture_fx_graph_before/capture_fx_graph_afterin thetorch_tensorrt.dynamo.Debuggerto see what the graph looks like at each lowering-pass boundary.Dryrun — compile with
dryrun=Trueto see the partition layout and coverage percentage without actually building TRT engines.Layer info — compile with
save_layer_info=Truein the Debugger and inspectengine_layer_info.jsonto see what TRT kernels were selected.File a bug — include the output of the above and the model’s
exported_program.print_readable()when reporting issues at pytorch/TensorRT#issues.