CompilationSettings Reference#

CompilationSettings is the single dataclass that controls every aspect of Torch-TensorRT Dynamo compilation. It is passed (directly or via keyword arguments) to torch_tensorrt.dynamo.compile(), torch_tensorrt.compile(), and torch.compile() with backend="tensorrt".

import torch
import torch_tensorrt

trt_gm = torch_tensorrt.dynamo.compile(
    exported_program,
    arg_inputs=inputs,
    use_explicit_typing=True,  # respects dtypes set in model/inputs
    min_block_size=3,
    optimization_level=4,
)

All parameters have sensible defaults. Change only what you need.


Core Parameters#

Parameter

Default

Description

enabled_precisions (DEPRECATED)

{dtype.f32}

Set of precisions the TensorRT builder may use. Any combination of torch.float32, torch.float16, torch.bfloat16, torch.int8, torch.float8_e4m3fn. Adding a lower precision does not force its use — TRT selects the best kernel per layer. For INT8 calibration or FP8, use ModelOpt quantization first.

min_block_size

5

Minimum number of consecutive TRT-capable operators required to form a TRT engine block. Subgraphs smaller than this are merged back into PyTorch. Lower values increase TRT coverage but may add engine-launch overhead for tiny blocks. Use dryrun mode to find the sweet spot.

torch_executed_ops

set()

Force specific torch.ops targets to run in PyTorch regardless of converter support. Example: {torch.ops.aten.add.Tensor} to keep all adds in PyTorch.

require_full_compilation

False

Raise an error if any node cannot be placed in TensorRT. Useful for CI correctness gates on models that are known to be fully TRT-compatible.

device

current CUDA device

torch_tensorrt.Device specifying the GPU to compile for.

use_python_runtime

False (auto)

False uses the C++ runtime (recommended — serializable, CUDAGraphs, multi-device safe). True forces the Python runtime (simpler to instrument for debugging but not serializable to ExportedProgram). None selects C++ if available.

pass_through_build_failures

False

When True, TRT engine build errors raise exceptions rather than fall back to PyTorch. Useful during development when some subgraphs are not yet fully convertible.


Optimization Tuning#

Parameter

Default

Description

optimization_level

None (TRT default = 3)

Integer 0–5. Higher levels let TRT spend more time searching for faster kernels at the cost of longer compile time. 0 = fastest build, 5 = best runtime performance. TRT’s built-in default (3) is a good balance for most workloads.

workspace_size

0 (TRT default)

Maximum GPU memory (bytes) TRT may allocate as scratch space during engine build. 0 means TRT picks a size automatically. Increase if you see workspace-related build warnings on large models.

num_avg_timing_iters

1

Number of iterations used to time and select kernels during the build phase. Higher values reduce timing noise and can improve kernel selection on NUMA or shared-GPU environments, at the cost of longer compile time.

max_aux_streams

None (TRT default)

Maximum number of auxiliary CUDA streams TRT may use per engine for concurrent layer execution. None lets TRT decide. Set to 0 to disable auxiliary streams and run all layers on the main stream.

tiling_optimization_level

"none"

Controls how aggressively TRT searches for tiling strategies. Options: "none" (no tiling search), "fast" (quick scan), "moderate", "full" (exhaustive search — best performance but slowest compile). Tiling can substantially improve throughput for convolution-heavy models.

l2_limit_for_tiling

-1 (no limit)

Target L2 cache usage limit in bytes for tiling optimization. Use when you want tiling kernels to fit within a specific L2 budget (e.g., on multi-tenant GPUs). -1 disables the constraint.

sparse_weights

False

Allow TRT to use sparse-weight kernels for qualified layers. Requires 2:4 structured sparsity in the model weights. Can provide significant throughput improvements on Ampere+ GPUs with sparse weights.

disable_tf32

False

Disable TensorFloat-32 (TF32) accumulation. TF32 is enabled by default on Ampere and newer GPUs and provides FP32-range with FP16-speed for matmul/conv. Disable only when you need strict IEEE FP32 semantics.


Precision and Typing#

Parameter

Default

Description

truncate_double

False

Automatically truncate float64 (double) inputs and weights to float32 before passing to TRT. TRT does not natively support float64; enable this to compile models that use double without changing the model code.

use_fp32_acc

False

Insert FP32 cast nodes around matmul layers so that accumulation happens in FP32 even when the network runs in FP16. Improves numerical accuracy for transformer models at a small throughput cost. Requires use_explicit_typing=True.

use_explicit_typing

True

Respect the dtypes set in the PyTorch model (strong typing). When True, TRT will not silently up/downcast layers. This is the recommended approach for controlling precision — set dtypes in your model/inputs directly. Required by autocast and use_fp32_acc.


Autocast (Automatic Mixed Precision)#

Autocast is an alternative to manually setting layer precisions a model. It analyses the graph and selectively lowers eligible operations to a reduced precision, skipping nodes that are numerically sensitive. Enable it with enable_autocast=True.

Note

When enable_autocast=True, use_explicit_typing is automatically set to True as well.

trt_gm = torch_tensorrt.dynamo.compile(
    exported_program,
    arg_inputs=inputs,
    enable_autocast=True,
    autocast_low_precision_type=torch.float16,
    autocast_excluded_ops={torch.ops.aten.softmax.int},
    autocast_max_output_threshold=1024.0,
)

Parameter

Default

Description

enable_autocast

False

Enable graph-aware automatic mixed precision. Analyses op outputs and reduction depths to assign each node to FP32 or low precision.

autocast_low_precision_type

None

The reduced precision to cast down to. Supported: torch.float16, torch.bfloat16. None disables actual casting (no-op even if enable_autocast=True).

autocast_excluded_nodes

set()

Set of regex patterns matched against node names. Nodes whose names match any pattern are kept in FP32. Example: {".*layer_norm.*"} keeps all LayerNorm ops in FP32.

autocast_excluded_ops

set()

Set of torch.ops targets always kept in FP32 regardless of other autocast decisions. Example: {torch.ops.aten.softmax.int}.

autocast_max_output_threshold

512.0

Nodes whose outputs exceed this absolute value are kept in FP32. Guards against overflow in activations with large dynamic range (e.g., unnormalized logits).

autocast_max_depth_of_reduction

None (∞)

Maximum reduction depth allowed in low precision. Reduction depth measures how many reduction operations (sum, mean, etc.) feed into a node. Nodes with higher depth are kept in FP32 to prevent error accumulation. None means no limit.

autocast_calibration_dataloader

None

A torch.utils.data.DataLoader used to calibrate output statistics (e.g., absolute max values) for each node. When provided, the autocast pass runs the model on calibration data to make per-node precision decisions based on observed ranges rather than static thresholds.


Weight Management#

These settings control how TRT engines store and manage weights. They primarily affect serialized engine size and whether the engine can be refitted without recompilation.

Parameter

Default

Description

immutable_weights

True

Build non-refittable engines. When True, TRT can apply more aggressive weight-fusing optimizations but the engine cannot be updated via torch_tensorrt.dynamo.refit_module_weights(). Set to False if you plan to do weight updates (e.g., LoRA, fine-tuning).

refit_identical_engine_weights

False

When multiple subgraphs share identical weight tensors (e.g., weight-tied language models), refit all engines with the same weights in a single pass. Requires immutable_weights=False.

strip_engine_weights

False

Serialize engines without weight data. Produces smaller engine files; weights must be refitted before the engine can run. Useful for distributing architecture-only engine blueprints. On TRT ≥ 10.14 this is handled automatically via TRT’s INCLUDE_REFIT serialization flag.

enable_weight_streaming

False

Enable TRT weight streaming for engines whose weights exceed GPU memory. Weights are streamed from host memory during inference. Requires TRT support and is typically used for very large models.


Hardware Compatibility#

Parameter

Default

Description

version_compatible

False

Build engines that can run on a newer TRT version than the one used to compile them (forward ABI compatibility). Disables some runtime optimizations; use when you need to ship engine files that may be loaded by a future TRT version.

hardware_compatible

False

Build engines compatible with GPU architectures other than the compilation GPU. Currently supports NVIDIA Ampere and newer. Useful for compiling on one GPU SKU and deploying on another within the same generation.

engine_capability

STANDARD

Restrict kernel selection to safe GPU kernels (SAFETY) or safe DLA kernels (DLA_STANDALONE). Use STANDARD for normal inference workloads.

enable_cross_compile_for_windows

False

Build engines on Linux that are deployable on Windows (x86-64). Disables Python runtime, lazy engine init, and engine caching. See torch_tensorrt.dynamo.cross_compile_for_windows().


Memory and Resource Management#

Parameter

Default

Description

enable_resource_partitioning

False

Split oversized TRT subgraphs so each piece fits within cpu_memory_budget during engine build. Prevents OOM errors on large models at the cost of slightly more engines. See Partitioning Phase for details.

cpu_memory_budget

None

Byte budget per TRT engine during build, used by resource partitioning. None uses available system memory (reported by psutil).

offload_module_to_cpu

False

Move the model weights to CPU RAM before compilation to reduce GPU memory pressure during the build phase. Weights are loaded back to GPU at runtime.

dynamically_allocate_resources

False

Let TRT dynamically allocate memory for intermediate tensors at runtime rather than pre-allocating. Can reduce peak memory footprint at a small runtime cost.


Graph Partitioning#

Parameter

Default

Description

use_fast_partitioner

True

Use the adjacency (fast_partition) partitioner. Set to False to use the globally-optimal global_partition which minimizes engine count but is slower to compile. See Partitioning Phase.

assume_dynamic_shape_support

False

Skip per-converter dynamic shape checks; treat all converters as dynamic-capable. Use when you have already validated all your ops with dynamic shapes and want to skip the safety gate.

enable_experimental_decompositions

False

Enable the full set of core ATen decompositions instead of the curated subset. May expose more ops to TRT conversion at the cost of potential numerical differences.

use_distributed_mode_trace

False

Use aot_autograd for tracing instead of the default path. Required when the model contains DTensor or other distributed tensors.


Compilation Workflow#

Parameter

Default

Description

dryrun

False

Run the full partitioning pipeline without building any TRT engines. True prints a detailed partition report. A string path (e.g. "/tmp/report.txt") also saves the report to a file. See Dryrun Mode.

lazy_engine_init

False

Defer TRT engine deserialization until all engines have been built. Works around resource contraints and builder overhad but engines may be less well tuned to their deployment resource availablity

debug

False

Enable verbose TRT builder logs at DEBUG level.


Engine Caching#

Parameter

Default

Description

cache_built_engines

False

Persist compiled TRT engines to disk after building. Combine with reuse_cached_engines for faster subsequent compilations of the same graph.

reuse_cached_engines

False

Load TRT engines from the disk cache on cache hit, skipping the build step. The cache key includes graph structure, input specs, and all engine-invariant settings (see Engine Caching).

timing_cache_path

/tmp/torch_tensorrt_engine_cache/timing_cache.bin

Path for TRT’s timing cache file. The timing cache records kernel timing data across sessions, speeding up subsequent engine builds for similar subgraphs even when the engine cache itself is cold.


DLA Parameters#

Deep Learning Accelerator (DLA) settings are only relevant when compiling for Jetson or DRIVE platforms. Set engine_capability=EngineCapability.DLA_STANDALONE to target DLA.

Parameter

Default

Description

dla_sram_size

1 MB

Fast software-managed SRAM used by DLA for intra-layer communication (bytes).

dla_local_dram_size

1 GB

Host DRAM used by DLA for intermediate tensor storage across layers (bytes).

dla_global_dram_size

512 MB

Host DRAM used by DLA for weights and metadata (bytes).


Engine-Invariant Settings#

Changing any of the following settings invalidates cached engines — the engine must be rebuilt from scratch:

enabled_precisions, max_aux_streams, version_compatible, optimization_level, disable_tf32, sparse_weights, engine_capability, hardware_compatible, refit_identical_engine_weights, immutable_weights, enable_weight_streaming, tiling_optimization_level, l2_limit_for_tiling, enable_autocast, autocast_low_precision_type, autocast_excluded_nodes, autocast_excluded_ops, autocast_max_output_threshold, autocast_max_depth_of_reduction, autocast_calibration_dataloader.

Settings not in this list (e.g., debug, dryrun, pass_through_build_failures) can be changed without invalidating the cache.