CompilationSettings Reference#
CompilationSettings is the single dataclass that controls every aspect of Torch-TensorRT
Dynamo compilation. It is passed (directly or via keyword arguments) to
torch_tensorrt.dynamo.compile(), torch_tensorrt.compile(), and
torch.compile() with backend="tensorrt".
import torch
import torch_tensorrt
trt_gm = torch_tensorrt.dynamo.compile(
exported_program,
arg_inputs=inputs,
use_explicit_typing=True, # respects dtypes set in model/inputs
min_block_size=3,
optimization_level=4,
)
All parameters have sensible defaults. Change only what you need.
Core Parameters#
Parameter |
Default |
Description |
|---|---|---|
|
|
Set of precisions the TensorRT builder may use. Any combination of
|
|
|
Minimum number of consecutive TRT-capable operators required to form a TRT engine block. Subgraphs smaller than this are merged back into PyTorch. Lower values increase TRT coverage but may add engine-launch overhead for tiny blocks. Use dryrun mode to find the sweet spot. |
|
|
Force specific |
|
|
Raise an error if any node cannot be placed in TensorRT. Useful for CI correctness gates on models that are known to be fully TRT-compatible. |
|
current CUDA device |
|
|
|
|
|
|
When |
Optimization Tuning#
Parameter |
Default |
Description |
|---|---|---|
|
|
Integer 0–5. Higher levels let TRT spend more time searching for faster kernels at the cost of longer compile time. 0 = fastest build, 5 = best runtime performance. TRT’s built-in default (3) is a good balance for most workloads. |
|
|
Maximum GPU memory (bytes) TRT may allocate as scratch space during engine build.
|
|
|
Number of iterations used to time and select kernels during the build phase. Higher values reduce timing noise and can improve kernel selection on NUMA or shared-GPU environments, at the cost of longer compile time. |
|
|
Maximum number of auxiliary CUDA streams TRT may use per engine for concurrent
layer execution. |
|
|
Controls how aggressively TRT searches for tiling strategies. Options:
|
|
|
Target L2 cache usage limit in bytes for tiling optimization. Use when you want
tiling kernels to fit within a specific L2 budget (e.g., on multi-tenant GPUs).
|
|
|
Allow TRT to use sparse-weight kernels for qualified layers. Requires 2:4 structured sparsity in the model weights. Can provide significant throughput improvements on Ampere+ GPUs with sparse weights. |
|
|
Disable TensorFloat-32 (TF32) accumulation. TF32 is enabled by default on Ampere and newer GPUs and provides FP32-range with FP16-speed for matmul/conv. Disable only when you need strict IEEE FP32 semantics. |
Precision and Typing#
Parameter |
Default |
Description |
|---|---|---|
|
|
Automatically truncate |
|
|
Insert FP32 cast nodes around matmul layers so that accumulation happens in
FP32 even when the network runs in FP16. Improves numerical accuracy for
transformer models at a small throughput cost. Requires |
|
|
Respect the dtypes set in the PyTorch model (strong typing). When |
Autocast (Automatic Mixed Precision)#
Autocast is an alternative to manually setting layer precisions a model. It analyses the graph and
selectively lowers eligible operations to a reduced precision, skipping nodes that are
numerically sensitive. Enable it with enable_autocast=True.
Note
When enable_autocast=True, use_explicit_typing is automatically set to
True as well.
trt_gm = torch_tensorrt.dynamo.compile(
exported_program,
arg_inputs=inputs,
enable_autocast=True,
autocast_low_precision_type=torch.float16,
autocast_excluded_ops={torch.ops.aten.softmax.int},
autocast_max_output_threshold=1024.0,
)
Parameter |
Default |
Description |
|---|---|---|
|
|
Enable graph-aware automatic mixed precision. Analyses op outputs and reduction depths to assign each node to FP32 or low precision. |
|
|
The reduced precision to cast down to. Supported: |
|
|
Set of regex patterns matched against node names. Nodes whose names match
any pattern are kept in FP32. Example: |
|
|
Set of |
|
|
Nodes whose outputs exceed this absolute value are kept in FP32. Guards against overflow in activations with large dynamic range (e.g., unnormalized logits). |
|
|
Maximum reduction depth allowed in low precision. Reduction depth measures how
many reduction operations (sum, mean, etc.) feed into a node. Nodes with higher
depth are kept in FP32 to prevent error accumulation. |
|
|
A |
Weight Management#
These settings control how TRT engines store and manage weights. They primarily affect serialized engine size and whether the engine can be refitted without recompilation.
Parameter |
Default |
Description |
|---|---|---|
|
|
Build non-refittable engines. When |
|
|
When multiple subgraphs share identical weight tensors (e.g., weight-tied
language models), refit all engines with the same weights in a single pass.
Requires |
|
|
Serialize engines without weight data. Produces smaller engine files; weights
must be refitted before the engine can run. Useful for distributing
architecture-only engine blueprints. On TRT ≥ 10.14 this is handled
automatically via TRT’s |
|
|
Enable TRT weight streaming for engines whose weights exceed GPU memory. Weights are streamed from host memory during inference. Requires TRT support and is typically used for very large models. |
Hardware Compatibility#
Parameter |
Default |
Description |
|---|---|---|
|
|
Build engines that can run on a newer TRT version than the one used to compile them (forward ABI compatibility). Disables some runtime optimizations; use when you need to ship engine files that may be loaded by a future TRT version. |
|
|
Build engines compatible with GPU architectures other than the compilation GPU. Currently supports NVIDIA Ampere and newer. Useful for compiling on one GPU SKU and deploying on another within the same generation. |
|
|
Restrict kernel selection to safe GPU kernels ( |
|
|
Build engines on Linux that are deployable on Windows (x86-64). Disables Python
runtime, lazy engine init, and engine caching. See
|
Memory and Resource Management#
Parameter |
Default |
Description |
|---|---|---|
|
|
Split oversized TRT subgraphs so each piece fits within |
|
|
Byte budget per TRT engine during build, used by resource partitioning. |
|
|
Move the model weights to CPU RAM before compilation to reduce GPU memory pressure during the build phase. Weights are loaded back to GPU at runtime. |
|
|
Let TRT dynamically allocate memory for intermediate tensors at runtime rather than pre-allocating. Can reduce peak memory footprint at a small runtime cost. |
Graph Partitioning#
Parameter |
Default |
Description |
|---|---|---|
|
|
Use the adjacency ( |
|
|
Skip per-converter dynamic shape checks; treat all converters as dynamic-capable. Use when you have already validated all your ops with dynamic shapes and want to skip the safety gate. |
|
|
Enable the full set of core ATen decompositions instead of the curated subset. May expose more ops to TRT conversion at the cost of potential numerical differences. |
|
|
Use |
Compilation Workflow#
Parameter |
Default |
Description |
|---|---|---|
|
|
Run the full partitioning pipeline without building any TRT engines.
|
|
|
Defer TRT engine deserialization until all engines have been built. Works around resource contraints and builder overhad but engines may be less well tuned to their deployment resource availablity |
|
|
Enable verbose TRT builder logs at |
Engine Caching#
Parameter |
Default |
Description |
|---|---|---|
|
|
Persist compiled TRT engines to disk after building. Combine with
|
|
|
Load TRT engines from the disk cache on cache hit, skipping the build step. The cache key includes graph structure, input specs, and all engine-invariant settings (see Engine Caching). |
|
|
Path for TRT’s timing cache file. The timing cache records kernel timing data across sessions, speeding up subsequent engine builds for similar subgraphs even when the engine cache itself is cold. |
DLA Parameters#
Deep Learning Accelerator (DLA) settings are only relevant when compiling for Jetson or
DRIVE platforms. Set engine_capability=EngineCapability.DLA_STANDALONE to target DLA.
Parameter |
Default |
Description |
|---|---|---|
|
|
Fast software-managed SRAM used by DLA for intra-layer communication (bytes). |
|
|
Host DRAM used by DLA for intermediate tensor storage across layers (bytes). |
|
|
Host DRAM used by DLA for weights and metadata (bytes). |
Engine-Invariant Settings#
Changing any of the following settings invalidates cached engines — the engine must be rebuilt from scratch:
enabled_precisions, max_aux_streams, version_compatible,
optimization_level, disable_tf32, sparse_weights,
engine_capability, hardware_compatible, refit_identical_engine_weights,
immutable_weights, enable_weight_streaming, tiling_optimization_level,
l2_limit_for_tiling, enable_autocast, autocast_low_precision_type,
autocast_excluded_nodes, autocast_excluded_ops,
autocast_max_output_threshold, autocast_max_depth_of_reduction,
autocast_calibration_dataloader.
Settings not in this list (e.g., debug, dryrun, pass_through_build_failures)
can be changed without invalidating the cache.