Engine Caching#
TRT engine compilation is the most expensive step in the Torch-TensorRT workflow. For repeated compilations of the same model (e.g., after process restart, during hyperparameter search, or in CI), the engine cache eliminates redundant builds by persisting compiled engines to disk and reloading them on a cache hit.
Enabling the Cache#
Pass cache_built_engines=True and reuse_cached_engines=True to
torch_tensorrt.dynamo.compile():
import torch
import torch_tensorrt
trt_gm = torch_tensorrt.dynamo.compile(
exported_program,
arg_inputs=inputs,
cache_built_engines=True,
reuse_cached_engines=True,
)
By default the cache lives at
/tmp/torch_tensorrt_engine_cache/ with a 5 GB size limit.
Customize the cache location and size:
from torch_tensorrt.dynamo._engine_cache import DiskEngineCache
my_cache = DiskEngineCache(
engine_cache_dir="/data/trt_cache",
engine_cache_size=20 * 1024**3, # 20 GB
)
trt_gm = torch_tensorrt.dynamo.compile(
exported_program,
arg_inputs=inputs,
cache_built_engines=True,
reuse_cached_engines=True,
engine_cache=my_cache,
)
What Gets Cached#
Each TRT subgraph is cached independently under a SHA-256 hash derived from three components:
Graph structure — a canonicalized string of node ops and targets (placeholder names are normalized so renaming inputs does not bust the cache).
Input specs — the
min/opt/maxshapes and dtypes of each input tensor.Engine-invariant settings — the subset of
torch_tensorrt.dynamo.CompilationSettingsthat affect the compiled engine (see engine-invariant-settings). Settings likedebugordryrundo not affect the cache key.
Each cache entry stores:
The serialized TRT engine bytes.
Input and output tensor names.
The original input specs (for verification on reload).
The weight name map (for refit support).
Whether the engine requires an output allocator (data-dependent shape ops).
Cache entries are stored as {cache_dir}/{hash}/blob.bin.
Cache Invalidation#
The cache is automatically invalidated when any engine-invariant setting changes. The following changes always require a cache miss (engine rebuild):
enabled_precisionsmax_aux_streamsversion_compatible/hardware_compatibleoptimization_leveldisable_tf32/sparse_weightsengine_capabilityimmutable_weights/refit_identical_engine_weights/enable_weight_streamingtiling_optimization_level/l2_limit_for_tilingAll
autocast_*settings
Changes to min_block_size, torch_executed_ops, debug, dryrun,
pass_through_build_failures, etc. do not invalidate cached engines.
LRU Eviction#
When a new engine would exceed the configured engine_cache_size,
DiskEngineCache evicts the least-recently-used entries (based on file modification
time) until enough space is available. An engine larger than the total cache size is
silently not cached (a warning is logged).
Timing Cache#
Separate from the engine cache, TRT maintains a timing cache that records kernel benchmark results. This speeds up subsequent engine builds for similar subgraphs even on a cold engine cache, because TRT can skip re-benchmarking known-fast kernels.
The timing cache is always active and persisted at timing_cache_path:
trt_gm = torch_tensorrt.dynamo.compile(
exported_program,
arg_inputs=inputs,
timing_cache_path="/data/trt_cache/timing_cache.bin",
)
The default path is
/tmp/torch_tensorrt_engine_cache/timing_cache.bin.
Custom Cache Backends#
To store engines in a location other than the local disk (e.g., a shared object store,
a database), implement the BaseEngineCache interface:
from torch_tensorrt.dynamo._engine_cache import BaseEngineCache
from typing import Optional
class S3EngineCache(BaseEngineCache):
def __init__(self, bucket: str, prefix: str = "trt_engines/"):
import boto3
self.s3 = boto3.client("s3")
self.bucket = bucket
self.prefix = prefix
def save(self, hash: str, blob: bytes) -> None:
key = f"{self.prefix}{hash}/blob.bin"
self.s3.put_object(Bucket=self.bucket, Key=key, Body=blob)
def load(self, hash: str) -> Optional[bytes]:
key = f"{self.prefix}{hash}/blob.bin"
try:
resp = self.s3.get_object(Bucket=self.bucket, Key=key)
return resp["Body"].read()
except self.s3.exceptions.NoSuchKey:
return None
trt_gm = torch_tensorrt.dynamo.compile(
exported_program,
arg_inputs=inputs,
cache_built_engines=True,
reuse_cached_engines=True,
engine_cache=S3EngineCache("my-model-cache-bucket"),
)
The two methods you must implement:
save(hash: str, blob: bytes) -> NonePersist the packed blob (already serialized by
BaseEngineCache.pack()) under the given hash key.load(hash: str) -> Optional[bytes]Return the packed blob for the given hash, or
Noneon a cache miss. ReturningNonecauses a normal engine build and subsequentsavecall.
The base class provides get_hash(), pack()/unpack(), insert(), and
check() — do not override these unless you understand the serialization format.
Weightless Engines (TRT ≥ 10.14)#
On TRT 10.14 and later, engines can be serialized without weights using TRT’s
INCLUDE_REFIT flag. This significantly reduces cache storage for models where the
architecture is shared across many weight variants (e.g., different fine-tuned
checkpoints of the same base model):
trt_gm = torch_tensorrt.dynamo.compile(
exported_program,
arg_inputs=inputs,
strip_engine_weights=True,
cache_built_engines=True,
reuse_cached_engines=True,
immutable_weights=False,
)
On a cache hit the weightless engine is loaded and refitted with the current weights
before inference. The strip_engine_weights setting is part of the engine-invariant
set on TRT < 10.14 (different cache key), but handled automatically by TRT itself on
10.14+.