Engine Caching#

TRT engine compilation is the most expensive step in the Torch-TensorRT workflow. For repeated compilations of the same model (e.g., after process restart, during hyperparameter search, or in CI), the engine cache eliminates redundant builds by persisting compiled engines to disk and reloading them on a cache hit.


Enabling the Cache#

Pass cache_built_engines=True and reuse_cached_engines=True to torch_tensorrt.dynamo.compile():

import torch
import torch_tensorrt

trt_gm = torch_tensorrt.dynamo.compile(
    exported_program,
    arg_inputs=inputs,
    cache_built_engines=True,
    reuse_cached_engines=True,
)

By default the cache lives at /tmp/torch_tensorrt_engine_cache/ with a 5 GB size limit.

Customize the cache location and size:

from torch_tensorrt.dynamo._engine_cache import DiskEngineCache

my_cache = DiskEngineCache(
    engine_cache_dir="/data/trt_cache",
    engine_cache_size=20 * 1024**3,  # 20 GB
)

trt_gm = torch_tensorrt.dynamo.compile(
    exported_program,
    arg_inputs=inputs,
    cache_built_engines=True,
    reuse_cached_engines=True,
    engine_cache=my_cache,
)

What Gets Cached#

Each TRT subgraph is cached independently under a SHA-256 hash derived from three components:

  1. Graph structure — a canonicalized string of node ops and targets (placeholder names are normalized so renaming inputs does not bust the cache).

  2. Input specs — the min/opt/max shapes and dtypes of each input tensor.

  3. Engine-invariant settings — the subset of torch_tensorrt.dynamo.CompilationSettings that affect the compiled engine (see engine-invariant-settings). Settings like debug or dryrun do not affect the cache key.

Each cache entry stores:

  • The serialized TRT engine bytes.

  • Input and output tensor names.

  • The original input specs (for verification on reload).

  • The weight name map (for refit support).

  • Whether the engine requires an output allocator (data-dependent shape ops).

Cache entries are stored as {cache_dir}/{hash}/blob.bin.


Cache Invalidation#

The cache is automatically invalidated when any engine-invariant setting changes. The following changes always require a cache miss (engine rebuild):

  • enabled_precisions

  • max_aux_streams

  • version_compatible / hardware_compatible

  • optimization_level

  • disable_tf32 / sparse_weights

  • engine_capability

  • immutable_weights / refit_identical_engine_weights / enable_weight_streaming

  • tiling_optimization_level / l2_limit_for_tiling

  • All autocast_* settings

Changes to min_block_size, torch_executed_ops, debug, dryrun, pass_through_build_failures, etc. do not invalidate cached engines.


LRU Eviction#

When a new engine would exceed the configured engine_cache_size, DiskEngineCache evicts the least-recently-used entries (based on file modification time) until enough space is available. An engine larger than the total cache size is silently not cached (a warning is logged).


Timing Cache#

Separate from the engine cache, TRT maintains a timing cache that records kernel benchmark results. This speeds up subsequent engine builds for similar subgraphs even on a cold engine cache, because TRT can skip re-benchmarking known-fast kernels.

The timing cache is always active and persisted at timing_cache_path:

trt_gm = torch_tensorrt.dynamo.compile(
    exported_program,
    arg_inputs=inputs,
    timing_cache_path="/data/trt_cache/timing_cache.bin",
)

The default path is /tmp/torch_tensorrt_engine_cache/timing_cache.bin.


Custom Cache Backends#

To store engines in a location other than the local disk (e.g., a shared object store, a database), implement the BaseEngineCache interface:

from torch_tensorrt.dynamo._engine_cache import BaseEngineCache
from typing import Optional

class S3EngineCache(BaseEngineCache):
    def __init__(self, bucket: str, prefix: str = "trt_engines/"):
        import boto3
        self.s3 = boto3.client("s3")
        self.bucket = bucket
        self.prefix = prefix

    def save(self, hash: str, blob: bytes) -> None:
        key = f"{self.prefix}{hash}/blob.bin"
        self.s3.put_object(Bucket=self.bucket, Key=key, Body=blob)

    def load(self, hash: str) -> Optional[bytes]:
        key = f"{self.prefix}{hash}/blob.bin"
        try:
            resp = self.s3.get_object(Bucket=self.bucket, Key=key)
            return resp["Body"].read()
        except self.s3.exceptions.NoSuchKey:
            return None

trt_gm = torch_tensorrt.dynamo.compile(
    exported_program,
    arg_inputs=inputs,
    cache_built_engines=True,
    reuse_cached_engines=True,
    engine_cache=S3EngineCache("my-model-cache-bucket"),
)

The two methods you must implement:

save(hash: str, blob: bytes) -> None

Persist the packed blob (already serialized by BaseEngineCache.pack()) under the given hash key.

load(hash: str) -> Optional[bytes]

Return the packed blob for the given hash, or None on a cache miss. Returning None causes a normal engine build and subsequent save call.

The base class provides get_hash(), pack()/unpack(), insert(), and check() — do not override these unless you understand the serialization format.


Weightless Engines (TRT ≥ 10.14)#

On TRT 10.14 and later, engines can be serialized without weights using TRT’s INCLUDE_REFIT flag. This significantly reduces cache storage for models where the architecture is shared across many weight variants (e.g., different fine-tuned checkpoints of the same base model):

trt_gm = torch_tensorrt.dynamo.compile(
    exported_program,
    arg_inputs=inputs,
    strip_engine_weights=True,
    cache_built_engines=True,
    reuse_cached_engines=True,
    immutable_weights=False,
)

On a cache hit the weightless engine is loaded and refitted with the current weights before inference. The strip_engine_weights setting is part of the engine-invariant set on TRT < 10.14 (different cache key), but handled automatically by TRT itself on 10.14+.