Runtime Phase#

The runtime phase wraps the compiled TensorRT engines together with any remaining PyTorch subgraphs into a single callable module and provides the execution infrastructure for inference.

Dynamo Runtime (Primary Path)#

The Dynamo runtime is fully serializable and supports advanced features like CUDAGraphs and multi-device safety.

TensorRT engines are stored as torch.classes.tensorrt.Engine — a C++ TorchBind class that holds the serialized engine bytes plus metadata:

Engine name
Refit map (PyTorch parameter name → TensorRT layer index)
Function signature (input/output names, dtypes, shapes)
Runtime requirements (e.g. whether an output allocator is needed for DDS ops)
Target TensorRT version and hardware compatibility flags

Inference is triggered via the torch.ops.tensorrt.execute_engine custom op:

tensorrt::execute_engine(
    Tensor[] input_tensors,
    __torch__.torch.classes.tensorrt.Engine engine
) -> Tensor[]

This op pops inputs and the engine off the PyTorch dispatcher stack, runs the tensors through TensorRT, and pushes output tensors back. The compiled torch.fx.Graph stores engine objects as attributes, making the whole module portable.

Serialization Options#

ExportedProgram (`torch.export`)#

The default serialization path for the Dynamo AOT workflow. The compiled torch.fx.GraphModule is wrapped in a torch.export.ExportedProgram container. TensorRT engines are stored as tensor attributes in the package; PyTrees capture input/output structure.

torch_tensorrt.save(trt_gm, "model.ep", arg_inputs=inputs)
# later:
trt_gm = torch_tensorrt.load("model.ep")

AOTInductor (`torch._export.aot_compile`)#

TensorRT engines can be baked into AOTInductor-generated shared objects alongside TorchInductor-compiled kernels. Any PyTorch-backed subgraphs become Inductor-generated Triton kernels. The result is deployable without Python — only libtorchtrt_runtime.so is needed at runtime.

See examples/torchtrt_aoti_example for a full example.

Stand-alone TensorRT Engines#

Individual TensorRT engines can also be extracted and run standalone with trtexec or any other TensorRT-compatible runtime, entirely outside of PyTorch.

MutableTorchTensorRTModule#

MutableTorchTensorRTModule is a higher-level wrapper for use cases that require weight mutability (e.g. LoRA adapters on diffusion models).

It maintains two graphs in parallel: the original PyTorch nn.Module and the compiled TRT graph. User interactions — including weight assignments via standard PyTorch APIs or HuggingFace diffusers — hit the PyTorch graph as normal. The module intercepts these and:

Weight mutations (same graph structure, different weights) → triggers a fast refit using the refit map constructed during conversion. No recompilation needed.
Structural mutations (e.g. a new LoRA adapter changes the graph topology) → triggers a full recompilation, using the engine cache to skip unchanged subgraphs.

This gives the ergonomics of a regular nn.Module with TensorRT performance, and is compatible with HuggingFace diffusers LoRA workflows without any code changes.

TorchScript Runtime (Legacy `ts` Path)#

Note

The following describes the legacy TorchScript runtime. For new development use the Dynamo path above.

The TorchScript runtime is based around a PyTorch JIT stack machine. All operators pop arguments off the stack, execute, and push results back. Stack elements are torch::jit::IValue objects.

When Torch-TensorRT is loaded it registers the trt::execute_engine(Tensor[] inputs, Engine engine) -> Tensor[] operator in the JIT operator library. Compiled TorchScript graphs store the engine as an attribute so it is portable and serializable. A typical compiled graph looks like:

graph(%self_1 : ..., %input_0 : Tensor):
    %1 : Engine = prim::GetAttr[name="...engine"](%self_1)
    %3 : Tensor[] = prim::ListConstruct(%input_0)
    %4 : Tensor[] = trt::execute_engine(%3, %1)
    %5 : Tensor = prim::ListUnpack(%4)
    return (%5)

Serialization uses TorchBind. When a TorchScript module is saved the pickler serializes the engine bytes into the zip archive; the unpickler reconstructs the engine holder at load time.

ABI Versioning#

Torch-TensorRT TorchScript programs are versioned with an ABI version number that tells the runtime about compatibility. The serialized format is a vector of strings encoding:

ABI version
Engine name
Device information (SM capability, device type)
Serialized TensorRT engine bytes

Runtime Phase#

Dynamo Runtime (Primary Path)#

Serialization Options#

ExportedProgram (torch.export)#

AOTInductor (torch._export.aot_compile)#

Stand-alone TensorRT Engines#

MutableTorchTensorRTModule#

TorchScript Runtime (Legacy ts Path)#

ABI Versioning#

ExportedProgram (`torch.export`)#

AOTInductor (`torch._export.aot_compile`)#

TorchScript Runtime (Legacy `ts` Path)#