Runtime Phase#

The runtime phase wraps the compiled TensorRT engines together with any remaining PyTorch subgraphs into a single callable module and provides the execution infrastructure for inference.

Dynamo Runtime (Primary Path)#

Two runtime backends are available. The backend is selected via the use_python_runtime compilation setting.

C++ Runtime (default)#

The C++ runtime is more performant, fully serializable, and supports advanced features like CUDAGraphs and multi-device safety.

TensorRT engines are stored as torch.classes.tensorrt.Engine — a C++ TorchBind class that holds the serialized engine bytes plus metadata:

  • Engine name

  • Refit map (PyTorch parameter name → TensorRT layer index)

  • Function signature (input/output names, dtypes, shapes)

  • Runtime requirements (e.g. whether an output allocator is needed for DDS ops)

  • Target TensorRT version and hardware compatibility flags

Inference is triggered via the torch.ops.tensorrt.execute_engine custom op:

tensorrt::execute_engine(
    Tensor[] input_tensors,
    __torch__.torch.classes.tensorrt.Engine engine
) -> Tensor[]

This op pops inputs and the engine off the PyTorch dispatcher stack, runs the tensors through TensorRT, and pushes output tensors back. The compiled torch.fx.Graph stores engine objects as attributes, making the whole module portable.

Python Runtime#

The Python runtime uses TensorRT’s Python API directly for inference. It is useful when a C++ build is not available (e.g. in some CI environments) and is simpler to instrument for debugging. It does not support serialization to ExportedProgram; the compiled graph is Python-only.

Serialization Options#

ExportedProgram (torch.export)#

The default serialization path for the Dynamo AOT workflow. The compiled torch.fx.GraphModule is wrapped in a torch.export.ExportedProgram container. TensorRT engines are stored as tensor attributes in the package; PyTrees capture input/output structure. Requires the C++ runtime and supports Python execution.

torch_tensorrt.save(trt_gm, "model.ep", arg_inputs=inputs)
# later:
trt_gm = torch_tensorrt.load("model.ep")

AOTInductor (torch._export.aot_compile)#

TensorRT engines can be baked into AOTInductor-generated shared objects alongside TorchInductor-compiled kernels. Any PyTorch-backed subgraphs become Inductor-generated Triton kernels. The result is deployable without Python — only libtorchtrt_runtime.so is needed at runtime.

See examples/torchtrt_aoti_example for a full example.

Stand-alone TensorRT Engines#

Individual TensorRT engines can also be extracted and run standalone with trtexec or any other TensorRT-compatible runtime, entirely outside of PyTorch.

MutableTorchTensorRTModule#

MutableTorchTensorRTModule is a higher-level wrapper for use cases that require weight mutability (e.g. LoRA adapters on diffusion models).

It maintains two graphs in parallel: the original PyTorch nn.Module and the compiled TRT graph. User interactions — including weight assignments via standard PyTorch APIs or HuggingFace diffusers — hit the PyTorch graph as normal. The module intercepts these and:

  • Weight mutations (same graph structure, different weights) → triggers a fast refit using the refit map constructed during conversion. No recompilation needed.

  • Structural mutations (e.g. a new LoRA adapter changes the graph topology) → triggers a full recompilation, using the engine cache to skip unchanged subgraphs.

This gives the ergonomics of a regular nn.Module with TensorRT performance, and is compatible with HuggingFace diffusers LoRA workflows without any code changes.


TorchScript Runtime (Legacy ts Path)#

Note

The following describes the legacy TorchScript runtime. For new development use the Dynamo path above.

The TorchScript runtime is based around a PyTorch JIT stack machine. All operators pop arguments off the stack, execute, and push results back. Stack elements are torch::jit::IValue objects.

When Torch-TensorRT is loaded it registers the trt::execute_engine(Tensor[] inputs, Engine engine) -> Tensor[] operator in the JIT operator library. Compiled TorchScript graphs store the engine as an attribute so it is portable and serializable. A typical compiled graph looks like:

graph(%self_1 : ..., %input_0 : Tensor):
    %1 : Engine = prim::GetAttr[name="...engine"](%self_1)
    %3 : Tensor[] = prim::ListConstruct(%input_0)
    %4 : Tensor[] = trt::execute_engine(%3, %1)
    %5 : Tensor = prim::ListUnpack(%4)
    return (%5)

Serialization uses TorchBind. When a TorchScript module is saved the pickler serializes the engine bytes into the zip archive; the unpickler reconstructs the engine holder at load time.

ABI Versioning#

Torch-TensorRT TorchScript programs are versioned with an ABI version number that tells the runtime about compatibility. The serialized format is a vector of strings encoding:

  • ABI version

  • Engine name

  • Device information (SM capability, device type)

  • Serialized TensorRT engine bytes