Export and Serialization#
Note
This page documents the design for serialization of Torch-TensorRT compiled programs. Original design discussion: RFC #2176.
Goal#
Allow a compiled Torch-TensorRT program to be saved to disk and loaded back without recompilation. The loaded program must be executable on any compatible device without importing any other model weights separately.
Serialization Formats#
Two formats are supported:
torch.export / ExportedProgram (.ep)#
The default format for the torch.export (AOT) workflow. The compiled
torch.fx.GraphModule is wrapped in a
torch.export.ExportedProgram container.
How TRT engines are stored:
Each compiled TRT subgraph uses the
torch.ops.tensorrt.execute_enginecustom op as acall_functionnode in the FX graph. This is serializable by the standardtorch.exportserialization stack.TRT engine bytes are serialized as tensor attributes in the
ExportedProgrampackage. Input/output structures are captured as PyTrees.
# Save
torch_tensorrt.save(trt_gm, "model.ep", arg_inputs=inputs)
# Load (no recompilation)
trt_gm = torch_tensorrt.load("model.ep")
output = trt_gm(*inputs)
Note
The C++ runtime is required for ExportedProgram serialization. The Python
runtime does not support this format.
AOTInductor (.so)#
TRT engines can be embedded into an AOTInductor-generated shared library alongside
TorchInductor-compiled Triton kernels. The result is deployable without Python —
only libtorchtrt_runtime.so is needed at runtime:
torch._export.aot_compile(
trt_gm,
args=inputs,
options={"aot_inductor.output_path": "model.so"},
)
# Runtime (no Python, no PyTorch)
# load with libtorchtrt_runtime.so only
Stand-Alone TRT Engine (.trt / .engine)#
Individual TRT engines can be extracted and serialized without any PyTorch wrapper:
trt_engine_bytes = torch_tensorrt.convert_exported_program_to_trt_engine(
exp_program, inputs=inputs
)
with open("engine.trt", "wb") as f:
f.write(trt_engine_bytes)
These can be run with trtexec or any other TensorRT-compatible runtime.
Internal Design#
The key constraint is that call_module nodes (submodule calls) are not
serializable by the standard torch.export serializer. Torch-TensorRT solves
this by embedding TRT engines as call_function nodes using the custom
torch.ops.tensorrt.execute_engine operator, which is serializable:
# call_function node — serializable
%execute_engine = call_function[
target=torch.ops.tensorrt.execute_engine
](args=([%arg7_1], <TRT engine bytes>), kwargs={})
Engine bytes are stored as opaque constant attributes attached to the graph and
packed into the ExportedProgram zip archive alongside the model weights.
Custom serializers (TorchTRTExportedProgramSerializer,
TorchTRTSerializer) handle the execute_engine node type during
torch.export serialization. Corresponding deserializers reconstruct the engine
from bytes and restore the call_function node.
Versioning#
Serialized programs include the Torch-TensorRT version, TensorRT version, and target device SM capability. A compatibility check at load time warns if the serialized engine was built for a different device or library version.