Python Runtime#
Torch-TensorRT provides two runtime backends for executing compiled TRT engines inside a PyTorch graph:
C++ runtime (default) —
TorchTensorRTModulebacked by a C++ TorchBind class. Fully serializable, supports CUDAGraphs, multi-device safe.Python runtime —
PythonTorchTensorRTModulebacked entirely by the TRT Python API. Simpler to instrument for debugging but not serializable toExportedProgram.
When to Use the Python Runtime#
Use use_python_runtime=True when:
You need to run on a machine where the C++ Torch-TensorRT library is not installed (e.g., a minimal CI container with only the Python wheel).
You want to attach Python-level callbacks to the engine execution (via Observer / Callback System) for debugging or profiling without building the C++ extension.
You are debugging a conversion issue and want to step through TRT execution in Python.
Use the default C++ runtime in all other cases, especially:
When saving a compiled module to disk (
torch_tensorrt.save()).When using CUDAGraphs for low-latency inference.
In production deployments.
Enabling the Python Runtime#
import torch_tensorrt
trt_gm = torch_tensorrt.dynamo.compile(
exported_program,
arg_inputs=inputs,
use_python_runtime=True,
)
Or via torch.compile:
trt_model = torch.compile(
model,
backend="tensorrt",
options={"use_python_runtime": True},
)
Limitations#
Not serializable:
PythonTorchTensorRTModulecannot be saved viatorch_tensorrt.save()as anExportedProgramor loaded back. The module is Python-only in-process.# This will raise an error with use_python_runtime=True: torch_tensorrt.save(trt_gm, "model.ep", arg_inputs=inputs)
No C++ deployment: The compiled module cannot be exported to AOTInductor or used in a C++ application without re-compiling with the C++ runtime.
CUDAGraphs: Whole-graph CUDAGraphs work with the Python runtime, but the per-submodule CUDAGraph recording in
CudaGraphsTorchTensorRTModuleis only available with the C++ runtime.
PythonTorchTensorRTModule Direct Instantiation#
You can instantiate PythonTorchTensorRTModule directly from raw engine bytes,
for example when integrating a TRT engine built outside of Torch-TensorRT:
from torch_tensorrt.dynamo.runtime import PythonTorchTensorRTModule
from torch_tensorrt.dynamo._settings import CompilationSettings
# Load raw engine bytes (e.g., from trtexec output or torch_tensorrt.dynamo.convert_*)
with open("model.engine", "rb") as f:
engine_bytes = f.read()
module = PythonTorchTensorRTModule(
serialized_engine=engine_bytes,
input_binding_names=["x"],
output_binding_names=["output"],
name="my_engine",
settings=CompilationSettings(),
)
output = module(torch.randn(1, 3, 224, 224).cuda())
Constructor arguments:
serialized_engine(bytes)The raw serialized TRT engine bytes.
input_binding_names(List[str])TRT input binding names in the order they are passed to
forward().output_binding_names(List[str])TRT output binding names in the order they should be returned.
name(str, optional)Human-readable name for the module (used in logging).
settings(CompilationSettings, optional)The compilation settings used to build the engine. Used to determine device placement and other runtime behaviors.
weight_name_map(dict, optional)Mapping of TRT weight names to PyTorch state dict names. Required for refit support via
torch_tensorrt.dynamo.refit_module_weights().requires_output_allocator(bool, defaultFalse)Set to
Trueif the engine contains data-dependent-shape ops (nonzero,unique, etc.) that require TRT’s output allocator.
Runtime Selection Logic#
When use_python_runtime is None (auto-select), Torch-TensorRT tries to import
the C++ TorchBind class. If the C++ extension is not available it silently falls back to
the Python runtime. Pass True or False to force a specific runtime.