.. _serialized_engine: Extracting a Raw TensorRT Engine ================================== :func:`~torch_tensorrt.dynamo.convert_exported_program_to_serialized_trt_engine` compiles an ``ExportedProgram`` directly to raw TensorRT engine bytes, bypassing the PyTorch wrapper. The output is a ``bytes`` object that can be: * Saved to a ``.engine`` file and loaded by ``trtexec`` or any TRT-native runtime. * Embedded in a C++ application via ``nvinfer1::IRuntime::deserializeCudaEngine``. * Deployed without any Python or PyTorch dependency at inference time. Use this API when you need a **self-contained TRT engine** rather than a compiled ``torch.fx.GraphModule``. For normal PyTorch-integrated inference, prefer :func:`~torch_tensorrt.dynamo.compile`. .. note:: This API compiles the **entire** exported program as a single TRT engine. It does not perform graph partitioning — if any operator in the graph is unsupported, the conversion will fail. Use ``require_full_compilation=True`` with :func:`~torch_tensorrt.dynamo.compile` first to verify full coverage. ---- Basic Usage ----------- .. code-block:: python import torch import torch_tensorrt model = MyModel().eval().cuda() inputs = [torch.randn(1, 3, 224, 224).cuda()] exported = torch.export.export(model, tuple(inputs)) engine_bytes: bytes = torch_tensorrt.dynamo.convert_exported_program_to_serialized_trt_engine( exported, arg_inputs=inputs, enabled_precisions={torch.float16}, ) # Save to disk with open("model.engine", "wb") as f: f.write(engine_bytes) ---- Dynamic Shapes -------------- Pass ``torch_tensorrt.Input`` objects to specify min/opt/max shape ranges: .. code-block:: python from torch_tensorrt import Input engine_bytes = torch_tensorrt.dynamo.convert_exported_program_to_serialized_trt_engine( exported, arg_inputs=[ Input( min_shape=(1, 3, 224, 224), opt_shape=(4, 3, 224, 224), max_shape=(8, 3, 224, 224), dtype=torch.float32, ) ], enabled_precisions={torch.float16}, ) ---- Loading the Engine ------------------ The returned bytes can be loaded back by any TRT-compatible runtime: .. code-block:: python import tensorrt as trt runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING)) engine = runtime.deserialize_cuda_engine(engine_bytes) Or via Torch-TensorRT's own deserializer: .. code-block:: python from torch_tensorrt.dynamo._refit import get_engine_from_encoded_engine import base64 # The bytes can be base64-encoded for storage: encoded = base64.b64encode(engine_bytes).decode() engine = get_engine_from_encoded_engine(encoded) ---- Compared to ``compile()`` -------------------------- .. list-table:: :widths: 30 35 35 :header-rows: 1 * - - ``compile()`` - ``convert_exported_program_to_serialized_trt_engine()`` * - Output - ``torch.fx.GraphModule`` (PyTorch-callable) - ``bytes`` (raw TRT engine) * - Partial TRT coverage - Yes — unsupported ops fall back to PyTorch - No — full TRT required * - Serialization - ``torch_tensorrt.save()`` → ``.ep`` file - ``open(..., "wb").write(bytes)`` → ``.engine`` file * - PyTorch at runtime - Required - Not required * - Multiple inputs/outputs - Full support - Full support * - Graph partitioning - Yes - No (single engine) ---- Key Parameters -------------- All :ref:`CompilationSettings ` parameters are accepted. The most relevant for this API: ``inputs`` / ``arg_inputs`` Input shape specifications. Accepts ``torch.Tensor`` (static shape inferred), ``torch_tensorrt.Input`` (explicit static or dynamic ranges), or a mix. ``enabled_precisions`` Set of ``torch.dtype`` values TRT may use. Add ``torch.float16`` or ``torch.bfloat16`` for reduced-precision inference. ``immutable_weights`` Default ``True``. Set to ``False`` to produce a refittable engine whose weights can be updated without recompilation. ``optimization_level`` Integer 0–5 controlling compile time vs. runtime performance trade-off. ``hardware_compatible`` ``True`` to build an engine deployable on different Ampere+ GPU SKUs. ``version_compatible`` ``True`` to build an engine forward-compatible with newer TRT releases. **Returns** ``bytes`` — the serialized TRT engine. Save or pass directly to ``IRuntime::deserializeCudaEngine``.