AOTInductor Deployment#
AOTInductor (Ahead-of-Time Inductor) compiles a PyTorch model into a self-contained
.pt2 package at build time. That package can be loaded and executed in Python or C++
without a Torch-TensorRT dependency at runtime.
Torch-TensorRT integrates with AOTInductor: TRT-convertible subgraphs become TRT engines
embedded in the package; the remaining ops (PyTorch fallback subgraphs) are compiled by
AOTInductor into native CUDA kernels. The result is a single .pt2 file that runs
end-to-end without Python.
When to use AOTInductor
Deploying to C++ servers without a Python environment.
Shipping a single self-contained artifact that bundles both TRT engines and PyTorch ops.
When you want inference-time independence from Torch-TensorRT.
Note
AOTInductor packaging is currently Linux-only.
Compile and Save#
The workflow is identical to the standard ir="dynamo" path, with two extra arguments
to torch_tensorrt.save:
output_format="aot_inductor"— selects the.pt2packager.retrace=True— re-exports the compiled graph throughtorch.exportbefore passing it totorch._inductor.aoti_compile_and_package. Required when the compiled module contains TRT engine subgraphs.
import torch
import torch_tensorrt
model = MyModel().eval().cuda()
example_inputs = (torch.randn(8, 10, device="cuda"),)
# Step 1 — export with optional dynamic shapes
batch_dim = torch.export.Dim("batch", min=1, max=1024)
exported = torch.export.export(
model, example_inputs, dynamic_shapes={"x": {0: batch_dim}}
)
# Step 2 — compile with Torch-TensorRT
trt_gm = torch_tensorrt.dynamo.compile(
exported,
inputs=[
torch_tensorrt.Input(
min_shape=(1, 10),
opt_shape=(8, 10),
max_shape=(1024, 10),
dtype=torch.float32,
)
],
use_explicit_typing=True,
min_block_size=1,
)
# Step 3 — package into a .pt2 file
torch_tensorrt.save(
trt_gm,
"model.pt2",
output_format="aot_inductor",
retrace=True,
arg_inputs=example_inputs,
)
The .pt2 file embeds both the TRT engine(s) and AOTInductor-compiled kernels for any
ops that fell back to PyTorch.
Python Inference#
Load the package with torch._inductor.aoti_load_package. No Torch-TensorRT import
is needed at inference time:
import torch
model = torch._inductor.aoti_load_package("model.pt2")
# Works with any batch size within the compiled range
output = model(torch.randn(4, 10, device="cuda"))
output = model(torch.randn(16, 10, device="cuda"))
C++ Inference#
The same .pt2 package runs in C++ via AOTIModelPackageLoader, with no Python
or Torch-TensorRT dependency:
#include "torch/torch.h"
#include "torch/csrc/inductor/aoti_package/model_package_loader.h"
int main() {
c10::InferenceMode mode;
torch::inductor::AOTIModelPackageLoader loader("model.pt2");
// Batch size 8
std::vector<torch::Tensor> inputs = {torch::randn({8, 10}, at::kCUDA)};
auto outputs = loader.run(inputs);
// Dynamic batch — different size works within the compiled min/max range
outputs = loader.run({torch::randn({1, 10}, at::kCUDA)});
return 0;
}
Note
At runtime, libtorchtrt_runtime.so is not needed. Ensure your link flags
exclude it (or use --as-needed) to avoid a spurious dependency.
Comparison: PT2 vs ExportedProgram#
Feature |
|
|
|---|---|---|
Python load |
|
|
C++ load |
Not supported |
|
Torch-TensorRT at runtime |
Required |
Not required |
Non-TRT ops |
Run via PyTorch eager |
Compiled by AOTInductor (native CUDA) |
Platform |
Linux + Windows |
Linux only |
See examples/torchtrt_aoti_example/ for a complete end-to-end runnable example
(model.py for compilation, inference.py for loading).