Cross-Compiling for Windows#

torch_tensorrt.dynamo.cross_compile_for_windows() compiles TRT engines on a Linux x86-64 host and produces an ExportedProgram containing engines that can be loaded and executed on Windows x86-64 — without requiring a Linux GPU at inference time.

This is the standard path for teams that build models on Linux (where TRT tooling is more mature) and deploy on Windows (game engines, desktop applications, enterprise software).


Requirements#

  • Build machine: Linux x86-64 with CUDA and TensorRT installed.

  • Target machine: Windows x86-64 with a compatible NVIDIA GPU (same or newer CUDA compute capability).

  • enable_cross_compile_for_windows=True is automatically set by this API; do not set it manually on compile().

The following features are disabled during cross-compilation (they are not available in the Windows TRT runtime or require OS-specific binaries):

  • Python runtime (use_python_runtime is forced to False)

  • Lazy engine initialization (lazy_engine_init is forced to False)

  • Engine caching (cache_built_engines / reuse_cached_engines disabled)


Workflow#

Step 1 — Export on the Linux build machine

import torch
import torch_tensorrt

model = MyModel().eval().cuda()
inputs = [torch.randn(1, 3, 224, 224).cuda()]

# Export to ExportedProgram
exp_program = torch.export.export(model, tuple(inputs))

Step 2 — Cross-compile for Windows

trt_gm = torch_tensorrt.dynamo.cross_compile_for_windows(
    exp_program,
    arg_inputs=inputs,
    use_explicit_typing=True,  # enabled_precisions deprecated; cast model/inputs to target dtype
)

Step 3 — Save the compiled module

torch_tensorrt.save(trt_gm, "model_windows.ep", arg_inputs=inputs)

Step 4 — Load and run on Windows

Copy model_windows.ep to the Windows machine. Ensure libtorchtrt_runtime.so / torchtrt_runtime.dll is on the library path.

# On Windows:
import torch_tensorrt
trt_gm = torch_tensorrt.load("model_windows.ep").module()
output = trt_gm(*inputs)

Dynamic Shapes#

Dynamic shapes work the same as in normal compile():

from torch_tensorrt import Input

trt_gm = torch_tensorrt.dynamo.cross_compile_for_windows(
    exp_program,
    arg_inputs=[
        Input(
            min_shape=(1,  3, 224, 224),
            opt_shape=(4,  3, 224, 224),
            max_shape=(16, 3, 224, 224),
        )
    ],
)

Engine Compatibility#

The produced engines are compatible with the same or newer CUDA compute capability as the GPU used during compilation. Use hardware_compatible=True if the Windows deployment GPU may have a different architecture within the Ampere+ generation:

trt_gm = torch_tensorrt.dynamo.cross_compile_for_windows(
    exp_program,
    arg_inputs=inputs,
    hardware_compatible=True,  # engine runs on Ampere and newer
)

Saving and Loading Cross-Compiled Programs#

The output of cross_compile_for_windows is a standard torch.fx.GraphModule containing TorchTensorRTModule submodules with Windows-compatible engine bytes. Save and load via the standard Torch-TensorRT save/load API:

# Save (Linux)
torch_tensorrt.save(trt_gm, "model_windows.ep", arg_inputs=inputs)

# Load (Windows)
trt_gm = torch_tensorrt.load("model_windows.ep").module()
trt_gm(*inputs)

Alternatively, save as a raw .engine file for direct TRT deployment:

engine_bytes = torch_tensorrt.dynamo.convert_exported_program_to_serialized_trt_engine(
    exp_program,
    arg_inputs=inputs,
    enable_cross_compile_for_windows=False,  # use cross_compile_for_windows() instead
)
# Note: use cross_compile_for_windows() for the full workflow;
# convert_exported_program_to_serialized_trt_engine() does not support cross-compilation.

Troubleshooting#

AssertionError: cross_compile_for_windows is only supported on Linux x86-64

You must run the compilation step on a Linux x86-64 machine. The @needs_cross_compile decorator gates this function.

Engine fails to load on Windows

Ensure the TRT version on Windows is ≥ the version used on Linux. Use version_compatible=True for forward compatibility within a TRT major version.

Output mismatch between Linux and Windows

Floating-point results may differ slightly due to different driver/hardware implementations. Use optimization_level=0 on Linux to minimize kernel specialization and improve cross-platform reproducibility.