CUDA Backend#
The CUDA backend is the ExecuTorch solution for running models on NVIDIA GPUs. It leverages the AOTInductor compiler to generate optimized CUDA kernels with libtorch-free execution, and uses Triton for high-performance GPU kernel generation.
Features#
Optimized GPU Execution: Uses AOTInductor to generate highly optimized CUDA kernels for model operators
Triton Kernel Support: Leverages Triton for GEMM (General Matrix Multiply), convolution, and SDPA (Scaled Dot-Product Attention) kernels.
Quantization Support: INT4 weight quantization with tile-packed format for improved performance and reduced memory footprint
Cross-Platform: Supports both Linux and Windows platforms
Multiple Model Support: Works with various models including LLMs, vision-language models, and audio models
Target Requirements#
Below are the requirements for running a CUDA-delegated ExecuTorch model:
Hardware: NVIDIA GPU with CUDA compute capability
CUDA Toolkit: CUDA 11.x or later (CUDA 12.x recommended)
Operating System: Linux or Windows
Drivers: PyTorch-Compatible NVIDIA GPU drivers installed
Development Requirements#
To develop and export models using the CUDA backend:
Python: Python 3.8+
PyTorch: PyTorch with CUDA support
ExecuTorch: Install ExecuTorch with CUDA backend support
Using the CUDA Backend#
Exporting Models with Python API#
The CUDA backend uses the CudaBackend and CudaPartitioner classes to export models. Here is a complete example:
import torch
from executorch.backends.cuda.cuda_backend import CudaBackend
from executorch.backends.cuda.cuda_partitioner import CudaPartitioner
from executorch.exir import EdgeCompileConfig, to_edge_transform_and_lower
from executorch.extension.export_util.utils import save_pte_program
# Configure edge compilation
edge_compile_config = EdgeCompileConfig(
_check_ir_validity=False,
_skip_dim_order=True,
)
# Define your model
model = YourModel().eval()
example_inputs = (torch.randn(1, 3, 224, 224),)
# Export the model using torch.export
exported_program = torch.export.export(model, example_inputs)
# Create the CUDA partitioner
partitioner = CudaPartitioner(
[CudaBackend.generate_method_name_compile_spec(model_name)]
)
# Add decompositions for Triton to generate kernels
exported_program = exported_program.run_decompositions({
torch.ops.aten.conv1d.default: conv1d_to_conv2d,
})
# Lower to ExecuTorch with CUDA backend
et_program = to_edge_transform_and_lower(
exported_program,
partitioner=[partitioner],
compile_config=edge_compile_config,
)
# Convert to executable program and save
exec_program = et_program.to_executorch()
save_pte_program(exec_program, model_name, "./output_dir")
This generates .pte and .ptd files that can be executed on CUDA devices.
For a complete working example, see the CUDA export script.
Runtime Integration#
To run the model on device, use the standard ExecuTorch runtime APIs. See Running on Device for more information.
When building from source, pass -DEXECUTORCH_BUILD_CUDA=ON when configuring the CMake build to compile the CUDA backend.
# CMakeLists.txt
add_subdirectory("executorch")
...
target_link_libraries(
my_target
PRIVATE executorch
extension_module_static
extension_tensor
aoti_cuda_backend)
No additional steps are necessary to use the backend beyond linking the target. CUDA-delegated .pte and .ptd files will automatically run on the registered backend.
Examples#
For complete end-to-end examples of exporting and running models with the CUDA backend, see:
Whisper — Audio transcription model with CUDA support
Voxtral — Audio multimodal model with CUDA support
Gemma3 — Vision-language model with CUDA support
These examples demonstrate the full workflow including model export, quantization options, building runners, and runtime execution.
ExecuTorch provides Makefile targets for building these example runners:
make whisper-cuda # Build Whisper runner with CUDA
make voxtral-cuda # Build Voxtral runner with CUDA
make gemma3-cuda # Build Gemma3 runner with CUDA