Arm Cortex-M Backend#
Note
This backend is a work-in-progress proof of concept. It is not intended for production use, and APIs may change without notice.
The Arm® Cortex®-M backend accelerates quantized model execution on Arm Cortex-M CPUs using CMSIS-NN optimized kernels. Unlike delegate-based backends, it operates as an operator library: quantized subgraphs are replaced with CMSIS-NN accelerated kernels during the pass-lowering stage, while unsupported operators fall back to portable fp32 kernels.
Target Support#
The backend targets Arm Cortex-M CPUs via CMSIS-NN, which provides optimized kernel implementations for three instruction set variants:
Variant |
Description |
Example CPUs |
Supported |
|---|---|---|---|
MVE (Helium) |
M-profile Vector extensions |
Cortex-M55, M85 |
✅ |
DSP |
DSP extension instructions |
Cortex-M4, M7, M33 |
⬜ |
Pure C |
Reference C implementation |
Any Cortex-M |
⬜ |
DSP and pure C variants use the same CMSIS-NN API and may work, but have not been tested.
CMSIS-NN Supported Operators#
The backend pass pipeline replaces quantized ATen operators with CMSIS-NN kernel calls. See the CMSIS-NN API documentation for the full list of available kernels.
ATen Op |
CMSIS-NN Kernel |
8w8a |
8w16a |
4w8a |
|---|---|---|---|---|
|
|
✅ |
⬜ |
⬜ |
|
|
✅ |
⬜ |
⬜ |
|
|
✅ |
⬜ |
⬜ |
|
|
✅ |
⬜ |
⬜ |
|
|
✅ |
⬜ |
⬜ |
|
|
✅ |
⬜ |
N/A |
|
|
✅ |
⬜ |
N/A |
|
|
✅ |
⬜ |
N/A |
|
|
✅ |
⬜ |
N/A |
|
|
✅ |
⬜ |
N/A |
|
|
✅ |
⬜ |
N/A |
|
|
✅ |
⬜ |
N/A |
|
|
✅ |
⬜ |
N/A |
|
|
✅ |
⬜ |
N/A |
— |
LSTM |
⬜ |
⬜ |
⬜ |
— |
SVDF |
⬜ |
⬜ |
⬜ |
Quantization Support#
The Cortex-M backend currently implements symmetric INT8 (8w8a) quantization:
Per-channel quantization for convolution operators.
Per-tensor quantization for all other supported operators.
Shared quantization parameters for data-movement operators (e.g. reshape, permute) to avoid unnecessary requantization.
CMSIS-NN also supports INT4 weights with INT8 activations (4w8a), INT8 weights with INT16 activations (8w16a), and per-channel quantization for fully connected layers, but the corresponding quantizer configurations and operator implementations are not yet integrated.
Tutorial#
Prerequisites#
Install the ExecuTorch pip package:
./install_executorch.sh
For cross-compilation and running on simulated hardware:
Arm GNU Toolchain for cross compilation.
Arm® Corstone™ SSE-300 FVP or SSE-320 FVP for simulation.
Tip
All cross-compilation tools can be downloaded and added to the path:
examples/arm/setup.sh --i-agree-to-the-contained-eula
source examples/arm/arm-scratch/setup_path.sh
1. Export and quantize#
Export the model, then quantize using CortexMQuantizer with the PT2E quantization flow:
import torch
from torchvision.models import mobilenet_v2, MobileNet_V2_Weights
from executorch.backends.cortex_m.quantizer.quantizer import CortexMQuantizer
from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e
model = mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval()
example_input = torch.randn(1, 3, 224, 224).to(memory_format=torch.channels_last)
exported_program = torch.export.export(model, (example_input,))
graph_module = exported_program.module()
quantizer = CortexMQuantizer()
prepared = prepare_pt2e(graph_module, quantizer)
# Calibrate with representative data
for calibration_input in calibration_data:
prepared(calibration_input)
quantized = convert_pt2e(prepared)
quantized_exported_program = torch.export.export(quantized, (example_input,))
2. Lower to edge and apply Cortex-M passes#
Lower to the edge dialect with a custom EdgeCompileConfig, then run the CortexMPassManager to replace quantized subgraphs with CMSIS-NN operator implementations:
from executorch.exir import EdgeCompileConfig, ExecutorchBackendConfig, to_edge
from executorch.backends.cortex_m.passes.cortex_m_pass_manager import CortexMPassManager
config = EdgeCompileConfig(
preserve_ops=[
torch.ops.aten.linear.default,
torch.ops.aten.hardsigmoid.default,
torch.ops.aten.hardsigmoid_.default,
torch.ops.aten.hardswish.default,
torch.ops.aten.hardswish_.default,
],
_check_ir_validity=False,
_core_aten_ops_exception_list=[torch.ops.aten.max_pool2d.default],
)
edge_program_manager = to_edge(quantized_exported_program, compile_config=config)
pass_manager = CortexMPassManager(edge_program_manager.exported_program())
edge_program_manager._edge_programs["forward"] = pass_manager.transform()
3. Serialize to .pte#
executorch_program = edge_program_manager.to_executorch(
config=ExecutorchBackendConfig(extract_delegate_segments=False)
)
with open("model.pte", "wb") as f:
f.write(executorch_program.buffer)
4. Cross-compile and run#
Cross-compile the ExecuTorch runtime, Cortex-M kernels, and the example runner application. The first cmake invocation builds the ExecuTorch libraries for Arm baremetal. The second builds the arm_executor_runner and links it against those libraries with the .pte model baked in.
# Build ExecuTorch libraries for Arm baremetal
cmake --preset arm-baremetal \
-DCMAKE_BUILD_TYPE=Release \
-DEXECUTORCH_BUILD_DEVTOOLS=ON \
-Bcmake-out-arm
cmake --build cmake-out-arm --target install -j$(nproc)
# Build the executor runner, linking the .pte into the binary
cmake -DCMAKE_TOOLCHAIN_FILE=$(pwd)/examples/arm/ethos-u-setup/arm-none-eabi-gcc.cmake \
-DCMAKE_BUILD_TYPE=Release \
-DET_PTE_FILE_PATH=$(pwd)/model.pte \
-DTARGET_CPU=cortex-m55 \
-Bbuild \
examples/arm/executor_runner
cmake --build build -j$(nproc) -- arm_executor_runner
Run on a simulated Cortex-M target:
backends/arm/scripts/run_fvp.sh --elf=build/arm_executor_runner --target=ethos-u55-128
For a complete end-to-end walkthrough including dataset setup, calibration, and result validation, see the Cortex-M MobileNetV2 notebook.