# Arm Cortex-M Backend

:::{note}
This backend is a work-in-progress proof of concept. It is not intended for production use, and APIs may change without notice.
:::

The Arm&reg; Cortex&reg;-M backend accelerates quantized model execution on Arm Cortex-M CPUs using [CMSIS-NN](https://arm-software.github.io/CMSIS-NN/latest/) optimized kernels. Unlike delegate-based backends, it operates as an operator library: quantized subgraphs are replaced with CMSIS-NN accelerated kernels during the pass-lowering stage, while unsupported operators fall back to portable fp32 kernels.

## Target Support

The backend targets Arm Cortex-M CPUs via CMSIS-NN, which provides optimized kernel implementations for three instruction set variants:

| Variant      | Description                 | Example CPUs       | Supported |
|--------------|-----------------------------|--------------------|-----------|
| MVE (Helium) | M-profile Vector extensions | Cortex-M55, M85    | ✅        |
| DSP          | DSP extension instructions  | Cortex-M4, M7, M33 | ⬜        |
| Pure C       | Reference C implementation  | Any Cortex-M       | ⬜        |

DSP and pure C variants use the same CMSIS-NN API and may work, but have not been tested.

## CMSIS-NN Supported Operators

The backend pass pipeline replaces quantized ATen operators with [CMSIS-NN](https://arm-software.github.io/CMSIS-NN/latest/) kernel calls. See the [CMSIS-NN API documentation](https://arm-software.github.io/CMSIS-NN/latest/modules.html) for the full list of available kernels.

| ATen Op                        | CMSIS-NN Kernel        | 8w8a | 8w16a | 4w8a |
|--------------------------------|------------------------|------|-------|------|
| `aten.convolution`             | `arm_convolve`         | ✅   | ⬜    | ⬜   |
| `aten.convolution` (depthwise) | `arm_depthwise_conv`   | ✅   | ⬜    | ⬜   |
| `aten.convolution` (transposed)| `arm_transpose_conv`   | ✅   | ⬜    | ⬜   |
| `aten.linear`                  | `arm_fully_connected`  | ✅   | ⬜    | ⬜   |
| `aten.bmm`                     | `arm_batch_matmul`     | ✅   | ⬜    | ⬜   |
| `aten.add`                     | `arm_elementwise_add`  | ✅   | ⬜    | N/A  |
| `aten.mul`                     | `arm_elementwise_mul`  | ✅   | ⬜    | N/A  |
| `aten.max_pool2d`              | `arm_max_pool`         | ✅   | ⬜    | N/A  |
| `aten.avg_pool2d`              | `arm_avgpool`          | ✅   | ⬜    | N/A  |
| `aten._softmax`                | `arm_softmax`          | ✅   | ⬜    | N/A  |
| `aten.minimum`                 | `arm_minimum`          | ✅   | ⬜    | N/A  |
| `aten.maximum`                 | `arm_maximum`          | ✅   | ⬜    | N/A  |
| `aten.permute_copy`            | `arm_transpose`        | ✅   | ⬜    | N/A  |
| `aten.constant_pad_nd`         | `arm_pad`              | ✅   | ⬜    | N/A  |
| —                              | LSTM                   | ⬜   | ⬜    | ⬜   |
| —                              | SVDF                   | ⬜   | ⬜    | ⬜   |

## Quantization Support

The Cortex-M backend currently implements **symmetric INT8 (8w8a)** quantization:
- **Per-channel** quantization for convolution operators.
- **Per-tensor** quantization for all other supported operators.
- **Shared quantization parameters** for data-movement operators (e.g. reshape, permute) to avoid unnecessary requantization.

CMSIS-NN also supports INT4 weights with INT8 activations (4w8a), INT8 weights with INT16 activations (8w16a), and per-channel quantization for fully connected layers, but the corresponding quantizer configurations and operator implementations are not yet integrated.

## Tutorial

### Prerequisites

Install the ExecuTorch pip package:
```bash
./install_executorch.sh
```

For cross-compilation and running on simulated hardware:
- [Arm GNU Toolchain](https://developer.arm.com/Tools%20and%20Software/GNU%20Toolchain) for cross compilation.
- [Arm&reg; Corstone&trade; SSE-300 FVP](https://developer.arm.com/documentation/100966/1128/Arm--Corstone-SSE-300-FVP) or [SSE-320 FVP](https://developer.arm.com/documentation/109760/0000/SSE-320-FVP) for simulation.

:::{tip}
All cross-compilation tools can be downloaded and added to the path:
```bash
examples/arm/setup.sh --i-agree-to-the-contained-eula
source examples/arm/arm-scratch/setup_path.sh
```
:::

### 1. Export and quantize

Export the model, then quantize using `CortexMQuantizer` with the PT2E quantization flow:

```python
import torch
from torchvision.models import mobilenet_v2, MobileNet_V2_Weights
from executorch.backends.cortex_m.quantizer.quantizer import CortexMQuantizer
from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e

model = mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval()

example_input = torch.randn(1, 3, 224, 224).to(memory_format=torch.channels_last)
exported_program = torch.export.export(model, (example_input,))
graph_module = exported_program.module()

quantizer = CortexMQuantizer()
prepared = prepare_pt2e(graph_module, quantizer)

# Calibrate with representative data
for calibration_input in calibration_data:
    prepared(calibration_input)

quantized = convert_pt2e(prepared)
quantized_exported_program = torch.export.export(quantized, (example_input,))
```

### 2. Lower to edge and apply Cortex-M passes

Lower to the edge dialect with a custom `EdgeCompileConfig`, then run the `CortexMPassManager` to replace quantized subgraphs with CMSIS-NN operator implementations:

```python
from executorch.exir import EdgeCompileConfig, ExecutorchBackendConfig, to_edge
from executorch.backends.cortex_m.passes.cortex_m_pass_manager import CortexMPassManager

config = EdgeCompileConfig(
    preserve_ops=[
        torch.ops.aten.linear.default,
        torch.ops.aten.hardsigmoid.default,
        torch.ops.aten.hardsigmoid_.default,
        torch.ops.aten.hardswish.default,
        torch.ops.aten.hardswish_.default,
    ],
    _check_ir_validity=False,
    _core_aten_ops_exception_list=[torch.ops.aten.max_pool2d.default],
)

edge_program_manager = to_edge(quantized_exported_program, compile_config=config)

pass_manager = CortexMPassManager(edge_program_manager.exported_program())
edge_program_manager._edge_programs["forward"] = pass_manager.transform()
```

### 3. Serialize to .pte

```python
executorch_program = edge_program_manager.to_executorch(
    config=ExecutorchBackendConfig(extract_delegate_segments=False)
)

with open("model.pte", "wb") as f:
    f.write(executorch_program.buffer)
```

### 4. Cross-compile and run

Cross-compile the ExecuTorch runtime, Cortex-M kernels, and the example runner application. The first cmake invocation builds the ExecuTorch libraries for Arm baremetal. The second builds the [arm_executor_runner](https://github.com/pytorch/executorch/blob/main/examples/arm/executor_runner/) and links it against those libraries with the `.pte` model baked in.

```bash
# Build ExecuTorch libraries for Arm baremetal
cmake --preset arm-baremetal \
  -DCMAKE_BUILD_TYPE=Release \
  -DEXECUTORCH_BUILD_DEVTOOLS=ON \
  -Bcmake-out-arm
cmake --build cmake-out-arm --target install -j$(nproc)

# Build the executor runner, linking the .pte into the binary
cmake -DCMAKE_TOOLCHAIN_FILE=$(pwd)/examples/arm/ethos-u-setup/arm-none-eabi-gcc.cmake \
      -DCMAKE_BUILD_TYPE=Release \
      -DET_PTE_FILE_PATH=$(pwd)/model.pte \
      -DTARGET_CPU=cortex-m55 \
      -Bbuild \
      examples/arm/executor_runner
cmake --build build -j$(nproc) -- arm_executor_runner
```

Run on a simulated Cortex-M target:

```bash
backends/arm/scripts/run_fvp.sh --elf=build/arm_executor_runner --target=ethos-u55-128
```

For a complete end-to-end walkthrough including dataset setup, calibration, and result validation, see the [Cortex-M MobileNetV2 notebook](https://github.com/pytorch/executorch/blob/main/examples/arm/cortex_m_mv2_example.ipynb).