Rate this Page

Advanced Pathway#

This pathway is for engineers building production-grade deployments, implementing custom backends, optimizing for constrained hardware, or working with large language models on edge devices. It assumes familiarity with the ExecuTorch export pipeline and at least one successful model deployment.


Advanced Topic Areas#

Select the area most relevant to your current work. Each section provides a curated sequence of documentation with dependencies noted.


Quantization and Optimization#

Quantization is the most impactful optimization available in ExecuTorch, reducing model size by 2–8× and improving latency significantly on supported backends.

Quantization Overview

Introduction to ExecuTorch’s quantization framework, including supported schemes (INT8, INT4, FP16) and the relationship between quantization and backend selection.

Difficulty: Intermediate

Quantization
Quantization & Optimization (Advanced)

Advanced quantization techniques including mixed-precision, per-channel quantization, and calibration workflows for production models.

Difficulty: Advanced

Quantization & Optimization
Model Export and Lowering

Full reference for to_edge_transform_and_lower, including quantization integration, dynamic shapes, and multi-backend lowering.

Difficulty: Advanced

Model Export and Lowering
Backend Dialect

Understanding the Backend Dialect IR and how it differs from Edge Dialect — essential for backend developers and advanced export customization.

Difficulty: Advanced

Backend Dialect

Memory Planning and Runtime Optimization#

Memory planning is critical for constrained devices. ExecuTorch provides ahead-of-time memory planning to eliminate runtime allocations.

Memory Planning

How ExecuTorch plans tensor memory at compile time, including memory hierarchy, buffer reuse strategies, and how to customize the planner.

Difficulty: Advanced

Memory Planning
Memory Planning Inspection

Tools for inspecting memory plans and diagnosing memory-related issues in exported programs.

Difficulty: Advanced

Memory Planning Inspection in ExecuTorch
Managing Tensor Memory in C++

The TensorPtr and from_blob APIs for zero-copy tensor management in C++ runtime integrations.

Difficulty: Intermediate

Managing Tensor Memory in C++
Portable C++ Programming

Guidelines for writing ExecuTorch C++ code that runs on bare-metal and RTOS environments without dynamic allocation or standard library dependencies.

Difficulty: Advanced

Portable C++ Programming

Custom Compiler Passes and Kernel Registration#

ExecuTorch’s compiler pass interface allows you to transform the exported graph before lowering, enabling model-specific optimizations and operator fusion.

Custom Compiler Passes

Writing and registering custom graph transformation passes that run during the export and lowering pipeline.

Difficulty: Advanced

Custom Compiler Passes and Partitioners
Kernel Registration

Registering custom ATen kernel implementations to replace or supplement the portable operator library with hardware-optimized versions.

Difficulty: Advanced

Kernel Registration
Kernel Library Overview

Architecture of ExecuTorch’s kernel library system, including the portable library, custom kernels, and selective build.

Difficulty: Advanced

Overview of ExecuTorch’s Kernel Libraries
Selective Build

Reduce binary size by including only the operators required by your specific model using the selective build system.

Difficulty: Advanced

Kernel Library Selective Build

Backend Delegate Development#

Implementing a new hardware backend for ExecuTorch requires understanding the delegate interface, partitioner API, and runtime integration.

Backend Development Guide

Complete guide to implementing a new ExecuTorch backend delegate, including the BackendInterface, preprocess, and execute methods.

Difficulty: Advanced

Backend Development
Integrating a Backend Delegate

Step-by-step walkthrough of integrating an existing backend delegate into the ExecuTorch build system and runtime.

Difficulty: Beginner (for integration) / Advanced (for implementation)

Integrating a Backend Delegate into ExecuTorch
Delegate and Partitioner

The Partitioner interface for selecting which subgraphs to delegate, including pattern matching and constraint specification.

Difficulty: Advanced

Understanding Backends and Delegates
Backend Delegate Implementation and Linking

Linking backend delegate implementations into the ExecuTorch runtime, including static and dynamic registration patterns.

Difficulty: Advanced

Backend Delegate Implementation and Linking
Lowering a Model as a Delegate

End-to-end example of using to_backend to lower a model subgraph to a custom delegate.

Difficulty: Advanced

Lowering a Model as a Delegate
Debugging Backend Delegates

Techniques for debugging delegate execution, including intermediate output comparison and delegate-specific logging.

Difficulty: Advanced

Debugging Delegation

C++ Runtime Integration#

For embedded, mobile native, and server deployments, the C++ runtime APIs provide full control over model loading, execution, and memory management.

Module Extension (High-Level API)

The Module class provides a high-level C++ API for loading and running .pte files with minimal boilerplate. Recommended for most C++ integrations.

Difficulty: Intermediate

Running an ExecuTorch Model Using the Module Extension in C++
Detailed C++ Runtime APIs

Low-level C++ runtime APIs for fine-grained control over memory allocation, operator dispatch, and execution planning. Required for bare-metal targets.

Difficulty: Intermediate

Detailed C++ Runtime APIs Tutorial
Using ExecuTorch with C++

CMake integration, target linking, cross-compilation setup, and C++ API reference for production deployments.

Difficulty: Advanced

Using ExecuTorch with C++
Runtime Platform Abstraction Layer

The PAL interface for porting ExecuTorch to new operating systems and bare-metal environments.

Difficulty: Advanced

Runtime Platform Abstraction Layer (PAL)

Large Language Models on Edge#

Deploying LLMs to edge devices involves additional complexity around quantization, tokenization, KV-cache management, and platform-specific optimizations.

LLM Overview

Complete overview of the ExecuTorch LLM workflow, supported models, and platform-specific deployment paths.

Difficulty: Intermediate

LLMs
Exporting LLMs

The export_llm module for exporting supported LLMs (Llama, Qwen, Phi, SmolLM) with quantization and optimization.

Difficulty: Intermediate

Exporting LLMs
Exporting Custom LLMs

Adapting the export pipeline for custom LLM architectures beyond the officially supported models, using nanoGPT as a worked example.

Difficulty: Intermediate

Exporting custom LLMs
Running LLMs with C++

C++ runtime integration for LLM inference, including tokenizer setup, KV-cache configuration, and streaming output.

Difficulty: Intermediate

Running LLMs with C++
Llama on Qualcomm Android

Deploying Llama 3 3B Instruct on Android using the Qualcomm AI Engine Direct backend with hardware acceleration.

Difficulty: Advanced

Run Llama 3 3B Instruct on Android (with Qualcomm AI Engine Direct Backend)
ExecuTorch on Raspberry Pi

Deploying Llama models on Raspberry Pi 4/5 edge devices using the ExecuTorch runtime.

Difficulty: Intermediate

ExecuTorch on Raspberry Pi

Developer Tools and Debugging#

ExecuTorch provides a comprehensive suite of profiling and debugging tools for diagnosing performance and correctness issues.

Developer Tools Overview

Overview of the ExecuTorch developer tools suite, including ETRecord, ETDump, and the Inspector API.

Difficulty: Intermediate

Introduction to the ExecuTorch Developer Tools
Profiling a Model

Step-by-step tutorial for profiling model execution using ETRecord and ETDump to identify performance bottlenecks.

Difficulty: Intermediate

Developer Tools Usage Tutorials
Profiling and Debugging

Comprehensive debugging guide covering numerical debugging, operator-level profiling, and common failure modes.

Difficulty: Advanced

Profiling and Debugging
Delegate Debugging

Techniques specific to debugging backend delegate execution, including output comparison and delegate-level tracing.

Difficulty: Advanced

Delegate Debugging

IR and Compiler Internals#

For contributors and advanced backend developers who need to understand ExecuTorch’s compiler internals.

Export Overview

The complete export pipeline from torch.export to .pte, including the role of each compilation stage.

Difficulty: Intermediate

Exporting to ExecuTorch
Compiler Entry Points

The public API surface for the ExecuTorch compiler, including to_edge, to_edge_transform_and_lower, and to_executorch.

Difficulty: Intermediate

Compiler Entry Points
IR Specification

Formal specification of the ExecuTorch IR, including operator semantics, type system, and serialization format.

Difficulty: Advanced

IR Specification
Compiler & IR (Advanced)

Advanced IR topics including graph transformations, custom dialects, and the relationship between Export IR and Edge Dialect.

Difficulty: Advanced

Compiler & IR

Contributing to ExecuTorch#

If you are working on ExecuTorch internals or want to contribute upstream, start with the contributor guide.

New Contributor Guide

Development environment setup, code style, testing requirements, and the pull request process for ExecuTorch contributors.

Difficulty: Advanced

New Contributor Guide
API Life Cycle and Deprecation Policy

How ExecuTorch manages API stability, deprecation timelines, and backward compatibility across releases.

Difficulty: Intermediate

API Life Cycle and Deprecation Policy

Advanced Learning Sequence#

If you prefer a structured progression rather than topic-based navigation, follow this sequence for a comprehensive advanced curriculum.

Order

Topic

Goal

1

Exporting to ExecuTorch

Understand the full compilation pipeline

2

Model Export and Lowering

Master advanced export options

3

Quantization & Optimization

Apply production-grade quantization

4

Memory Planning

Optimize memory for constrained devices

5

Custom Compiler Passes and Partitioners

Write custom graph transformations

6

Backend Development

Implement a custom backend delegate

7

Detailed C++ Runtime APIs Tutorial

Master the low-level C++ runtime

8

Developer Tools Usage Tutorials

Profile and debug production models