Profiler Integration#
Created On: Dec 26, 2025 | Last Updated On: Jan 19, 2026
Background#
PyTorch ships a device-agnostic profiler that instruments CPU-side operator dispatch, coordinates with accelerator collectors, captures Python stacks, and exports aggregated statistics or Chrome/Perfetto traces. For core architecture, see torch/csrc/profiler/README.md.
There are two primary integration paths for accelerators:
Legacy autograd profiler:
Can attach backend-specific hooks via
ProfilerStubsto record device events and compute elapsed times.Works without Kineto; suitable for PrivateUse1 backends that want a minimal, self-contained path.
Kineto-based timeline:
Bridges to Kineto, which aggregates device timelines via vendor libraries (e.g., CUPTI for CUDA).
Provides rich activity traces and advanced export/visualization, but requires a Kineto-capable backend.
This document focuses on path (1): how a PrivateUse1 accelerator exposes the minimal hooks to plug into the legacy autograd profiler so ATen ops and record_function ranges are correctly attributed to device activity.
Design#
Architecture overview#
Layer |
Responsibility |
Source |
|---|---|---|
Python control plane |
Owns profiler lifecycle ( |
|
Profiler stubs |
Implements |
|
Device runtime |
Provides streams, events, and device guards used by the stubs; implementation is backend-specific. |
Backend extension (vendor code) |
This layering keeps PyTorch device-agnostic: Python brokers the session, ProfilerStubs translate profiler requests into backend runtime calls, and the runtime interacts with the accelerator.
Key contracts#
Record hooks:
record()must capture (optional) device index, allocate a backend event, optionally stash a CPU timestamp, and enqueue the event on the active stream.Elapsed time:
elapsed()is responsible for synchronizing individual events and returning durations in microseconds.Synchronization hooks:
synchronize()andonEachDevice()guarantee phase transitions (e.g., warmup → active) are aligned across devices.Annotations:
mark,rangePush, andrangePopcan be implemented to enrich traces; otherwise they may be left as no-ops.
Implementation (Legacy way)#
Here we use OpenReg (Open Registration) to illustrate the minimal set of hooks a PrivateUse1 accelerator needs to expose so the profiler can attribute ATen ops, record_function ranges, and user code to device activity. OpenReg keeps upstream code untouched by translating profiler requests into its runtime calls, mirroring what a production accelerator would implement inside an out-of-tree extension.
OpenReg currently relies on the legacy profiler (torch.autograd.profiler.profile) interface rather than the modern one (torch.profiler.profile) because the latter enforces use_kineto=True.
Profiler stubs (C++)#
torch::profiler::impl::OpenRegMethods inherits from ProfilerStubs and wires the hooks described above:
Method |
Purpose |
|---|---|
|
Grabs the current |
|
Synchronizes both events, calls |
|
Uses |
|
Calls |
|
Report availability and provide placeholder implementations for mark/push/pop. |
The constructor registers the methods once via registerPrivateUse1Methods(&methods);, making them discoverable whenever the profiler is enabled with use_device="openreg".
Python control plane#
On the Python side, no new entrypoint is required—developers use the standard autograd profiler:
from torch.autograd.profiler import profile as autograd_profile
from torch.profiler import record_function
with autograd_profile(use_device="openreg", record_shapes=True) as prof:
with record_function("matmul"):
x = torch.randn(512, 512, device="openreg")
y = torch.randn(512, 512, device="openreg")
z = x @ y
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
prof.export_chrome_trace("openreg_trace.json")
Data capture flow#
User code enters
autograd_profile(use_device="openreg").The profiler transitions to
ProfilerState.KINETO_PRIVATEUSE1_FALLBACK.The profiler asks the active backend to
record()an event.The OpenReg stubs allocate
orEventobjects, attach them to the current stream, and stash CPU timestamps.When events end, the profiler calls
elapsed()to compute durations.