Profiler Integration#
Created On: Dec 26, 2025 | Last Updated On: Dec 26, 2025
Background#
PyTorch ships a device-agnostic profiler that instruments CPU-side operator dispatch, coordinates with accelerator collectors, captures Python stacks, and exports aggregated statistics or Chrome/Perfetto traces. For core architecture, see torch/csrc/profiler/README.md.
There are two primary integration paths for accelerators:
Legacy autograd profiler:
Can attach backend-specific hooks via
ProfilerStubsto record device events and compute elapsed times.Works without Kineto; suitable for PrivateUse1 backends that want a minimal, self-contained path.
Kineto-based timeline:
Bridges to Kineto, which aggregates device timelines via vendor libraries (e.g., CUPTI for CUDA).
Provides rich activity traces and advanced export/visualization, but requires a Kineto-capable backend.
This document focuses on path (1): how a PrivateUse1 accelerator exposes the minimal hooks to plug into the legacy autograd profiler so aten ops and record_function ranges are correctly attributed to device activity.
Design#
Architecture overview#
Layer |
Responsibility |
Source |
|---|---|---|
Python control plane |
Owns profiler lifecycle ( |
|
Profiler stubs |
Implements |
|
Device runtime |
Provides streams, events, and device guards used by the stubs; implementation is backend-specific. |
Backend extension (vendor code) |
This layering keeps PyTorch device-agnostic: Python brokers the session, ProfilerStubs translate profiler requests into backend runtime calls, and the runtime interacts with the accelerator.
Key contracts#
Record hooks:
record()must capture (optional) device index, allocate a backend event, optionally stash a CPU timestamp, and enqueue the event on the active stream.Elapsed time:
elapsed()is responsible for synchronizing individual events and returning durations in microseconds.Synchronization hooks:
synchronize()andonEachDevice()guarantee phase transitions (e.g., warmup → active) are aligned across devices.Annotations:
mark,rangePush, andrangePopcan be implemented to enrich traces; otherwise they may be left as no-ops.
Implementation (Legacy way)#
Here we use OpenReg (Open Registration) to illustrate the minimal set of hooks a PrivateUse1 accelerator needs to expose so the profiler can attribute aten ops, record_function ranges, and user code to device activity. OpenReg keeps upstream code untouched by translating profiler requests into its runtime calls, mirroring what a production accelerator would implement inside an out-of-tree extension.
Profiler stubs (C++)#
torch::profiler::impl::OpenRegMethods inherits from ProfilerStubs and wires the hooks described above:
Method |
Purpose |
|---|---|
|
Grabs the current |
|
Synchronizes both events, calls |
|
Uses |
|
Calls |
|
Report availability and provide placeholder implementations for mark/push/pop. |
The constructor registers the methods once via registerPrivateUse1Methods(&methods);, making them discoverable whenever the profiler is enabled with use_device="openreg".
Python control plane#
On the Python side, no new entrypoint is required—developers use the standard autograd profiler:
from torch.autograd.profiler import profile as autograd_profile
from torch.profiler import record_function
with autograd_profile(use_device="openreg", record_shapes=True) as prof:
with record_function("matmul"):
x = torch.randn(512, 512, device="openreg")
y = torch.randn(512, 512, device="openreg")
z = x @ y
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
prof.export_chrome_trace("openreg_trace.json")
A few noteworthy behaviors:
Legacy autograd path: OpenReg currently relies on the autograd profiler rather than Kineto. Always pass
use_kineto=False(the default forPrivateUse1) if you call lower-level APIs.Shape recording:
record_shapes=Trueworks end-to-end because OpenReg registers shape metadata capture intorch/csrc/profiler/util.hvia the standard RecordFunction callbacks.record_functioncompatibility: Custom scopes (torch.profiler.record_function) are propagated automatically, enabling model authors to annotate higher-level stages like data transfer or optimizer steps.
Data capture flow#
User code enters
autograd_profile(use_device="openreg").The profiler transitions to
ProfilerState.PRIVATEUSE1and enablesRecordFunctioncallbacks.Whenever an operator or
record_functionscope begins, the profiler asks the active backend torecord()an event.The OpenReg stubs allocate
orEventobjects, attach them to the current stream, and stash CPU timestamps.When scopes end, the profiler calls
elapsed()to compute durations, aggregates statistics, and optionally serializes Chrome traces.