# Profiler Integration ## Background PyTorch ships a device-agnostic profiler that instruments CPU-side operator dispatch, coordinates with accelerator collectors, captures Python stacks, and exports aggregated statistics or Chrome/Perfetto traces. For core architecture, see [`torch/csrc/profiler/README.md`][PyTorch Profiler README]. There are two primary integration paths for accelerators: 1. Legacy autograd profiler: - Can attach backend-specific hooks via `ProfilerStubs` to record device events and compute elapsed times. - Works without Kineto; suitable for PrivateUse1 backends that want a minimal, self-contained path. 2. Kineto-based timeline: - Bridges to Kineto, which aggregates device timelines via vendor libraries (e.g., CUPTI for CUDA). - Provides rich activity traces and advanced export/visualization, but requires a Kineto-capable backend. This document focuses on path (1): how a `PrivateUse1` accelerator exposes the minimal hooks to plug into the legacy autograd profiler so aten ops and `record_function` ranges are correctly attributed to device activity. ## Design ### Architecture overview | Layer | Responsibility | Source | | -------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------- | | Python control plane | Owns profiler lifecycle (`prepare → start → stop → step`) and exposes user APIs such as `torch.autograd.profiler.profile`. | `torch/autograd/profiler.py` | | Profiler stubs | Implements `torch::profiler::impl::ProfilerStubs` so the profiler can record device events, synchronize, iterate devices, and compute elapsed time. | `torch/csrc/profiler/stubs/` | | Device runtime | Provides streams, events, and device guards used by the stubs; implementation is backend-specific. | Backend extension (vendor code) | This layering keeps PyTorch device-agnostic: Python brokers the session, `ProfilerStubs` translate profiler requests into backend runtime calls, and the runtime interacts with the accelerator. ### Key contracts * **Record hooks**: `record()` must capture (optional) device index, allocate a backend event, optionally stash a CPU timestamp, and enqueue the event on the active stream. * **Elapsed time**: `elapsed()` is responsible for synchronizing individual events and returning durations in microseconds. * **Synchronization hooks**: `synchronize()` and `onEachDevice()` guarantee phase transitions (e.g., warmup → active) are aligned across devices. * **Annotations**: `mark`, `rangePush`, and `rangePop` can be implemented to enrich traces; otherwise they may be left as no-ops. ## Implementation (Legacy way) Here we use OpenReg (Open Registration) to illustrate the minimal set of hooks a `PrivateUse1` accelerator needs to expose so the profiler can attribute aten ops, `record_function` ranges, and user code to device activity. OpenReg keeps upstream code untouched by translating profiler requests into its runtime calls, mirroring what a production accelerator would implement inside an out-of-tree extension. ### Profiler stubs (C++) [`torch::profiler::impl::OpenRegMethods`][openreg-stubs] inherits from `ProfilerStubs` and wires the hooks described above: | Method | Purpose | | ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------ | | `record` | Grabs the current `OpenRegStream`, creates an `orEvent`, captures an optional CPU timestamp via `c10::getTime()`, and records the event on the stream. | | `elapsed` | Synchronizes both events, calls `orEventElapsedTime`, and converts milliseconds to microseconds for the profiler. | | `onEachDevice` | Uses `c10::DeviceGuard(DeviceType::PrivateUse1)` to iterate over `torch.openreg.device_count()` so schedulers can run per-device setup or teardown. | | `synchronize` | Calls `orDeviceSynchronize()` to align device work with CPU scheduling phases. | | `enabled` and annotation shims | Report availability and provide placeholder implementations for mark/push/pop. | The constructor registers the methods once via `registerPrivateUse1Methods(&methods);`, making them discoverable whenever the profiler is enabled with `use_device="openreg"`. ### Python control plane On the Python side, no new entrypoint is required—developers use the standard autograd profiler: ```python from torch.autograd.profiler import profile as autograd_profile from torch.profiler import record_function with autograd_profile(use_device="openreg", record_shapes=True) as prof: with record_function("matmul"): x = torch.randn(512, 512, device="openreg") y = torch.randn(512, 512, device="openreg") z = x @ y print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10)) prof.export_chrome_trace("openreg_trace.json") ``` A few noteworthy behaviors: 1. **Legacy autograd path**: OpenReg currently relies on the autograd profiler rather than Kineto. Always pass `use_kineto=False` (the default for `PrivateUse1`) if you call lower-level APIs. 2. **Shape recording**: `record_shapes=True` works end-to-end because OpenReg registers shape metadata capture in `torch/csrc/profiler/util.h` via the standard RecordFunction callbacks. 3. **`record_function` compatibility**: Custom scopes (`torch.profiler.record_function`) are propagated automatically, enabling model authors to annotate higher-level stages like data transfer or optimizer steps. ### Data capture flow 1. User code enters `autograd_profile(use_device="openreg")`. 2. The profiler transitions to `ProfilerState.PRIVATEUSE1` and enables `RecordFunction` callbacks. 3. Whenever an operator or `record_function` scope begins, the profiler asks the active backend to `record()` an event. 4. The OpenReg stubs allocate `orEvent` objects, attach them to the current stream, and stash CPU timestamps. 5. When scopes end, the profiler calls `elapsed()` to compute durations, aggregates statistics, and optionally serializes Chrome traces. [PyTorch Profiler README]: https://github.com/pytorch/pytorch/blob/main/torch/csrc/profiler/README.md "PyTorch Profiler README" [openreg-stubs]: https://github.com/pytorch/pytorch/blob/main/test/cpp_extensions/open_registration_extension/torch_openreg/csrc/profiler/stubs/openreg.cpp "OpenReg profiler stubs"