Recent Posts
Edward Yang (@ezyang) · June 1, 2026
eagercudamemory
Disclosure. This post was drafted by Claude (Anthropic’s coding assistant) with editing from ezyang.
In an ideal world, users of CUDA memory in PyTorch programs should be able to abstract the allocator behavior as: there is a fixed amount of GPU memory, whenever you allocate this available memory goes down, and when you free the available memory goes back up.
Unfortunately, the internal implementation of the CUDA caching allocator means that certain allocation patterns can give rise to fragmentation, where even though there is “technically” enough free space to store a requested allocation, the CUDA caching allocator is unable to actually serve the request.
There are many modern use cases …
Continue reading →Edward Yang (@ezyang) · May 30, 2026
ai-agentscode-reviewossllm
One of the important topics being discussed among the PyTorch team is how the PyTorch codebase should engage with AI coding agents. Today, many PRs to PyTorch are AI-authored, and there have been obvious growing pains as we’ve figured things out. Based on discussions at the most recent PyTorch compiler offsite (May 2026), I’ve assembled this playbook for AI coding in PyTorch. It is half descriptive, half prescriptive: it is trying to codify practices that are being used among some members of the team, and bring everyone else along. Hopefully, this post is just the beginning of our ongoing conversation about how to engage with AI coding agents.
We can think of AI generated code as living in a …
Continue reading →Arsh Zahed (@azahed98) · May 15, 2026
dynamotorch.compilegraph-breaksskills
I’m excited to share debug-graph-breaks, a new skill for debugging Torch Compile graph breaks, now available in the meta-pytorch/skills repository.
Torch Compile graph breaks prevent full graph capture and hurt performance. This skill helps you:
Identify root causes of graph breaks Understand why operations break compilation Get actionable fixes with specific code changes Learn best practices for Torch Compile-friendly code The skill is grounded in the Graph Break Website as its knowledge base—improvements to the website directly improve the skill’s quality.
Evaluated on the OSS Model Graph Break Corpus—a collection of real-world graph break scenarios from open-source models.
Evaluation …
Continue reading →Animesh Jain (@anijain2305) · May 13, 2026
dynamocpythonllm-agentsgraph-breakstp-slots
TL;DR – Dynamo’s ad-hoc CPython support creates fragmented graph breaks that are hard to fix — even for LLM agents. By refactoring Dynamo to mirror CPython’s tp_* slot semantics, we make the system systematically auditable and agent-friendly, already lifting CPython test pass rates from 38% to 45% and proactively eliminating classes of graph breaks in frontier models.
Working with frontier training frameworks has surfaced some fundamental issues in Dynamo. The issues broadly fell into four categories:
CPython language gaps: For example, Dynamo supports calling a functools.partial object but did not support hashing it. Insufficient exception messages: One frontier framework had an unusual …
Continue reading →William Wen (@williamwen42) · May 13, 2026
dynamotorch.compilegraph-breaks
torch._dynamo.config.nested_graph_breaks = True has been enabled on all Dynamo and Inductor unit tests (~250 test files). A sweep of the OSS benchmark models with graph breaks shows 81/82 passing with NGB (the single regression is a pre-existing unstable model), with graph break reductions of up to 67% and graph merging in models with complex nested call structures (GNNs, detection models). Dynamo tracing time is neutral or improved for most models, and models with significant graph merging see up to 15% runtime speedup (8% geomean). The remaining goal is to set nested_graph_breaks to True by default.
The nested graph break problem in torch.compile refers to the Dynamo limitation of only …
Continue reading →Tristan Rice (@d4l3k) · May 14, 2026
distributedtorchcommsncclsymmetric-memorytritonprototyping
TL;DR – Modifying the C++ comms layer is a big barrier when researchers want to prototype new collective features. We’ve added Python bindings to torchcomms (#2080) and built two pure-Python backend prototypes — one wrapping NVIDIA’s new nccl4py bindings (#2515) and one built on SymmetricMemory + Triton (#2521) — both passing the core torchcomms integration test suite. Since they plug into torch.distributed, researchers can fork, tweak, and mix them with existing projects like TorchTitan without touching C++.
We’ve been thinking about how to improve overall research and prototyping speed for comms and collective libraries. LLMs have hugely improved prototyping speed for new ideas and …
Continue reading →Lucas Kabela (@lucaskabela), Jiani Wang, Tianyu Liu, Richard Zou (@zou3519), Joe Cummings, Milad Mohammadi · May 6, 2026
torchtitanrltorch.compiledistributedperformance
TL;DR – We enabled torch.compile across the full RL training loop in TorchTitan, achieving a 6x end-to-end speedup (from 446s to 70s) on Qwen3 0.6B for GSM8K. Thanks to TorchTitan RL using a single unified model definition for both training and inference, we can share compiled artifacts across the trainer and generator, reducing startup time while leveraging performance improvements to make this possible.
Most RL frameworks (Verl, OpenRLHF, etc.) maintain separate model definitions for training vs. inference. This means:
Duplicated code to keep in sync Separate optimization paths for each No opportunity to share compilation work TorchTitan RL uses one model definition across both the …
Continue reading →Sayak Paul (@sayakpaul), Animesh Jain (@anijain2305), Benjamin Bossan (@BenjaminBossan) · May 11, 2026
torch.compilediffusersregional-compilationdynamic-shapesquantizationlora
TL;DR – torch.compile delivers a ~1.5x speedup on Flux-1-Dev with no quality loss. Use compile_repeated_blocks to cut compile latency 7x (67s → 9.6s) while keeping the speedup, enable dynamic=True to avoid recompiles on shape changes, and combine with CPU offloading, NF4 quantization, and LoRA hot-swap without giving up the compiled kernels.
Diffusion pipelines are heavy: Flux-1-Dev in bf16 weighs ~33 GB and a single image takes 6.7s on an H100. torch.compile can fuse kernels and strip Python overhead, but applying it naively to a real pipeline runs into four practical issues:
Compile latency. First-call JIT cost — 67.4s for the full DiT. Graph breaks. Any unsupported op silently slices the …
Continue reading →Edward Yang (@ezyang) · May 3, 2026
cppcpythonllm
One of the lost arts of PyTorch development is the ability to write idiomatic C++ code that interacts with the CPython API. This was a very important skill in the early days of eager PyTorch, since we spent a lot of time moving large chunks of the framework to C++ for speed reasons, but we don’t touch the C++ code that much these days and many members of the team haven’t written any amount of serious C++.
LLMs seriously lower the barrier for writing C++ and dealing with the minutiae of manual memory management in C. But they’re not perfect. So these devlog is to talk about all of the things that I put into the process. A prompt of sorts. It is based off of the experience driving an LLM to …
Continue reading →Edward Yang (@ezyang) · May 3, 2026
citoolingllmmergedogpytorchbot
Disclosure. This post was drafted by Claude (Anthropic’s coding assistant) with editing from ezyang.
mergedog is an entirely vibe-coded small Python harness that takes one approved pytorch/pytorch PR and shepherds it through CI to the point a human can comment @pytorchbot merge. The idea is to use LLMs to deal with some aspects of the drudgery of landing PRs from external contributors:
Pressing the “Approve CI workflows” button (in a secure way!), Waiting for the CI results to come back, Checking if the CI failures are spurious or real, and Fixing simple CI failures that are just due to brain-os. While each of these tasks is individually not onerous, they take up time in aggregate; and it …
Continue reading →Aaron Orenstein (@aorenste) · April 16, 2026
dispatcherdispatch_keysbackendsautocastfunctionalizationtorch_dispatch
I wanted to write about how PT2 does autograd, but that requires understanding eager autograd, which requires understanding the dispatcher. So let’s start there.
Let’s pretend we’re building Torch. Let’s start from first principles with the problems we encounter and how to solve them.
Problem 1: We want to be able to call operators for each backend.
Solution: Polymorphism! We just define a class where we have every operator defined as a virtual method. Backends just implement every operator.
class Torch: def mm(self, a: Tensor, b: Tensor) -> Tensor: ... def einsum(self, equation: str, *operands: Tensor) -> Tensor: ... ... Now I just need to implement Torch for each “real” backend (CPU, Cuda, …
Continue reading →Laith Sakka (@laithsakka) · March 25, 2026
dynamic_shapesunbackedperformancevllmtorchbenchinductor
TL;DR – Unbacked dynamic shapes had 2x–20% slowdowns on TorchBench and ~30% regressions on vLLM. We fixed the root causes — now unbacked matches backed across all tested models and configurations.
These regressions were blocking adoption in Frontier workloads like vLLM. Demand for unbacked shapes is growing — just in the past week, multiple users needed them to control recompilations — so the gap was not acceptable.
We’ve now solved this: unbacked matches backed across all HuggingFace TorchBench models (up to 2x faster) and 30+ vLLM models across multiple configurations.
The key idea behind this work is simple:
For a given graph G and guard set E, unbacked shapes must match the performance …
Continue reading →Laith Sakka (@laithsakka), Aditya Venkataraman (@aditvenk) · February 27, 2026
dynamic_shapesunbackedtorch.exportcompile_timesymbolic_shapes
TL;DR – A regression report revealed that exporting a model with many unbacked (data-dependent) symbols took 264s. Profiling showed the latency was dominated by repeated symbolic reasoning in the shape system. A series of targeted, generally applicable optimizations reduced tracing time to 87s (~3x faster).
A report indicated a severe slowdown when exporting a model that heavily uses data-dependent operations (i.e., unbacked symbolic shapes). Profiling showed that most of the time was spent inside the symbolic shape system.
At the time of investigation, torch.export did not support profiling out of the box, which made root-cause analysis difficult. After enabling profiling for export, a …
Continue reading →Laith Sakka (@laithsakka), Aditya Venkataraman (@aditvenk), Bob Ren (@bobrenjc93) · January 20, 2026
dynamic_shapesunbackedbackedtorch.compilefrontier
TL;DR – We expect unbacked dynamic shapes to become the dominant shape mechanism for Frontier-style workloads due to their better predictability and controllability. However, some blockers remain for their ideal usage, most notably the performance gap, which is a primary focus for the first half of 2026.
Recently, unbacked dynamic shapes have become a hot topic. But many people still don’t fully understand (1) what backed vs unbacked dynamic shapes actually are, and (2) why that choice matters for performance, UX, and Frontier.
In this post, I’ll walk through a simplified story of how we got here, why unbacked shapes are becoming more important, and what’s still blocking them. This post is …
Continue reading →