Distributed

FSDP, DTensor, c10d, and distributed training

Recent

Python First Comms for Researchers

Tristan Rice (@d4l3k) · May 14, 2026
distributedtorchcommsncclsymmetric-memorytritonprototyping

TL;DR – Modifying the C++ comms layer is a big barrier when researchers want to prototype new collective features. We’ve added Python bindings to torchcomms (#2080) and built two pure-Python backend prototypes — one wrapping NVIDIA’s new nccl4py bindings (#2515) and one built on SymmetricMemory + Triton (#2521) — both passing the core torchcomms integration test suite. Since they plug into torch.distributed, researchers can fork, tweak, and mix them with existing projects like TorchTitan without touching C++. We’ve been thinking about how to improve overall research and prototyping speed for comms and collective libraries. LLMs have hugely improved prototyping speed for new ideas and …

Continue reading →

torch.compile for TorchTitan RL: 6x Faster Unified RL Training

Lucas Kabela (@lucaskabela), Jiani Wang, Tianyu Liu, Richard Zou (@zou3519), Joe Cummings, Milad Mohammadi · May 6, 2026
torchtitanrltorch.compiledistributedperformance

TL;DR – We enabled torch.compile across the full RL training loop in TorchTitan, achieving a 6x end-to-end speedup (from 446s to 70s) on Qwen3 0.6B for GSM8K. Thanks to TorchTitan RL using a single unified model definition for both training and inference, we can share compiled artifacts across the trainer and generator, reducing startup time while leveraging performance improvements to make this possible. Most RL frameworks (Verl, OpenRLHF, etc.) maintain separate model definitions for training vs. inference. This means: Duplicated code to keep in sync Separate optimization paths for each No opportunity to share compilation work TorchTitan RL uses one model definition across both the …

Continue reading →

All Distributed Logs