Tristan Rice (@d4l3k) · May 14, 2026
distributedtorchcommsncclsymmetric-memorytritonprototyping
TL;DR – Modifying the C++ comms layer is a big barrier when researchers want to prototype new collective features. We’ve added Python bindings to torchcomms (#2080) and built two pure-Python backend prototypes — one wrapping NVIDIA’s new nccl4py bindings (#2515) and one built on SymmetricMemory + Triton (#2521) — both passing the core torchcomms integration test suite. Since they plug into torch.distributed, researchers can fork, tweak, and mix them with existing projects like TorchTitan without touching C++.
We’ve been thinking about how to improve overall research and prototyping speed for comms and collective libraries. LLMs have hugely improved prototyping speed for new ideas and …
Continue reading →Lucas Kabela (@lucaskabela), Jiani Wang, Tianyu Liu, Richard Zou (@zou3519), Joe Cummings, Milad Mohammadi · May 6, 2026
torchtitanrltorch.compiledistributedperformance
TL;DR – We enabled torch.compile across the full RL training loop in TorchTitan, achieving a 6x end-to-end speedup (from 446s to 70s) on Qwen3 0.6B for GSM8K. Thanks to TorchTitan RL using a single unified model definition for both training and inference, we can share compiled artifacts across the trainer and generator, reducing startup time while leveraging performance improvements to make this possible.
Most RL frameworks (Verl, OpenRLHF, etc.) maintain separate model definitions for training vs. inference. This means:
Duplicated code to keep in sync Separate optimization paths for each No opportunity to share compilation work TorchTitan RL uses one model definition across both the …
Continue reading →