torch.compile for TorchTitan RL: 6x Faster Unified RL Training

Lucas Kabela (@lucaskabela), Jiani Wang, Tianyu Liu, Richard Zou (@zou3519), Joe Cummings, Milad Mohammadi · May 6, 2026 · 3 min read
torchtitanrltorch.compiledistributedperformance

TL;DR – We enabled torch.compile across the full RL training loop in TorchTitan, achieving a 6x end-to-end speedup (from 446s to 70s) on Qwen3 0.6B for GSM8K. Thanks to TorchTitan RL using a single unified model definition for both training and inference, we can share compiled artifacts across the trainer and generator, reducing startup time while leveraging performance improvements to make this possible.

What makes TorchTitan RL different?

Most RL frameworks (Verl, OpenRLHF, etc.) maintain separate model definitions for training vs. inference. This means:

TorchTitan RL uses one model definition across both the Trainer (TorchTitan) and Generator (vLLM). torch.compile traces the model once and reuses it in both contexts, enabling fullgraph optimizations that span the entire RL loop and reducing compilation time vs. compiling each independently.

Challenges: Due to the unified definition, we needed to handle interoperability with vLLM and particular DTensor operations. This included defining how to capture weak_ref for cudagraph management of DTensors, as well as fixing codegen paths that would be otherwise undiscovered.

Results

Qwen3 0.6B on GSM8K, TP=4 on 8 H100, 10 training steps:

No Compile (baseline)+ Separate Compile & Piecewise CUDAGraphs+ Batching+ Fullgraph CUDAGraphs & Shared Compile
Total Time446.0s205.0s120.0s70.4s
Startup Time24.3s79.1s84.3s47.9s
Generator Time262.4s22.0s17.9s5.4s
Trainer Time157.0s103.3s17.8s17.1s

Total Time = Startup + Generator + Trainer + weight sync. Startup is compilation/and cudagraph capture overhead. Weight sync time is negligible.

Key takeaways

What we shipped

Try it

python torchtitan/experiments/rl/grpo.py --module rl --config rl_grpo_qwen3_0_6b

Full setup: torchtitan/experiments/rl

What’s next

There are still a number of unexplored integrations and optimizations to be made, including: