Inductor

torch.compile and Diffusers: A Hands-On Guide to Peak Performance

Sayak Paul (@sayakpaul), Animesh Jain (@anijain2305), Benjamin Bossan (@BenjaminBossan) · May 11, 2026
torch.compilediffusersregional-compilationdynamic-shapesquantizationlora

TL;DR – torch.compile delivers a ~1.5x speedup on Flux-1-Dev with no quality loss. Use compile_repeated_blocks to cut compile latency 7x (67s → 9.6s) while keeping the speedup, enable dynamic=True to avoid recompiles on shape changes, and combine with CPU offloading, NF4 quantization, and LoRA hot-swap without giving up the compiled kernels. Diffusion pipelines are heavy: Flux-1-Dev in bf16 weighs ~33 GB and a single image takes 6.7s on an H100. torch.compile can fuse kernels and strip Python overhead, but applying it naively to a real pipeline runs into four practical issues: Compile latency. First-call JIT cost — 67.4s for the full DiT. Graph breaks. Any unsupported op silently slices the …

Continue reading →

Recent

torch.compile and Diffusers: A Hands-On Guide to Peak Performance

All Inductor Logs