Rate this Page
torch.compile End-to-End Tutorial">

torch.compile End-to-End Tutorial#

Author: William Wen

torch.compile is the new way to speed up your PyTorch code! torch.compile makes PyTorch code run faster by JIT-compiling PyTorch code into optimized kernels, while requiring minimal code changes.

This tutorial covers an end-to-end example of training and evaluating a real model with torch.compile. For a gentle introduction to torch.compile, please check out the introduction to torch.compile tutorial.

Required pip Dependencies

  • torch >= 2.0

  • torchvision

What you will learn
  • How to apply torch.compile to a real model

  • torch.compile speedups on a real model

  • torch.compile’s first few iterations are expected to be slower due to compilation overhead

# NOTE: a modern NVIDIA GPU (H100, A100, or V100) is recommended for this tutorial in
# order to reproduce the speedup numbers shown below and documented elsewhere.

import torch
import warnings

gpu_ok = False
if torch.cuda.is_available():
    device_cap = torch.cuda.get_device_capability()
    if device_cap in ((7, 0), (8, 0), (9, 0)):
        gpu_ok = True

if not gpu_ok:
    warnings.warn(
        "GPU is not NVIDIA V100, A100, or H100. Speedup numbers may be lower "
        "than expected."
    )
/var/lib/workspace/intermediate_source/torch_compile_full_example.py:51: UserWarning: GPU is not NVIDIA V100, A100, or H100. Speedup numbers may be lower than expected.
  warnings.warn(

Let’s demonstrate how using torch.compile can speed up a real model. We will compare standard eager mode and torch.compile by evaluating and training a torchvision model on random data.

Before we start, we need to define some utility functions.

# Returns the result of running `fn()` and the time it took for `fn()` to run,
# in seconds. We use CUDA events and synchronization for the most accurate
# measurements.
def timed(fn):
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    result = fn()
    end.record()
    torch.cuda.synchronize()
    return result, start.elapsed_time(end) / 1000


# Generates random input and targets data for the model, where `b` is
# batch size.
def generate_data(b):
    return (
        torch.randn(b, 3, 128, 128).cuda(),
        torch.randint(1000, (b,)).cuda(),
    )


N_ITERS = 10

from torchvision.models import densenet121


def init_model():
    return densenet121().cuda()

First, let’s compare inference.

Note that in the call to torch.compile, we have the additional mode argument, which we will discuss below.

model = init_model()

# Note that we generally recommend directly compiling a torch.nn.Module by calling
# its .compile() method.
model_opt = init_model()
model_opt.compile(mode="reduce-overhead")

inp = generate_data(16)[0]
with torch.no_grad():
    print("eager:", timed(lambda: model(inp))[1])
    print("compile:", timed(lambda: model_opt(inp))[1])
eager: 0.3469137878417969
/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:320: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
compile: 53.05196484375

Notice that torch.compile takes a lot longer to complete compared to eager. This is because torch.compile compiles the model into optimized kernels as it executes. In our example, the structure of the model doesn’t change, and so recompilation is not needed. So if we run our optimized model several more times, we should see a significant improvement compared to eager.

eager_times = []
for i in range(N_ITERS):
    inp = generate_data(16)[0]
    with torch.no_grad():
        _, eager_time = timed(lambda: model(inp))
    eager_times.append(eager_time)
    print(f"eager eval time {i}: {eager_time}")

print("~" * 10)

compile_times = []
for i in range(N_ITERS):
    inp = generate_data(16)[0]
    with torch.no_grad():
        _, compile_time = timed(lambda: model_opt(inp))
    compile_times.append(compile_time)
    print(f"compile eval time {i}: {compile_time}")
print("~" * 10)

import numpy as np

eager_med = np.median(eager_times)
compile_med = np.median(compile_times)
speedup = eager_med / compile_med
assert speedup > 1
print(
    f"(eval) eager median: {eager_med}, compile median: {compile_med}, speedup: {speedup}x"
)
print("~" * 10)
eager eval time 0: 0.018268224716186523
eager eval time 1: 0.017490943908691405
eager eval time 2: 0.017051584243774415
eager eval time 3: 0.01682022476196289
eager eval time 4: 0.01681920051574707
eager eval time 5: 0.016821247100830078
eager eval time 6: 0.016867328643798828
eager eval time 7: 0.016825344085693358
eager eval time 8: 0.017006591796875
eager eval time 9: 0.016881664276123046
~~~~~~~~~~
compile eval time 0: 0.08573426818847656
compile eval time 1: 0.008682496070861816
compile eval time 2: 0.009084927558898925
compile eval time 3: 0.008154111862182617
compile eval time 4: 0.00810700798034668
compile eval time 5: 0.008144736289978027
compile eval time 6: 0.008087552070617676
compile eval time 7: 0.00809779167175293
compile eval time 8: 0.008034303665161132
compile eval time 9: 0.008056832313537597
~~~~~~~~~~
(eval) eager median: 0.016874496459960937, compile median: 0.008125872135162353, speedup: 2.0766381970178256x
~~~~~~~~~~

And indeed, we can see that running our model with torch.compile results in a significant speedup. Speedup mainly comes from reducing Python overhead and GPU read/writes, and so the observed speedup may vary on factors such as model architecture and batch size. For example, if a model’s architecture is simple and the amount of data is large, then the bottleneck would be GPU compute and the observed speedup may be less significant.

You may also see different speedup results depending on the chosen mode argument. The "reduce-overhead" mode uses CUDA graphs to further reduce the overhead of Python. For your own models, you may need to experiment with different modes to maximize speedup. You can read more about modes here.

You may might also notice that the second time we run our model with torch.compile is significantly slower than the other runs, although it is much faster than the first run. This is because the "reduce-overhead" mode runs a few warm-up iterations for CUDA graphs.

Now, let’s consider comparing training.

model = init_model()
opt = torch.optim.Adam(model.parameters())


def train(mod, data):
    opt.zero_grad(True)
    pred = mod(data[0])
    loss = torch.nn.CrossEntropyLoss()(pred, data[1])
    loss.backward()
    opt.step()


eager_times = []
for i in range(N_ITERS):
    inp = generate_data(16)
    _, eager_time = timed(lambda: train(model, inp))
    eager_times.append(eager_time)
    print(f"eager train time {i}: {eager_time}")
print("~" * 10)

model = init_model()
opt = torch.optim.Adam(model.parameters())

# Note that because we are compiling a regular Python function, we do not
# call any .compile() method.
train_opt = torch.compile(train, mode="reduce-overhead")

compile_times = []
for i in range(N_ITERS):
    inp = generate_data(16)
    _, compile_time = timed(lambda: train_opt(model, inp))
    compile_times.append(compile_time)
    print(f"compile train time {i}: {compile_time}")
print("~" * 10)

eager_med = np.median(eager_times)
compile_med = np.median(compile_times)
speedup = eager_med / compile_med
assert speedup > 1
print(
    f"(train) eager median: {eager_med}, compile median: {compile_med}, speedup: {speedup}x"
)
print("~" * 10)
eager train time 0: 0.3430697021484375
eager train time 1: 0.05137203216552735
eager train time 2: 0.04936601638793945
eager train time 3: 0.04917862319946289
eager train time 4: 0.05007564926147461
eager train time 5: 0.04922163009643555
eager train time 6: 0.04914278411865235
eager train time 7: 0.0488458251953125
eager train time 8: 0.05013404846191406
eager train time 9: 0.04903535842895508
~~~~~~~~~~
compile train time 0: 160.373359375
compile train time 1: 2.58037255859375
compile train time 2: 0.02270412826538086
compile train time 3: 0.020928512573242186
compile train time 4: 0.020255615234375
compile train time 5: 0.02027008056640625
compile train time 6: 0.02025164794921875
compile train time 7: 0.02028339195251465
compile train time 8: 0.02022604751586914
compile train time 9: 0.020231168746948244
~~~~~~~~~~
(train) eager median: 0.0492938232421875, compile median: 0.02027673625946045, speedup: 2.431053134559002x
~~~~~~~~~~

Again, we can see that torch.compile takes longer in the first iteration, as it must compile the model, but in subsequent iterations, we see significant speedups compared to eager.

We remark that the speedup numbers presented in this tutorial are for demonstration purposes only. Official speedup values can be seen at the TorchInductor performance dashboard.

Conclusion#

In this tutorial, we applied torch.compile to training and inference on a real model, demonstrating speedups.

Importantly, we note that the first few iterations of a compiled model are slower than eager mode due to compilation overhead, but subsequent iterations are expected to have speedups.

For a gentle introduction to torch.compile, please check out the introduction to torch.compile tutorial.

To troubleshoot issues and to gain a deeper understanding of how to apply torch.compile to your code, check out the torch.compile programming model.

We hope that you will give torch.compile a try!

Total running time of the script: (3 minutes 39.324 seconds)