Shortcuts

Resource Management

Overview

Efficient control of CPU and GPU memory is essential for successful model compilation, especially when working with large models such as LLMs or diffusion models. Uncontrolled memory growth can cause compilation failures or process termination. This guide describes the symptoms of excessive memory usage and provides methods to reduce both CPU and GPU memory consumption.

Memory Usage Control

CPU Memory

By default, Torch-TensorRT may consume up to 5x the model size in CPU memory. This can exceed system limits when compiling large models.

Common symptoms of high CPU memory usage:

  • Program freeze

  • Process terminated by the operating system

Ways to lower CPU memory usage:

  1. Enable memory trimming

    Set the following environment variable:

    export TORCHTRT_ENABLE_BUILDER_MALLOC_TRIM=1
    

    This reduces approximately 2x of redundant model copies, limiting total CPU memory usage to up to 3x the model size.

  2. Disable CPU offloading

    In compilation settings, set:

    offload_module_to_cpu = False
    

    This removes another 1x model copy, reducing peak CPU memory usage to about 2x the model size.

GPU Memory

By default, Torch-TensorRT may consume up to 2x the model size in GPU memory.

Common symptoms of high GPU memory usage:

  • CUDA out-of-memory errors

  • TensorRT compilation errors

Ways to lower GPU memory usage:

  1. Enable offloading to CPU

    In compilation settings, set:

    offload_module_to_cpu = True
    

    This shifts one model copy from GPU to CPU memory. As a result, peak GPU memory usage decreases to about 1x the model size, while one more copy of the model will occupy the CPU memory so CPU memory usage increases by roughly 1x.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources