Resource Management#

Overview#

Efficient control of CPU and GPU memory is essential for successful model compilation, especially when working with large models such as LLMs or diffusion models. Uncontrolled memory growth can cause compilation failures or process termination. This guide describes the symptoms of excessive memory usage and provides methods to reduce both CPU and GPU memory consumption.

Memory Usage Control#

CPU Memory#

By default, Torch-TensorRT may consume up to 5x the model size in CPU memory. This can exceed system limits when compiling large models.

Common symptoms of high CPU memory usage:

Program freeze
Process terminated by the operating system

Ways to lower CPU memory usage:

Enable memory trimming

Set the following environment variable:
```
export TORCHTRT_ENABLE_BUILDER_MALLOC_TRIM=1
```
This reduces approximately 2x of redundant model copies, limiting total CPU memory usage to up to 3x the model size.
Disable CPU offloading

In compilation settings, set:
```
offload_module_to_cpu = False
```
This removes another 1x model copy, reducing peak CPU memory usage to about 2x the model size.

GPU Memory#

By default, Torch-TensorRT may consume up to 2x the model size in GPU memory.

Common symptoms of high GPU memory usage:

CUDA out-of-memory errors
TensorRT compilation errors

Ways to lower GPU memory usage:

Enable offloading to CPU

In compilation settings, set:
```
offload_module_to_cpu = True
```
This shifts one model copy from GPU to CPU memory. As a result, peak GPU memory usage decreases to about 1x the model size, while one more copy of the model will occupy the CPU memory so CPU memory usage increases by roughly 1x.

Runtime Weight Streaming#

Weight streaming allows a compiled TRT engine to use less GPU VRAM at inference time by streaming model weights from CPU memory to the GPU on demand. This is useful for very large models (LLMs, diffusion models) that exceed available VRAM.

Enable during compilation:

trt_model = torch_tensorrt.compile(
    model,
    ir="dynamo",
    arg_inputs=inputs,
    enable_weight_streaming=True,
)

Adjust the GPU memory budget at runtime:

Use torch_tensorrt.runtime.weight_streaming as a context manager to set how much GPU memory the engine is allowed to use for weights. Setting a smaller budget forces more streaming from CPU:

import torch_tensorrt

# Allocate 2 GiB on GPU for weights; the rest streams from CPU
with torch_tensorrt.runtime.weight_streaming(trt_model) as ctx:
    ctx.device_budget = 2 * 1024**3  # bytes
    output = trt_model(*inputs)
# Budget is reset to the original value on exit

Query available budget information:

with torch_tensorrt.runtime.weight_streaming(trt_model) as ctx:
    # Total streamable bytes across all TRT submodules
    print(f"Total streamable: {ctx.total_device_budget} bytes")
    # Automatically selected optimal budget
    auto_budget = ctx.get_automatic_weight_streaming_budget()
    print(f"Auto budget: {auto_budget} bytes")
    ctx.device_budget = auto_budget
    output = trt_model(*inputs)

Note

Weight streaming requires enable_weight_streaming=True at compile time. If the model was not compiled with this flag, ctx.total_device_budget will be 0 and setting device_budget will raise a RuntimeError.

Dynamic Resource Allocation#

By default, TRT submodules allocate GPU memory statically at module initialization. The ResourceAllocationStrategy context manager temporarily switches all TRT submodules in a compiled graph module to dynamic allocation — resources are allocated and freed per forward call rather than held for the module lifetime.

This can reduce peak GPU memory when running multiple compiled models concurrently, at the cost of slightly higher per-call latency:

from torch_tensorrt.dynamo.runtime import ResourceAllocationStrategy

trt_model = torch_tensorrt.compile(model, ir="dynamo", arg_inputs=inputs)

with ResourceAllocationStrategy(trt_model, dynamically_allocate_resources=True):
    output = trt_model(*inputs)
# Submodules revert to static allocation on exit

Use dynamically_allocate_resources=False to force static allocation inside the context (the opposite direction — useful for profiling or benchmarking).