Dynamic Memory Allocation#

Note

This page documents the design for dynamically allocated engine memory in Torch-TensorRT. Original design discussion: RFC #3714.

Goal#

Some TRT engines consume significantly more GPU memory than the equivalent PyTorch module. When multiple TRT-accelerated submodules are loaded simultaneously (e.g. in a diffusers pipeline with UNet, VAE, and text encoder), the total resident GPU memory can exceed the device limit even if each module runs sequentially.

The solution is dynamic memory allocation: instead of allocating device memory for an engine when it is loaded, allocation is deferred to execution time. The memory is released immediately after inference completes, so only one engine holds GPU activation memory at a time.

User API#

A context manager sets the allocation strategy for all TRT engines within scope:

import torch_tensorrt

with torch_tensorrt.runtime.enable_dynamic_engine_context(trt_model):
    output = trt_model(*inputs)

Alternatively, the strategy can be set per-module:

trt_model.set_resource_allocation_strategy("dynamic")

Two strategies are available:

"static" (default) — device memory is allocated when the engine is loaded (via createExecutionContext). Memory is held for the lifetime of the engine.
"dynamic" — device memory is allocated on each forward pass (via createExecutionContextWithoutDeviceMemory + manual device-memory assignment), and released immediately after the call returns.

Internal Implementation#

C++ Runtime#

The TRTEngine C++ class manages an IExecutionContext per engine. When the strategy is switched to dynamic the existing context is destroyed and a new context is created without device memory:

void TRTEngine::set_resource_allocation_strategy(
        ResourceAllocationStrategy new_strategy) {
    if (new_strategy != resource_allocation_strategy_) {
        resource_allocation_strategy_ = new_strategy;
        if (new_strategy == ResourceAllocationStrategy::kDynamic) {
            exec_ctx_ = engine_->createExecutionContextWithoutDeviceMemory();
        } else {
            exec_ctx_ = engine_->createExecutionContext();
        }
    }
}

During execute_engine, when dynamic allocation is active, a temporary torch::Tensor of type uint8 provides the required device memory:

void execute_engine(...) {
    torch::Tensor dynamic_workspace;
    if (engine.resource_allocation_strategy == kDynamic) {
        dynamic_workspace = torch::empty(
            engine.device_memory_size,
            torch::TensorOptions().dtype(torch::kUInt8).device(torch::kCUDA)
        );
        exec_ctx_->setDeviceMemory(dynamic_workspace.data_ptr());
    }
    // ... run inference ...
    // dynamic_workspace freed here (goes out of scope)
}

Python Exposure#

The _ResourceAllocator Python module wraps the C++ setting and provides the context manager surface. The TorchTensorRTModule exposes set_resource_allocation_strategy through its TorchBind interface.

Limitations#

Dynamic allocation does not reduce the peak memory of a single engine during inference — it only reduces the memory that is resident when the engine is idle.
The per-inference allocation/free overhead is small but non-zero; avoid dynamic allocation for latency-critical paths where static would fit in memory.

Dynamic Memory Allocation#

Goal#

User API#

Internal Implementation#

C++ Runtime#

Python Exposure#

Limitations#

Related#