.. _cuda_graphs:

CUDAGraphs and the Output Allocator
=====================================

Torch-TensorRT provides two runtime features that can significantly reduce per-request
inference latency for steady-state workloads: **CUDA Graphs** and the
**Dynamic Output Allocator**.

----

CUDA Graphs
-----------

CUDA Graphs capture a sequence of GPU operations into a replayable graph. On replay,
the entire sequence is submitted in a single kernel launch rather than individual
dispatches, eliminating CPU-side dispatch overhead and improving GPU utilization.

Enabling CUDAGraphs
^^^^^^^^^^^^^^^^^^^^

The canonical way to enable CUDA graphs is the ``enable_cudagraphs`` context manager:

.. code-block:: python

    import torch
    import torch_tensorrt

    trt_gm = torch_tensorrt.compile(model, ir="dynamo", arg_inputs=inputs)

    with torch.no_grad():
        with torch_tensorrt.runtime.enable_cudagraphs(trt_gm) as cg_module:
            # First call: warms up and records the CUDA graph
            output = cg_module(*inputs)
            # Subsequent calls: replay the captured graph (fast path)
            output = cg_module(*inputs)
    # Graph recording is torn down on context exit; trt_gm is restored

The context manager automatically selects one of two capture modes based on the
compiled model:

* **Per-subgraph mode** (no graph breaks): CUDA graph capture is applied to each
  individual TRT submodule. ``cg_module`` is the same ``GraphModule`` object with
  the per-subgraph flag enabled.

* **Whole-graph mode** (model has PyTorch fallback subgraphs / graph breaks): The
  entire forward pass — TRT subgraphs *and* PyTorch subgraphs between them — is
  captured as a single CUDA graph via ``CudaGraphsTorchTensorRTModule``. This
  eliminates inter-subgraph dispatch overhead even when the model is partially
  executed in PyTorch.

  .. code-block:: python

      # Force a graph break so the model has a PyTorch fallback subgraph
      opt_with_break = torch_tensorrt.compile(
          model, ir="dynamo", arg_inputs=[input],
          torch_executed_ops={"torch.ops.aten.mul.Tensor"},
          min_block_size=1,
      )

      with torch_tensorrt.runtime.enable_cudagraphs(opt_with_break) as cg_module:
          # cg_module is a CudaGraphsTorchTensorRTModule wrapping opt_with_break
          output = cg_module(input)

You can also enable CUDA graphs globally for the session (without a context manager):

.. code-block:: python

    torch_tensorrt.runtime.set_cudagraphs_mode(True)
    output = trt_gm(*inputs)
    torch_tensorrt.runtime.set_cudagraphs_mode(False)

Prefer the context manager over ``set_cudagraphs_mode`` — it guarantees the mode is
restored even if an exception occurs.

How Recording Works
^^^^^^^^^^^^^^^^^^^^

1. **Warm-up**: Three forward passes on a side CUDA stream. This forces memory
   allocations and kernel initializations to happen *before* recording, so they are
   excluded from the graph.

2. **Input shape tracking**: A shape key is computed from all input shapes. If the
   key changes between calls, the captured graph is reset and re-recorded for the new
   shapes.

3. **Replay**: On a shape-key cache hit, input tensors are copied into pre-allocated
   buffers and the captured graph is replayed in a single submission.

Limitations
^^^^^^^^^^^

* **Dynamic shapes**: CUDAGraphs require fixed tensor addresses and sizes. If your
  input shapes change between requests, the graph is re-recorded for each new shape.
  For variable-batch workloads, consider bucketing inputs by shape or using the
  ``DynamicOutputAllocator`` instead.

* **Data-dependent-shape ops**: Operations like ``nonzero`` and ``unique`` produce
  outputs whose size is unknown at graph-capture time. These require the
  ``DynamicOutputAllocator`` and are incompatible with full CUDA graph capture unless
  they are partitioned into a separate PyTorch subgraph.

* **Weight streaming**: If ``enable_weight_streaming=True``, the graph is re-recorded
  whenever weights are streamed (flagged by ``is_weight_streaming_set``).

* **Not serializable**: ``CudaGraphsTorchTensorRTModule`` is a runtime wrapper and
  cannot be saved via ``torch_tensorrt.save()``. Save the underlying ``trt_gm`` first,
  load it, then wrap.

----

DynamicOutputAllocator
-----------------------

Some TRT ops produce outputs whose shape depends on runtime data — TensorRT calls
these **data-dependent shape (DDS)** operations. Examples: ``aten.nonzero``,
``aten.unique``, ``aten.nms``.

For these ops, TRT cannot pre-allocate output buffers of the correct size. The
``DynamicOutputAllocator`` solves this by implementing TRT's ``IOutputAllocator``
interface: TRT calls back into the allocator at runtime to request a buffer of the
correct size, and the allocator provides a freshly-allocated CUDA tensor.

When Is It Used?
^^^^^^^^^^^^^^^^^

Automatically — you do not need to configure it manually. When any converter in the
graph sets ``requires_output_allocator=True`` in its ``ConverterSupport``, the
``TRTInterpreter`` sets ``ctx.requires_output_allocator = True`` on the
``ConversionContext``. The runtime module then uses the ``DynamicOutputAllocator``
for that engine.

.. code-block:: python

    # Check if a compiled module uses the output allocator
    for name, submodule in trt_gm.named_children():
        if hasattr(submodule, "requires_output_allocator"):
            print(f"{name}: requires_output_allocator = {submodule.requires_output_allocator}")

Writing a Converter That Requires the Output Allocator
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Set ``requires_output_allocator=True`` in the decorator:

.. code-block:: python

    @dynamo_tensorrt_converter(
        torch.ops.aten.nonzero.default,
        requires_output_allocator=True,
        supports_dynamic_shapes=True,
    )
    def aten_ops_nonzero(ctx, target, args, kwargs, name):
        ...

Performance Implications
^^^^^^^^^^^^^^^^^^^^^^^^^

The ``DynamicOutputAllocator`` performs a CUDA memory allocation on every forward
pass for each DDS output. This adds a small latency cost compared to pre-allocated
buffers. If DDS ops are in your hot path:

* Use ``torch_executed_ops`` to force the DDS op to run in PyTorch where NumPy-style
  dynamic allocation is cheap.
* Cache output tensor handles across calls if the output size is bounded in practice.

----

Choosing Between Approaches
-----------------------------

.. list-table::
   :widths: 25 25 25 25
   :header-rows: 1

   * - Scenario
     - CUDAGraphs
     - Output Allocator
     - Recommendation
   * - Fixed-shape inference, latency critical
     - Ideal
     - Not needed
     - Enable CUDAGraphs
   * - Variable batch sizes
     - Re-records per shape
     - Not needed
     - Bucket inputs + CUDAGraphs, or no CUDAGraphs
   * - Graph contains ``nonzero`` / ``unique``
     - Incompatible
     - Required automatically
     - Let the allocator run, disable CUDAGraphs for that subgraph
   * - Maximize throughput, not latency
     - Marginal benefit
     - Not needed
     - Skip, focus on ``optimization_level``