.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "tutorials/_rendered_examples/dynamo/dynamic_memory_allocation.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_tutorials__rendered_examples_dynamo_dynamic_memory_allocation.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_tutorials__rendered_examples_dynamo_dynamic_memory_allocation.py:


.. _dynamic_memory_allocation:

Dynamic Memory Allocation
==========================================================

This script demonstrates how to use dynamic memory allocation with Torch-TensorRT
to reduce GPU memory footprint. When enabled, TensorRT engines allocate and deallocate resources
dynamically during inference, which can significantly reduce peak memory usage.

This is particularly useful when:

- Running multiple models on the same GPU
- Working with limited GPU memory
- Memory usage needs to be minimized between inference calls

.. GENERATED FROM PYTHON SOURCE LINES 19-21

Imports and Model Definition
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. GENERATED FROM PYTHON SOURCE LINES 21-34

.. code-block:: python


    import gc
    import time

    import numpy as np
    import torch
    import torch_tensorrt as torch_trt
    import torchvision.models as models

    np.random.seed(5)
    torch.manual_seed(5)
    inputs = [torch.rand((100, 3, 224, 224)).to("cuda")]


.. GENERATED FROM PYTHON SOURCE LINES 35-46

Compilation Settings with Dynamic Memory Allocation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Key settings for dynamic memory allocation:

- ``dynamically_allocate_resources=True``: Enables dynamic resource allocation
- ``lazy_engine_init=True``: Delays engine initialization until first inference
- ``immutable_weights=False``: Allows weight refitting if needed

With these settings, the engine will allocate GPU memory only when needed
and deallocate it after inference completes.

.. GENERATED FROM PYTHON SOURCE LINES 46-61

.. code-block:: python


    settings = {
        "ir": "dynamo",
        "use_python_runtime": False,
        "enabled_precisions": {torch.float32},
        "immutable_weights": False,
        "lazy_engine_init": True,
        "dynamically_allocate_resources": True,
    }

    model = models.resnet152(pretrained=True).eval().to("cuda")
    compiled_module = torch_trt.compile(model, inputs=inputs, **settings)
    print((torch.cuda.mem_get_info()[1] - torch.cuda.mem_get_info()[0]) / 1024**3)
    compiled_module(*inputs)


.. GENERATED FROM PYTHON SOURCE LINES 62-75

Runtime Resource Allocation Control
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can control resource allocation behavior at runtime using the
``ResourceAllocationStrategy`` context manager. This allows you to:

- Switch between dynamic and static allocation modes
- Control when resources are allocated and deallocated
- Optimize memory usage for specific inference patterns

In this example, we temporarily disable dynamic allocation to keep
resources allocated between inference calls, which can improve performance
when running multiple consecutive inferences.

.. GENERATED FROM PYTHON SOURCE LINES 75-94

.. code-block:: python


    time.sleep(30)
    with torch_trt.dynamo.runtime.ResourceAllocationStrategy(
        compiled_module, dynamically_allocate_resources=False
    ):
        print(
            "Memory used (GB):",
            (torch.cuda.mem_get_info()[1] - torch.cuda.mem_get_info()[0]) / 1024**3,
        )
        compiled_module(*inputs)
        gc.collect()
        torch.cuda.empty_cache()
        time.sleep(30)
        print(
            "Memory used (GB):",
            (torch.cuda.mem_get_info()[1] - torch.cuda.mem_get_info()[0]) / 1024**3,
        )
        compiled_module(*inputs)


.. GENERATED FROM PYTHON SOURCE LINES 95-110

Memory Usage Comparison
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Dynamic memory allocation trades off some performance for reduced memory footprint:

**Benefits:**

- Lower peak GPU memory usage
- Reduced memory pressure on shared GPUs

**Considerations:**

- Slight overhead from allocation/deallocation
- Best suited for scenarios where memory is constrained
- May not be necessary for single-model deployments with ample memory


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  0.000 seconds)


.. _sphx_glr_download_tutorials__rendered_examples_dynamo_dynamic_memory_allocation.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: dynamic_memory_allocation.py <dynamic_memory_allocation.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: dynamic_memory_allocation.ipynb <dynamic_memory_allocation.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_