Runtime Optimization
=====================

Optimize inference throughput and latency: CUDA Graphs for kernel-replay,
pre-allocated output buffers, and the Python runtime module.

.. toctree::
   :maxdepth: 1

   cuda_graphs
   Example: Torch Export with Cudagraphs <../_rendered_examples/dynamo/torch_export_cudagraphs>
   Example: Pre-allocated output buffer <../_rendered_examples/dynamo/pre_allocated_output_example>
   python_runtime