Runtime Optimization#

Optimize inference throughput and latency: CUDA Graphs for kernel-replay, pre-allocated output buffers, multiple optimization profiles for distinct shape regimes (e.g. LLM prefill/decode), and choosing the Python vs C++ TRT execution path.