Skip to main content
Ctrl+K

Torch-TensorRT

  • Installation
  • User Guide
  • Advanced Usage
  • Model Zoo
  • API Reference
    • Debugging
    • Contributing
    • Legacy Frontends
  • GitHub
  • Installation
  • User Guide
  • Advanced Usage
  • Model Zoo
  • API Reference
  • Debugging
  • Contributing
  • Legacy Frontends
  • GitHub

Section Navigation

  • HuggingFace Models
    • Compiling LLM models from Huggingface
    • Example: Compiling Stable Diffusion with torch.compile
    • Example: Compiling FLUX.1-dev with the dynamo backend
    • Example: Mutable Torch TensorRT Module
  • Extensibility
    • Converters
      • Writing Dynamo Converters
      • Converter Registry Internals
      • The impl/ Building-Block Library
      • Example: Auto-generate a Converter for a Custom Kernel
      • Example: Overloading Converters with Custom Converters
    • Lowering Passes
      • Writing Dynamo ATen Lowering Passes
    • Plugins
      • Plugin System
      • Example: Auto-generate a Plugin for a Custom Kernel
      • Example: Using Custom Kernels within TensorRT Engines
      • Automatically Generate a TensorRT AOT Plugin
      • Example: Custom Kernels with NVRTC in TensorRT AOT Plugins
  • Resource & Memory Management
    • Resource Management
    • Engine Caching
    • Example: Engine Caching
    • Example: Engine Caching (BERT)
    • Example: Weight Streaming
    • Example: Dynamic Memory Allocation
    • Example: Low CPU Memory Compilation
  • Compilation & Graph Analysis
    • Tracing Models with torch_tensorrt.dynamo.trace
    • Dryrun Mode
    • Example: Hierarchical Partitioner
  • Weight Refitting & LoRA
    • Refitting TensorRT Engines with Updated Weights
    • Example: Refitting Programs with New Weights
  • Runtime Optimization
    • CUDAGraphs and the Output Allocator
    • Example: Torch Export with Cudagraphs
    • Example: Pre-allocated output buffer
    • Python Runtime
  • Deployment
    • Serving a Torch-TensorRT model with Triton
    • Cross-Compiling for Windows
    • Example: Cross-runtime Compilation for Windows
    • Distributed Inference
    • Complex Tensor Support
  • Example: Distributed Inference
    • Torch-TensorRT Distributed Inference
    • Torch-TensorRT Distributed Inference
    • Tensor Parallel Distributed Inference with Torch-TensorRT
  • Operators Supported
  • Advanced Usage
  • Resource...

Resource & Memory Management#

Control GPU/CPU memory consumption during compilation and inference: weight streaming, dynamic memory allocation, and low-CPU-memory compilation.

  • Resource Management
  • Engine Caching
  • Example: Engine Caching
  • Example: Engine Caching (BERT)
  • Example: Weight Streaming
  • Example: Dynamic Memory Allocation
  • Example: Low CPU Memory Compilation

previous

Using Custom Kernels with NVRTC in TensorRT AOT Plugins

next

Resource Management

Edit on GitHub
Show Source

© Copyright 2024, NVIDIA Corporation.