Skip to main content
Ctrl+K

Torch-TensorRT

  • Installation
  • User Guide
  • Advanced Usage
  • Model Zoo
  • API Reference
    • Debugging
    • Contributing
    • Legacy Frontends
  • GitHub
  • Installation
  • User Guide
  • Advanced Usage
  • Model Zoo
  • API Reference
  • Debugging
  • Contributing
  • Legacy Frontends
  • GitHub

Section Navigation

  • HuggingFace Models
    • Compiling LLM models from Huggingface
    • Example: Compiling Stable Diffusion with torch.compile
    • Example: Compiling FLUX.1-dev with the dynamo backend
    • Example: Mutable Torch TensorRT Module
  • Extensibility
    • Converters
      • Writing Dynamo Converters
      • Converter Registry Internals
      • The impl/ Building-Block Library
      • Example: Auto-generate a Converter for a Custom Kernel
      • Example: Overloading Converters with Custom Converters
    • Lowering Passes
      • Writing Dynamo ATen Lowering Passes
    • Plugins
      • Plugin System
      • Example: Auto-generate a Plugin for a Custom Kernel
      • Example: Using Custom Kernels within TensorRT Engines
      • Automatically Generate a TensorRT AOT Plugin
      • Example: Custom Kernels with NVRTC in TensorRT AOT Plugins
  • Resource & Memory Management
    • Resource Management
    • Engine Caching
    • Example: Engine Caching
    • Example: Engine Caching (BERT)
    • Example: Weight Streaming
    • Example: Dynamic Memory Allocation
    • Example: Low CPU Memory Compilation
  • Compilation & Graph Analysis
    • Tracing Models with torch_tensorrt.dynamo.trace
    • Dryrun Mode
    • Example: Hierarchical Partitioner
  • Weight Refitting & LoRA
    • Refitting TensorRT Engines with Updated Weights
    • Example: Refitting Programs with New Weights
  • Runtime Optimization
    • CUDAGraphs and the Output Allocator
    • Example: Torch Export with Cudagraphs
    • Example: Pre-allocated output buffer
    • Python Runtime
  • Deployment
    • Serving a Torch-TensorRT model with Triton
    • Cross-Compiling for Windows
    • Example: Cross-runtime Compilation for Windows
    • Distributed Inference
    • Complex Tensor Support
  • Example: Distributed Inference
    • Torch-TensorRT Distributed Inference
    • Torch-TensorRT Distributed Inference
    • Tensor Parallel Distributed Inference with Torch-TensorRT
  • Operators Supported
  • Advanced Usage
  • Extensibility
  • Plugins

Plugins#

Register custom CUDA and Triton kernels as TensorRT plugins — from auto-generated Python plugins to AOT-compiled C++ plugins for use in serialized engines.

  • Plugin System
  • Example: Auto-generate a Plugin for a Custom Kernel
  • Example: Using Custom Kernels within TensorRT Engines
  • Automatically Generate a TensorRT AOT Plugin
  • Step 1: Define the Triton Kernel
  • Step 2: Register the PyTorch op
  • Step 3: Register the QDP Shape Descriptor
  • Step 4: Register the AOT Implementation
  • Step 5: Generate the Converter
  • Step 6: Compile and Run
  • Example: Custom Kernels with NVRTC in TensorRT AOT Plugins

previous

Writing Dynamo ATen Lowering Passes

next

Plugin System

Edit on GitHub
Show Source

© Copyright 2024, NVIDIA Corporation.