{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n\n# Compiling GPT2 using the Torch-TensorRT ``torch.compile`` frontend\n\nThis example illustrates the state of the art model [GPT2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) optimized using\n``torch.compile`` frontend of Torch-TensorRT. Install the following dependencies before compilation\n\n```python\npip install -r requirements.txt\n```\nGPT2 is a causal (unidirectional) transformer pretrained using language modeling on a very large corpus of text data. In this example, we use the GPT2 model available at [HuggingFace](https://huggingface.co/docs/transformers/en/model_doc/gpt2) and apply torch.compile on it to\nget the graph module representation of the graph. Torch-TensorRT converts this graph into an optimized TensorRT engine.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Import necessary libraries\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch\nimport torch_tensorrt\nfrom transformers import AutoModelForCausalLM, AutoTokenizer"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Define the necessary parameters\nTorch-TensorRT requires a GPU for successful compilation of the model.\n``MAX_LENGTH`` is the maximum length the generated tokens can have. This corresponds to the length of the input prompt +\nnumber of new tokens generated\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "MAX_LENGTH = 32\nDEVICE = torch.device(\"cuda:0\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Model definition\nWe use ``AutoModelForCausalLM`` class to load the pretrained GPT2 model from hugging face. ``kv_cache`` is not supported in Torch-TRT currently so ``use_cache=False``\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "with torch.no_grad():\n    tokenizer = AutoTokenizer.from_pretrained(\"gpt2\")\n    model = (\n        AutoModelForCausalLM.from_pretrained(\n            \"gpt2\",\n            pad_token_id=tokenizer.eos_token_id,\n            use_cache=False,\n            attn_implementation=\"eager\",\n        )\n        .to(DEVICE)\n        .eval()\n    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## PyTorch inference\nTokenize a sample input prompt and get pytorch model outputs\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "prompt = \"I enjoy walking with my cute dog\"\nmodel_inputs = tokenizer(prompt, return_tensors=\"pt\")\ninput_ids = model_inputs[\"input_ids\"].to(DEVICE)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The ``generate()`` API of the ``AutoModelForCausalLM`` class is used for auto-regressive generation with greedy decoding.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "with torch.no_grad():\n    pyt_gen_tokens = model.generate(\n        input_ids,\n        max_length=MAX_LENGTH,\n        use_cache=False,\n        pad_token_id=tokenizer.eos_token_id,\n    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Torch-TensorRT compilation and inference\nThe input sequence length is dynamic, so we mark it using ``torch._dynamo.mark_dynamic`` API.\nWe provide a (min, max) range of this value so that TensorRT knows in advance what values to optimize for.\nUsually, this would be the context length for the model. We start with ``min=2`` due to the [0/1 specialization](https://docs.google.com/document/d/16VPOa3d-Liikf48teAOmxLc92rgvJdfosIy-yoT38Io/edit?fbclid=IwAR3HNwmmexcitV0pbZm_x1a4ykdXZ9th_eJWK-3hBtVgKnrkmemz6Pm5jRQ&tab=t.0#heading=h.ez923tomjvyk)\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "torch._dynamo.mark_dynamic(input_ids, 1, min=2, max=1023)\nmodel.forward = torch.compile(\n    model.forward,\n    backend=\"tensorrt\",\n    dynamic=None,\n    options={\n        \"enabled_precisions\": {torch.float32},\n        \"disable_tf32\": True,\n        \"min_block_size\": 1,\n    },\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Auto-regressive generation loop for greedy decoding using TensorRT model\nThe first token generation compiles the model using TensorRT and the second token\nencounters recompilation (which is an issue currently that would be resolved in the future)\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "with torch.no_grad():\n    trt_gen_tokens = model.generate(\n        inputs=input_ids,\n        max_length=MAX_LENGTH,\n        use_cache=False,\n        pad_token_id=tokenizer.eos_token_id,\n    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Decode the output sentences of PyTorch and TensorRT\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(\n    \"Pytorch model generated text: \",\n    tokenizer.decode(pyt_gen_tokens[0], skip_special_tokens=True),\n)\nprint(\"=============================\")\nprint(\n    \"TensorRT model generated text: \",\n    tokenizer.decode(trt_gen_tokens[0], skip_special_tokens=True),\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The output sentences should look like\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "\"\"\"\nPytorch model generated text:  I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll\n=============================\nTensorRT model generated text:  I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll\n\"\"\""
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.15"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}