(Part 3) Serving on vLLM, SGLang, ExecuTorch
============================================

TorchAO provides an end-to-end pre-training, fine-tuning, and serving model optimization flow by leveraging our quantization and sparsity techniques integrated into our partner frameworks. This is part 3 of 3 such tutorials showcasing this end-to-end flow, focusing on the serving step.

.. image:: ../static/e2e_flow_part3.png

This tutorial demonstrates how to perform post-training quantization and deploy models for inference using torchao as the underlying optimization engine, seamlessly integrated through HuggingFace Transformers, vLLM, and ExecuTorch.

.. contents::
   :local:
   :depth: 2

Post-training Quantization with HuggingFace
-------------------------------------------

HuggingFace Transformers provides seamless integration with torchao quantization. The ``TorchAoConfig`` automatically applies torchao's optimized quantization algorithms during model loading.

.. code-block:: bash

    pip install git+https://github.com/huggingface/transformers@main
    pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
    pip install torch
    pip install accelerate

For this example, we'll use ``Float8DynamicActivationFloat8WeightConfig`` on the Phi-4 mini-instruct model.

.. code-block:: python

    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
    from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow

    model_id = "microsoft/Phi-4-mini-instruct"

    quant_config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())
    quantization_config = TorchAoConfig(quant_type=quant_config)
    quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    # Push the model to hub
    USER_ID = "YOUR_USER_ID"
    MODEL_NAME = model_id.split("/")[-1]
    save_to = f"{USER_ID}/{MODEL_NAME}-float8dq"
    quantized_model.push_to_hub(save_to, safe_serialization=False)
    tokenizer.push_to_hub(save_to)

.. note::
    For more information on supported quantization and sparsity configurations, see `HF-Torchao Docs <https://huggingface.co/docs/transformers/main/en/quantization/torchao>`_.

Serving and Inference
--------------------

Serving and Inference with vLLM
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vLLM automatically leverages torchao's optimized kernels when serving quantized models, providing significant throughput improvements.

First, install vLLM with torchao support:

.. code-block:: bash

    pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
    pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126

To serve in vLLM, we're using the model we quantized and pushed to Hugging Face hub in the previous step :ref:`Post-training Quantization with HuggingFace`.

.. code-block:: bash

    # Server
    vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3

    # Client
    curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
        "model": "pytorch/Phi-4-mini-instruct-float8dq",
        "messages": [
            {"role": "user", "content": "Give me a short introduction to large language models."}
        ],
        "temperature": 0.6,
        "top_p": 0.95,
        "top_k": 20,
        "max_tokens": 32768
    }'

Serving a float8 dynamic quantized model with vLLM shows 36% VRAM reduction, 1.15x-1.2x inference speedup and little to no accuracy impact on H100. :ref:`Memory Benchmarking` and :ref:`Performance Benchmarking` for more details.

.. note::
    For more information on vLLM Integration, please refer to the detailed guide :ref:`torchao_vllm_integration`.

Serving and Inference with SGLang
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(Coming soon!)

Inference with Transformers
^^^^^^^^^^^^^^^^^^^^^^^^^^

Install the required packages:

.. code-block:: bash

    pip install git+https://github.com/huggingface/transformers@main
    pip install torchao
    pip install torch
    pip install accelerate

.. code-block:: python

    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

    torch.random.manual_seed(0)

    model_path = "pytorch/Phi-4-mini-instruct-float8dq"

    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        device_map="auto",
        torch_dtype="auto",
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    messages = [
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
        {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
        {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
    ]

    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
    )

    generation_args = {
        "max_new_tokens": 500,
        "return_full_text": False,
        "temperature": 0.0,
        "do_sample": False,
    }

    output = pipe(messages, **generation_args)
    print(output[0]['generated_text'])

Mobile Deployment with ExecuTorch
--------------------------------

ExecuTorch enables on-device inference using torchao's mobile-optimized quantization schemes. The 8da4w (8-bit dynamic activation, 4-bit weight) configuration is specifically designed for mobile deployment. Optionally, before lowering to ExecuTorch, we can finetune a model using QAT :doc:`finetuning`, which has demonstrated some improvements in the quality of quantized models.

[Optional] Untie Embedding Weights
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Optionally, we can quantize the embedding and lm_head differently, since those layers are tied, we first need to untie the model:

.. code-block:: python

    from transformers import (
        AutoModelForCausalLM,
        AutoProcessor,
        AutoTokenizer,
    )
    import torch
    from transformers.modeling_utils import find_tied_parameters

    model_id = "microsoft/Phi-4-mini-instruct"
    untied_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    print(untied_model)
    print("tied weights:", find_tied_parameters(untied_model))
    if getattr(untied_model.config.get_text_config(decoder=True), "tie_word_embeddings"):
        setattr(untied_model.config.get_text_config(decoder=True), "tie_word_embeddings", False)

    untied_model._tied_weights_keys = []
    untied_model.lm_head.weight = torch.nn.Parameter(untied_model.lm_head.weight.clone())

    print("tied weights:", find_tied_parameters(untied_model))

    USER_ID = "YOUR_USER_ID"
    MODEL_NAME = model_id.split("/")[-1]
    save_to = f"{USER_ID}/{MODEL_NAME}-untied-weights"

    untied_model.push_to_hub(save_to)
    tokenizer.push_to_hub(save_to)

    # or save locally
    save_to_local_path = f"{MODEL_NAME}-untied-weights"
    untied_model.save_pretrained(save_to_local_path)
    tokenizer.save_pretrained(save_to)

Step 1: Create Mobile-Optimized Quantization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Quantizing the model for mobile deployment using TorchAO's ``Int8DynamicActivationIntxWeightConfig`` configuration. If we've untied the embedding and lm_head following the previous step, we can quantize embedding using ``IntxWeightOnlyConfig`` configuration, and lm_head using ``Int8DynamicActivationIntxWeightConfig`` configuration.

.. code-block:: python

    from transformers import (
        AutoModelForCausalLM,
        AutoProcessor,
        AutoTokenizer,
        TorchAoConfig,
    )
    from torchao.quantization.quant_api import (
        IntxWeightOnlyConfig,
        Int8DynamicActivationIntxWeightConfig,
        ModuleFqnToConfig,
        quantize_,
    )
    from torchao.quantization.granularity import PerGroup, PerAxis
    import torch

    # we start from the model with untied weights
    model_id = "microsoft/Phi-4-mini-instruct"
    USER_ID = "YOUR_USER_ID"
    MODEL_NAME = model_id.split("/")[-1]
    untied_model_id = f"{USER_ID}/{MODEL_NAME}-untied-weights"
    untied_model_local_path = f"{MODEL_NAME}-untied-weights"

    # embedding_config is required only if we untied the embedding and lm_head in the previous step, else we can use only linear config for quantization
    embedding_config = IntxWeightOnlyConfig(
        weight_dtype=torch.int8,
        granularity=PerAxis(0),
    )
    linear_config = Int8DynamicActivationIntxWeightConfig(
        weight_dtype=torch.int4,
        weight_granularity=PerGroup(32),
        weight_scale_dtype=torch.bfloat16,
    )
    quant_config = ModuleFqnToConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
    quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True, untie_embedding_weights=True, modules_to_not_convert=[])

    # either use `untied_model_id` or `untied_model_local_path`
    quantized_model = AutoModelForCausalLM.from_pretrained(untied_model_id, torch_dtype=torch.float32, device_map="auto", quantization_config=quantization_config)
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    # Push to hub
    MODEL_NAME = model_id.split("/")[-1]
    save_to = f"{USER_ID}/{MODEL_NAME}-8da4w"
    quantized_model.push_to_hub(save_to, safe_serialization=False)
    tokenizer.push_to_hub(save_to)


Step 2: Export to ExecuTorch
^^^^^^^^^^^^^^^^^^^^^^^^^^

Convert the quantized model to .pte file, which can be run on mobile device.

.. code-block:: bash

    # Install ExecuTorch
    git clone https://github.com/pytorch/executorch.git
    cd executorch
    ./install_requirements.sh

    # Convert checkpoint format for ExecuTorch
    python -m executorch.examples.models.phi_4_mini.convert_weights pytorch_model.bin pytorch_model_converted.bin

    # Export to PTE format with torchao optimizations preserved
    PARAMS="executorch/examples/models/phi_4_mini/config.json"
    python -m executorch.examples.models.llama.export_llama \
        --model "phi_4_mini" \
        --checkpoint "pytorch_model_converted.bin" \
        --params "$PARAMS" \
        -kv \
        --use_sdpa_with_kv_cache \
        -X \
        --metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}' \
        --max_seq_length 128 \
        --max_context_length 128 \
        --output_name="phi4-mini-8da4w.pte"

The .pte file can be run with ExecuTorch on a mobile phone. Follow the `instructions <https://docs.pytorch.org/executorch/main/llm/llama-demo-ios.html>`_ for doing this on an iOS device.

Mobile Performance Characteristics
^^^^^^^^^^^^^^^^^^^^^^^^^^

The torchao-optimized 8da4w model provides:

- **Memory**: ~3.2GB on iPhone 15 Pro
- **Speed**: ~17 tokens/sec on iPhone 15 Pro
- **Accuracy**: Maintained within 5-10% of original model on most benchmarks

.. note::
    For detailed instructions on testing the ExecuTorch model and reproducing benchmarks please refer to the `HF Phi-4-mini-instruct-8da4w model <https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w>`_.

Evaluation
---------

Model Quality Assessment
^^^^^^^^^^^^^^^^^^^^^^

Evaluate quantized models using lm-evaluation-harness:

.. code-block:: bash

    # Install evaluation framework
    # Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install

    # Evaluate baseline model
    lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8

    # Evaluate torchao-quantized model (float8dq)
    lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8

Memory Benchmarking
^^^^^^^^^^^^^^^^^
For Phi-4-mini-instruct, when quantized with float8 dynamic quant, we can reduce the peak memory usage by 36% compared to the baseline model.

.. code-block:: python

    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer

    # use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-float8dq"
    model_id = "pytorch/Phi-4-mini-instruct-float8dq"
    quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    torch.cuda.reset_peak_memory_stats()

    prompt = "Hey, are you conscious? Can you talk to me?"
    messages = [
        {
            "role": "system",
            "content": "",
        },
        {"role": "user", "content": prompt},
    ]
    templated_prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    print("Prompt:", prompt)
    print("Templated prompt:", templated_prompt)
    inputs = tokenizer(
        templated_prompt,
        return_tensors="pt",
    ).to("cuda")
    generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
    output_text = tokenizer.batch_decode(
        generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    print("Response:", output_text[0][len(prompt):])

    mem = torch.cuda.max_memory_reserved() / 1e9
    print(f"Peak Memory Usage: {mem:.02f} GB")

Output:

.. code:: console

    Prompt: Hey, are you conscious? Can you talk to me?
    Templated prompt: <|system|><|end|><|user|>Hey, are you conscious? Can you talk to me?<|end|><|assistant|>
    Response: Hello! Yes, I am a digital assistant, and I am fully operational and ready to assist you. How can I help you today?
    Peak Memory Usage: 5.70 GB

+-------------------+---------------------+------------------------------+
| Benchmark         | Phi-4 mini-instruct | Phi-4-mini-instruct-float8dq |
+===================+=====================+==============================+
| Peak Memory (GB)  | 8.91                | 5.70 (36% reduction)         |
+-------------------+---------------------+------------------------------+

Performance Benchmarking
^^^^^^^^^^^^^^^^^^^^^^

Latency Benchmarking
"""""""""""""""""""

.. code-block:: bash

    # baseline
    python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1

    # float8dq
    VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1

Serving Benchmarking
"""""""""""""""""""""

We benchmarked the throughput in a serving environment.

.. code-block:: bash

    # Setup: Get vllm source code
    git clone git@github.com:vllm-project/vllm.git

    # Install vllm
    VLLM_USE_PRECOMPILED=1 pip install --editable .

    # Run the benchmarks under vllm root folder:

    # Download sharegpt dataset:
    wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

    # Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
    # Note: you can change the number of prompts to be benchmarked with --num-prompts argument for benchmark_serving script.

    # For baseline
    # Server:
    vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
    # Client:
    python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1

    # For float8dq
    # Server:
    VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
    # Client:
    python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1

Results (H100 machine)
"""""""""""""""""""""

+----------------------------+---------------------+------------------------------+
| Benchmark                  | Phi-4-mini-instruct | Phi-4-mini-instruct-float8dq |
+============================+=====================+==============================+
| latency (batch_size=1)     | 1.64s               | 1.41s (1.16x speedup)        |
+----------------------------+---------------------+------------------------------+
| latency (batch_size=128)   | 3.1s                | 2.72s (1.14x speedup)        |
+----------------------------+---------------------+------------------------------+
| serving (num_prompts=1)    | 1.35 req/s          | 1.57 req/s (1.16x speedup)   |
+----------------------------+---------------------+------------------------------+
| serving (num_prompts=1000) | 66.68 req/s         | 80.53 req/s (1.21x speedup)  |
+----------------------------+---------------------+------------------------------+

Conclusion
---------

This tutorial demonstrated how torchao's quantization and sparsity techniques integrate seamlessly across the entire ML deployment stack:

- **HuggingFace Transformers** provides easy model loading with torchao quantization
- **vLLM** leverages torchao's optimized kernels for high-throughput serving
- **ExecuTorch** enables mobile deployment with torchao's mobile-optimized schemes
- **lm-evaluation-harness** provides model quality assessment

All these frameworks use torchao as the underlying optimization engine, ensuring consistent performance gains and ease of integration. The quantization techniques shown provide significant memory reduction (3-4x) and performance improvements (1.5-2x) while maintaining model quality within acceptable bounds for most applications.

For production deployments, always benchmark on your specific use case and hardware to validate the performance and accuracy trade-offs.