Quantization#
The XNNPACK delegate can also be used as a backend to execute symmetrically quantized models. To quantize a PyTorch model for the XNNPACK backend, use the XNNPACKQuantizer
. Quantizers
are backend specific, which means the XNNPACKQuantizer
is configured to quantize models to leverage the quantized operators offered by the XNNPACK Library.
Supported Quantization Schemes#
The XNNPACK delegate supports the following quantization schemes:
8-bit symmetric weights with 8-bit asymmetric activations (via the PT2E quantization flow).
Supports both static and dynamic activations.
Supports per-channel and per-tensor schemes.
Supports linear, convolution, add, mul, cat, and adaptive avg pool 2d operators.
Weight-only quantization is not currently supported on XNNPACK.
8-bit Quantization using the PT2E Flow#
To perform 8-bit quantization with the PT2E flow, perform the following steps prior to exporting the model:
Create an instance of the
XnnpackQuantizer
class. Set quantization parameters.Use
torch.export.export
to prepare for quantization.Call
prepare_pt2e
to prepare the model for quantization.For static quantization, run the prepared model with representative samples to calibrate the quantizated tensor activation ranges.
Call
convert_pt2e
to quantize the model.Export and lower the model using the standard flow.
The output of convert_pt2e
is a PyTorch model which can be exported and lowered using the normal flow. As it is a regular PyTorch model, it can also be used to evaluate the accuracy of the quantized model using standard PyTorch techniques.
import torch
import torchvision.models as models
from torchvision.models.mobilenetv2 import MobileNet_V2_Weights
from executorch.backends.xnnpack.quantizer.xnnpack_quantizer import XNNPACKQuantizer, get_symmetric_quantization_config
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
from executorch.exir import to_edge_transform_and_lower
from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e
model = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval()
sample_inputs = (torch.randn(1, 3, 224, 224), )
qparams = get_symmetric_quantization_config(is_per_channel=True) # (1)
quantizer = XNNPACKQuantizer()
quantizer.set_global(qparams)
training_ep = torch.export.export(model, sample_inputs).module() # (2)
prepared_model = prepare_pt2e(training_ep, quantizer) # (3)
for cal_sample in [torch.randn(1, 3, 224, 224)]: # Replace with representative model inputs
prepared_model(cal_sample) # (4) Calibrate
quantized_model = convert_pt2e(prepared_model) # (5)
et_program = to_edge_transform_and_lower( # (6)
torch.export.export(quantized_model, sample_inputs),
partitioner=[XnnpackPartitioner()],
).to_executorch()
See PyTorch 2 Export Post Training Quantization for more information.
LLM quantization with quantize_#
The XNNPACK backend also supports quantizing models with the torchao quantize_ API. This is most commonly used for LLMs, requiring more advanced quantization. Since quantize_ is not backend aware, it is important to use a config that is compatible with CPU/XNNPACK:
Quantize embeedings with IntxWeightOnlyConfig (with weight_dtype torch.int2, torch.int4, or torch.int8, using PerGroup or PerAxis granularity)
Quantize linear layers with Int8DynamicActivationIntxWeightConfig (with weight_dtype=torch.int4, using PerGroup or PerAxis granularity)
Below is a simple example, but a more detailed tutorial including accuracy evaluation on popular LLM benchmarks can be found in the torchao documentation.
from torchao.quantization.granularity import PerGroup, PerAxis
from torchao.quantization.quant_api import (
IntxWeightOnlyConfig,
Int8DynamicActivationIntxWeightConfig,
quantize_,
)
# Quantize embeddings with 8-bits, per channel
embedding_config = IntxWeightOnlyConfig(
weight_dtype=torch.int8,
granularity=PerAxis(0),
)
qunatize_(
eager_model,
lambda m, fqn: isinstance(m, torch.nn.Embedding),
)
# Quatize linear layers with 8-bit dynamic activations and 4-bit weights
linear_config = Int8DynamicActivationIntxWeightConfig(
weight_dtype=torch.int4,
weight_granularity=PerGroup(32),
)
quantize_(eager_model, linear_config)