autoquant¶

torchao.quantization.autoquant(model, example_input=None, qtensor_class_list=[<class 'torchao.quantization.autoquant.AQDefaultLinearWeight'>, <class 'torchao.quantization.autoquant.AQInt8WeightOnlyQuantizedLinearWeight'>, <class 'torchao.quantization.autoquant.AQInt8WeightOnlyQuantizedLinearWeight2'>, <class 'torchao.quantization.autoquant.AQInt8DynamicallyQuantizedLinearWeight'>], filter_fn=None, mode=['interpolate', 0.85], manual=False, set_inductor_config=True, supress_autoquant_errors=True, min_sqnr=None, **aq_kwargs)[source]¶

Autoquantization is a process which identifies the fastest way to quantize each layer of a model over some set of potential qtensor subclasses.

Autoquantization happens in three steps:

1-Prepare Model: the model is searched for Linear layers whose weights are exchanged for AutoQuantizableLinearWeight. 2-Shape Calibration: the user runs the model on one or more inputs, the details of the activation shape/dtype seen by

the AutoQuantizableLinearWeight are recorded so we know what shapes/dtypes to use in order to optimize the quantized op in step 3

3-Finalize Autoquantization: for each AutoQuantizableLinearWeight, benchmarks are run for each shape/dtype on each member of the qtensor_class_list.: the fastest option is picked, resulting in a highly performant model

This autoquant function performs step 1. Steps 2 and 3 can be completed by simply running the model. If example_input is provided, this function also runs the model (which completes steps 2 and 3). This autoquant api can handle models which have already had torch.compile applied to them, in which case, once the model is run and quantized, the torch.compile process normally proceeds as well.

To optimize over a combination of input shapes/dtypes, the user can set manual=True, run the model with all desired shapes/dtypes, then call model.finalize_autoquant to finalize the quantization once the desired set of inputs have been logged.

Parameters:

model (torch.nn.Module) – The model to be autoquantized.
example_input (Any, optional) – An example input for the model. If provided, the function performs a forward pass on this input (which fully autoquantizes the model unless manual=True). Defaults to None.
qtensor_class_list (list, optional) – A list of tensor classes to be used for quantization. Defaults to DEFAULT_AUTOQUANT_CLASS_LIST.
filter_fn (callable, optional) – A filter function to apply to the model parameters. Defaults to None.
mode (list, optional) – A list containing mode settings for quantization. The first element is the mode type (e.g., “interpolate”), and the second element is the mode value (e.g., 0.85). Defaults to [“interpolate”, .85].
manual (bool, optional) – Whether to stop shape calibration and do autoquant after a single run (default, False) or to wait for the user to call model.finalize_autoquant (True) so inputs with several shapes/dtypes can be logged.
set_inductor_config (bool, optional) – Whether to automatically use recommended inductor config settings (defaults to True)
supress_autoquant_errors (bool, optional) – Whether to suppress errors during autoquantization. (defaults to True)
min_sqnr (float, optional) – minimum acceptable signal to quantization noise ration (https://en.wikipedia.org/wiki/Signal-to-quantization-noise_ratio) for output of quantized layer v.s. non-quantized layer, this is used to filter
impact (out quantization methods that causes too large numerical) –
resaonable (user can start with a) –
result (number like 40 and adjust depending on the) –
**aq_kwargs – Additional keyword arguments for the autoquantization process.

Returns:

The autoquantized and wrapped model. If example_input is provided, the function performs a forward pass: on the input and returns the result of the forward pass.

Return type:

torch.nn.Module

Example usage:

torchao.autoquant(torch.compile(model)) model(*example_input)

# multiple input shapes torchao.autoquant(model, manual=True) model(*example_input1) model(*example_input2) model.finalize_autoquant()

autoquant¶

Docs

Tutorials

Resources