Hugging Face Integration¶

Saving the Model ¶

After we quantize the model, we can save it.

# Save quantized model (see below for safe_serialization enablement progress)
with tempfile.TemporaryDirectory() as tmp_dir:
    model.save_pretrained(tmp_dir, safe_serialization=False)

# optional: push to hub (uncomment the following lines)
# save_to = "your-username/Llama-3.2-1B-int4"
# model.push_to_hub(save_to, safe_serialization=False)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
tokenizer.push_to_hub(save_to)

Current Status of Safetensors support: TorchAO quantized models cannot yet be serialized with safetensors due to tensor subclass limitations. When saving quantized models, you must use safe_serialization=False.

# don't serialize model with Safetensors
output_dir = "llama3-8b-int4wo-128"
quantized_model.save_pretrained("llama3-8b-int4wo-128", safe_serialization=False)

Workaround: For production use, save models with safe_serialization=False when pushing to Hugging Face Hub.

Future Work: The TorchAO team is actively working on safetensors support for tensor subclasses. Track progress here and here.

Supported Quantization Types ¶

Weight-only quantization stores the model weights in a specific low-bit data type but performs computation with a higher-precision data type, like bfloat16. This lowers the memory requirements from model weights but retains the memory peaks for activation computation.

Dynamic activation quantization stores the model weights in a low-bit dtype, while also quantizing the activations on-the-fly to save additional memory. This lowers the memory requirements from model weights, while also lowering the memory overhead from activation computations. However, this may come at a quality tradeoff at times, so it is recommended to test different models thoroughly.

Note

Please refer to the torchao docs for supported quantization types.

Hugging Face Integration¶

Quick Start: Usage Example ¶

1. Quantizing Models with Transformers ¶

2. Quantizing Models with Diffusers ¶

Saving the Model ¶

Supported Quantization Types ¶

Docs

Tutorials

Resources

Hugging Face Integration¶

Quick Start: Usage Example¶

1. Quantizing Models with Transformers¶

2. Quantizing Models with Diffusers¶

Saving the Model¶

Supported Quantization Types¶

Docs

Tutorials

Resources

Quick Start: Usage Example ¶

1. Quantizing Models with Transformers ¶

2. Quantizing Models with Diffusers ¶

Saving the Model ¶

Supported Quantization Types ¶