(torchao_hf_integration)= # Hugging Face Integration ```{contents} :local: :depth: 2 ``` (usage-examples)= ## Quick Start: Usage Example First, install the required packages. ```bash pip install git+https://github.com/huggingface/transformers@main pip install git+https://github.com/huggingface/diffusers@main pip install torchao pip install torch pip install accelerate ``` (quantizing-models-transformers)= ### 1. Quantizing Models with Transformers Below is an example of using `Float8DynamicActivationInt4WeightConfig` on the Llama-3.2-1B model. ```python from transformers import TorchAoConfig, AutoModelForCausalLM from torchao.quantization import Float8DynamicActivationInt4WeightConfig # Create quantization configuration quantization_config = TorchAoConfig( quant_type=Float8DynamicActivationInt4WeightConfig(group_size=128, use_hqq=True) ) # Load and automatically quantize the model model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-1B", torch_dtype="auto", device_map="auto", quantization_config=quantization_config ) ``` ```{seealso} For inference examples and recommended quantization methods based on different hardwares (i.e. A100 GPU, H100 GPU, CPU), see [HF-Torchao Docs (Quantization Examples)](https://huggingface.co/docs/transformers/main/en/quantization/torchao#quantization-examples). For inference using vLLM, please see [(Part 3) Serving on vLLM, SGLang, ExecuTorch](https://docs.pytorch.org/ao/main/serving.html) for a full end-to-end tutorial. ``` (quantizing-models-diffusers)= ### 2. Quantizing Models with Diffusers Below is an example of how we can integrate with Diffusers. ```python from diffusers import FluxPipeline, FluxTransformer2DModel, TorchAoConfig model_id = "black-forest-labs/Flux.1-Dev" dtype = torch.bfloat16 quantization_config = TorchAoConfig("int8wo") transformer = FluxTransformer2DModel.from_pretrained( model_id, subfolder="transformer", quantization_config=quantization_config, torch_dtype=dtype, ) pipe = FluxPipeline.from_pretrained( model_id, transformer=transformer, torch_dtype=dtype, ) pipe.to("cuda") prompt = "A cat holding a sign that says hello world" image = pipe(prompt, num_inference_steps=4, guidance_scale=0.0).images[0] image.save("output.png") ``` ```{note} Example Output: ![alt text](output.png "Model Output") ``` ```{seealso} Please refer to [HF-TorchAO-Diffuser Docs](https://huggingface.co/docs/diffusers/en/quantization/torchao) for more examples and benchmarking results. ``` (saving-models)= ## Saving the Model After we quantize the model, we can save it. ```python # Save quantized model (see below for safe_serialization enablement progress) with tempfile.TemporaryDirectory() as tmp_dir: model.save_pretrained(tmp_dir, safe_serialization=False) # optional: push to hub (uncomment the following lines) # save_to = "your-username/Llama-3.2-1B-int4" # model.push_to_hub(save_to, safe_serialization=False) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B") tokenizer.push_to_hub(save_to) ``` **Current Status of Safetensors support**: TorchAO quantized models cannot yet be serialized with safetensors due to tensor subclass limitations. When saving quantized models, you must use `safe_serialization=False`. ```python # don't serialize model with Safetensors output_dir = "llama3-8b-int4wo-128" quantized_model.save_pretrained("llama3-8b-int4wo-128", safe_serialization=False) ``` **Workaround**: For production use, save models with `safe_serialization=False` when pushing to Hugging Face Hub. **Future Work**: The TorchAO team is actively working on safetensors support for tensor subclasses. Track progress [here](https://github.com/pytorch/ao/issues/2338) and [here](https://github.com/pytorch/ao/pull/2881). (Supported-Quantization-Types)= ## Supported Quantization Types Weight-only quantization stores the model weights in a specific low-bit data type but performs computation with a higher-precision data type, like `bfloat16`. This lowers the memory requirements from model weights but retains the memory peaks for activation computation. Dynamic activation quantization stores the model weights in a low-bit dtype, while also quantizing the activations on-the-fly to save additional memory. This lowers the memory requirements from model weights, while also lowering the memory overhead from activation computations. However, this may come at a quality tradeoff at times, so it is recommended to test different models thoroughly. ```{note} Please refer to the [torchao docs](https://docs.pytorch.org/ao/main/api_ref_quantization.html) for supported quantization types. ```