# Exporting Llama 3.2 1B/3B Instruct to ExecuTorch Vulkan and running on device This tutorial assumes that you have a working local copy of the ExecuTorch repo, and have gone through the steps to install the executorch pip package or have installed it by building from source. This tutorial also assumes that you have the Android SDK tools installed and that you are able to connect to an Android device via `adb`. Finally, the Android NDK should also be installed, and your environment should have a variable `ANDROID_NDK` that points to the root directory of the NDK. ```shell export ANDROID_NDK= ``` ## Download the Llama 3.2 1B/3B Instruct model checkpoint and tokenizer The model checkpoint and tokenizer can be downloaded from the [Meta Llama website](https://www.llama.com/llama-downloads/). The model files should be downloaded to `~/.llama/checkpoints/Llama3.2-1B-Instruct`. ## Export the Llama 3.2 1B/3B model First, navigate to the root of the ExecuTorch repo. ```shell # Navigate to executorch root cd ~/executorch ``` Then, set some environment variables to describe how the model should be exported. Feel free to tune the values to your preferences. ```shell export LLM_NAME=Llama3.2 && \ export LLM_SIZE=1B && \ export LLM_SUFFIX="-Instruct" && \ export QUANT=8da4w && \ export BACKEND=vulkan && \ export GROUP_SIZE=64 && \ export CONTEXT_LENGTH=2048 ``` Then, export the Llama 3.2 1B/3B Instruct model to ExecuTorch Vulkan. Note that that `--vulkan-force-fp16` flag is set, which will improve model inference latency at the cost of model accuracy. Feel free to remove this flag. ```shell python -m examples.models.llama.export_llama \ -c $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/consolidated.00.pth \ -p $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/params.json \ -d fp32 --${BACKEND} \ -qmode ${QUANT} -G ${GROUP_SIZE} \ --max_seq_length ${CONTEXT_LENGTH} \ --max_context_length ${CONTEXT_LENGTH} \ -kv --use_sdpa_with_kv_cache \ --metadata '{"append_eos_to_prompt": 0, "get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \ --model "llama3_2" \ --output_name $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte ``` After exporting the model, push the exported `.pte` file and the tokenizer to your device. ```shell adb shell mkdir -p /data/local/tmp/llama && \ adb push ~/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/tokenizer.model \ /data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_tokenizer.model && \ adb push ~/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte \ /data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte ``` ## Build Core Executorch Components To be able to run the `.pte` file on device, first the core libraries, including the Vulkan backend, must be compiled for Android. ```shell cmake . \ -DCMAKE_INSTALL_PREFIX=cmake-out-android-so \ -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \ --preset "android-arm64-v8a" \ -DANDROID_PLATFORM=android-28 \ -DPYTHON_EXECUTABLE=python \ -DCMAKE_BUILD_TYPE=Release \ -DEXECUTORCH_PAL_DEFAULT=posix \ -DEXECUTORCH_BUILD_LLAMA_JNI=ON \ -DEXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP=ON \ -DEXECUTORCH_BUILD_VULKAN=ON \ -DEXECUTORCH_BUILD_TESTS=OFF \ -Bcmake-out-android-so && \ cmake --build cmake-out-android-so -j16 --target install --config Release ``` ## Build and push the llama runner binary to Android Then, build a binary that can be used to run the `.pte` file. ```shell cmake examples/models/llama \ -DCMAKE_INSTALL_PREFIX=cmake-out-android-so \ -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \ -DEXECUTORCH_ENABLE_LOGGING=ON \ -DANDROID_ABI=arm64-v8a \ -DANDROID_PLATFORM=android-28 \ -DCMAKE_BUILD_TYPE=Release \ -DPYTHON_EXECUTABLE=python \ -Bcmake-out-android-so/examples/models/llama && \ cmake --build cmake-out-android-so/examples/models/llama -j16 --config Release ``` Once the binary is built, it can be pushed to your Android device. ```shell adb shell mkdir /data/local/tmp/etvk/ && \ adb push cmake-out-android-so/examples/models/llama/llama_main /data/local/tmp/etvk/ ``` ## Execute the llama runner binary Finally, we can execute the lowered `.pte` file on your device. ```shell adb shell /data/local/tmp/etvk/llama_main \ --model_path=/data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte \ --tokenizer_path=/data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_tokenizer.model \ --temperature=0 --seq_len=400 --warmup \ --prompt=\"\<\|begin_of_text\|\>\<\|start_header_id\|\>system\<\|end_header_id\|\>Write me a short poem.\<\|eot_id\|\>\<\|start_header_id\|\>assistant\<\|end_header_id\|\>\" ``` Here is some sample output captured from a Galaxy S24: ```shell E tokenizers:hf_tokenizer.cpp:60] Error parsing json file: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - invalid literal; last read: 'I' <|begin_of_text|><|start_header_id|>system<|end_header_id|>Write me a short poem.<|eot_id|><|start_header_id|>assistant<|end_header_id|> Here is a short poem I came up with: "Moonlight whispers secrets to the night A gentle breeze that rustles the light The stars up high, a twinkling show A peaceful world, where dreams grow slow" I hope you enjoy it!<|eot_id|> PyTorchObserver {"prompt_tokens":14,"generated_tokens":54,"model_load_start_ms":1760077800721,"model_load_end_ms":1760077802998,"inference_start_ms":1760077802998,"inference_end_ms":1760077804187,"prompt_eval_end_ms":1760077803162,"first_token_ms":1760077803162,"aggregate_sampling_time_ms":19,"SCALING_FACTOR_UNITS_PER_SECOND":1000} Prompt Tokens: 14 Generated Tokens: 54 Model Load Time: 2.277000 (seconds) Total inference time: 1.189000 (seconds) Rate: 45.416316 (tokens/second) Prompt evaluation: 0.164000 (seconds) Rate: 85.365854 (tokens/second) Generated 54 tokens: 1.025000 (seconds) Rate: 52.682927 (tokens/second) Time to first generated token: 0.164000 (seconds) Sampling time over 68 tokens: 0.019000 (seconds) ```