Rate this Page

Exporting Llama 3.2 1B/3B Instruct to ExecuTorch Vulkan and running on device#

This tutorial assumes that you have a working local copy of the ExecuTorch repo, and have gone through the steps to install the executorch pip package or have installed it by building from source.

This tutorial also assumes that you have the Android SDK tools installed and that you are able to connect to an Android device via adb.

Finally, the Android NDK should also be installed, and your environment should have a variable ANDROID_NDK that points to the root directory of the NDK.

export ANDROID_NDK=<path_to_ndk>

Download the Llama 3.2 1B/3B Instruct model checkpoint and tokenizer#

The model checkpoint and tokenizer can be downloaded from the Meta Llama website.

The model files should be downloaded to ~/.llama/checkpoints/Llama3.2-1B-Instruct.

Export the Llama 3.2 1B/3B model#

First, navigate to the root of the ExecuTorch repo.

# Navigate to executorch root
cd ~/executorch

Then, set some environment variables to describe how the model should be exported. Feel free to tune the values to your preferences.

export LLM_NAME=Llama3.2 && \
export LLM_SIZE=1B && \
export LLM_SUFFIX="-Instruct" && \
export QUANT=8da4w && \
export BACKEND=vulkan && \
export GROUP_SIZE=64 && \
export CONTEXT_LENGTH=2048

Then, export the Llama 3.2 1B/3B Instruct model to ExecuTorch Vulkan. Note that that --vulkan-force-fp16 flag is set, which will improve model inference latency at the cost of model accuracy. Feel free to remove this flag.

python -m examples.models.llama.export_llama \
    -c $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/consolidated.00.pth \
    -p $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/params.json \
    -d fp32 --${BACKEND} \
    -qmode ${QUANT} -G ${GROUP_SIZE} \
    --max_seq_length ${CONTEXT_LENGTH} \
    --max_context_length ${CONTEXT_LENGTH} \
    -kv --use_sdpa_with_kv_cache \
    --metadata '{"append_eos_to_prompt": 0, "get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
    --model "llama3_2" \
    --output_name $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte

After exporting the model, push the exported .pte file and the tokenizer to your device.

adb shell mkdir -p /data/local/tmp/llama && \
adb push ~/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/tokenizer.model \
  /data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_tokenizer.model && \
adb push ~/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte \
  /data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte

Build Core Executorch Components#

To be able to run the .pte file on device, first the core libraries, including the Vulkan backend, must be compiled for Android.

cmake . \
    -DCMAKE_INSTALL_PREFIX=cmake-out-android-so \
    -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
    -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \
    --preset "android-arm64-v8a" \
    -DANDROID_PLATFORM=android-28 \
    -DPYTHON_EXECUTABLE=python \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_PAL_DEFAULT=posix \
    -DEXECUTORCH_BUILD_LLAMA_JNI=ON \
    -DEXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP=ON \
    -DEXECUTORCH_BUILD_VULKAN=ON \
    -DEXECUTORCH_BUILD_TESTS=OFF \
    -Bcmake-out-android-so && \
cmake --build cmake-out-android-so -j16 --target install --config Release

Build and push the llama runner binary to Android#

Then, build a binary that can be used to run the .pte file.

cmake examples/models/llama \
    -DCMAKE_INSTALL_PREFIX=cmake-out-android-so \
    -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake  \
    -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \
    -DEXECUTORCH_ENABLE_LOGGING=ON \
    -DANDROID_ABI=arm64-v8a \
    -DANDROID_PLATFORM=android-28 \
    -DCMAKE_BUILD_TYPE=Release \
    -DPYTHON_EXECUTABLE=python \
    -Bcmake-out-android-so/examples/models/llama && \
cmake --build cmake-out-android-so/examples/models/llama -j16 --config Release

Once the binary is built, it can be pushed to your Android device.

adb shell mkdir /data/local/tmp/etvk/ && \
adb push cmake-out-android-so/examples/models/llama/llama_main /data/local/tmp/etvk/

Execute the llama runner binary#

Finally, we can execute the lowered .pte file on your device.

adb shell /data/local/tmp/etvk/llama_main \
  --model_path=/data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte \
  --tokenizer_path=/data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_tokenizer.model \
  --temperature=0 --seq_len=400 --warmup \
  --prompt=\"\<\|begin_of_text\|\>\<\|start_header_id\|\>system\<\|end_header_id\|\>Write me a short poem.\<\|eot_id\|\>\<\|start_header_id\|\>assistant\<\|end_header_id\|\>\"

Here is some sample output captured from a Galaxy S24:

E tokenizers:hf_tokenizer.cpp:60] Error parsing json file: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - invalid literal; last read: 'I'
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Write me a short poem.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Here is a short poem I came up with:

"Moonlight whispers secrets to the night
A gentle breeze that rustles the light
The stars up high, a twinkling show
A peaceful world, where dreams grow slow"

I hope you enjoy it!<|eot_id|>

PyTorchObserver {"prompt_tokens":14,"generated_tokens":54,"model_load_start_ms":1760077800721,"model_load_end_ms":1760077802998,"inference_start_ms":1760077802998,"inference_end_ms":1760077804187,"prompt_eval_end_ms":1760077803162,"first_token_ms":1760077803162,"aggregate_sampling_time_ms":19,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
        Prompt Tokens: 14    Generated Tokens: 54
        Model Load Time:                2.277000 (seconds)
        Total inference time:           1.189000 (seconds)               Rate:  45.416316 (tokens/second)
                Prompt evaluation:      0.164000 (seconds)               Rate:  85.365854 (tokens/second)
                Generated 54 tokens:    1.025000 (seconds)               Rate:  52.682927 (tokens/second)
        Time to first generated token:  0.164000 (seconds)
        Sampling time over 68 tokens:   0.019000 (seconds)