How to build model assets for Snapdragon NPU devices
These instructions demonstrate generating the Llama 3.2 3B model. You can use the same instructions to generate the Phi-3.5 mini instruct model.
Setup and pre-requisites
-
Sign up for Qualcomm AI Hub access
Once signed up, configure your Qualcomm AI hub API token
Follow instructions shown here: https://app.aihub.qualcomm.com/docs/hub/getting_started.html#getting-started
-
Install the Qualcomm AI Engine Direct SDK
-
Sign up to get access to HuggingFace weights for Llama-3.2-3B
This step is only required for models that require signing a license agreement.
-
Setup a Linux environment
There are steps in this process that can only be run on Linux. A WSL environment suffices.
Install libc++=dev in the Linux environment.
sudo apt get libc++-dev
Generate Qualcomm context binaries
-
Install model from Qualcomm AI hub
python -m pip install -U qai_hub_models[llama-v3-2-3b-chat-quantized]
-
Generate QNN context binaries
Generate the QNN binaries. This step downloads and uploads the model and binaries to and from the Qualcomm AI Hub and depending on your upload speed can take several hours.
python -m qai_hub_models.models.llama_v3_2_3b_chat_quantized.export --device "Snapdragon X Elite CRD" --skip-inferencing --skip-profiling --output-dir .
More information on this step can be found at: https://github.com/quic/ai-hub-apps/tree/main/tutorials/llm_on_genie.
Generate ONNX wrapper models
-
Download the following script from the onnxruntime repo
curl -LO https://raw.githubusercontent.com/microsoft/onnxruntime/refs/heads/main/onnxruntime/python/tools/qnn/gen_qnn_ctx_onnx_model.py
-
Extract the QNN graph information from QNN context binary file, once for each model (.bin file)
Note: this script only runs on Linux with libc++-dev installed (from the setup section)
for bin_file in *.bin; do $QNN_SDK_ROOT/bin/x86_64-linux-clang/qnn-context-binary-utility --context_binary="$bin_file" --json_file="${bin_file%.bin}.json"; done
-
Generate the ONNX wrapper models
Run the following command below to generate the ONNX wrapper models
On Linux with bash:
for bin_file in *.bin; do python gen_qnn_ctx_onnx_model.py -b "$bin_file" -q "${bin_file%.bin}.json" --quantized_IO --disable_embed_mode; done
On Windows with PowerShell:
Get-ChildItem -Filter "*.bin" | ForEach-Object { $binFile = $_.Name $jsonFile = "$($binFile -replace '\.bin$', '.json')" python gen_qnn_ctx_onnx_model.py -b $binFile -q $jsonFile --quantized_IO --disable_embed_mode }
Add other assets
Download assets from https://huggingface.co/onnx-community/Llama-3.2-3B-instruct-hexagon-npu-assets
Check model assets
Once the above instructions are complete you should have the following model assets
genai_config.json
tokenizer.json
tokenizer_config.json
special_tokens_map.json
quantizer.onnx
dequantizer.onnx
position-processor.onnx
- a set of transformer model binaries
- Qualcomm context binaries (
*.bin
) - Context binary meta data (
*.json
) - ONNX wrapper models (
*.onnx
)
- Qualcomm context binaries (