How to build model assets for Snapdragon NPU devices

These instructions demonstrate generating the Llama 3.2 3B model. You can use the same instructions to generate the Phi-3.5 mini instruct model.

Setup and pre-requisites

Sign up for Qualcomm AI Hub access

Once signed up, configure your Qualcomm AI hub API token

Follow instructions shown here: [https://app.aihub.qualcomm.com/docs/hub/getting_started.html#getting-started]
Install the Qualcomm AI Engine Direct SDK
Sign up to get access to HuggingFace weights for Llama-3.2-3B

This step is only required for models that require signing a license agreement.
Setup a Linux environment

There are steps in this process that can only be run on Linux. A WSL environment suffices.

Install libc++=dev in the Linux environment.
```
sudo apt get libc++-dev
```

Generate Qualcomm context binaries

Install model from Qualcomm AI hub

python -m pip install -U qai_hub_models[llama-v3-2-3b-chat-quantized]

Generate QNN context binaries

Generate the QNN binaries. This step downloads and uploads the model and binaries to and from the Qualcomm AI Hub and depending on your upload speed can take several hours.
```
python -m qai_hub_models.models.llama_v3_2_3b_chat_quantized.export --device "Snapdragon X Elite CRD" --skip-inferencing --skip-profiling --output-dir .
```
More information on this step can be found at: [https://github.com/quic/ai-hub-apps/tree/main/tutorials/llm_on_genie].

Generate ONNX wrapper models

Download the following script from the onnxruntime repo

curl -LO https://raw.githubusercontent.com/microsoft/onnxruntime/refs/heads/main/onnxruntime/python/tools/qnn/gen_qnn_ctx_onnx_model.py

Install the onnx package
```
pip install onnx
```
Extract the QNN graph information from QNN context binary file, once for each model (.bin file)

Note: this script only runs on Linux with libc++-dev installed (from the setup section)
```
for bin_file in *.bin; do $QNN_SDK_ROOT/bin/x86_64-linux-clang/qnn-context-binary-utility --context_binary="$bin_file" --json_file="${bin_file%.bin}.json"; done
```

Generate the ONNX wrapper models

Run the following command below to generate the ONNX wrapper models

On Linux with bash:

for bin_file in *.bin; do python gen_qnn_ctx_onnx_model.py -b "$bin_file" -q "${bin_file%.bin}.json" --quantized_IO --disable_embed_mode; done

On Windows with PowerShell:

Get-ChildItem -Filter "*.bin" | ForEach-Object {
  $binFile = $_.Name
  $jsonFile = "$($binFile -replace '\.bin$', '.json')"
  python gen_qnn_ctx_onnx_model.py -b $binFile -q $jsonFile --quantized_IO --disable_embed_mode
}

Add other assets

Download assets from [https://huggingface.co/onnx-community/Llama-3.2-3B-instruct-hexagon-npu-assets]

Check model assets

Once the above instructions are complete you should have the following model assets

genai_config.json
tokenizer.json
tokenizer_config.json
special_tokens_map.json
quantizer.onnx
dequantizer.onnx
position-processor.onnx
a set of transformer model binaries
- Qualcomm context binaries (*.bin)
- Context binary meta data (*.json)
- ONNX wrapper models (*.onnx)