Run SLMs on Snapdragon devices with NPUs
Learn how to run SLMs on Snapdragon devices with ONNX Runtime.
Models
Models supported currently are:
- Phi-3.5 mini instruct
- Llama 3.2 3B
Devices with Snapdragon NPUs requires models in a specific size and format.
Instructions to generate models in this format can be found in Build models for snapdragon
Once you have built or downloaded the model, place the model assets in a known location. These assets consist of:
- genai_config.json
- tokenizer.json
- tokenizer_config.json
- special_tokens_map.json
- quantizer.onnx
- dequantizer.onnx
- position-processor.onnx
- a set of transformer model binaries
- Qualcomm context binaries (*.bin)
- Context binary meta data (*.json)
- ONNX wrapper models (*.onnx)
Python application
If your device has Python installed, you can run a simple question and answering script to query the model.
Install the runtime
pip install onnxruntime-genai
Download the script
curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/model-qa.py -o model-qa.py
Run the script
This script assumes that the model assets are in a folder called models\Phi-3.5-mini-instruct
python .\model-qa.py -e cpu -g -v --system_prompt "You are a helpful assistant. Be brief and concise." --chat_template "<|user|>\n{input} <|end|>\n<|assistant|>" -m ..\..\models\Phi-3.5-mini-instruct
A look inside the Python script
The complete Python script is published here: https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/model-qa.py. The script utilizes the API in the following standard way:
-
Load the model
model = og.Model(config)
This loads the model into memory.
-
Create pre processors and tokenize system prompt
tokenizer = og.Tokenizer(model) tokenizer_stream = tokenizer.create_stream() # Optional system_tokens = tokenizer.encode(system_prompt)
This creates a tokenizer and a tokenizer stream which allows tokens to be returned to the user as they are generated.
-
Interactive input loop
while True: # Read prompt # Run the generation, streaming the output tokens
-
Generative loop
# 1. Pre-process the prompt into tokens input_tokens = tokenizer.encode(prompt) # 2. Create parameters and generator (KV cache etc) and process the prompt params = og.GeneratorParams(model) params.set_search_options(**search_options) generator = og.Generator(model, params) generator.append_tokens(system_tokens + input_tokens) # 3. Loop until all output tokens are generated, printing # out the decoded token while not generator.is_done(): generator.generate_next_token() new_token = generator.get_next_tokens()[0] print(tokenizer_stream.decode(new_token), end="", flush=True) print() # Delete the generator to free the captured graph before creating another one del generator
C++ Application
To run the models on snadragon NPU within a C++ application, use the code from here: https://github.com/microsoft/onnxruntime-genai/tree/main/examples/c.
Building and running this application requires a Windows PC with a Snapdragon NPU, as well as:
- cmake
- Visual Studio 2022
Clone the repo
git clone https://github.com/microsoft/onnxruntime-genai
cd examples\c
Install onnxruntime
Currently requires the nightly build of onnxruntime, as there are up to the minute changes to QNN support for language models.
Download the nightly version of the ONNX Runtime QNN binaries from here
mkdir onnxruntime-win-arm64-qnn
move Microsoft.ML.OnnxRuntime.QNN.1.22.0-dev-20250225-0548-e46c0d8.nupkg onnxruntime-win-arm64-qnn
cd onnxruntime-win-arm64-qnn
tar xvzf Microsoft.ML.OnnxRuntime.QNN.1.22.0-dev-20250225-0548-e46c0d8.nupkg
copy runtimes\win-arm64\native\* ..\..\..\lib
cd ..
Install onnxruntime-genai
curl https://github.com/microsoft/onnxruntime-genai/releases/download/v0.6.0/onnxruntime-genai-0.6.0-win-arm64.zip -o onnxruntime-genai-win-arm64.zip
tar xvf onnxruntime-genai-win-arm64.zip
cd onnxruntime-genai-0.6.0-win-arm64
copy include\* ..\include
copy lib\* ..\lib
Build the sample
cmake -A arm64 -S . -B build -DPHI3-QA=ON
cd build
cmake --build . --config Release
Run the sample
cd Release
.\phi3_qa.exe <path_to_model>
A look inside the C++ sample
The C++ application is published here: https://github.com/microsoft/onnxruntime-genai/blob/main/examples/c/src/phi3_qa.cpp. The script utilizes the API in the following standard way:
-
Load the model
auto model = OgaModel::Create(*config);
This loads the model into memory.
-
Create pre processors
auto tokenizer = OgaTokenizer::Create(*model); auto tokenizer_stream = OgaTokenizerStream::Create(*tokenizer);
This creates a tokenizer and a tokenizer stream which allows tokens to be returned to the user as they are generated.
-
Interactive input loop
while True: # Read prompt # Run the generation, streaming the output tokens
-
Generative loop
# 1. Pre-process the prompt into tokens auto sequences = OgaSequences::Create(); tokenizer->Encode(prompt.c_str(), *sequences); # 2. Create parameters and generator (KV cache etc) and process the prompt auto params = OgaGeneratorParams::Create(*model); params->SetSearchOption("max_length", 1024); auto generator = OgaGenerator::Create(*model, *params); generator->AppendTokenSequences(*sequences); # 3. Loop until all output tokens are generated, printing # out the decoded token while (!generator->IsDone()) { generator->GenerateNextToken(); if (is_first_token) { timing.RecordFirstTokenTimestamp(); is_first_token = false; } const auto num_tokens = generator->GetSequenceCount(0); const auto new_token = generator->GetSequenceData(0)[num_tokens - 1]; std::cout << tokenizer_stream->Decode(new_token) << std::flush; }