Announcing MultiLoRA with ONNX Runtime: Revolutionizing AI Customization

By:

Dmitri Smirnov, Jambay Kinley, Natalie Kershaw, Parinita Rahi, Pranav Sharma, Devang Patel, Samuel Kemp

20TH NOVEMBER, 2024

A New Era of AI Customization

Today’s AI services must cater to a vast range of users—each with unique requirements and preferences. Customizing large language models for individual customers or speakers has traditionally been a resource-intensive and time-consuming task.

LoRA adapters have proven transformative in customizing large AI models for specific tasks without requiring resource-intensive fine-tuning. MultiLoRA changes the game by enabling seamless integration of lightweight adapters, allowing models to adapt dynamically to different contexts and customers. With ONNX Runtime as its foundation, MultiLoRA offers unparalleled performance, ensuring efficient memory usage.

Streamlined Integration with Olive

MultiLoRA relies on the existing Olive toolchain to generate adapted ONNX models.

This ensures:

  • A unified pipeline for creating LoRA-adapted models.
  • Consistent handling of versioning and metadata across models and adapters.

By standardizing around Olive, MultiLoRA simplifies workflows and eliminates compatibility concerns with third-party sources.

ONNX Runtime Adaptations

Simplified Adapter Activation and Loading

  • Dynamic Activation API: A single API, SetActiveAdapters(string[] adapters), allows activating or deactivating adapters at runtime.

    • Empty Input: Resets the model to its base state, running without any adapters.
    • Multi-Adapter Support: Simultaneously activate multiple adapters to meet complex customer requirements.
  • Generative Loop Support:

    • Active adapters remain loaded as long as the GeneratorParams instance persists, ensuring efficient memory use.
    • References are automatically released when the instance is destroyed, avoiding resource leaks.

Adapter Management Without Generative Loops

For models not tied to user prompts or generative processes, a new Run() API is introduced:

Results = Run(RunOptions, input_names[], input_data[], output_names[]); 
  • RunOptions Class: Facilitates seamless execution of base models or adapter-enhanced variants.
  • Shared Adapter Loading: Adapters are stored within the model instance, allowing efficient reuse across multiple sessions.

Language Bindings Expansion

The current MultiLoRA implementation offers bindings for Python, C, C++, C#, and Java.

Memory Management

Our implementation memory maps LoRA parameters from disk, which improves memory management.

How MultiLoRA Works

Generate the ONNX Models and Adapters

If you have an existing base model and adapter in Hugging Face PEFT format, you can automatically create optimized ONNX models that will run efficiently on the ONNX runtime using the MultiLoRA paradigm by leveraging the following command:

olive auto-opt -m <path to model> -a <example adapter> -o <output folder> --device cpu|gpu --provider <execution provider>

You can then add additional adapters that exist on Hugging Face (or local disk) for the same base model by converting them into the ONNX adapter format using:

olive convert-adapters -a <adapter> -o <output>

Alternatively, you can fine-tune your own adapter using:

# Step 1: finetune (output a PyTorch model and PEFT adapter) 
olive fine-tune --method qlora -m <model> -d <dataset> -o models/ft 
# Step 2 : Optimize base model and adapter into ONNX format 
olive auto-opt -m models/ft/model -a models/ft/adapter -o <output folder> --device cpu|gpu --provider <execution provider> 

Run the ONNX Models and Switch Adapters

  1. Load Adapters: Dynamically load adapters for the base model:

    adapters = oga.Adapters(model) 
    adapters.load("file", "name")
  2. Set Active Adapter: Switch adapters on the fly based on customer requests:

    generator.set_active_adapter(adapters, "name") 

Looking Ahead

In Development:

  • Batching Support: Enhancing ONNX Runtime kernels for adapter-aware batching.
  • Expanded Bindings: Introducing language bindings for broader adoption.
  • Memory Features: Additional memory management improvements.

Your Feedback Matters

As MultiLoRA evolves, we invite developers to test the feature, provide insights, and shape its roadmap. By working together, we aim to create a flexible, powerful foundation for AI adaptation.

Conclusion

MultiLoRA is more than an enhancement to ONNX Runtime—it’s a step forward in making AI systems modular, adaptable, and accessible. By addressing technical challenges like memory management, batching, and data format inefficiencies, MultiLoRA lays the groundwork for a new era of AI deployment.

Let’s build the future of adaptable AI together. Join us in exploring MultiLoRA with ONNX Runtime!

Resources