RepoMicrosoftMicrosoftpublished Nov 13, 2023seen 2d

microsoft/onnxruntime-genai

C++

Open original ↗

Captured source

source ↗
published Nov 13, 2023seen 2dcaptured 8hhttp 200method plain

microsoft/onnxruntime-genai

Description: Generative AI extensions for onnxruntime

Language: C++

License: MIT

Stars: 1050

Forks: 304

Open issues: 181

Created: 2023-11-13T20:37:15Z

Pushed: 2026-06-11T02:21:52Z

Default branch: main

Fork: no

Archived: no

README:

ONNX Runtime GenAI

Status

![Nightly Build](https://github.com/microsoft/onnxruntime-genai/actions/workflows/linux-cpu-x64-nightly-build.yml)

Description

Run generative AI models with ONNX Runtime. This API gives you an easy, flexible and performant way of running LLMs on device. It implements the generative AI loop for ONNX models, including pre and post processing, inference with ONNX Runtime, logits processing, search and sampling, KV cache management, and grammar specification for tool calling.

ONNX Runtime GenAI powers Foundry Local, Windows ML, and the Visual Studio Code AI Toolkit.

See documentation at the ONNX Runtime website for more details.

| Support matrix | Supported now | Under development | On the roadmap| | -------------- | ------------- | ----------------- | -------------- | | Model architectures | AMD OLMo ChatGLM DeepSeek ERNIE 4.5 Fara Gemma gpt-oss Granite HunYuan Dense V1 InternLM2 Llama Mistral Nemotron Phi (language + vision) Qwen (language + vision) SmolLM3 Whisper | Stable diffusion | Multi-modal models | | API | Python C# C/C++ Java ^ | Objective-C || | O/S | Linux Windows Mac Android || iOS ||| | Architecture | x86 x64 arm64 |||| | Hardware Acceleration | CPU CUDA DirectML NvTensorRtRtx (TRT-RTX) OpenVINO QNN WebGPU | | AMD GPU | | Features | Multi-LoRA Continuous decoding Constrained decoding | | Speculative decoding |

^ Requires build from source

Installation

See installation instructions or build from source

Sample code for Phi-3 in Python

1. Download the model

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .

2. Install the API

pip install numpy
pip install --pre onnxruntime-genai

3. Run the model

import onnxruntime_genai as og

model = og.Model('cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4')
tokenizer = og.Tokenizer(model)
stream = tokenizer.create_stream()

# Set the max length to something sensible by default,
# since otherwise it will be set to the entire context length
search_options = {}
search_options['max_length'] = 2048
search_options['batch_size'] = 1

chat_template = '\n{input} \n'

text = input("Input: ")
if not text:
print("Error, input cannot be empty")
exit()

prompt = f'{chat_template.format(input=text)}'

input_tokens = tokenizer.encode(prompt)

params = og.GeneratorParams(model)
params.set_search_options(**search_options)
generator = og.Generator(model, params)

print("Output: ", end='', flush=True)

try:
generator.append_tokens(input_tokens)
while not generator.is_done():
generator.generate_next_token()
new_token = generator.get_next_tokens()[0]
print(stream.decode(new_token), end='', flush=True)
except KeyboardInterrupt:
print(" --control+c pressed, aborting generation--")

print()
del generator

Choose the correct version of the examples

Due to the evolving nature of this project and ongoing feature additions, examples in the main branch may not always align with the latest stable release. This section outlines how to ensure compatibility between the examples and the corresponding version.

Stable version

Install the package according to the installation instructions. For example, install the Python package.

pip install onnxruntime-genai

Get the version of the package

Linux/Mac:

pip list | grep onnxruntime-genai

Windows:

pip list | findstr "onnxruntime-genai"

Then, check out the version of the examples that corresponds to that release.

# Clone the repo
git clone https://github.com/microsoft/onnxruntime-genai.git && cd onnxruntime-genai
# Checkout the branch for the version you are using
git checkout v0.11.5
cd examples

Nightly version (main branch)

Checkout the main branch of the repo

git clone https://github.com/microsoft/onnxruntime-genai.git && cd onnxruntime-genai

Build from source, using these instructions. For example, to build the Python wheel:

python build.py

Navigate to the examples folder in the main branch.

cd examples

To install the nightly Python build:

# Change onnxruntime-genai to the Python package you want to install
pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ onnxruntime-genai

Roadmap

See the Discussions to request new features and up-vote existing requests.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.

Linting

This project enables lintrunner for linting. You can install the dependencies and initialize with

pip install -r requirements-lintrunner.txt
lintrunner init

This will install lintrunner on your system and download all the necessary dependencies to run linters locally.

To format local changes:

lintrunner -a…

Excerpt shown — open the source for the full document.