ReleaseMicrosoftMicrosoftpublished May 29, 2026seen 2d

microsoft/onnxruntime-genai v0.14.0

microsoft/onnxruntime-genai

Open original ↗

Captured source

source ↗
published May 29, 2026seen 2dcaptured 9hhttp 200method plain

v0.14.0

Repository: microsoft/onnxruntime-genai

Tag: v0.14.0

Published: 2026-05-29T18:06:43Z

Prerelease: no

Release notes:

What's Changed

  • Fix WhisperProcessor divide-by-zero when single prompt is provided by @Copilot in https://github.com/microsoft/onnxruntime-genai/pull/2068
  • Fix lm_head tensor loading order dependency in quantized model builder by @thpereir in https://github.com/microsoft/onnxruntime-genai/pull/2061
  • Fail to build Whisper model by @xiaofeihan1 in https://github.com/microsoft/onnxruntime-genai/pull/2075
  • Rename NemotronCacheConfig to NemotronConfig and add blank penalty to the decoder by @nenad1002 in https://github.com/microsoft/onnxruntime-genai/pull/2042
  • Fix YaRN RoPE bugs in model builder and add parity tests by @titaiwangms in https://github.com/microsoft/onnxruntime-genai/pull/2076
  • Add Transformers v5 Support by @sayanshaw24 in https://github.com/microsoft/onnxruntime-genai/pull/2089
  • macOS ARM64 ADO pipeline by @Copilot in https://github.com/microsoft/onnxruntime-genai/pull/2091
  • Reduce CPU-side per-token overhead in GenerateNextToken and SampleTopP by @hanbitmyths in https://github.com/microsoft/onnxruntime-genai/pull/2085
  • Add onStageComplete by @apsonawane in https://github.com/microsoft/onnxruntime-genai/pull/2074
  • [WebGPU] Support continuous decoding (RewindTo) with graph capture by @qjia7 in https://github.com/microsoft/onnxruntime-genai/pull/2083
  • [Mistral3] Add VLM support with multi-image inference by @titaiwangms in https://github.com/microsoft/onnxruntime-genai/pull/2077
  • Add k_quant_linear mixed-precision quantization for hybrid attention … by @apsonawane in https://github.com/microsoft/onnxruntime-genai/pull/2100
  • Removes QNN packaging from onnxruntime-genai pipelines by @baijumeswani in https://github.com/microsoft/onnxruntime-genai/pull/2109
  • Add Gemma4 multimodal support (vision + audio) by @apsonawane in https://github.com/microsoft/onnxruntime-genai/pull/2103
  • Update GUIDs during az login by @kunal-vaishnavi in https://github.com/microsoft/onnxruntime-genai/pull/2122
  • Add CODEOWNERS file for repository ownership by @kunal-vaishnavi in https://github.com/microsoft/onnxruntime-genai/pull/2119
  • Qwen3.5: drop fp32 cast around RMSNorm in builder by @xiaofeihan1 in https://github.com/microsoft/onnxruntime-genai/pull/2101
  • Add support for LFM2 in ORT GenAI by @xenova in https://github.com/microsoft/onnxruntime-genai/pull/1979
  • Enable CUDA graph capture for CUDA EP to improve decode throughput by @apsonawane in https://github.com/microsoft/onnxruntime-genai/pull/2070
  • [Qwen3.5] dedup position ids by @daijh in https://github.com/microsoft/onnxruntime-genai/pull/2102
  • Address win-cuda pipeline errors by @baijumeswani in https://github.com/microsoft/onnxruntime-genai/pull/2154
  • Update Extensions Commit to Fix Id2Token Bugs by @sayanshaw24 in https://github.com/microsoft/onnxruntime-genai/pull/2159
  • Limit the CUDA cmake architectures to 86 for CI builds by @baijumeswani in https://github.com/microsoft/onnxruntime-genai/pull/2161
  • Gate leaked-object error reporting in Shutdown() to debug builds or when logging is enabled by @baijumeswani in https://github.com/microsoft/onnxruntime-genai/pull/2162
  • Update Copilot instructions for reviewing model builder by @kunal-vaishnavi in https://github.com/microsoft/onnxruntime-genai/pull/2164
  • Fix DecoderState input_ids check regression introduced in #2103 by @titaiwangms in https://github.com/microsoft/onnxruntime-genai/pull/2148
  • Fix memory leaks by @skottmckay in https://github.com/microsoft/onnxruntime-genai/pull/2153
  • [Qwen3.5] Use LpNormalization for L2-norm in linear-attention Q/K by @xiaofeihan1 in https://github.com/microsoft/onnxruntime-genai/pull/2127
  • Fix: Win32 build failure when paths contain spaces by @nsubaru in https://github.com/microsoft/onnxruntime-genai/pull/2053
  • Fix CUDA build with MSVC by enabling /Zc:preprocessor for nvcc host compilation on VS 16.5 or greater by @nsubaru in https://github.com/microsoft/onnxruntime-genai/pull/2054
  • Apply linear rope_scaling in model builder for Neutts/nano by @VishalX in https://github.com/microsoft/onnxruntime-genai/pull/2142
  • Fix Quark/AWQ weight loading for Qwen3-VL-4B text model by @anilmartha in https://github.com/microsoft/onnxruntime-genai/pull/2143
  • Fix WebGPU inference crash in embedding and multi-modal feature allocation by @feich-ms in https://github.com/microsoft/onnxruntime-genai/pull/2163
  • Support Visual Studio 18 2026 build by @Copilot in https://github.com/microsoft/onnxruntime-genai/pull/2017
  • Add QNN EP documentation to OGA including Genie note by @qti-kromero in https://github.com/microsoft/onnxruntime-genai/pull/2158
  • Use windowsml package and make winml usage simpler by @baijumeswani in https://github.com/microsoft/onnxruntime-genai/pull/2155
  • Cleanup TensorObject created by OrtxTensorResultGetAt by @skottmckay in https://github.com/microsoft/onnxruntime-genai/pull/2168
  • Fix nemotron leaks by @skottmckay in https://github.com/microsoft/onnxruntime-genai/pull/2169
  • [RyzenAI] make speech sub-model optional in PhiMultiModalProcessor by @manasablrm in https://github.com/microsoft/onnxruntime-genai/pull/2167
  • Enable graph capture for WebGPU models and DML continuous decoding tests by @qjia7 in https://github.com/microsoft/onnxruntime-genai/pull/2099
  • [Qwen3] Allow packed QKV MatMul under QK-Norm via post-MatMul Split by @xiaofeihan1 in https://github.com/microsoft/onnxruntime-genai/pull/2137
  • Enable Linux ARM64 builds and packaging by @baijumeswani in https://github.com/microsoft/onnxruntime-genai/pull/2107
  • Add gemma4 unit tests by @apsonawane in https://github.com/microsoft/onnxruntime-genai/pull/2151
  • Auto-detect fixed kv-cache shape in DefaultKeyValueCache by @akholodnamdcom in https://github.com/microsoft/onnxruntime-genai/pull/2166
  • Add text-only mode support for Qwen 3.5 model builder by @apsonawane in https://github.com/microsoft/onnxruntime-genai/pull/2157
  • Fix heap overflow issue by @apsonawane in https://github.com/microsoft/onnxruntime-genai/pull/2110
  • [Benchmark] Add --use_random_tokens flag to C benchmark by @VishalX in https://github.com/microsoft/onnxruntime-genai/pull/2170
  • Add HunYuan Dense V1 (hunyuan_v1_dense) model support by @anilmartha in https://github.com/microsoft/onnxruntime-genai/pull/2144
  • Nvidia Parakeet Tdt ASR support by @nenad1002 in https://github.com/microsoft/onnxruntime-genai/pull/2150
  • Multilingual Streaming Nemotron ASR + CUDA support…

Excerpt shown — open the source for the full document.