QwenLM/Qwen2.5-Omni
Jupyter Notebook
Captured source
source ↗QwenLM/Qwen2.5-Omni
Description: Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
Language: Jupyter Notebook
License: Apache-2.0
Stars: 4021
Forks: 324
Open issues: 220
Created: 2025-03-22T01:43:13Z
Pushed: 2025-06-12T11:03:07Z
Default branch: main
Fork: no
Archived: no
README:
Qwen2.5-Omni
中文  |   English  
💜 Qwen Chat   |   🤗 Hugging Face   |   🤖 ModelScope   |   📑 Blog   |   📚 Cookbooks   |   📑 Paper  
🖥️ Demo   |   💬 WeChat (微信)   |   🫨 Discord   |   📑 API
We release Qwen2.5-Omni, the new flagship end-to-end multimodal model in the Qwen series. Designed for comprehensive multimodal perception, it seamlessly processes diverse inputs including text, images, audio, and video, while delivering real-time streaming responses through both text generation and natural speech synthesis. Let's click the video below for more information 😃
News
- 2025.06.12: Qwen2.5-Omni-7B ranked first among open source models in the spoken language understanding and reasoning benchmark MMSU.
- 2025.06.09: Congratulations to our open source Qwen2.5-Omni-7B for ranking first in the MMAU leaderboard, and first in the MMAR of open source models in the audio understanding and reasoning evaluation!
- 2025.05.16: We release 4-bit quantized Qwen2.5-Omni-7B (GPTQ-Int4/AWQ) models that maintain comparable performance to the original version on multimodal evaluations while reducing GPU VRAM consumption by over 50%+. See [GPTQ-Int4 and AWQ Usage](#gptq-int4-and-awq-usage) for details, and models can be obtained from Hugging Face (GPTQ-Int4|AWQ) and ModelScope (GPTQ-Int4|AWQ)
- 2025.05.13: MNN Chat App support Qwen2.5-Omni now, let's experience Qwen2.5-Omni on the edge devices! Please refer to [Deployment with MNN](#deployment-with-mnn) for information about memory consumption and inference speed benchmarks.
- 2025.04.30: Exciting! We We have released Qwen2.5-Omni-3B to enable more platforms to run Qwen2.5-Omni. The model can be downloaded from Hugging Face. The [performance](#performance) of this model is updated, and please refer to [Minimum GPU memory requirements](#minimum-gpu-memory-requirements) for information about resource consumption. And for best experience, [transformers](#--transformers-usage) and [vllm](#deployment-with-vllm) code have update, you can pull the [official docker](#-docker) again to get them.
- 2025.04.11: We release the new vllm version which support audio ouput now! Please experience it from source or our docker image.
- 2025.04.02: ⭐️⭐️⭐️ Qwen2.5-Omni reaches top-1 on Hugging Face Trending!
- 2025.03.29: ⭐️⭐️⭐️ Qwen2.5-Omni reaches top-2 on Hugging Face Trending!
- 2025.03.26: Real-time interaction with Qwen2.5-Omni is available on Qwen Chat. Let's start this amazing journey now!
- 2025.03.26: We have released the Qwen2.5-Omni. For more details, please check our blog!
Contents
- [Overview](#overview)
- [Introduction](#introduction)
- [Key Features](#key-features)
- [Model Architecture](#model-architecture)
- [Performance](#performance)
- [Quickstart](#quickstart)
- [Transformers Usage](#--transformers-usage)
- [ModelScope Usage](#-modelscope-usage)
- [GPTQ-Int4 and AWQ Usage](#gptq-int4-and-awq-usage)
- [Usage Tips](#usage-tips)
- [Cookbooks for More Usage Cases](#cookbooks-for-more-usage-cases)
- [API inference](#api-inference)
- [Customization Settings](#customization-settings)
- [Chat with Qwen2.5-Omni](#chat-with-qwen25-omni)
- [Online Demo](#online-demo)
- [Launch Local Web UI Demo](#launch-local-web-ui-demo)
- [Real-Time Interaction](#real-time-interaction)
- [Deployment with vLLM](#deployment-with-vllm)
- [Deployment with MNN](#deployment-with-mnn)
- [Docker](#-docker)
Overview
Introduction
Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.
Key Features
- Omni and Novel Architecture: We propose Thinker-Talker architecture, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. We propose a novel position embedding, named TMRoPE (Time-aligned Multimodal RoPE), to synchronize the timestamps of video inputs with audio.
- Real-Time Voice and Video Chat: Architecture designed for fully real-time interactions, supporting chunked input and immediate output.
- Natural and Robust Speech Generation: Surpassing many existing streaming and non-streaming alternatives, demonstrating superior robustness and naturalness in speech generation.
- Strong Performance Across Modalities: Exhibiting exceptional performance across all modalities when benchmarked against similarly sized single-modality models. Qwen2.5-Omni outperforms the similarly sized Qwen2-Audio in audio capabilities and achieves comparable performance to Qwen2.5-VL-7B.
- Excellent End-to-End Speech Instruction Following: Qwen2.5-Omni shows performance in end-to-end speech instruction following that rivals its effectiveness with text inputs, evidenced by benchmarks such as MMLU and GSM8K.
Model Architecture
Performance
We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates strong performance across all modalities when compared to similarly sized single-modality models and closed-source models like Qwen2.5-VL-7B, Qwen2-Audio, and Gemini-1.5-pro. In tasks requiring the integration of multiple…
Excerpt shown — open the source for the full document.
Notability
notability 8.0/10New Qwen Omni model with high stars.