RepoQwen (Alibaba Cloud)Qwen (Alibaba Cloud)published Mar 22, 2025seen 6d

QwenLM/Qwen2.5-Omni

Jupyter Notebook

Open original ↗

Captured source

source ↗
published Mar 22, 2025seen 6dcaptured 8hhttp 200method plain

QwenLM/Qwen2.5-Omni

Description: Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.

Language: Jupyter Notebook

License: Apache-2.0

Stars: 4021

Forks: 324

Open issues: 220

Created: 2025-03-22T01:43:13Z

Pushed: 2025-06-12T11:03:07Z

Default branch: main

Fork: no

Archived: no

README:

Qwen2.5-Omni

中文 &nbsp| &nbsp English&nbsp&nbsp

💜 Qwen Chat&nbsp&nbsp | &nbsp&nbsp🤗 Hugging Face&nbsp&nbsp | &nbsp&nbsp🤖 ModelScope&nbsp&nbsp | &nbsp&nbsp📑 Blog&nbsp&nbsp | &nbsp&nbsp📚 Cookbooks&nbsp&nbsp | &nbsp&nbsp📑 Paper&nbsp&nbsp

🖥️ Demo&nbsp&nbsp | &nbsp&nbsp💬 WeChat (微信)&nbsp&nbsp | &nbsp&nbsp🫨 Discord&nbsp&nbsp | &nbsp&nbsp📑 API

We release Qwen2.5-Omni, the new flagship end-to-end multimodal model in the Qwen series. Designed for comprehensive multimodal perception, it seamlessly processes diverse inputs including text, images, audio, and video, while delivering real-time streaming responses through both text generation and natural speech synthesis. Let's click the video below for more information 😃

News

  • 2025.06.12: Qwen2.5-Omni-7B ranked first among open source models in the spoken language understanding and reasoning benchmark MMSU.
  • 2025.06.09: Congratulations to our open source Qwen2.5-Omni-7B for ranking first in the MMAU leaderboard, and first in the MMAR of open source models in the audio understanding and reasoning evaluation!
  • 2025.05.16: We release 4-bit quantized Qwen2.5-Omni-7B (GPTQ-Int4/AWQ) models that maintain comparable performance to the original version on multimodal evaluations while reducing GPU VRAM consumption by over 50%+. See [GPTQ-Int4 and AWQ Usage](#gptq-int4-and-awq-usage) for details, and models can be obtained from Hugging Face (GPTQ-Int4|AWQ) and ModelScope (GPTQ-Int4|AWQ)
  • 2025.05.13: MNN Chat App support Qwen2.5-Omni now, let's experience Qwen2.5-Omni on the edge devices! Please refer to [Deployment with MNN](#deployment-with-mnn) for information about memory consumption and inference speed benchmarks.
  • 2025.04.30: Exciting! We We have released Qwen2.5-Omni-3B to enable more platforms to run Qwen2.5-Omni. The model can be downloaded from Hugging Face. The [performance](#performance) of this model is updated, and please refer to [Minimum GPU memory requirements](#minimum-gpu-memory-requirements) for information about resource consumption. And for best experience, [transformers](#--transformers-usage) and [vllm](#deployment-with-vllm) code have update, you can pull the [official docker](#-docker) again to get them.
  • 2025.04.11: We release the new vllm version which support audio ouput now! Please experience it from source or our docker image.
  • 2025.04.02: ⭐️⭐️⭐️ Qwen2.5-Omni reaches top-1 on Hugging Face Trending!
  • 2025.03.29: ⭐️⭐️⭐️ Qwen2.5-Omni reaches top-2 on Hugging Face Trending!
  • 2025.03.26: Real-time interaction with Qwen2.5-Omni is available on Qwen Chat. Let's start this amazing journey now!
  • 2025.03.26: We have released the Qwen2.5-Omni. For more details, please check our blog!

Contents

  • [Overview](#overview)
  • [Introduction](#introduction)
  • [Key Features](#key-features)
  • [Model Architecture](#model-architecture)
  • [Performance](#performance)
  • [Quickstart](#quickstart)
  • [Transformers Usage](#--transformers-usage)
  • [ModelScope Usage](#-modelscope-usage)
  • [GPTQ-Int4 and AWQ Usage](#gptq-int4-and-awq-usage)
  • [Usage Tips](#usage-tips)
  • [Cookbooks for More Usage Cases](#cookbooks-for-more-usage-cases)
  • [API inference](#api-inference)
  • [Customization Settings](#customization-settings)
  • [Chat with Qwen2.5-Omni](#chat-with-qwen25-omni)
  • [Online Demo](#online-demo)
  • [Launch Local Web UI Demo](#launch-local-web-ui-demo)
  • [Real-Time Interaction](#real-time-interaction)
  • [Deployment with vLLM](#deployment-with-vllm)
  • [Deployment with MNN](#deployment-with-mnn)
  • [Docker](#-docker)

Overview

Introduction

Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.

Key Features

  • Omni and Novel Architecture: We propose Thinker-Talker architecture, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. We propose a novel position embedding, named TMRoPE (Time-aligned Multimodal RoPE), to synchronize the timestamps of video inputs with audio.
  • Real-Time Voice and Video Chat: Architecture designed for fully real-time interactions, supporting chunked input and immediate output.
  • Natural and Robust Speech Generation: Surpassing many existing streaming and non-streaming alternatives, demonstrating superior robustness and naturalness in speech generation.
  • Strong Performance Across Modalities: Exhibiting exceptional performance across all modalities when benchmarked against similarly sized single-modality models. Qwen2.5-Omni outperforms the similarly sized Qwen2-Audio in audio capabilities and achieves comparable performance to Qwen2.5-VL-7B.
  • Excellent End-to-End Speech Instruction Following: Qwen2.5-Omni shows performance in end-to-end speech instruction following that rivals its effectiveness with text inputs, evidenced by benchmarks such as MMLU and GSM8K.

Model Architecture

Performance

We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates strong performance across all modalities when compared to similarly sized single-modality models and closed-source models like Qwen2.5-VL-7B, Qwen2-Audio, and Gemini-1.5-pro. In tasks requiring the integration of multiple…

Excerpt shown — open the source for the full document.

Notability

notability 8.0/10

New Qwen Omni model with high stars.