RepoQwen (Alibaba Cloud)Qwen (Alibaba Cloud)published Jun 24, 2024seen 6d

QwenLM/Qwen2-Audio

Python

Open original ↗

Captured source

source ↗
published Jun 24, 2024seen 6dcaptured 8hhttp 200method plain

QwenLM/Qwen2-Audio

Description: The official repo of Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.

Language: Python

Stars: 2078

Forks: 165

Open issues: 115

Created: 2024-06-24T06:11:27Z

Pushed: 2025-04-21T08:50:49Z

Default branch: main

Fork: no

Archived: no

README:

中文 &nbsp| &nbsp English&nbsp&nbsp

Qwen2-Audio-7B 🤖 | 🤗&nbsp | Qwen-Audio-7B-Instruct 🤖 | 🤗&nbsp | Demo 🤖 | 🤗&nbsp

📑 Paper &nbsp&nbsp | &nbsp&nbsp 📑 Blog &nbsp&nbsp | &nbsp&nbsp 💬 WeChat (微信)&nbsp&nbsp | &nbsp&nbsp Discord&nbsp&nbsp

We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. We introduce two distinct audio interaction modes:

  • voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input;
  • audio analysis: users could provide audio and text instructions for analysis during the interaction;

We've released two models of the Qwen2-Audio series: Qwen2-Audio-7B and Qwen2-Audio-7B-Instruct.

Architecture

The overview of three-stage training process of Qwen2-Audio.

News and Updates

  • 2024.8.9 🎉 We released the checkpoints of both Qwen2-Audio-7B and Qwen2-Audio-7B-Instruct on ModelScope and Hugging Face.
  • 2024.7.15 🎉 We released the paper of Qwen2-Audio, introducing the relevant model structure, training methods, and model performance. Check our report for details!
  • 2023.11.30 🔥 We released the Qwen-Audio series.

Evaluation

We evaluated the Qwen2-Audio's abilities on 13 standard benchmarks as follows: TaskDescriptionDatasetSplitMetricASRAutomatic Speech RecognitionFleursdev | testWERAishell2testLibrispeechdev | testCommon Voicedev | testS2TTSpeech-to-Text TranslationCoVoST2testBLEU SERSpeech Emotion RecognitionMeldtestACCVSCVocal Sound ClassificationVocalSoundtestACCAIR-Bench Chat-Benchmark-SpeechFisher SpokenWOZ IEMOCAP Common voicedev | testGPT-4 EvalChat-Benchmark-SoundClothodev | testGPT-4 Eval Chat-Benchmark-MusicMusicCapsdev | testGPT-4 EvalChat-Benchmark-Mixed-AudioCommon voice AudioCaps MusicCapsdev | testGPT-4 Eval

The below is the overal performance:

The details of evaluation are as follows:

(Note: The evaluation results we present are based on the initial model of the original training framework. However, the scores showed some fluctuations after converting the framework to Huggingface. Here, we present our complete evaluation results, starting with the initial model results from the paper.)

TaskDatasetModelPerformanceMetricsResultsASRLibrispeech dev-clean | dev-other | test-clean | test-otherSpeechT5WER 2.1 | 5.5 | 2.4 | 5.8SpeechNet- | - | 30.7 | -SLM-FT- | - | 2.6 | 5.0SALMONN- | - | 2.1 | 4.9SpeechVerse- | - | 2.1 | 4.4Qwen-Audio1.8 | 4.0 | 2.0 | 4.2Qwen2-Audio1.3 | 3.4 | 1.6 | 3.6Common Voice 15 en | zh | yue | frWhisper-large-v3WER 9.3 | 12.8 | 10.9 | 10.8Qwen2-Audio8.6 | 6.9 | 5.9 | 9.6 Fleurs zhWhisper-large-v3WER 7.7Qwen2-Audio7.5Aishell2 Mic | iOS | AndroidMMSpeech-baseWER 4.5 | 3.9 | 4.0Paraformer-large- | 2.9 | -Qwen-Audio3.3 | 3.1 | 3.3Qwen2-Audio3.0 | 3.0 | 2.9S2TTCoVoST2 en-de | de-en | en-zh | zh-enSALMONNBLEU 18.6 | - | 33.1 | -SpeechLLaMA- | 27.1 | - | 12.3BLSP14.1 | - | - | -Qwen-Audio25.1 | 33.9 | 41.5 | 15.7Qwen2-Audio29.9 | 35.2 | 45.2 | 24.4 CoVoST2 es-en | fr-en | it-en |SpeechLLaMABLEU 27.9 | 25.2 | 25.9Qwen-Audio39.7 | 38.5 | 36.0Qwen2-Audio40.0 | 38.5 | 36.3SERMeldWavLM-largeACC 0.542Qwen-Audio0.557Qwen2-Audio0.553VSCVocalSoundCLAPACC 0.4945Pengi0.6035Qwen-Audio0.9289Qwen2-Audio0.9392 AIR-Bench Chat Benchmark Speech | Sound | Music | Mixed-AudioSALMONN BLSP Pandagpt Macaw-LLM SpeechGPT Next-gpt Qwen-Audio Gemini-1.5-pro Qwen2-AudioGPT-4 6.16 | 6.28 | 5.95 | 6.08 6.17 | 5.55 | 5.08 | 5.33 3.58 | 5.46 | 5.06 | 4.25 0.97 | 1.01 | 0.91 | 1.01 1.57 | 0.95 | 0.95 | 4.13 3.86 | 4.76 | 4.18 | 4.13 6.47 | 6.95 | 5.52 | 6.08 6.97 | 5.49 | 5.06 | 5.27 7.18 | 6.99 | 6.79 | 6.77

(Second is after converting huggingface)

TaskDatasetModelPerformanceMetricsResultsASRLibrispeech dev-clean | dev-other | test-clean | test-otherSpeechT5WER 2.1 | 5.5 | 2.4 | 5.8SpeechNet- | - | 30.7 | -SLM-FT- | - | 2.6 | 5.0SALMONN- | - | 2.1 | 4.9SpeechVerse- | - | 2.1 | 4.4Qwen-Audio1.8 | 4.0 | 2.0 | 4.2Qwen2-Audio1.7 | 3.6 | 1.7 | 4.0Common Voice 15 en | zh | yue | frWhisper-large-v3WER 9.3 | 12.8 | 10.9 | 10.8Qwen2-Audio8.7 | 6.5 | 5.9 | 9.6 Fleurs zhWhisper-large-v3WER 7.7Qwen2-Audio7.0Aishell2 Mic | iOS | AndroidMMSpeech-baseWER 4.5 | 3.9 | 4.0Paraformer-large- | 2.9 | -Qwen-Audio3.3 | 3.1 | 3.3Qwen2-Audio3.2 | 3.1 | 2.9S2TTCoVoST2 en-de | de-en | en-zh | zh-enSALMONNBLEU 18.6 | - | 33.1 | -SpeechLLaMA- | 27.1 | - | 12.3BLSP14.1 | - | - | -Qwen-Audio25.1 | 33.9 | 41.5 | 15.7Qwen2-Audio29.6 | 33.6 | 45.6 | 24.0 CoVoST2 es-en | fr-en | it-en |SpeechLLaMABLEU 27.9 | 25.2 | 25.9Qwen-Audio39.7 | 38.5 | 36.0Qwen2-Audio38.7 | 37.2 | 35.2SERMeldWavLM-largeACC 0.542Qwen-Audio0.557Qwen2-Audio0.535VSCVocalSoundCLAPACC 0.4945Pengi0.6035Qwen-Audio0.9289Qwen2-Audio0.9395 AIR-Bench Chat Benchmark Speech | Sound | Music | Mixed-AudioSALMONN BLSP Pandagpt Macaw-LLM SpeechGPT Next-gpt Qwen-Audio Gemini-1.5-pro Qwen2-AudioGPT-4 6.16 | 6.28 | 5.95 | 6.08 6.17 | 5.55 | 5.08 | 5.33 3.58 | 5.46 | 5.06 | 4.25 0.97 | 1.01 | 0.91 | 1.01 1.57 | 0.95 | 0.95 | 4.13 3.86 | 4.76 | 4.18 | 4.13 6.47 | 6.95 | 5.52 | 6.08 6.97 | 5.49 | 5.06 | 5.27 7.24 | 6.83 | 6.73 | 6.42

We have provided all evaluation scripts to reproduce our results. Please refer to [eval_audio/EVALUATION.md](eval_audio/EVALUATION.md) for details.

Requirements

The code of Qwen2-Audio has been in the latest Hugging face transformers and we advise you to build from source with command pip install git+https://github.com/huggingface/transformers, or you might encounter the following error:

KeyError: 'qwen2-audio'

Quickstart

Below, we provide simple examples to show how to use Qwen2-Audio and Qwen2-Audio-Instruct with 🤗 Transformers. Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries. Now you can start with ModelScope or Transformers. Qwen2-Audio models currently perform best with audio clips under 30 seconds.

🤗 Transformers

In the…

Excerpt shown — open the source for the full document.