RepoQwen (Alibaba Cloud)Qwen (Alibaba Cloud)published Nov 7, 2023seen 6d

QwenLM/Qwen-Audio

Python

Open original ↗

Captured source

source ↗
published Nov 7, 2023seen 6dcaptured 13hhttp 200method plain

QwenLM/Qwen-Audio

Description: The official repo of Qwen-Audio (通义千问-Audio) chat & pretrained large audio language model proposed by Alibaba Cloud.

Language: Python

License: NOASSERTION

Stars: 1902

Forks: 145

Open issues: 63

Created: 2023-11-07T06:31:39Z

Pushed: 2024-07-05T09:17:49Z

Default branch: main

Fork: no

Archived: no

README:

中文 &nbsp| &nbsp English&nbsp&nbsp

Qwen-Audio 🤖 | 🤗&nbsp | Qwen-Audio-Chat 🤖 | 🤗&nbsp | &nbsp&nbsp Demo 🤖 | 🤗&nbsp

&nbsp&nbspHomepage&nbsp | &nbsp&nbspPaper&nbsp&nbsp | &nbsp&nbsp&nbspWeChat&nbsp&nbsp | &nbsp&nbspDiscord&nbsp&nbsp

Qwen-Audio (Qwen Large Audio Language Model) is the multimodal version of the large model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-Audio accepts diverse audio (human speech, natural sound, music and song) and text as inputs, outputs text. The contribution of Qwen-Audio include:

  • Fundamental audio models: Qwen-Audio is a fundamental multi-task audio-language model that supports various tasks, languages, and audio types, serving as a universal audio understanding model. Building upon Qwen-Audio, we develop Qwen-Audio-Chat through instruction fine-tuning, enabling multi-turn dialogues and supporting diverse audio-oriented scenarios.
  • Multi-task learning framework for all types of audios: To scale up audio-language pre-training, we address the challenge of variation in textual labels associated with different datasets by proposing a multi-task training framework, enabling knowledge sharing and avoiding one-to-many interference. Our model incorporates more than 30 tasks and extensive experiments show the model achieves strong performance.
  • Strong Performance: Experimental results show that Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Specifically, Qwen-Audio achieves state-of-the-art results on the test set of Aishell1, cochlscene, ClothoAQA, and VocalSound.
  • Flexible multi-run chat from audio and text input: Qwen-Audio supports multiple-audio analysis, sound understanding and reasoning, music appreciation, and tool usage.

We release two models of the Qwen-Audio series soon:

  • Qwen-Audio: The pre-trained multi-task audio understanding model uses Qwen-7B as the initialization of the LLM, and Whisper-large-v2 as the initialization of the audio encoder.
  • Qwen-Audio-Chat: A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-Audio-Chat supports more flexible interaction, such as multiple audio inputs, multi-round question answering, and creative capabilities.

News and Updates

  • 2023.11.30 🔥 We have released the checkpoints of both Qwen-Audio and Qwen-Audio-Chat on ModelScope and Hugging Face.
  • 2023.11.15 🎉 We released a paper for details about Qwen-Audio and Qwen-Audio-Chat model, including training details and model performance.

Evaluation

We evaluated the Qwen-Audio's abilities on 12 standard benchmarks as follows:

The below is the overal performance:

The details of evaluation are as follows:

Automatic Speech Recognition

Dataset Model Results (WER)

dev-clean dev-othoer test-clean test-other

Librispeech SpeechT5 2.1 5.5 2.4 5.8

SpeechNet - - 30.7 -

SLM-FT - - 2.6 5.0

SALMONN - - 2.1 4.9

Qwen-Audio 1.8 4.0 2.0 4.2

Dataset Model Results (WER)

dev test

Aishell1 MMSpeech-base 2.0 2.1

MMSpeech-large 1.6 1.9

Paraformer-large - 2.0

Qwen-Audio 1.2 (SOTA) 1.3 (SOTA)

Dataset Model Results (WER)

Mic iOS Android

Aishell2 MMSpeech-base 4.5 3.9 4.0

Paraformer-large - 2.9 -

Qwen-Audio 3.3 3.1 3.3

Soeech-to-text Translation

Dataset Model Results (BLUE)

en-de de-en en-zh zh-en es-en fr-en it-en

CoVoST2 SALMMON 18.6 - 33.1 - - - -

SpeechLLaMA - 27.1 - 12.3 27.9 25.2 25.9

BLSP 14.1 - - - - - -

Qwen-Audio 25.1 33.9 41.5 15.7 39.7 38.5 36.0

Automatic Audio Caption

Dataset Model Results

CIDER SPICE SPIDEr

Clotho Pengi 0.416 0.126 0.271

Qwen-Audio 0.441 0.136 0.288

Speech Recognition with Word-level Timestamp

Dataset Model AAC (ms)

Industrial Data Force-aligner 60.3

Paraformer-large-TP 65.3

Qwen-Audio 51.5 (SOTA)

Automatic Scene Classification

Dataset Model ACC

Cochlscene Cochlscene 0.669

Qwen-Audio 0.795 (SOTA)

TUT2017 Pengi 0.353

Qwen-Audio 0.649

Speech Emotion Recognition

Dataset Model ACC

Meld WavLM-large 0.542

Qwen-Audio 0.557

Audio Question & Answer

Dataset Model Results

ACC ACC (binary)

ClothoAQA ClothoAQA 0.542 0.627

Pengi - 0.645

Qwen-Audio 0.579 0.749

Vocal Sound Classification

Dataset Model ACC

VocalSound CLAP 0.4945

Pengi 0.6035

Qwen-Audio 0.9289 (SOTA)

Music Note Analysis

Dataset Model NS. Qualities (MAP) NS. Instrument (ACC)

NSynth Pengi 0.3860 0.5007

Qwen-Audio 0.4742 0.7882

We have provided all evaluation scripts to reproduce our results. Please refer to [eval_audio/EVALUATION.md](eval_audio/EVALUATION.md) for details.

Evaluation of Chat

To evaluate the chat abilities of Qwen-Audio-Chat, we provide [TUTORIAL](TUTORIAL.md) and demo for users.

Requirements

  • python 3.8 and above
  • pytorch 1.12 and above, 2.0 and above are recommended
  • CUDA 11.4 and above are recommended (this is for GPU users)
  • FFmpeg

Quickstart

Below, we provide simple examples to show how to use Qwen-Audio and Qwen-Audio-Chat with 🤖 ModelScope and 🤗 Transformers.

Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries.

pip install -r requirements.txt

Now you can start with ModelScope or Transformers. For more usage, please refer to the [tutorial](TUTORIAL.md). Qwen-Audio models currently perform best with audio clips under 30 seconds.

🤗 Transformers

To use Qwen-Audio-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, please make sure that you are using the latest code.

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)

# Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-Audio-Chat",…

Excerpt shown — open the source for the full document.