QwenLM/Qwen-Audio
Python
Captured source
source ↗QwenLM/Qwen-Audio
Description: The official repo of Qwen-Audio (通义千问-Audio) chat & pretrained large audio language model proposed by Alibaba Cloud.
Language: Python
License: NOASSERTION
Stars: 1902
Forks: 145
Open issues: 63
Created: 2023-11-07T06:31:39Z
Pushed: 2024-07-05T09:17:49Z
Default branch: main
Fork: no
Archived: no
README:
中文  |   English  
Qwen-Audio 🤖 | 🤗  | Qwen-Audio-Chat 🤖 | 🤗  |    Demo 🤖 | 🤗 
  Homepage  |   Paper   |    WeChat   |   Discord  
Qwen-Audio (Qwen Large Audio Language Model) is the multimodal version of the large model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-Audio accepts diverse audio (human speech, natural sound, music and song) and text as inputs, outputs text. The contribution of Qwen-Audio include:
- Fundamental audio models: Qwen-Audio is a fundamental multi-task audio-language model that supports various tasks, languages, and audio types, serving as a universal audio understanding model. Building upon Qwen-Audio, we develop Qwen-Audio-Chat through instruction fine-tuning, enabling multi-turn dialogues and supporting diverse audio-oriented scenarios.
- Multi-task learning framework for all types of audios: To scale up audio-language pre-training, we address the challenge of variation in textual labels associated with different datasets by proposing a multi-task training framework, enabling knowledge sharing and avoiding one-to-many interference. Our model incorporates more than 30 tasks and extensive experiments show the model achieves strong performance.
- Strong Performance: Experimental results show that Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Specifically, Qwen-Audio achieves state-of-the-art results on the test set of Aishell1, cochlscene, ClothoAQA, and VocalSound.
- Flexible multi-run chat from audio and text input: Qwen-Audio supports multiple-audio analysis, sound understanding and reasoning, music appreciation, and tool usage.
We release two models of the Qwen-Audio series soon:
- Qwen-Audio: The pre-trained multi-task audio understanding model uses Qwen-7B as the initialization of the LLM, and Whisper-large-v2 as the initialization of the audio encoder.
- Qwen-Audio-Chat: A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-Audio-Chat supports more flexible interaction, such as multiple audio inputs, multi-round question answering, and creative capabilities.
News and Updates
- 2023.11.30 🔥 We have released the checkpoints of both Qwen-Audio and Qwen-Audio-Chat on ModelScope and Hugging Face.
- 2023.11.15 🎉 We released a paper for details about Qwen-Audio and Qwen-Audio-Chat model, including training details and model performance.
Evaluation
We evaluated the Qwen-Audio's abilities on 12 standard benchmarks as follows:
The below is the overal performance:
The details of evaluation are as follows:
Automatic Speech Recognition
Dataset Model Results (WER)
dev-clean dev-othoer test-clean test-other
Librispeech SpeechT5 2.1 5.5 2.4 5.8
SpeechNet - - 30.7 -
SLM-FT - - 2.6 5.0
SALMONN - - 2.1 4.9
Qwen-Audio 1.8 4.0 2.0 4.2
Dataset Model Results (WER)
dev test
Aishell1 MMSpeech-base 2.0 2.1
MMSpeech-large 1.6 1.9
Paraformer-large - 2.0
Qwen-Audio 1.2 (SOTA) 1.3 (SOTA)
Dataset Model Results (WER)
Mic iOS Android
Aishell2 MMSpeech-base 4.5 3.9 4.0
Paraformer-large - 2.9 -
Qwen-Audio 3.3 3.1 3.3
Soeech-to-text Translation
Dataset Model Results (BLUE)
en-de de-en en-zh zh-en es-en fr-en it-en
CoVoST2 SALMMON 18.6 - 33.1 - - - -
SpeechLLaMA - 27.1 - 12.3 27.9 25.2 25.9
BLSP 14.1 - - - - - -
Qwen-Audio 25.1 33.9 41.5 15.7 39.7 38.5 36.0
Automatic Audio Caption
Dataset Model Results
CIDER SPICE SPIDEr
Clotho Pengi 0.416 0.126 0.271
Qwen-Audio 0.441 0.136 0.288
Speech Recognition with Word-level Timestamp
Dataset Model AAC (ms)
Industrial Data Force-aligner 60.3
Paraformer-large-TP 65.3
Qwen-Audio 51.5 (SOTA)
Automatic Scene Classification
Dataset Model ACC
Cochlscene Cochlscene 0.669
Qwen-Audio 0.795 (SOTA)
TUT2017 Pengi 0.353
Qwen-Audio 0.649
Speech Emotion Recognition
Dataset Model ACC
Meld WavLM-large 0.542
Qwen-Audio 0.557
Audio Question & Answer
Dataset Model Results
ACC ACC (binary)
ClothoAQA ClothoAQA 0.542 0.627
Pengi - 0.645
Qwen-Audio 0.579 0.749
Vocal Sound Classification
Dataset Model ACC
VocalSound CLAP 0.4945
Pengi 0.6035
Qwen-Audio 0.9289 (SOTA)
Music Note Analysis
Dataset Model NS. Qualities (MAP) NS. Instrument (ACC)
NSynth Pengi 0.3860 0.5007
Qwen-Audio 0.4742 0.7882
We have provided all evaluation scripts to reproduce our results. Please refer to [eval_audio/EVALUATION.md](eval_audio/EVALUATION.md) for details.
Evaluation of Chat
To evaluate the chat abilities of Qwen-Audio-Chat, we provide [TUTORIAL](TUTORIAL.md) and demo for users.
Requirements
- python 3.8 and above
- pytorch 1.12 and above, 2.0 and above are recommended
- CUDA 11.4 and above are recommended (this is for GPU users)
- FFmpeg
Quickstart
Below, we provide simple examples to show how to use Qwen-Audio and Qwen-Audio-Chat with 🤖 ModelScope and 🤗 Transformers.
Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries.
pip install -r requirements.txt
Now you can start with ModelScope or Transformers. For more usage, please refer to the [tutorial](TUTORIAL.md). Qwen-Audio models currently perform best with audio clips under 30 seconds.
🤗 Transformers
To use Qwen-Audio-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, please make sure that you are using the latest code.
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)
# Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-Audio-Chat",…Excerpt shown — open the source for the full document.