MoonshotAI/Kimi-Audio
Python
Captured source
source ↗MoonshotAI/Kimi-Audio
Description: Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation
Language: Python
Stars: 4649
Forks: 361
Open issues: 112
Created: 2025-04-25T10:00:18Z
Pushed: 2025-06-21T15:30:28Z
Default branch: master
Fork: no
Archived: no
README:
Kimi-Audio-7B 🤗 | Kimi-Audio-7B-Instruct 🤗 | 📑 Paper
We present Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation. This repository contains the official implementation, models, and evaluation toolkit for Kimi-Audio.
🔥🔥🔥 News!!
- May 29, 2025: 👋 We release a finetuning example of Kimi-Audio-7B.
- April 27, 2025: 👋 We release pretrained model weights of Kimi-Audio-7B.
- April 25, 2025: 👋 We release the inference code and model weights of Kimi-Audio-7B-Instruct.
- April 25, 2025: 👋 We release the audio evaluation toolkit Kimi-Audio-Evalkit. We can easily reproduce the our results and baselines by this toolkit!
- April 25, 2025: 👋 We release the technical report of Kimi-Audio.
Table of Contents
- [Introduction](#introduction)
- [Architecture Overview](#architecture-overview)
- [Quick Start](#quick-start)
- [Evaluation](#evaluation)
- [Speech Recognition](#automatic-speech-recognition-asr)
- [Audio Understanding](#audio-understanding)
- [Audio-to-Text Chat](#audio-to-text-chat)
- [Speech Conversation](#speech-conversation)
- [Finetune](#finetune)
- [Evaluation Toolkit](#evaluation-toolkit)
- [Generation Testset](#generation-testset)
- [License](#license)
- [Acknowledgements](#acknowledgements)
- [Citation](#citation)
- [Contact Us](#contact-us)
Introduction
Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include:
- Universal Capabilities: Handle diverse tasks like automatic speech recognition (ASR), audio question answering (AQA), automatic audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC), and end-to-end speech conversation.
- State-of-the-Art Performance: Achieve SOTA results on numerous audio benchmarks (see [Evaluation](#evaluation) and the Technical Report).
- Large-Scale Pre-training: Pre-train on over 13 million hours of diverse audio data (speech, music, sounds) and text data, enabling robust audio reasoning and language understanding.
- Novel Architecture: Employ a hybrid audio input (continuous acoustic vectors + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation.
- Efficient Inference: Feature a chunk-wise streaming detokenizer based on flow matching for low-latency audio generation.
- Open-Source: Release the code and model checkpoints for both pre-training and instruction fine-tuning, and release a comprehensive evaluation toolkit to foster community research and development.
Architecture Overview
Kimi-Audio consists of three main components:
1. Audio Tokenizer: Converts input audio into:
- Discrete semantic tokens (12.5Hz) using vector quantization.
- Continuous acoustic features derived from a Whisper encoder (downsampled to 12.5Hz).
2. Audio LLM: A transformer-based model (initialized from a pre-trained text LLM like Qwen 2.5 7B) with shared layers processing multimodal inputs, followed by parallel heads for autoregressively generating text tokens and discrete audio semantic tokens. 3. Audio Detokenizer: Converts the predicted discrete semantic audio tokens back into high-fidelity waveforms using a flow-matching model and a vocoder (BigVGAN), supporting chunk-wise streaming with a look-ahead mechanism for low latency.
Getting Started
Step1: Get the Code
git clone https://github.com/MoonshotAI/Kimi-Audio.git cd Kimi-Audio git submodule update --init --recursive pip install -r requirements.txt
Kimi‑Audio can now be installed directly via pip.
pip install torch pip install git+https://github.com/MoonshotAI/Kimi-Audio.git
Quick Start
This example demonstrates basic usage for generating text from audio (ASR) and generating both text and speech in a conversational turn.
import soundfile as sf
from kimia_infer.api.kimia import KimiAudio
# --- 1. Load Model ---
model_path = "moonshotai/Kimi-Audio-7B-Instruct"
model = KimiAudio(model_path=model_path, load_detokenizer=True)
# --- 2. Define Sampling Parameters ---
sampling_params = {
"audio_temperature": 0.8,
"audio_top_k": 10,
"text_temperature": 0.0,
"text_top_k": 5,
"audio_repetition_penalty": 1.0,
"audio_repetition_window_size": 64,
"text_repetition_penalty": 1.0,
"text_repetition_window_size": 16,
}
# --- 3. Example 1: Audio-to-Text (ASR) ---
messages_asr = [
# You can provide context or instructions as text
{"role": "user", "message_type": "text", "content": "Please transcribe the following audio:"},
# Provide the audio file path
{"role": "user", "message_type": "audio", "content": "test_audios/asr_example.wav"}
]
# Generate only text output
_, text_output = model.generate(messages_asr, **sampling_params, output_type="text")
print(">>> ASR Output Text: ", text_output) # Expected output: "这并不是告别,这是一个篇章的结束,也是新篇章的开始。"
# --- 4. Example 2: Audio-to-Audio/Text Conversation ---
messages_conversation = [
# Start conversation with an audio query
{"role": "user", "message_type": "audio", "content": "test_audios/qa_example.wav"}
]
# Generate both audio and text output
wav_output, text_output = model.generate(messages_conversation, **sampling_params, output_type="both")
# Save the generated audio
output_audio_path = "output_audio.wav"
sf.write(output_audio_path, wav_output.detach().cpu().view(-1).numpy(), 24000) # Assuming 24kHz output
print(f">>> Conversational Output Audio saved to: {output_audio_path}")
print(">>> Conversational Output Text: ", text_output) # Expected output: "当然可以,这很简单。一二三四五六七八九十。"
# --- 5. Example 3: Audio-to-Audio/Text Conversation with Multiturn ---
messages = [
{"role": "user", "message_type": "audio", "content": "test_audios/multiturn/case2/multiturn_q1.wav"},
#…Excerpt shown — open the source for the full document.
Notability
notability 7.0/10High-starred audio repo