What does this model signal mean?

Mistral AI published mistralai/Voxtral-Small-24B-2507. This model signal is evidence of what shipped on model infrastructure and how the release is positioned. High-signal details: license apache-2.0 · 43.7K HF downloads · New Mistral model with strong community traction.. onlylabs links this event to 1 captured evidence page and 6 related model signals.

Mistral AI Model: mistralai/Voxtral-Small-24B-2507

Captured source

source ↗

Hugging Face/huggingface.co/mistralai/Voxtral-Small-24B-2507

mistralai/Voxtral-Small-24B-2507 model card

Source ↗

published Jul 1, 2025seen 5dcaptured 14hhttp 200method plaintask audio-text-to-textlicense apache-2.0library vllmparams 24Bdownloads 44klikes 498

Voxtral Small 1.0 (24B) - 2507

Voxtral Small is an enhancement of Mistral Small 3, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.

Learn more about Voxtral in our blog post here and our research paper.

Key Features

Voxtral builds upon Mistral Small 3 with powerful audio understanding capabilities.

Dedicated transcription mode: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly
Long-form context: With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding
Built-in Q&A and summarization: Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models
Natively multilingual: Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian)
Function-calling straight from voice: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents
Highly capable at text: Retains the text understanding capabilities of its language model backbone, Mistral Small 3.1

Benchmark Results

Audio

Average word error rate (WER) over the FLEURS, Mozilla Common Voice and Multilingual LibriSpeech benchmarks:

!image/png

Text

!image/png

Usage

The model can be used with the following frameworks;

`vllm (recommended)`: See [here](#vllm-recommended)
`Transformers` 🤗: See [here](#transformers-🤗)

Notes:

temperature=0.2 and top_p=0.95 for chat completion (*e.g. Audio Understanding*) and temperature=0.0 for transcription
Multiple audios per message and multiple user turns with audio are supported
Function calling is supported
System prompts are not yet supported

vLLM (recommended)

We recommend using this model with vLLM.

Installation

Make sure to install vllm >= 0.10.0, we recommend using uv

uv pip install -U "vllm[audio]" --system

Doing so should automatically install `mistral_common >= 1.8.1`.

To check:

python -c "import mistral_common; print(mistral_common.__version__)"

Offline

You can test that your vLLM setup works as expected by cloning the vLLM repo:

git clone https://github.com/vllm-project/vllm && cd vllm

and then running:

python examples/offline_inference/audio_language.py --num-audios 2 --model-type voxtral

Serve

We recommend that you use Voxtral-Small-24B-2507 in a server/client setting.

1. Spin up a server:

vllm serve mistralai/Voxtral-Small-24B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral --tensor-parallel-size 2 --tool-call-parser mistral --enable-auto-tool-choice

Note: Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.

2. To ping the client you can use a simple Python snippet. See the following examples.

Audio Instruct

Leverage the audio capabilities of Voxtral-Small-24B-2507 to chat.

Make sure that your client has mistral-common with audio installed:

pip install --upgrade mistral_common\[audio\]

Python snippet

from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage, RawAudio
from mistral_common.audio import Audio
from huggingface_hub import hf_hub_download

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://:8000/v1"

client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
bcn_file = hf_hub_download("patrickvonplaten/audio_samples", "bcn_weather.mp3", repo_type="dataset")

def file_to_chunk(file: str) -> AudioChunk:
audio = Audio.from_file(file, strict=False)
return AudioChunk.from_audio(audio)

text_chunk = TextChunk(text="Which speaker is more inspiring? Why? How are they different from each other? Answer in French.")
user_msg = UserMessage(content=[file_to_chunk(obama_file), file_to_chunk(bcn_file), text_chunk]).to_openai()

print(30 * "=" + "USER 1" + 30 * "=")
print(text_chunk.text)
print("\n\n")

response = client.chat.completions.create(
model=model,
messages=[user_msg],
temperature=0.2,
top_p=0.95,
)
content = response.choices[0].message.content

print(30 * "=" + "BOT 1" + 30 * "=")
print(content)
print("\n\n")
# The model could give the following answer:
# ```L'orateur le plus inspirant est le président.
# Il est plus inspirant parce qu'il parle de ses expériences personnelles
# et de son optimisme pour l'avenir du pays.
# Il est différent de l'autre orateur car il ne parle pas de la météo,
# mais plutôt de ses interactions avec les gens et de son rôle en tant que président.```

messages = [
user_msg,
AssistantMessage(content=content).to_openai(),
UserMessage(content="Ok, now please summarize the content of the first audio.").to_openai()
]
print(30 * "=" + "USER 2" + 30 * "=")
print(messages[-1]["content"])
print("\n\n")

response = client.chat.completions.create(
model=model,
messages=messages,
temperature=0.2,
top_p=0.95,
)
content = response.choices[0].message.content
print(30 * "=" + "BOT 2" + 30 * "=")
print(content)

Transcription

Voxtral-Small-24B-2507 has powerful transcription capabilities!

Make sure that your client has mistral-common with audio installed:

pip install --upgrade mistral_common\[audio\]

Python snippet

from mistral_common.protocol.transcription.request import…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

New Mistral model with strong community traction.