WritingMistral AIMistral AIpublished Sep 17, 2024seen 2h

Announcing Pixtral 12B

Open original ↗

Captured source

source ↗
published Sep 17, 2024seen 2hcaptured 2hhttp 200method plain

[Deprecated] Pixtral 12B | Mistral AI Research Announcing Pixtral 12B September 17, 2024 By Mistral AI team

Back to Blog

8 min read

Share this post

Copy to clipboard Copied

Heads up: this model is deprecated Pixtral 12B is no longer maintained and has been replaced by our latest, more powerful vision and multimodal models. Explore our current vision capabilities

Pixtral 12B in short: Natively multimodal, trained with interleaved image and text data

Strong performance on multimodal tasks, excels in instruction following

Maintains state-of-the-art performance on text-only benchmarks

Architecture:

New 400M parameter vision encoder trained from scratch

12B parameter multimodal decoder based on Mistral Nemo

Supports variable image sizes and aspect ratios

Supports multiple images in the long context window of 128k tokens

Use:

License: Apache 2.0

Try it  on La Plateforme  or  on Le Chat

Pixtral is trained to understand both natural images and documents, achieving 52.5% on the MMMU reasoning benchmark, surpassing a number of larger models. The model shows strong abilities in tasks such as chart and figure understanding, document question answering, multimodal reasoning and instruction following. Pixtral is able to ingest images at their natural resolution and aspect ratio, giving the user flexibility on the number of tokens used to process an image. Pixtral is also able to process any number of images in its long context window of 128K tokens. Unlike previous open-source models, Pixtral does not compromise on text benchmark performance to excel in multimodal tasks.

Performance Pixtral was trained to be a  drop-in replacement for Mistral Nemo 12B . Its key distinguishing factor from existing open-source models is the delivery of  best-in-class multimodal reasoning   without compromising on key text capabilities  such as instruction following, coding, and math. Evaluation protocol We re-evaluate a range of open and closed models through the  same   evaluation harness . For each dataset, the prompt was chosen such that we could reproduce the results of leading multimodal models (GPT-4o and Claude-3.5-Sonnet). All models were then evaluated with this same prompt. Overall, Pixtral substantially outperforms all open models around its scale and, in many cases, outperforms closed models such as Claude 3 Haiku. Pixtral even outperforms or matches the performance of much larger models like LLaVa OneVision 72B on multimodal benchmarks.All prompts will be open-sourced.

Performance of Pixtral compared to closed and larger multimodal models. [All models were benchmarked through the same evaluation harness and with the same prompt. We verify that prompts reproduce the performance reported for GPT-4o and Claude 3.5 Sonnet (prompts will be provided in technical report)]. Instruction following Pixtral particularly excels at both multimodal and text-only instruction following as compared to other open multimodal models. It substantially outperforms Qwen2-VL 7B, LLaVa-OneVision 7B and Phi-3.5 Vision in instruction following, with a 20% relative improvement in text  IF-Eval  and  MT-Bench  over the nearest OSS model. To further evaluate this ability for multimodal use cases, we create multimodal versions of these benchmarks:  MM-IF-Eval  and  MM-MT-Bench . Pixtral outperforms open-source alternatives on multimodal instruction following benchmarks as well. We will open-source  MM-MT-Bench to the community.

Performance of Pixtral compared to open multimodal models. All models were benchmarked through the same evaluation harness and with the same prompt. Architecture Variable image size:  Pixtral is designed to optimize for both speed and performance. We trained a new vision encoder that natively supports variable image sizes: We simply pass images through the vision encoder at their native resolution and aspect ratio, converting them into image tokens for each 16x16 patch in the image

These tokens are then flattened to create a sequence, with  [IMG BREAK]  and  [IMG END]  tokens added between rows and at the end of the image.

[IMG BREAK]  tokens let the model distinguish between images of different aspect ratios with the same number of tokens.

In this way, Pixtral can be used to accurately understand complex diagrams, charts and documents in high resolution, while providing fast inference speeds on small images like icons, clipart, and equations.

Pixtral uses a new vision encoder trained from scratch that natively supports variable image sizes. Final architecture: Pixtral has two components: the Vision Encoder, which tokenizes images, and a Multimodal Transformer Decoder, which predicts the next text token given a sequence of text and images. The model is trained to predict the next text token on interleaved image and text data. This architecture allows Pixtral to process any number of images with arbitrary sizes in its large context window of 128K tokens.

Qualitative Examples Reasoning over complex figures This table combines the data from both tables, ensuring that all models and their respective scores from each benchmark are included in a single, cohesive format.

Prompt: List the top 5 countries in Europe with the highest GDP Response: The top 5 countries in Europe with the highest GDP, based on the image, are: Germany

GDP: $3.99 trillion

Percentage: 4.65%

United Kingdom

GDP: $2.82 trillion

Percentage: 3.29%

France

GDP: $2.78 trillion

Percentage: 3.24%

Italy

GDP: $2.07 trillion

Percentage: 2.42%

Spain

GDP: $1.43 trillion

Percentage: 1.66%

These countries are highlighted in green on the diagram. Chart Understanding and Analysis Image:

Prompt: When did things start to go wrong for dark dragon? Response: The image is a line graph titled 'train/loss' that shows the training loss over steps for four different models: dark-dragon-50, dainty-sunset-49, different-haze-34, and morning-field-33. The x-axis represents the steps, ranging from 0 to 60k, and the y-axis represents the loss on a logarithmic scale from 1e-5 to 1e+5. To determine when things started to go wrong for the dark-dragon-50 model, we need to look at the red line representing this model. Initially, the loss decreases, indicating that the model is learning effectively. However, around the 10k step mark, the loss starts to spike significantly. This spike indicates that the model's performance is deteriorating. The loss continues to spike and fluctuate, reaching very high values, which suggests that the...

Excerpt shown — open the source for the full document.

Notability

notability 8.0/10

Notable multimodal model release by Mistral with HN traction.