mistralai/mistral-evals
Python
Captured source
source ↗mistralai/mistral-evals
Language: Python
Stars: 87
Forks: 14
Open issues: 8
Created: 2024-09-13T18:11:29Z
Pushed: 2025-11-21T10:21:09Z
Default branch: main
Fork: no
Archived: no
README:
Mistral Evals
This repository contains code to run evals released by Mistral AI as well as standardized prompts, parsing and metrics computation for popular academic benchmarks.
Installation
pip install -r requirements.txt
Evals
We support the following evals in this repository:
mm_mt_bench: MM-MT-Bench is a multi-turn LLM-as-a-judge evaluation task released by Mistral AI that uses GPT-4o for judging model answers given reference answers.vqav2: VQAv2docvqa: DocVQAmathvista: MathVistammmu: MMMUchartqa: ChartQA
Example usage:
Step 1: Host a model using vLLM
To install vLLM, follow the directions here.
>> vllm serve mistralai/Pixtral-12B-2409 --config_format mistral --tokenizer_mode "mistral"
Step 2: Evaluate hosted model.
>> python -m eval.run eval_vllm \ --model_name mistralai/Pixtral-12B-2409 \ --url http://0.0.0.0:8000 \ --output_dir ~/tmp \ --eval_name "mm_mt_bench"
NOTE: Evaluating MM-MT-Bench requires calls to GPT-4o as a judge, hence you'll need to set the OPENAI_API_KEY environment variable for the eval to work.
For evaluating the other supported evals, see the Evals section.
Evaluating a non-vLLM model
To evaluate your own model, you can also create a Model class which implements a __call__ method which takes as input a chat completion request and returns a string answer. Requests are provided in vLLM API format.
class CustomModel(Model): def __call__(self, request: dict[str, Any]): # Your model code ... return answer
Usage
*You must not use this library or our models in a manner that infringes, misappropriates, or otherwise violates any third party’s rights, including intellectual property rights.*
Notability
notability 6.0/10Notable repo from Mistral, moderate stars