meituan-longcat/Meeseeks
Python
Captured source
source ↗meituan-longcat/Meeseeks
Description: A iterative feedback driven benchmark on LLM's instruction following ability
Language: Python
Stars: 57
Forks: 6
Open issues: 1
Created: 2025-05-09T08:44:50Z
Pushed: 2026-05-25T09:06:26Z
Default branch: main
Fork: no
Archived: no
README:
---
🚀 Latest News
We officially released the multilingual version of Meeseeks!
📋 Previous Versions
Temporarily removed for arr submitting
📖 Introduction
Meeseeks is an instruction-following benchmark designed to evaluate how well models can adhere to user instructions in a multi-turn scenario. A key feature of Meeseeks is its self-correction loop, where models receive structured feedback and must refine their responses accordingly.
This benchmark provides a realistic evaluation of a model’s adaptability, instruction adherence, and iterative improvement.
---
📊 Leaderboard

---
🍄🟫 A Quick Example
ROUND1-Input Evaluation Content Capability tags
Generate 32 colloquial user comments and 40 formal user comments from a consumer perspective in short video comment sections. Each comment should be exactly 7 characters long and must not contain the following words:["this", "good", "that"] Whether 32 colloquial user comments were generated Element number requirement
Whether 40 formal user comments were generated Element number requirement
Whether all comments are exactly 7 characters Generate in 0∼10 words、Generate at accurate word number
Whether comments are non-repetitive Generate repeat/non-repeat content
Whether comments do not contain forbidden words: ["this", "good", "that"] Generate with certain keywords
💡 Let's activate multi-round mode!
ROUND2 - Input (if ROUND1 model output fails to meet requirement: "Whether all comments are exactly 7 characters")
Your response has the following issues: Whether all comments are exactly 7 characters: ❌ Content character count does not match range[7, 7] [mom prouds of you] character count: 4 Please provide your corrected response based on this information. Note: Only output the answer, do not output additional information.
ROUND3 - Input ...
...
---
🚀 Quick Start
Step 1: Environment Setup
1.1 Install Dependencies
Run the automated installation script:
bash install_deps.sh
This script will:
- Detect your Python version (3.9 or 3.10+)
- Install all required dependencies
- Resolve version conflicts automatically
- Install language-specific NLP libraries (Chinese, Japanese, Korean, Arabic, German, French, etc.)
> Requirements: Python 3.9+ (Python 3.10+ recommended)
1.2 Configure API Keys
Create a .env file in the project root with your API configurations:
# Qwen API Configuration (Extract Model) QWEN_API_KEY=your_api_key_here QWEN_BASE_URL=your_api_base_url_here QWEN_MODEL=your_model_name_here # Qwen Coder API Configuration (Score Model) QWEN_CODER_API_KEY=your_api_key_here QWEN_CODER_BASE_URL=your_api_base_url_here QWEN_CODER_MODEL=your_model_name_here # Tested Model API Configuration (Model Under Evaluation) TESTED_MODEL_API_KEY=your_api_key_here TESTED_MODEL_BASE_URL=your_api_base_url_here TESTED_MODEL_NAME=your_model_name_here
> 💡 Tip: All three models support OpenAI-compatible API format. You can use the same model for all three roles if needed.
---
Step 2: Run Evaluation
2.1 Asia Languages Evaluation (Chinese, Japanese, Korean)
Run evaluation for all Asia languages:
python default_run_asia.py
Or filter specific languages:
# Evaluate only Chinese data python default_run_asia.py --chinese # Evaluate only Japanese data python default_run_asia.py --japanese # Evaluate only Korean data python default_run_asia.py --korean # Combine multiple languages python default_run_asia.py --chinese --japanese
2.2 English & Multi-language Evaluation
Run evaluation for all supported languages:
python default_run_eng.py
Or filter specific languages:
# Evaluate only English data python default_run_eng.py --english # Evaluate only German data python default_run_eng.py --german # Evaluate other languages python default_run_eng.py --french # French python default_run_eng.py --spanish # Spanish python default_run_eng.py --portuguese # Portuguese python default_run_eng.py --russian # Russian python default_run_eng.py --arabic # Arabic python default_run_eng.py --indonesian # Indonesian # Combine multiple languages python default_run_eng.py --english --german --french
---
⚙️ Model Requirements
Before running any evaluation, you need to configure three model APIs:
1. Tested Model (TESTED_MODEL_* in .env)
- The model you want to evaluate
- Must support OpenAI-compatible Chat Completions API
2. Extract Model (QWEN_* in .env)
- *Recommended: Qwen2.5-Coder-32B-Instruct*
- Used to extract structured outputs from model responses
- Requires strong code generation and structure understanding
3. Score Model (QWEN_CODER_* in .env)
- *Recommended: Qwen2.5-32B-Instruct*
- Used to evaluate and score the extracted results
- Requires strong reasoning and judgment capabilities
---
💡 Hardware & API Options
- If you have a GPU:
Deploy open-source Qwen2.5 series models locally using vLLM, TGI, or similar frameworks.
- If you don't have a GPU:
Use commercial APIs instead:
- ✅ *Highly recommended:* Claude 3.7 Sonnet or GPT-4
- Any OpenAI-compatible API endpoint will work
---
📂 Evaluation Results
Results will be automatically saved to:
- Asia languages:
evaluation_results_asia/ - English & others:
evaluation_results_english/
Each directory contains:
round_1.json,round_2.json: Detailed evaluation results per roundround_1_stats.json,round_2_stats.json: Statistical summaries- Structured logs and scoring information for analysis
🙏 Contributors behind the scenes
Temporarily removed for arr submitting
Notability
notability 3.0/10Low star new repo by Meituan