RepoMeituan (LongCat)Meituan (LongCat)published May 9, 2025seen 5d

meituan-longcat/Meeseeks

Python

Open original ↗

Captured source

source ↗
published May 9, 2025seen 5dcaptured 11hhttp 200method plain

meituan-longcat/Meeseeks

Description: A iterative feedback driven benchmark on LLM's instruction following ability

Language: Python

Stars: 57

Forks: 6

Open issues: 1

Created: 2025-05-09T08:44:50Z

Pushed: 2026-05-25T09:06:26Z

Default branch: main

Fork: no

Archived: no

README:

---

🚀 Latest News

We officially released the multilingual version of Meeseeks!

📋 Previous Versions

Temporarily removed for arr submitting

📖 Introduction

Meeseeks is an instruction-following benchmark designed to evaluate how well models can adhere to user instructions in a multi-turn scenario. A key feature of Meeseeks is its self-correction loop, where models receive structured feedback and must refine their responses accordingly.

This benchmark provides a realistic evaluation of a model’s adaptability, instruction adherence, and iterative improvement.

---

📊 Leaderboard

![leaderboard](leaderboard.svg)

---

🍄‍🟫 A Quick Example

ROUND1-Input Evaluation Content Capability tags

Generate 32 colloquial user comments and 40 formal user comments from a consumer perspective in short video comment sections. Each comment should be exactly 7 characters long and must not contain the following words:["this", "good", "that"] Whether 32 colloquial user comments were generated Element number requirement

Whether 40 formal user comments were generated Element number requirement

Whether all comments are exactly 7 characters Generate in 0∼10 words、Generate at accurate word number

Whether comments are non-repetitive Generate repeat/non-repeat content

Whether comments do not contain forbidden words: ["this", "good", "that"] Generate with certain keywords

💡 Let's activate multi-round mode!

ROUND2 - Input (if ROUND1 model output fails to meet requirement: "Whether all comments are exactly 7 characters")

Your response has the following issues: Whether all comments are exactly 7 characters: ❌ Content character count does not match range[7, 7] [mom prouds of you] character count: 4 Please provide your corrected response based on this information. Note: Only output the answer, do not output additional information.

ROUND3 - Input ...

...

---

🚀 Quick Start

Step 1: Environment Setup

1.1 Install Dependencies

Run the automated installation script:

bash install_deps.sh

This script will:

  • Detect your Python version (3.9 or 3.10+)
  • Install all required dependencies
  • Resolve version conflicts automatically
  • Install language-specific NLP libraries (Chinese, Japanese, Korean, Arabic, German, French, etc.)

> Requirements: Python 3.9+ (Python 3.10+ recommended)

1.2 Configure API Keys

Create a .env file in the project root with your API configurations:

# Qwen API Configuration (Extract Model)
QWEN_API_KEY=your_api_key_here
QWEN_BASE_URL=your_api_base_url_here
QWEN_MODEL=your_model_name_here

# Qwen Coder API Configuration (Score Model)
QWEN_CODER_API_KEY=your_api_key_here
QWEN_CODER_BASE_URL=your_api_base_url_here
QWEN_CODER_MODEL=your_model_name_here

# Tested Model API Configuration (Model Under Evaluation)
TESTED_MODEL_API_KEY=your_api_key_here
TESTED_MODEL_BASE_URL=your_api_base_url_here
TESTED_MODEL_NAME=your_model_name_here

> 💡 Tip: All three models support OpenAI-compatible API format. You can use the same model for all three roles if needed.

---

Step 2: Run Evaluation

2.1 Asia Languages Evaluation (Chinese, Japanese, Korean)

Run evaluation for all Asia languages:

python default_run_asia.py

Or filter specific languages:

# Evaluate only Chinese data
python default_run_asia.py --chinese

# Evaluate only Japanese data
python default_run_asia.py --japanese

# Evaluate only Korean data
python default_run_asia.py --korean

# Combine multiple languages
python default_run_asia.py --chinese --japanese

2.2 English & Multi-language Evaluation

Run evaluation for all supported languages:

python default_run_eng.py

Or filter specific languages:

# Evaluate only English data
python default_run_eng.py --english

# Evaluate only German data
python default_run_eng.py --german

# Evaluate other languages
python default_run_eng.py --french # French
python default_run_eng.py --spanish # Spanish
python default_run_eng.py --portuguese # Portuguese
python default_run_eng.py --russian # Russian
python default_run_eng.py --arabic # Arabic
python default_run_eng.py --indonesian # Indonesian

# Combine multiple languages
python default_run_eng.py --english --german --french

---

⚙️ Model Requirements

Before running any evaluation, you need to configure three model APIs:

1. Tested Model (TESTED_MODEL_* in .env)

  • The model you want to evaluate
  • Must support OpenAI-compatible Chat Completions API

2. Extract Model (QWEN_* in .env)

  • *Recommended: Qwen2.5-Coder-32B-Instruct*
  • Used to extract structured outputs from model responses
  • Requires strong code generation and structure understanding

3. Score Model (QWEN_CODER_* in .env)

  • *Recommended: Qwen2.5-32B-Instruct*
  • Used to evaluate and score the extracted results
  • Requires strong reasoning and judgment capabilities

---

💡 Hardware & API Options

  • If you have a GPU:

Deploy open-source Qwen2.5 series models locally using vLLM, TGI, or similar frameworks.

  • If you don't have a GPU:

Use commercial APIs instead:

  • ✅ *Highly recommended:* Claude 3.7 Sonnet or GPT-4
  • Any OpenAI-compatible API endpoint will work

---

📂 Evaluation Results

Results will be automatically saved to:

  • Asia languages: evaluation_results_asia/
  • English & others: evaluation_results_english/

Each directory contains:

  • round_1.json, round_2.json: Detailed evaluation results per round
  • round_1_stats.json, round_2_stats.json: Statistical summaries
  • Structured logs and scoring information for analysis

🙏 Contributors behind the scenes

Temporarily removed for arr submitting

Notability

notability 3.0/10

Low star new repo by Meituan