meituan-longcat/VitaBench-2.0
Python
Captured source
source ↗meituan-longcat/VitaBench-2.0
Language: Python
License: MIT
Stars: 27
Forks: 2
Open issues: 1
Created: 2026-05-26T08:19:36Z
Pushed: 2026-06-04T12:42:16Z
Default branch: main
Fork: no
Archived: no
README:
📃 Paper • 🌐 Website • 🤗 Dataset
🌍 English version benchmark coming soon.
📖 Introduction
Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. We introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions.
VitaBench 2.0 extends VitaBench from one-shot tasks to long-term, multi-session user interactions, where an agent must *infer*, *utilize*, and *update* user preference across fragmented conversations and behaviors that span days, weeks, or months. While VitaBench 1.0 measures whether an agent can complete a single complex life-serving request, VitaBench 2.0 further asks the question: can an agent understand the user from daily interactions, anticipate their evolving needs, and act on their behalf — over time?
Each evaluation in VitaBench 2.0 simulates a continuing relationship between an agent and a user across multiple sessions in daily scenarios. Across these sessions, user preferences drift, prior commitments must be honored, and earlier context must be retrieved or reconstructed to act correctly in the present. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across three representative memory architectures:
- Full Context — the entire interaction history is appended to the prompt, an upper-bound on what the model can possibly leverage.
- Agentic Memory — the agent autonomously decides what to write to and read from a structured memory store.
- RAG Memory — past interactions are chunked, embedded, and retrieved on demand.
Our results show that even the SOTA models reach only ~50% Avg@4 under Full Context and degrade further under realistic memory settings, indicating that long-horizon personalization and proactivity remain open challenges for current LLM agents.
🛠️ Quick Start
1. Install
git clone https://github.com/meituan-longcat/VitaBench-2.0.git cd VitaBench-2.0 pip install -e .
This installs the vita CLI.
2. Download the dataset
VitaBench 2.0 tasks are hosted on Hugging Face: meituan-longcat/VitaBench-2.0.
pip install -U "huggingface_hub[cli]" huggingface-cli download meituan-longcat/VitaBench-2.0 \ --repo-type dataset \ --local-dir data/vita/domains/personalization
After downloading, you should have data/vita/domains/personalization/tasks.json (56 users, 771 subtasks).
3. Configure the LLM
cp src/vita/models.yaml.example src/vita/models.yaml export OPENAI_API_KEY=sk-...
src/vita/models.yaml supports any OpenAI-compatible endpoint — change default.base_url to point at Azure, vLLM, Together, llama.cpp, etc. The YAML supports ${VAR} placeholders expanded from your shell.
For RAG / embeddings:
| Env var | Default | Purpose | |---------|---------|---------| | VITA_EMBEDDING_URL | models.yaml default.base_url | Embedding endpoint | | VITA_EMBEDDING_KEY | models.yaml default.api_key | Embedding API key | | VITA_EMBEDDING_MODEL | text-embedding-3-large | Embedding model name | | VITA_EMBEDDING_MAX_CONCURRENCY | 64 | Per-event-loop semaphore size |
4. Run an evaluation
vita run \ --domain personalization \ --memory-type rewrite \ --agent-llm gpt-4.1 \ --user-llm gpt-4.1 \ --evaluator-llm gpt-4.1 \ --num-tasks 1 --max-steps 50
--save-to .json writes results under data/simulations/.
5. Run all memory backends
bash scripts/run_memory_benchmark.sh # or a subset: bash scripts/run_memory_benchmark.sh full_context rewrite rag
Memory backends
| --memory-type | Behaviour | |-----------------|-----------| | null | No memory (baseline) | | groundtruth | Injects the canonical preference memory directly (upper bound) | | full_context | Dumps every prior interaction as context | | rewrite | LLM rewrites a single consolidated memory string each update | | rag | Async vector retrieval (text-embedding-3-large by default) | | rag_cache | RAG with a precomputed embedding cache (see scripts/precompute_rag_cache.py) |
Per-backend defaults live in src/vita/memory.yaml; constructor kwargs override.
🏆 Leaderboard
Performance of non-thinking and thinking models under three memory settings. The leaderboard is sorted by Avg@4 under Full Context. Best results in each column are in bold.
Non-thinking Models
Model
Avg@4 Pass@4 Pass^4 Avg@4 Pass@4 Pass^4 Avg@4 Pass@4 Pass^4
GPT-4o-mini 0.067 0.180 0.006 0.084 0.229 0.008 0.094 0.227 0.011
GPT-3.5-Turbo 0.140 0.314 0.019 0.231 0.467 0.056 0.205 0.409 0.059
LongCat-Flash-Chat 0.298 0.510 0.123 0.302 0.537 0.105 0.290 0.471 0.136
GLM-4.5 0.307 0.529 0.127 0.330 0.569 0.112 0.316 0.523 0.152
Doubao-Seed-1.6 0.326 0.512 0.171 0.340 0.576 0.129 0.351 0.543 0.174
GLM-4.6 0.342 0.612 0.113 0.336 0.623 0.084 0.317 0.555 0.123
Kimi-K2.6 0.378 0.632 0.147 0.397 0.674 0.145 0.383 0.621 0.163
GLM-5.1 0.420 0.654 0.204 0.423 0.664 0.182 0.383 0.585 0.200
Doubao-Seed-2.0-pro 0.428 0.649 0.218 0.426 0.665 0.198 0.406 0.625 0.208
DeepSeek-V4-Pro 0.456 0.652 0.267 0.427 0.658 0.207 0.424 0.618 0.247
Thinking Models
Model
Avg@4 Pass@4 Pass^4 Avg@4 Pass@4 Pass^4 Avg@4 Pass@4 Pass^4
o4-mini 0.210 0.433 0.047 0.270 0.533 0.073 0.261 0.452 0.091
Gemini-2.5-Flash 0.282 0.556 0.063 0.312 0.567 0.098 0.309 0.544 0.107
Qwen3-Max 0.284 0.499 0.105 0.324 0.599 0.091 0.315 0.519 0.134
Kimi-K2.6 0.293 0.533 0.099 0.280 0.508 0.088 0.303 0.511 0.118
Gemini-2.5-Pro 0.331 0.605 0.109 0.378 0.638 0.138 0.320 0.579 0.109
MiniMax-M2.7 0.345 0.584 0.145 0.351 0.609 0.124 0.314 0.518 0.143
GLM-4.6 0.359 0.612 0.116 0.351 0.625 0.107 0.336 0.574 0.135
GLM-4.5 0.364 0.623 0.156 0.311 0.596 0.106 0.336 0.555 0.147
Doubao-Seed-1.6 0.373 0.599 0.176 0.383 0.646 0.123…
Excerpt shown — open the source for the full document.
Notability
notability 2.0/10Low stars, routine repo