RepoMeituan (LongCat)Meituan (LongCat)published May 26, 2026seen 5d

meituan-longcat/VitaBench-2.0

Python

Open original ↗

Captured source

source ↗
published May 26, 2026seen 5dcaptured 12hhttp 200method plain

meituan-longcat/VitaBench-2.0

Language: Python

License: MIT

Stars: 27

Forks: 2

Open issues: 1

Created: 2026-05-26T08:19:36Z

Pushed: 2026-06-04T12:42:16Z

Default branch: main

Fork: no

Archived: no

README:

📃 Paper • 🌐 Website • 🤗 Dataset

🌍 English version benchmark coming soon.

📖 Introduction

Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. We introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions.

VitaBench 2.0 extends VitaBench from one-shot tasks to long-term, multi-session user interactions, where an agent must *infer*, *utilize*, and *update* user preference across fragmented conversations and behaviors that span days, weeks, or months. While VitaBench 1.0 measures whether an agent can complete a single complex life-serving request, VitaBench 2.0 further asks the question: can an agent understand the user from daily interactions, anticipate their evolving needs, and act on their behalf — over time?

Each evaluation in VitaBench 2.0 simulates a continuing relationship between an agent and a user across multiple sessions in daily scenarios. Across these sessions, user preferences drift, prior commitments must be honored, and earlier context must be retrieved or reconstructed to act correctly in the present. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across three representative memory architectures:

  • Full Context — the entire interaction history is appended to the prompt, an upper-bound on what the model can possibly leverage.
  • Agentic Memory — the agent autonomously decides what to write to and read from a structured memory store.
  • RAG Memory — past interactions are chunked, embedded, and retrieved on demand.

Our results show that even the SOTA models reach only ~50% Avg@4 under Full Context and degrade further under realistic memory settings, indicating that long-horizon personalization and proactivity remain open challenges for current LLM agents.

🛠️ Quick Start

1. Install

git clone https://github.com/meituan-longcat/VitaBench-2.0.git
cd VitaBench-2.0
pip install -e .

This installs the vita CLI.

2. Download the dataset

VitaBench 2.0 tasks are hosted on Hugging Face: meituan-longcat/VitaBench-2.0.

pip install -U "huggingface_hub[cli]"
huggingface-cli download meituan-longcat/VitaBench-2.0 \
--repo-type dataset \
--local-dir data/vita/domains/personalization

After downloading, you should have data/vita/domains/personalization/tasks.json (56 users, 771 subtasks).

3. Configure the LLM

cp src/vita/models.yaml.example src/vita/models.yaml
export OPENAI_API_KEY=sk-...

src/vita/models.yaml supports any OpenAI-compatible endpoint — change default.base_url to point at Azure, vLLM, Together, llama.cpp, etc. The YAML supports ${VAR} placeholders expanded from your shell.

For RAG / embeddings:

| Env var | Default | Purpose | |---------|---------|---------| | VITA_EMBEDDING_URL | models.yaml default.base_url | Embedding endpoint | | VITA_EMBEDDING_KEY | models.yaml default.api_key | Embedding API key | | VITA_EMBEDDING_MODEL | text-embedding-3-large | Embedding model name | | VITA_EMBEDDING_MAX_CONCURRENCY | 64 | Per-event-loop semaphore size |

4. Run an evaluation

vita run \
--domain personalization \
--memory-type rewrite \
--agent-llm gpt-4.1 \
--user-llm gpt-4.1 \
--evaluator-llm gpt-4.1 \
--num-tasks 1 --max-steps 50

--save-to .json writes results under data/simulations/.

5. Run all memory backends

bash scripts/run_memory_benchmark.sh
# or a subset:
bash scripts/run_memory_benchmark.sh full_context rewrite rag

Memory backends

| --memory-type | Behaviour | |-----------------|-----------| | null | No memory (baseline) | | groundtruth | Injects the canonical preference memory directly (upper bound) | | full_context | Dumps every prior interaction as context | | rewrite | LLM rewrites a single consolidated memory string each update | | rag | Async vector retrieval (text-embedding-3-large by default) | | rag_cache | RAG with a precomputed embedding cache (see scripts/precompute_rag_cache.py) |

Per-backend defaults live in src/vita/memory.yaml; constructor kwargs override.

🏆 Leaderboard

Performance of non-thinking and thinking models under three memory settings. The leaderboard is sorted by Avg@4 under Full Context. Best results in each column are in bold.

Non-thinking Models

Model

Avg@4 Pass@4 Pass^4 Avg@4 Pass@4 Pass^4 Avg@4 Pass@4 Pass^4

GPT-4o-mini 0.067 0.180 0.006 0.084 0.229 0.008 0.094 0.227 0.011

GPT-3.5-Turbo 0.140 0.314 0.019 0.231 0.467 0.056 0.205 0.409 0.059

LongCat-Flash-Chat 0.298 0.510 0.123 0.302 0.537 0.105 0.290 0.471 0.136

GLM-4.5 0.307 0.529 0.127 0.330 0.569 0.112 0.316 0.523 0.152

Doubao-Seed-1.6 0.326 0.512 0.171 0.340 0.576 0.129 0.351 0.543 0.174

GLM-4.6 0.342 0.612 0.113 0.336 0.623 0.084 0.317 0.555 0.123

Kimi-K2.6 0.378 0.632 0.147 0.397 0.674 0.145 0.383 0.621 0.163

GLM-5.1 0.420 0.654 0.204 0.423 0.664 0.182 0.383 0.585 0.200

Doubao-Seed-2.0-pro 0.428 0.649 0.218 0.426 0.665 0.198 0.406 0.625 0.208

DeepSeek-V4-Pro 0.456 0.652 0.267 0.427 0.658 0.207 0.424 0.618 0.247

Thinking Models

Model

Avg@4 Pass@4 Pass^4 Avg@4 Pass@4 Pass^4 Avg@4 Pass@4 Pass^4

o4-mini 0.210 0.433 0.047 0.270 0.533 0.073 0.261 0.452 0.091

Gemini-2.5-Flash 0.282 0.556 0.063 0.312 0.567 0.098 0.309 0.544 0.107

Qwen3-Max 0.284 0.499 0.105 0.324 0.599 0.091 0.315 0.519 0.134

Kimi-K2.6 0.293 0.533 0.099 0.280 0.508 0.088 0.303 0.511 0.118

Gemini-2.5-Pro 0.331 0.605 0.109 0.378 0.638 0.138 0.320 0.579 0.109

MiniMax-M2.7 0.345 0.584 0.145 0.351 0.609 0.124 0.314 0.518 0.143

GLM-4.6 0.359 0.612 0.116 0.351 0.625 0.107 0.336 0.574 0.135

GLM-4.5 0.364 0.623 0.156 0.311 0.596 0.106 0.336 0.555 0.147

Doubao-Seed-1.6 0.373 0.599 0.176 0.383 0.646 0.123…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Low stars, routine repo