RepoMicrosoftMicrosoftpublished Oct 10, 2025seen 5d

microsoft/OdysseyBench

Jupyter Notebook

Open original ↗

Captured source

source ↗
published Oct 10, 2025seen 5dcaptured 8hhttp 200method plain

microsoft/OdysseyBench

Description: Repo for the OdysseyBench Benchmark for Evaluating Agent Memory on Long-horizon Productivity Workflows

Language: Jupyter Notebook

License: MIT

Stars: 10

Forks: 0

Open issues: 22

Created: 2025-10-10T11:22:39Z

Pushed: 2026-06-11T00:11:52Z

Default branch: main

Fork: no

Archived: no

README:

OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows

OdysseyBench is a comprehensive benchmark and evaluation suite for task-oriented agent systems, supporting both the OdysseyBench+ and OdysseyBench-Neo tracks. This project provides tools for task generation, execution, validation, and in-depth evaluation of agent performance, with a focus on memory and retrieval-augmented generation (RAG) capabilities.

💼 Preparation

git clone https://github.com/microsoft/OdysseyBench.git

git clone https://github.com/zlwang-cs/OfficeBench.git /tmp/OfficeBench

find /tmp/OfficeBench/tasks/ -type d -name testbed -exec bash -c 'dest="OdysseyBench/tasks/${1#*/tasks/}"; mkdir -p "$dest"; cp -r "$1" "$dest/../"' _ {} \;

rm -rf /tmp/OfficeBench

🛠️ Setup

conda create -n odysseybench python=3.10
pip install -r requirements.txt
export OPENAI_API_KEY=OPENAI_KEY

---

📁 Tasks Directory Structure

  • /tasks/substasks_plus: Tasks for OdysseyBench+
  • /tasks/chat_histories_plus: Dialogues for OdysseyBench+
  • /tasks/substasks_neo: Tasks for OdysseyBench-Neo
  • /tasks/chat_histories_neo: Dialogues for OdysseyBench-Neo
  • /tasks/outputs/: Results of task execution
  • /tasks/testbed/: Files required for task execution

---

📊 Evaluation on OdysseyBench

Configuration

Edit config/base_config.yaml:

memory:
mode: "use_rag" # Options: raw_chat, use_rag, clean
rag_mode: "summarysession" # summarysession, dialogsession, dialogutterance, summarychunk
top_k: 5 # Used in 'use_rag' mode
  • Long-Context Evaluation:

Set mode: raw_chat to include all dialogues in the prompt (ignores rag_mode and top_k).

  • RAG Evaluation:
  • For raw context: set rag_mode to dialoguesession or dialogueutterance.
  • For summary: set rag_mode to summarysession or summarychunk.

---

Run Evaluations

OdysseyBench+

python run_all.py --tag OdysseyBench_plus

OdysseyBench-Neo

python run_all.py --neo --tag OdysseyBench_neo

---

🚀 Run HomerAgents+

python run_homeragents_plus.py --loops 5

🚀 Run HomerAgents-Neo

🪄 Generate Synthesized Tasks

python run_homeragents_neo.py

🧱 Quality Verification

  • Cross Validation

Select the intersection of successfully executed tasks by task-description and task-intent + task-instruction:

python run_all.py --neo_clean --tag test-neo-ground-truth-memory
# Set 'mode' as 'clean' in configs/base_config.yaml

python run_all.py --neo_clean --tag test-neo--task_description
# Set 'Memory' as 'False' in configs/base_config.yaml
  • Evaluate Execution Performance
sh evaluate_all.sh test-neo-ground-truth-memory o3 True
sh evaluate_all.sh test-neo-task_description o3 True

🧹 Data Cleaning & Formatting

  • Cross-validation selection:
python utils_clean/cross_validation.py
  • Uniform task format:
python utils_clean/clean_tasks.py
  • Uniform dialogue format:
python utils_clean/clean_dialogue.py

---

🏆 Generation Task Evaluation

python llm-as-a-judge.py

---

🤝 Contributing

Contributions are welcome! Please open issues or pull requests for improvements or questions.

---

📬 Reference

If you found this code useful, please cite the following paper:

@article{wang2025odysseybench,
title={OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows},
author={Wang, Weixuan and Han, Dongge and Diaz, Daniel Madrigal and Xu, Jin and R{\"u}hle, Victor and Rajmohan, Saravan},
journal={arXiv preprint arXiv:2508.09124},
year={2025}
}

Acknowledgements

This project builds on and incorporates material from
[OfficeBench](https://github.com/zlwang-cs/OfficeBench). See NOTICE.txt
for attribution details.

---

Notability

notability 2.0/10

New repo with only 10 stars.

Microsoft has a repo signal matching evals and quality, product and customer.