microsoft/AsgardBench
Python
Captured source
source ↗microsoft/AsgardBench
Description: Visually grounded planning benchmark for multimodal agents
Language: Python
License: MIT
Stars: 5
Forks: 0
Open issues: 9
Created: 2026-01-28T22:22:31Z
Pushed: 2026-06-11T01:26:20Z
Default branch: main
Fork: no
Archived: no
README:
AsgardBench
A benchmark for evaluating Vision-Language Models (VLMs) on embodied household tasks.
📄 Paper: AsgardBench - Evaluating Visually Grounded Interactive Planning Under Minimal Feedback
Overview
AsgardBench evaluates how well VLMs can act as embodied agents completing multi-step household tasks. Given a task description (e.g., "Make coffee") and egocentric visual observations, the model must output actions to accomplish the goal.
Key features:
- 🏠 108 task instances across 12 task types and 3 scene types in AI2-THOR
- 👁️ Egocentric visual observations
- 🎯 Automatic success evaluation via goal checking
- 🔧 Works with any OpenAI-compatible API endpoint
For detailed methodology, ablation studies, and results, please see our paper.
Quick Start
Prerequisites
- uv
- An OpenAI-compatible API endpoint (OpenAI, Azure OpenAI, OpenRouter, vLLM, etc.)
- Linux: X11 display or Xvfb (for AI2-THOR rendering)
Installation
# Clone the repository git clone https://github.com/microsoft/AsgardBench.git cd AsgardBench # Install dependencies uv sync
Configuration
Create a .env file with your API credentials:
cp .env.example .env # Edit .env with your API key and endpoint
Required environment variables:
OPENAI_API_KEY=your-api-key OPENAI_BASE_URL=https://api.openai.com/v1 # Or your endpoint
Run the Sanity Check
Verify your setup with a quick 2-task sanity check:
# On Linux without a display, use xvfb-run xvfb-run -a uv run python -m AsgardBench.Model.model_tester \ --test magt_benchmark_sanity \ --model gpt-4o # On systems with a display (or macOS) uv run python -m AsgardBench.Model.model_tester \ --test magt_benchmark_sanity \ --model gpt-4o
Run the Full Benchmark
xvfb-run -a uv run python -m AsgardBench.Model.model_tester \ --test magt_benchmark \ --model gpt-4o
See [Docker](#docker) for containerized execution. After running, see [Results](#results) to generate reports and view scores.
Benchmark Structure
The benchmark consists of 108 task instances across 12 task types:
| Dataset | Tasks | |---------|-------| | magt_benchmark | 108 | | magt_benchmark_sanity | 2 (quick setup verification) |
For a full list of task types and their variations, see our paper.
Data Format
Each task in Generated/magt_benchmark*/ contains a plan.json with:
{
"name": "task_name",
"task_description": "Make coffee in the mug", // task description given to the model
"scene": "FloorPlan1", // AI2-THOR scene identifier
"step_count": 25, // Expected number of steps to complete the task
"initial_pose": { // Agent's starting position and orientation
"position": {"x": 0.5, "y": 0.9, "z": -1.2},
"rotation": 90,
"horizon": 30,
"standing": true
},
"goal": { // Success conditions for the task
"goal_type": "ObjectStateGoal",
"conditions": [...]
},
"setup_actions": [...], // Actions to initialize the scene
"object_setup": {...}, // Object placements and states
"randomization": {...} // Randomization parameters used
}Configuration
The default configuration runs the baseline evaluation used in our paper. Simply specify the test set and model:
uv run python -m AsgardBench.Model.model_tester --test --model
| Argument | Description | Default | |----------|-------------|---------| | --test | Test set name (magt_benchmark or magt_benchmark_sanity) | Required | | --model | Model identifier | Required | | --temperature | Sampling temperature | 0.0 | | --max_completion_tokens | Maximum tokens for model response | 8192 | | --rep | Repetition number (for multiple runs) | 1 |
Run --help for the full list of parameters, including ablation flags. See our paper for details on the baseline configuration and ablation methodology.
Provider-Specific Configuration
For Anthropic and Google APIs, set OPENAI_CACHE_CONTROL=explicit to enable prompt caching. Other providers (OpenAI, Azure, OpenRouter, vLLM, etc.) use automatic caching by default.
Results
Results are saved to Test//----/:
test_results.json- Per-task success/failure dataconfig.json- Run configurationPlans/- Detailed execution logs per task
Generating Reports
To generate aggregated results across all test runs:
uv run python -m AsgardBench.Model.generate_reports
This produces Test/results.xlsx with success rates, failure breakdowns, and per-plan performance statistics.
Viewing Individual Task Executions
To inspect step-by-step execution traces with images:
uv run streamlit run AsgardBench/plan_viewer.py
This launches a web UI for browsing task executions, viewing agent observations, and analyzing failures.
Docker
A Dockerfile is provided for containerized execution.
Building the Image
docker build -t asgardbench .
Running the Benchmark
# Run sanity check docker run --rm \ -e OPENAI_API_KEY=sk-... \ -e OPENAI_BASE_URL=https://api.openai.com/v1 \ asgardbench \ --test magt_benchmark_sanity --model gpt-4o # Run full benchmark with results saved to host docker run --rm \ -v $(pwd)/results:/app/Test \ -e OPENAI_API_KEY=sk-... \ -e OPENAI_BASE_URL=https://api.openai.com/v1 \ asgardbench \ --test magt_benchmark --model gpt-4o
Networking Notes
When connecting to a local API server (e.g., vLLM running on the host on port 7000), use --network host:
docker run --rm --network host \ -e OPENAI_API_KEY=dummy \ -e OPENAI_BASE_URL=http://host.docker.internal:7000/v1 \ asgardbench \ --test magt_benchmark_sanity --model your-model
Responsible AI
For information about intended uses, limitations, and best practices, see [RESPONSIBLE_AI_FAQ.md](RESPONSIBLE_AI_FAQ.md).
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10New repo with few stars