meituan-longcat/MineExplorer
Python
Captured source
source ↗meituan-longcat/MineExplorer
Description: Reproduction code for paper "MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft"
Language: Python
Stars: 11
Forks: 0
Open issues: 1
Created: 2026-05-29T05:07:05Z
Pushed: 2026-06-12T05:11:39Z
Default branch: main
Fork: no
Archived: no
README:
MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft
MineExplorer is a benchmark for evaluating the open-world exploration capabilities of multimodal large language model (MLLM) agents in Minecraft. We first filter atomic tasks whose solutions rely heavily on Minecraft-specific knowledge to better reflect general open-world reasoning, then organize the benchmark around a ReAct-style capability formulation and compose atomic tasks into implicit multi-hop tasks. To construct reliable instances, MineExplorer uses a multi-agent synthesis workflow that jointly designs task graphs, sandbox scenes, and rule-based milestone evaluators. Experiments show that open-world exploration remains challenging: strong models handle many single-hop tasks but degrade sharply when hidden prerequisites must be coordinated over longer trajectories, and larger models or thinking modes do not consistently translate into better performance.
---
1. Minecraft Sandbox Environment
MineExplorer uses a Minecraft sandbox service built on top of MineStudio, an open-source framework that provides a Minecraft simulator engine controllable via HTTP API. The sandbox allows you to programmatically create and control Minecraft game environments — spawning scenes, issuing commands, resetting episodes, and capturing first-person screenshots.
We release a ready-to-use Docker image: `davidzhth/mineexplorer:0.0.1`
Image Specifications
| Component | Version / Details | |-----------|-------------------| | Base OS | Ubuntu 22.04 | | Python | 3.10 | | Java | OpenJDK 8 (Minecraft runtime dependency) | | Framework | MineStudio (bundled Minecraft simulator engine) | | Rendering | Xvfb virtual framebuffer (headless rendering, no display required) |
Starting the Sandbox
docker run -d --name mineexplorer -p 8000:8000 davidzhth/mineexplorer:0.0.1
The service listens on port 8000. On first launch, Minecraft needs to load the world — this typically takes 60–120 seconds. You can check readiness with:
curl http://localhost:8000/monitor/alive
# Returns: {"status":"alive", ...}Local Environment Configuration
Once the sandbox is running, set the following environment variable so that both generate_benchmark.py and eval_benchmark.py connect to it automatically:
export MC_SANDBOX_URL=http://localhost:8000
You are then ready to generate and evaluate the benchmark as described in the sections below.
---
2. Environment Setup
Install the required Python packages:
pip install gymnasium numpy requests pillow loguru python-dotenv typer fastapi uvicorn pydantic imageio imageio-ffmpeg
Set the required environment variables:
export AGENT_API_KEY="your_api_key" export AGENT_API_BASE="https://your-api-endpoint/v1/openai/native"
---
3. Generating the Benchmark
Use generate_benchmark.py to generate Minecraft evaluation tasks. The benchmark directory contains the benchmark used in the paper, covering single-hop to 4-hop tasks.
python generate_benchmark.py multi \ --model aws.claude-opus-4.6 \ --num-samples 10 \ --k-min 1 \ --k-max 1 \ --candidate-num 1 \ --output benchmark_new
Key Arguments
| Argument | Description | |----------|-------------| | multi / single | Multi-agent or single-agent benchmark generation. The paper uses multi-agent mode, which produces more reliable instances but is slower due to sandbox interaction. | | --model | Model name to use for generation | | --num-samples | Number of samples to generate | | --k-min / --k-max | Range of subtask hops per sample (e.g., set both to 1 for single-hop tasks only) | | --candidate-num | Number of candidate atomic tasks | | --output | Output directory |
Output Structure
benchmark_new/ ├── 0000/ │ └── multi-agent/ │ ├── metadata.json # Scene configuration │ ├── milestones.json # Milestone definitions │ ├── reasoning_graph.json # Dependency graph │ └── debate_log.json # Agent dialogue log ├── 0001/ │ └── multi-agent/ │ └── ...
Want to Generate Harder Tasks?
You can generate extremely challenging tasks by increasing --k-min, --k-max, and --candidate-num. For example, the following command generates tasks with 8–12 prerequisite hops and 15 candidate atomic tasks:
python generate_benchmark.py multi \ --model aws.claude-opus-4.6 \ --num-samples 10 \ --k-min 8 \ --k-max 12 \ --candidate-num 15 \ --output benchmark_hard
This produces tasks with deeply nested, multi-branch dependency graphs — far more complex than standard benchmark instances:
---
4. Evaluating the Benchmark
Use eval_benchmark.py to run an agent on the generated benchmark and evaluate its performance.
Using an OpenAI-compatible API
python eval_benchmark.py \ --model aws.claude-opus-4.6 \ --benchmark-dir benchmark_new \ --output-dir results \ --num-workers 10 \ --resume
Using a Local vLLM Service
Start the vLLM server first:
python -m vllm.entrypoints.openai.api_server \ --model Qwen2.5-7B \ --port 8000
Then run evaluation:
python eval_benchmark.py \ --model Qwen2.5-7B \ --benchmark-dir benchmark_new \ --output-dir results \ --num-workers 10 \ --use-vllm
Common Arguments
| Argument | Description | |----------|-------------| | --model | Model to use for evaluation | | --benchmark-dir | Path to the benchmark directory | | --output-dir | Directory to save results | | --num-workers | Number of parallel sandbox workers | | --resume | Resume from checkpoint (skip completed tasks) | | --limit | Limit number of evaluation samples (for testing) |
Output Structure
results/ └── aws.claude-opus-4.6/ ├── 0000/ │ ├── result.json # Evaluation result │ ├── episode.mp4 # Episode replay video │ └── messages/ # Conversation logs ├── 0001/ │ └── ... └── eval_summary.json # Aggregated statistics
---
Results
---
Citation
If you find this work useful, please cite:
@misc{ju2026mineexplorerevaluatingopenworldexploration,...Excerpt shown — open the source for the full document.
Notability
notability 1.0/10Low-traction repo