RepoAmazon (Nova)Amazon (Nova)published May 19, 2026seen 5d

amazon-science/EvoMAS

Python

Open original ↗

Captured source

source ↗
published May 19, 2026seen 5dcaptured 9hhttp 200method plain

amazon-science/EvoMAS

Description: Evolutionary Generation of Multi-Agent Systems; Yuntong Hu, Yuting Zhang, Matthew Trager, Yi Zhang, Shuo Yang, Wei Xia, Stefano Soatto, ICML 2026

Language: Python

License: NOASSERTION

Stars: 3

Forks: 1

Open issues: 0

Created: 2026-05-19T19:23:29Z

Pushed: 2026-05-29T14:00:18Z

Default branch: main

Fork: no

Archived: no

README:

EvoMAS - Evolutionary Generation of Multi-Agent Systems

This repository implements the following paper:

> EvoMAS: Evolutionary Generation of Multi-Agent Systems > Yuntong Hu, Yuting Zhang, Matthew Trager, Yi Zhang, Shuo Yang, Wei Xia, Stefano Soatto > *ICML 2026* > [[arXiv]](https://arxiv.org/abs/2602.06511)

This project is released under the [CC BY-NC 4.0](LICENSE) license. You are free to share and adapt the material for non-commercial purposes, with appropriate credit to the authors. Commercial use is not permitted.

---

EvoMAS uses an LLM meta-model as an evolutionary operator to select, mutate, and cross over MAS configurations from a pool, evaluating them on target benchmarks. This README covers the short path: install, then run one of the benchmark scripts.

Setup

conda create -n mas python=3.11 -y
conda activate mas
pip install -r requirements.txt

# API keys — only fill in the providers you'll use
cp .env.example .env
# then edit .env

AWS Bedrock models use the standard credential chain (aws configure, env vars, or an EC2 instance role). No .env entry is needed when AWS CLI credentials are already set up.

Note on model availability: Some model IDs (e.g., on Bedrock) referenced in the default configuration and MAS pool files may be deprecated or retired over time. If you encounter model-not-found errors, update the model IDs to currently available versions. You can override model IDs without editing source code via META_MODEL, JUDGE_MODEL, and AGENT_MODELS environment variables (see [Editing parameters](#editing-parameters)), or pass --meta-model-id / --model-list flags to main.py directly.

Download datasets

Task data is not shipped with the repo. Run the one-click preparation script to download every benchmark into the paths the pipeline expects:

scripts/prepare_datasets.sh

This performs four idempotent steps:

1. SWE-bench source repos — clones 15 upstream projects into dataset/repos/ at the pinned commits listed in src/dataset/repos.txt. 2. BBEH — downloads all 23 subsets from google-deepmind/bbeh into dataset/bbeh/benchmark_tasks/ and rebuilds bbeh_mini (20 stride-sampled tasks per subset = 460 total). 3. WorkBench — clones olly-styles/WorkBench and stages per-domain data.csv + test.json under dataset/workbench//. 4. SWE-bench task metadata — downloads Lite (300 tasks) and Verified (500 tasks) from HuggingFace into dataset/swe_bench_{lite,verified}/.

Useful flags:

scripts/prepare_datasets.sh --force # re-fetch everything
scripts/prepare_datasets.sh --skip-repos # everything except the ~GB repo clones
scripts/prepare_datasets.sh --bbeh-only # target a single dataset family
scripts/prepare_datasets.sh --workbench-only
scripts/prepare_datasets.sh --swe-only
scripts/prepare_datasets.sh --repos-only

Running a benchmark

Three benchmark scripts live in scripts/. Each accepts a subset/variant either as a positional arg or as an env var, and writes outputs under $OUTPUT_ROOT/_main/ (default output_paper/).

# BBEH — 24 subsets available under dataset/bbeh/benchmark_tasks/
scripts/run_bbeh.sh mini
scripts/run_bbeh.sh boolean_expressions
SUBSET=word_sorting scripts/run_bbeh.sh

# WorkBench — 6 subdomains
scripts/run_workbench.sh email # default
scripts/run_workbench.sh calendar
SUBDOMAIN=multi_domain scripts/run_workbench.sh

# SWE-bench — verified (default) or lite
scripts/run_swebench.sh verified
scripts/run_swebench.sh lite
VARIANT=lite scripts/run_swebench.sh

Each script auto-resolves NUM_EVAL_TASKS to the full task count of the chosen subset / domain / variant (e.g. 460 for bbeh_mini, 500 for swe_bench_verified). To run on a smaller sample, override via env var (see below).

There is also a batch-size ablation script:

DATASET=bbeh_mini scripts/run_ablation_batch_size.sh # sweep BS ∈ {1, 10, 460}

Editing parameters

All tunable parameters are env vars with defaults in scripts/common.sh. Override any of them before invoking a script:

| Variable | Default | Description | |---|---|---| | NUM_EVAL_TASKS | (full dataset) | Number of tasks to run | | MAX_STEPS | 2 | Evolutionary iterations per batch | | NUM_PARENTS | 2 | Parent configs the meta-model selects per batch | | SEED | 42 | Random seed | | BATCH_SIZE | 1 | Tasks per evolution batch (1 = per-query; N = one shared trajectory over N tasks) | | WORKERS | 16 | Parallel batches (ThreadPoolExecutor) | | META_MODEL | bedrock:global.anthropic.claude-sonnet-4-5-... | Evolutionary-operator LLM | | JUDGE_MODEL | same as META_MODEL | LLM-as-judge for reward | | AGENT_MODELS | bedrock:us.anthropic.claude-3-5-sonnet-... bedrock:qwen.qwen3-235b-... bedrock:qwen.qwen3-coder-480b-... | Space-separated worker model palette | | MEMORY_EVOLUTION | true | Persist meta-model memory updates | | MEMORY_PATH | (auto) | Explicit memory JSON path; empty = dataset//memory_.json | | OUTPUT_ROOT | output_paper | Root directory for run outputs | | CONDA_ENV | mas | conda environment name | | EVOMAS_REPOS_DIR | dataset/repos | Directory containing cloned SWE-bench source repos |

Examples:

# Small smoke test
NUM_EVAL_TASKS=1 MAX_STEPS=1 WORKERS=1 scripts/run_bbeh.sh mini

# Larger parallel run on workbench_email
WORKERS=32 scripts/run_workbench.sh email

# Swap the worker palette to a single model
AGENT_MODELS="bedrock:us.anthropic.claude-3-5-sonnet-20241022-v2:0" \
scripts/run_bbeh.sh boolean_expressions

# Continue a prior memory file instead of starting fresh
MEMORY_PATH=dataset/bbeh/benchmark_tasks/bbeh_mini/memory_20260423_101510.json \
scripts/run_bbeh.sh mini

Direct main.py invocation

Under the hood, each script runs python main.py with the resolved arguments. For ad-hoc configurations not covered by the scripts, call main.py directly:

python main.py --dataset bbeh_boolean_expressions \
--num-eval-tasks 50 \
--batch-size 1 \
--workers 8 \
--max-steps 2 \
--meta-model-id…

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

New repo, minimal traction.

Amazon (Nova) has a repo signal matching data demand, infrastructure.