microsoft/amplifier-bundle-evaluation
HTML
Captured source
source ↗microsoft/amplifier-bundle-evaluation
Description: evaluation bundle for the Amplifier project
Language: HTML
License: MIT
Stars: 2
Forks: 0
Open issues: 0
Created: 2026-05-04T14:50:08Z
Pushed: 2026-06-09T20:59:35Z
Default branch: main
Fork: no
Archived: no
README:
Amplifier Bundle Evaluation
A one-stop-shop for evaluating AI agents, bundles, and recipes across the Amplifier ecosystem. Provides an evaluation mode and supporting context for designing evaluations, along with a Python harness (amplifier_evaluation) for running pre-defined tasks against agents in a Digital Twin Universe.
Example Uses:
- "/evaluation I have changes to an Amplifier bundle I would like to evaluate the impact of. Can you help me measure it?"
- "/evaluation I have a custom agent that does Y"
- "/evaluation I built a memory system and want to know if it improves my agent"
Self-evaluation
Evaluations of this bundle live under [.amplifier/evaluations/](.amplifier/evaluations/). Each one is a self-contained scenario with a single-script runner. The first is [01-evaluate-amplifier-bundle](.amplifier/evaluations/01-evaluate-amplifier-bundle/).
Installation
Prerequisites
- An existing Amplifier installation
- A bundle that provides a runtime (e.g.
amplifier-foundation) composed in the same session amplifier-bundle-modescomposed in the same session, since the evaluation mode is delivered through that capability- The industry benchmarking capability recommends using Humanity's Last Exam for sample tasks. This dataset is gated and requires creating a HuggingFace account and creating an access token with the permission "Read access to contents of all public gated repos you can access". Please protect the integrity of this benchmark by not publicly sharing, re-uploading, or distributing the dataset.
Amplifier Bundle
To compose it onto an existing setup:
amplifier bundle add "git+https://github.com/microsoft/amplifier-bundle-evaluation@main#subdirectory=behaviors/evaluation.yaml" --app
--app composes the bundle onto every Amplifier session. Remove it to only register the bundle for later activation with amplifier bundle use.
If you also need the modes capability (required for the evaluation mode to be discoverable):
amplifier bundle add "git+https://github.com/microsoft/amplifier-bundle-modes@main#subdirectory=behaviors/modes.yaml" --app
Harness (Python package)
The bundle also ships a Python package, amplifier_evaluation, for running pre-defined tasks against an agent in a Digital Twin Universe. It is consumed as a library and is independent of the evaluation mode. The package depends on amplifier-core and amplifier-foundation; for local development those resolve to the sibling submodules via [tool.uv.sources].
git clone https://github.com/microsoft/amplifier-bundle-evaluation cd amplifier-bundle-evaluation uv sync
The first sync builds amplifier-core from source (Rust extension), which takes a minute or two. After that, the AI User is importable as from amplifier_evaluation.ai_user import AIUser. See [context/harness/overview.md](context/harness/overview.md) for layout, task/agent schemas, and usage.
CLI
The bundle ships an amplifier-evaluation CLI for running the benchmark. Install it on PATH:
uv tool install git+https://github.com/microsoft/amplifier-bundle-evaluation@main
It is also runnable as a module without installing (python -m amplifier_evaluation run).
Run the benchmark over every (agent, task) permutation discovered under amplifier-benchmark/:
amplifier-evaluation run --agents-dir amplifier-benchmark/agents --tasks-dir amplifier-benchmark/tasks
The default --agents-dir and --tasks-dir resolve relative to the current directory, so running from a repo checkout needs no flags. Use --agent/--task to restrict the selection (or --pair agent:task for explicit pairings) and --dry-run to preview it. See amplifier-evaluation run --help for the full option list.
Contributing
> [!NOTE] > This project is not currently accepting external contributions, but we're actively working toward opening this up. We value community input and look forward to collaborating in the future. For now, feel free to fork and experiment!
Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit Contributor License Agreements.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
Notability
notability 3.0/10Low traction, routine repo