What does this repo signal mean?

Microsoft published microsoft/amplifier-bundle-evaluation (HTML). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo microsoft/amplifier-bundle-evaluation · language HTML · Evaluation suite for AI model amplification techniques.. onlylabs links this event to 1 captured evidence page and 6 related repo signals. It also maps to Evals and quality in the data-business radar.

Microsoft Repo: microsoft/amplifier-bundle-evaluation

Captured source

source ↗

GitHub/github.com/microsoft/amplifier-bundle-evaluation

microsoft/amplifier-bundle-evaluation repository metadata

Source ↗

published May 4, 2026seen Jun 5captured Jun 11http 200method plain

microsoft/amplifier-bundle-evaluation

Description: evaluation bundle for the Amplifier project

Language: HTML

License: MIT

Stars: 2

Forks: 0

Open issues: 0

Created: 2026-05-04T14:50:08Z

Pushed: 2026-06-09T20:59:35Z

Default branch: main

Fork: no

Archived: no

README:

Amplifier Bundle Evaluation

A one-stop-shop for evaluating AI agents, bundles, and recipes across the Amplifier ecosystem. Provides an evaluation mode and supporting context for designing evaluations, along with a Python harness (amplifier_evaluation) for running pre-defined tasks against agents in a Digital Twin Universe.

Example Uses:

"/evaluation I have changes to an Amplifier bundle I would like to evaluate the impact of. Can you help me measure it?"
"/evaluation I have a custom agent that does Y"
"/evaluation I built a memory system and want to know if it improves my agent"

Self-evaluation

Evaluations of this bundle live under [.amplifier/evaluations/](.amplifier/evaluations/). Each one is a self-contained scenario with a single-script runner. The first is [01-evaluate-amplifier-bundle](.amplifier/evaluations/01-evaluate-amplifier-bundle/).

Installation

Prerequisites

An existing Amplifier installation
A bundle that provides a runtime (e.g. amplifier-foundation) composed in the same session
amplifier-bundle-modes composed in the same session, since the evaluation mode is delivered through that capability
The industry benchmarking capability recommends using Humanity's Last Exam for sample tasks. This dataset is gated and requires creating a HuggingFace account and creating an access token with the permission "Read access to contents of all public gated repos you can access". Please protect the integrity of this benchmark by not publicly sharing, re-uploading, or distributing the dataset.

Amplifier Bundle

To compose it onto an existing setup:

amplifier bundle add "git+https://github.com/microsoft/amplifier-bundle-evaluation@main#subdirectory=behaviors/evaluation.yaml" --app

--app composes the bundle onto every Amplifier session. Remove it to only register the bundle for later activation with amplifier bundle use.

If you also need the modes capability (required for the evaluation mode to be discoverable):

amplifier bundle add "git+https://github.com/microsoft/amplifier-bundle-modes@main#subdirectory=behaviors/modes.yaml" --app

Harness (Python package)

The bundle also ships a Python package, amplifier_evaluation, for running pre-defined tasks against an agent in a Digital Twin Universe. It is consumed as a library and is independent of the evaluation mode. The package depends on amplifier-core and amplifier-foundation; for local development those resolve to the sibling submodules via [tool.uv.sources].

git clone https://github.com/microsoft/amplifier-bundle-evaluation
cd amplifier-bundle-evaluation
uv sync

The first sync builds amplifier-core from source (Rust extension), which takes a minute or two. After that, the AI User is importable as from amplifier_evaluation.ai_user import AIUser. See [context/harness/overview.md](context/harness/overview.md) for layout, task/agent schemas, and usage.

CLI

The bundle ships an amplifier-evaluation CLI for running the benchmark. Install it on PATH:

uv tool install git+https://github.com/microsoft/amplifier-bundle-evaluation@main

It is also runnable as a module without installing (python -m amplifier_evaluation run).

Run the benchmark over every (agent, task) permutation discovered under amplifier-benchmark/:

amplifier-evaluation run --agents-dir amplifier-benchmark/agents --tasks-dir amplifier-benchmark/tasks

The default --agents-dir and --tasks-dir resolve relative to the current directory, so running from a repo checkout needs no flags. Use --agent/--task to restrict the selection (or --pair agent:task for explicit pairings) and --dry-run to preview it. See amplifier-evaluation run --help for the full option list.

Contributing

> [!NOTE] > This project is not currently accepting external contributions, but we're actively working toward opening this up. We value community input and look forward to collaborating in the future. For now, feel free to fork and experiment!

Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit Contributor License Agreements.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Notability

notability 3.0/10

Low traction, routine repo