RepoMicrosoftMicrosoftpublished May 7, 2026seen 5d

microsoft/amplifier-eval-harness

Python

Open original ↗

Captured source

source ↗

microsoft/amplifier-eval-harness

Description: Eval harness for exploring configuration of amplifier-app-cli for the Amplifier project

Language: Python

License: MIT

Stars: 0

Forks: 1

Open issues: 0

Created: 2026-05-07T14:53:20Z

Pushed: 2026-05-22T14:37:09Z

Default branch: main

Fork: no

Archived: no

README:

amplifier-eval-harness

Test harness for running scenarios through amplifier-app-cli inside Digital Twin Universe (DTU) environments.

Runs bundles × scenarios × runs matrices in isolated containers, captures per-run artifacts and metrics, supports swapping in local working trees of any ecosystem repo via Gitea mirroring.

Status

Pre-alpha (v0.2). Sequential and parallel execution paths in place. First successful end-to-end smoke run against a live DTU on 2026-05-08 (Linux/Incus, foundation bundle, claude-opus-4-7); broader validation across configs/scenarios is still pending.

Quick start

# Prerequisites:
# - amplifier CLI installed (uv tool install git+https://github.com/microsoft/amplifier)
# - amplifier-bundle-gitea (provides amplifier-gitea CLI)
# - amplifier-bundle-digital-twin-universe v0.2.0+ (provides amplifier-digital-twin CLI).
# v0.1.x silently ignores `default_match_mode: boundary`; URL prefix collisions
# with sibling repos can over-match. PR #7 (merged 2026-05-05) fixes it.
# - Docker running (for Gitea container) + Incus (for DTU containers)
# - At least one provider env var set (ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY, GITHUB_TOKEN…)

# Install
uv tool install --from . amplifier-eval-harness

# Sanity check (no DTU launches)
amplifier-eval-harness validate --config configs/smoke.yaml

# Smoke run (1 bundle × 1 scenario × 1 run, sequential)
amplifier-eval-harness run --config configs/smoke.yaml

# Baseline (foundation + amplifier-dev × 3 runs each, up to 2 in parallel)
amplifier-eval-harness run --config configs/baseline.yaml

# Override parallelism at the CLI without editing the config
amplifier-eval-harness run --config configs/baseline.yaml --parallelism 4

# Dry run (just expand and print the matrix)
amplifier-eval-harness run --config configs/smoke.yaml --dry-run

Output lands in eval-results/-/. Read summary.md first.

Configs

Configs live in configs/. Add new ones with descriptive names; pick which to run via --config.

| Config | Purpose | |---|---| | smoke.yaml | Inner-dev-loop. 1 bundle × 1 scenario × 1 run, sequential. | | baseline.yaml | foundation + amplifier-dev × explain-repo × 3 runs each, parallelism=2. |

See [docs/designs/architecture.md](docs/designs/architecture.md) for the full schema and run flow.

Scenarios

Scenarios live in scenarios//. Each scenario has a prompt.md and an optional workspace/ directory of fixture files seeded into /workspace inside the DTU before the prompt runs.

| Scenario | What it exercises | |---|---| | explain-repo | File reading, code summarization. Stable across runs. |

Settings overlays

Per-config provider/model selection happens via a YAML overlay deep-merged into the container's ~/.amplifier/settings.yaml at provision time. The default overlay (settings/default-providers.yaml) is lifted from the harness owner's ~/.amplifier/settings.yaml minus provider-chat-completions (which is local-only and not relevant inside DTUs).

To use a different model mix, copy the overlay, edit, and point the config's settings_overlay: at the new file.

Gitea instance pinning

By default the harness greedily reuses the first instance returned by amplifier-gitea list, falling back to creating a new one if none exist. That's fine on a solo dev machine but dangerous when multiple workspaces share a host — two harness invocations against the same Gitea race on populate_repo and may stomp on each other's mirrors.

To isolate a workspace, create a dedicated Gitea instance and pin to it:

amplifier-gitea create --port 10111 --name gitea-myworkspace
# {"id": "gitea-abcd1234", "port": 10111, ...}

# Pin via env var (per-invocation, no config edit required)
EVAL_HARNESS_GITEA_INSTANCE=gitea-abcd1234 amplifier-eval-harness run --config configs/smoke.yaml

# Or pin in the config itself
echo "gitea_instance_id: gitea-abcd1234" >> configs/myconfig.yaml

Resolution order (first wins): EVAL_HARNESS_GITEA_INSTANCE env var → YAML gitea_instance_id → greedy reuse of first listed instance → create new on port 10110. Pinned instances must already exist; the harness errors out rather than silently falling back.

Running inside a nested Incus DTU

When you run amplifier-eval-harness from inside an Incus DTU (e.g. a resolve-stack instance), eval-sub-DTUs are spawned as siblings via the forwarded Incus socket. Their localhost is their own loopback — not the harness DTU's — so the default http://localhost: GITEA_URL baked into sub-DTU profiles is unreachable. uv tool install inside the sub-DTU fails on any transitive git+https://github.com/microsoft/... dependency because mitmproxy's url_rewrites redirect those to the unreachable host.

Fix: set AMPLIFIER_EVAL_HARNESS_GITEA_HOST to the harness DTU's eth0 IP. The harness will use this IP (instead of localhost) when passing GITEA_URL to eval-sub-DTU launch vars.

# Find the harness DTU's eth0 IP (run this inside the DTU):
ip -4 addr show eth0 | awk '/inet / {print $2}' | cut -d/ -f1
# e.g. 10.119.176.124

# Set before running the harness:
export AMPLIFIER_EVAL_HARNESS_GITEA_HOST=10.119.176.124
amplifier-eval-harness run --config configs/smoke.yaml

Local harness operations (Gitea API calls, mirroring, token fetches) are not affected — they still reach Gitea via localhost from the harness DTU's own perspective.

Architecture in 60 seconds

1. Read config → expand bundles × scenarios × runs_per_combo into a flat list of RunSpec. 2. Ensure a Gitea instance, push every relevant repo into it (upstream mirror or local working-tree snapshot). 3. For each RunSpec (sequential when parallelism: 1, ThreadPoolExecutor-bounded when > 1):

  • Render a parameterized DTU profile.
  • Launch DTU; wait for readiness; push scenario workspace fixture; deep-merge settings overlay.
  • exec amplifier run --bundle --output-format json-trace "" and capture stdout, stderr, exit code.
  • file-pull the session directory; destroy DTU (or…

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

Routine new repo from Microsoft