microsoft/amplifier-eval-harness
Python
Captured source
source ↗microsoft/amplifier-eval-harness
Description: Eval harness for exploring configuration of amplifier-app-cli for the Amplifier project
Language: Python
License: MIT
Stars: 0
Forks: 1
Open issues: 0
Created: 2026-05-07T14:53:20Z
Pushed: 2026-05-22T14:37:09Z
Default branch: main
Fork: no
Archived: no
README:
amplifier-eval-harness
Test harness for running scenarios through amplifier-app-cli inside Digital Twin Universe (DTU) environments.
Runs bundles × scenarios × runs matrices in isolated containers, captures per-run artifacts and metrics, supports swapping in local working trees of any ecosystem repo via Gitea mirroring.
Status
Pre-alpha (v0.2). Sequential and parallel execution paths in place. First successful end-to-end smoke run against a live DTU on 2026-05-08 (Linux/Incus, foundation bundle, claude-opus-4-7); broader validation across configs/scenarios is still pending.
Quick start
# Prerequisites: # - amplifier CLI installed (uv tool install git+https://github.com/microsoft/amplifier) # - amplifier-bundle-gitea (provides amplifier-gitea CLI) # - amplifier-bundle-digital-twin-universe v0.2.0+ (provides amplifier-digital-twin CLI). # v0.1.x silently ignores `default_match_mode: boundary`; URL prefix collisions # with sibling repos can over-match. PR #7 (merged 2026-05-05) fixes it. # - Docker running (for Gitea container) + Incus (for DTU containers) # - At least one provider env var set (ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY, GITHUB_TOKEN…) # Install uv tool install --from . amplifier-eval-harness # Sanity check (no DTU launches) amplifier-eval-harness validate --config configs/smoke.yaml # Smoke run (1 bundle × 1 scenario × 1 run, sequential) amplifier-eval-harness run --config configs/smoke.yaml # Baseline (foundation + amplifier-dev × 3 runs each, up to 2 in parallel) amplifier-eval-harness run --config configs/baseline.yaml # Override parallelism at the CLI without editing the config amplifier-eval-harness run --config configs/baseline.yaml --parallelism 4 # Dry run (just expand and print the matrix) amplifier-eval-harness run --config configs/smoke.yaml --dry-run
Output lands in eval-results/-/. Read summary.md first.
Configs
Configs live in configs/. Add new ones with descriptive names; pick which to run via --config.
| Config | Purpose | |---|---| | smoke.yaml | Inner-dev-loop. 1 bundle × 1 scenario × 1 run, sequential. | | baseline.yaml | foundation + amplifier-dev × explain-repo × 3 runs each, parallelism=2. |
See [docs/designs/architecture.md](docs/designs/architecture.md) for the full schema and run flow.
Scenarios
Scenarios live in scenarios//. Each scenario has a prompt.md and an optional workspace/ directory of fixture files seeded into /workspace inside the DTU before the prompt runs.
| Scenario | What it exercises | |---|---| | explain-repo | File reading, code summarization. Stable across runs. |
Settings overlays
Per-config provider/model selection happens via a YAML overlay deep-merged into the container's ~/.amplifier/settings.yaml at provision time. The default overlay (settings/default-providers.yaml) is lifted from the harness owner's ~/.amplifier/settings.yaml minus provider-chat-completions (which is local-only and not relevant inside DTUs).
To use a different model mix, copy the overlay, edit, and point the config's settings_overlay: at the new file.
Gitea instance pinning
By default the harness greedily reuses the first instance returned by amplifier-gitea list, falling back to creating a new one if none exist. That's fine on a solo dev machine but dangerous when multiple workspaces share a host — two harness invocations against the same Gitea race on populate_repo and may stomp on each other's mirrors.
To isolate a workspace, create a dedicated Gitea instance and pin to it:
amplifier-gitea create --port 10111 --name gitea-myworkspace
# {"id": "gitea-abcd1234", "port": 10111, ...}
# Pin via env var (per-invocation, no config edit required)
EVAL_HARNESS_GITEA_INSTANCE=gitea-abcd1234 amplifier-eval-harness run --config configs/smoke.yaml
# Or pin in the config itself
echo "gitea_instance_id: gitea-abcd1234" >> configs/myconfig.yamlResolution order (first wins): EVAL_HARNESS_GITEA_INSTANCE env var → YAML gitea_instance_id → greedy reuse of first listed instance → create new on port 10110. Pinned instances must already exist; the harness errors out rather than silently falling back.
Running inside a nested Incus DTU
When you run amplifier-eval-harness from inside an Incus DTU (e.g. a resolve-stack instance), eval-sub-DTUs are spawned as siblings via the forwarded Incus socket. Their localhost is their own loopback — not the harness DTU's — so the default http://localhost: GITEA_URL baked into sub-DTU profiles is unreachable. uv tool install inside the sub-DTU fails on any transitive git+https://github.com/microsoft/... dependency because mitmproxy's url_rewrites redirect those to the unreachable host.
Fix: set AMPLIFIER_EVAL_HARNESS_GITEA_HOST to the harness DTU's eth0 IP. The harness will use this IP (instead of localhost) when passing GITEA_URL to eval-sub-DTU launch vars.
# Find the harness DTU's eth0 IP (run this inside the DTU):
ip -4 addr show eth0 | awk '/inet / {print $2}' | cut -d/ -f1
# e.g. 10.119.176.124
# Set before running the harness:
export AMPLIFIER_EVAL_HARNESS_GITEA_HOST=10.119.176.124
amplifier-eval-harness run --config configs/smoke.yamlLocal harness operations (Gitea API calls, mirroring, token fetches) are not affected — they still reach Gitea via localhost from the harness DTU's own perspective.
Architecture in 60 seconds
1. Read config → expand bundles × scenarios × runs_per_combo into a flat list of RunSpec. 2. Ensure a Gitea instance, push every relevant repo into it (upstream mirror or local working-tree snapshot). 3. For each RunSpec (sequential when parallelism: 1, ThreadPoolExecutor-bounded when > 1):
- Render a parameterized DTU profile.
- Launch DTU; wait for readiness; push scenario workspace fixture; deep-merge settings overlay.
exec amplifier run --bundle --output-format json-trace ""and capture stdout, stderr, exit code.file-pullthe session directory; destroy DTU (or…
Excerpt shown — open the source for the full document.
Notability
notability 4.0/10Routine new repo from Microsoft