RepoNous ResearchNous Researchpublished May 16, 2026seen 5d

NousResearch/hermes-compression-eval

Python

Open original ↗

Captured source

source ↗

NousResearch/hermes-compression-eval

Description: Offline probe-based evaluation harness for hermes-agent's ContextCompressor. Methodology adapted from Factory's Dec 2025 'Evaluating Compression'.

Language: Python

Stars: 9

Forks: 6

Open issues: 1

Created: 2026-05-16T09:41:37Z

Pushed: 2026-05-16T09:41:39Z

Default branch: main

Fork: no

Archived: no

README:

hermes-compression-eval

Offline evaluation harness for agent/context_compressor.py in hermes-agent. Runs a real conversation fixture through ContextCompressor.compress(), asks the compressor model to answer probe questions from the compressed state, and has a judge model score each answer 0–5 on six dimensions (accuracy, context_awareness, artifact_trail, completeness, continuity, instruction_following).

Methodology adapted from Factory's December 2025 write-up *Evaluating Compression*. The scoreboard framing is not adopted.

Why this exists

agent/context_compressor.py decides what survives compression when a session exceeds the context-window threshold. Its prompts and template sections are tuned by hand. Until now there was no signal between *"test suite green"* and *"a user hits a bad summary in production."*

This harness gives that signal: edit the compressor prompt, re-run the eval, compare the per-dimension scores against a saved baseline.

Costs

LLM-graded and non-deterministic. Each probe = 1 continuation call + 1 grading call. A full run across the three checked-in fixtures with default settings runs ~30 probe pairs against your configured provider. Budget accordingly. Not appropriate for CI.

Install

git clone https://github.com/NousResearch/hermes-compression-eval.git
cd hermes-compression-eval
pip install -r requirements.txt # openai, fire

The harness imports ContextCompressor and agent.redact from hermes-agent. Locate your hermes-agent checkout one of three ways (checked in this order):

1. HERMES_AGENT_ROOT=/path/to/hermes-agent — explicit override. 2. ~/.hermes/hermes-agent/ — the default location hermes setup writes. 3. Sibling directory: clone hermes-agent next to hermes-compression-eval.

Usage

# Baseline run (writes results/baseline/)
python3 run_eval.py \
--compressor-provider=nous --compressor-model=openai/gpt-5.4-mini \
--judge-provider=nous --judge-model=openai/gpt-5.4-mini \
--runs=3 --label=baseline

# After editing context_compressor.py prompts, compare:
python3 run_eval.py \
--compressor-provider=nous --compressor-model=openai/gpt-5.4-mini \
--judge-provider=nous --judge-model=openai/gpt-5.4-mini \
--runs=3 --label=my-tweak \
--compare-to=results/baseline

results//report.md is paste-ready for a PR body. Per-run JSON goes to results//runs/.

What ships

| Path | Purpose | |---|---| | run_eval.py | Fire CLI — the entry point | | compressor_driver.py | Thin wrapper that forces a single-shot compress() over fixture messages | | grader.py | Two-phase continuation + grading via the OpenAI SDK | | rubric.py | Six-dimension scoring rubric, judge-prompt builder, JSON parser | | report.py | Markdown report rendering + --compare-to delta mode | | scrub_fixtures.py | Pipeline to convert real ~/.hermes/sessions/*.jsonl into public-safe JSON fixtures | | fixtures/ | Three checked-in scrubbed sessions (feature-impl, debug, config-build) | | probes/ | Three probe banks, 10–11 probes each, covering recall / artifact / continuation / decision | | tests/ | 33 hermetic unit tests for non-LLM paths |

Adding a fixture

1. Pick a session under ~/.hermes/sessions/*.jsonl worth measuring. 2. Add a SPECS entry in scrub_fixtures.py (source filename, output name, description, user-message paraphrase, model guess, context length, optional truncate-at). 3. Run python3 scrub_fixtures.py — writes fixtures/.json. 4. Add a probe bank at probes/.probes.json covering all four types (recall, artifact, continuation, decision). 5. Re-run python3 -m pytest tests/ -q to verify it loads and parses.

See DESIGN.md for the full scrubber pipeline and probe-format spec.

Tests

python3 -m pytest tests/ -q

33 hermetic tests cover rubric parsing edge cases, judge-prompt building, report rendering, summariser medians, per-run JSON roundtrip, fixture and probe loading, and a PII smoke check on the checked-in fixtures.

The LLM paths (continuation + grading) require credentials and real API calls; they're exercised by running the eval itself, not by these tests.

License

MIT, same as hermes-agent.

Notability

notability 3.0/10

Low star count, routine new repo