microsoft/sentinel_environments
TypeScript
Captured source
source ↗microsoft/sentinel_environments
Description: Environments that evolve over time to test monitoring and proactive scenarios.
Language: TypeScript
License: MIT
Stars: 3
Forks: 0
Open issues: 20
Created: 2025-11-14T20:25:20Z
Pushed: 2026-06-14T02:58:06Z
Default branch: main
Fork: no
Archived: no
README:
---
Sentinel Environments is a benchmark of 10 high-fidelity web-app replicas that tests whether an agent can *wait*, *monitor*, and *act* only when a condition is met. Each environment replays a scripted timeline of events; the agent must notice the right moment and respond, without wasting resources polling in between.
This README covers how to run the benchmark. For the motivation, task design, and baseline results, see the paper and blog post.
Environments
The benchmark ships 10 environments (Micro*), each with 10 monitoring scenarios (100 total).
| Environment | Mimics | Surface | Data Type | |-------------|--------|---------|-----------| | MicroMail | Email (Gmail/Outlook) | Inbox, folders, attachments, search | Synthetic images | | MicroChat | Team messaging (Slack/Teams) | Teams, channels, DMs, calls | Synthetic images | | MicroDin | Professional network (LinkedIn) | Feed, connections, jobs, messaging | Synthetic images | | MicroHub | Code hosting (GitHub) | Repo, issues, PRs, commits, releases | Text-only (JSONL) | | MicroHood | Stock trading (Robinhood) | Portfolio, orders, watchlist, news | Text-only (JSONL) | | MicroGram | Photo sharing (Instagram) | Feed, stories, DMs, activity | Synthetic images | | MicroTube | Video platform (YouTube) | Feed, player, comments, subscriptions | Synthetic video/images | | MicroFy | Music streaming (Spotify) | Tracks, playlists, artists, lyrics | Synthetic audio | | MicroLendar | Calendar (Google Calendar) | Month/week/day views, events, tasks | Text-only (JSONL) | | MicroScholar | Academic search (Google Scholar) | Search, papers, authors, alerts | Text-only (JSONL) |
> Environments marked Text-only (JSONL) use structured text data with no AI-generated media. The others use AI-generated images, video, or audio produced by the pipeline in [data_generation/](#regenerating-synthetic-media).
Quick Start
Set up local dependencies, build the SQLite databases, and run all three components (API server, frontend, harness shell) in one tmux session:
./start.sh
On the first run, this creates .venv, installs Python and Node dependencies, and builds the gitignored .db files. It then opens windows for the server (API on :8000), the frontend (Vite on :5173), and a harness shell ready for eval runs. To set things up manually instead, follow the three steps below.
1. API server
python3 -m venv .venv source .venv/bin/activate pip install -r server/requirements.txt # Build the SQLite databases from the JSONL catalogs (required on first setup, # and whenever data/catalogs/ changes). The .db files are gitignored. python -m server.scripts.build_db uvicorn server.server:app --host 0.0.0.0 --port 8000
2. Frontend
cd frontend npm ci npm run dev
Open http://localhost:5173 to use the environments by hand. The frontend proxies API calls to localhost:8000, so the API server must be running.
3. Eval harness
The harness discovers every scenario JSON, runs each one against your agent as a subprocess, and collects results. Configure your agent in an eval_config.yaml at the repo root:
server_url: http://localhost:8000 frontend_url: http://localhost:5173 # Wall-clock vs sim-clock exchange rate. 1.0 = real-time (default), # >1 is faster, ///`: | File | Description | |------|-------------| | `results.json` | Evaluation result (`success`, `detail`, `evaluation_time`, `condition_at`, `contact_get_time`) | | `output.txt` | Agent subprocess stdout/stderr | | `error.txt` | Harness error traceback (only if the task failed to run) |
$ python -m server.eval_harness run my_first_run Found 100 tasks. Results -> results/my_first_run Running micromail-attachment-name-absolute-active ... Running micromail-body-december-relative-active ... ... Done.
### Dev mode Set `SENTINEL_DEV` to skip the `/init` flow and preload an environment with all of its catalog data, so the UI is populated without running the harness:
Preload every environment
SENTINEL_DEV=all uvicorn server.server:app --reload --port 8000
Preload a single environment
SENTINEL_DEV=microfy uvicorn server.server:app --reload --port 8000
Each environment ships a `scenarios//dev.json` that loads all catalog rows via `["*"]` wildcards. The eval harness always skips `dev.json`. ### Runtime tuning `speed_factor` controls how fast the simulation clock advances relative to wall-clock. It is a top-level key in `eval_config.yaml` (and `--speed` on `run_simulation` for local playback) — never stored in scenario JSON. The harness sizes each task's subprocess timeout as `kill_at_wall = MAX_CONDITION_AT / speed_factor + REACTION_WINDOW`. `speed_factor > 1` compresses the timeline; `speed_factor 0`. All constants live in `server/timing.py`. ## How a scenario works Each scenario is a JSON file under `scenarios//`, named `--.json` (no-op scenarios use `--noop.json`):
{ "id": "micromail-junk-relative-passive", "environment": "micromail", "prompt": "Keep an eye on the junk folder for me, tell me when 3 more emails end up there.", "condition_at": 297.75, "kill_at": 630.0, "event_timeline_end": 720.0, "events": [ { "time": 0, "type": "preload_emails", "payload": { } } ], "eval_sql": "SELECT (SELECT COUNT(*) FROM email_states WHERE folder='junk') >= (SELECT CAST(value AS INTEGER) FROM session_meta WHERE key='baseline_junk_count') + 3" }
Timing fields, all in simulation-seconds: - **`condition_at`** — earliest sim-time the success condition can become true. Randomized per scenario in `[10, 600]`, deterministic from `scenario_id`. For no-op scenarios it is `null` (the condition never fires). The released values are frozen; do not hand-edit them. - **`kill_at`** — sim-time the harness terminates the run (`630`, constant). - **`event_timeline_end`** — right edge of the authored timeline (`720`, constant). Invariant (enforced by `tests/test_scenario_schema.py`): `0 SQLite) │ ├── server.py # Main FastAPI app │...
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine security environment repo; minimal traction.