What does this repo signal mean?

Microsoft published microsoft/sentinel_environments (TypeScript). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo microsoft/sentinel_environments · language TypeScript · Routine security environment repo; minimal traction.. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Microsoft Repo: microsoft/sentinel_environments

Captured source

source ↗

GitHub/github.com/microsoft/sentinel_environments

microsoft/sentinel_environments repository metadata

Source ↗

published Nov 14, 2025seen 1wcaptured 1whttp 200method plain

microsoft/sentinel_environments

Description: Environments that evolve over time to test monitoring and proactive scenarios.

Language: TypeScript

License: MIT

Stars: 3

Forks: 0

Open issues: 20

Created: 2025-11-14T20:25:20Z

Pushed: 2026-06-14T02:58:06Z

Default branch: main

Fork: no

Archived: no

README:

---

Sentinel Environments is a benchmark of 10 high-fidelity web-app replicas that tests whether an agent can *wait*, *monitor*, and *act* only when a condition is met. Each environment replays a scripted timeline of events; the agent must notice the right moment and respond, without wasting resources polling in between.

This README covers how to run the benchmark. For the motivation, task design, and baseline results, see the paper and blog post.

Environments

The benchmark ships 10 environments (Micro*), each with 10 monitoring scenarios (100 total).

| Environment | Mimics | Surface | Data Type | |-------------|--------|---------|-----------| | MicroMail | Email (Gmail/Outlook) | Inbox, folders, attachments, search | Synthetic images | | MicroChat | Team messaging (Slack/Teams) | Teams, channels, DMs, calls | Synthetic images | | MicroDin | Professional network (LinkedIn) | Feed, connections, jobs, messaging | Synthetic images | | MicroHub | Code hosting (GitHub) | Repo, issues, PRs, commits, releases | Text-only (JSONL) | | MicroHood | Stock trading (Robinhood) | Portfolio, orders, watchlist, news | Text-only (JSONL) | | MicroGram | Photo sharing (Instagram) | Feed, stories, DMs, activity | Synthetic images | | MicroTube | Video platform (YouTube) | Feed, player, comments, subscriptions | Synthetic video/images | | MicroFy | Music streaming (Spotify) | Tracks, playlists, artists, lyrics | Synthetic audio | | MicroLendar | Calendar (Google Calendar) | Month/week/day views, events, tasks | Text-only (JSONL) | | MicroScholar | Academic search (Google Scholar) | Search, papers, authors, alerts | Text-only (JSONL) |

> Environments marked Text-only (JSONL) use structured text data with no AI-generated media. The others use AI-generated images, video, or audio produced by the pipeline in [data_generation/](#regenerating-synthetic-media).

Quick Start

Set up local dependencies, build the SQLite databases, and run all three components (API server, frontend, harness shell) in one tmux session:

./start.sh

On the first run, this creates .venv, installs Python and Node dependencies, and builds the gitignored .db files. It then opens windows for the server (API on :8000), the frontend (Vite on :5173), and a harness shell ready for eval runs. To set things up manually instead, follow the three steps below.

1. API server

python3 -m venv .venv
source .venv/bin/activate
pip install -r server/requirements.txt

# Build the SQLite databases from the JSONL catalogs (required on first setup,
# and whenever data/catalogs/ changes). The .db files are gitignored.
python -m server.scripts.build_db

uvicorn server.server:app --host 0.0.0.0 --port 8000

2. Frontend

cd frontend
npm ci
npm run dev

Open http://localhost:5173 to use the environments by hand. The frontend proxies API calls to localhost:8000, so the API server must be running.

3. Eval harness

The harness discovers every scenario JSON, runs each one against your agent as a subprocess, and collects results. Configure your agent in an eval_config.yaml at the repo root:

server_url: http://localhost:8000
frontend_url: http://localhost:5173

# Wall-clock vs sim-clock exchange rate. 1.0 = real-time (default),
# >1 is faster, ///`:

| File | Description |
|------|-------------|
| `results.json` | Evaluation result (`success`, `detail`, `evaluation_time`, `condition_at`, `contact_get_time`) |
| `output.txt` | Agent subprocess stdout/stderr |
| `error.txt` | Harness error traceback (only if the task failed to run) |

$ python -m server.eval_harness run my_first_run Found 100 tasks. Results -> results/my_first_run Running micromail-attachment-name-absolute-active ... Running micromail-body-december-relative-active ... ... Done.

### Dev mode

Set `SENTINEL_DEV` to skip the `/init` flow and preload an environment with all of its catalog data, so the UI is populated without running the harness:

Preload every environment

SENTINEL_DEV=all uvicorn server.server:app --reload --port 8000

Preload a single environment

SENTINEL_DEV=microfy uvicorn server.server:app --reload --port 8000

Each environment ships a `scenarios//dev.json` that loads all catalog rows via `["*"]` wildcards. The eval harness always skips `dev.json`.

### Runtime tuning

`speed_factor` controls how fast the simulation clock advances relative to wall-clock. It is a top-level key in `eval_config.yaml` (and `--speed` on `run_simulation` for local playback) — never stored in scenario JSON. The harness sizes each task's subprocess timeout as `kill_at_wall = MAX_CONDITION_AT / speed_factor + REACTION_WINDOW`. `speed_factor > 1` compresses the timeline; `speed_factor 0`. All constants live in `server/timing.py`.

## How a scenario works

Each scenario is a JSON file under `scenarios//`, named `--.json` (no-op scenarios use `--noop.json`):

{ "id": "micromail-junk-relative-passive", "environment": "micromail", "prompt": "Keep an eye on the junk folder for me, tell me when 3 more emails end up there.", "condition_at": 297.75, "kill_at": 630.0, "event_timeline_end": 720.0, "events": [ { "time": 0, "type": "preload_emails", "payload": { } } ], "eval_sql": "SELECT (SELECT COUNT(*) FROM email_states WHERE folder='junk') >= (SELECT CAST(value AS INTEGER) FROM session_meta WHERE key='baseline_junk_count') + 3" }

Timing fields, all in simulation-seconds:

- **`condition_at`** — earliest sim-time the success condition can become true. Randomized per scenario in `[10, 600]`, deterministic from `scenario_id`. For no-op scenarios it is `null` (the condition never fires). The released values are frozen; do not hand-edit them.
- **`kill_at`** — sim-time the harness terminates the run (`630`, constant).
- **`event_timeline_end`** — right edge of the authored timeline (`720`, constant).

Invariant (enforced by `tests/test_scenario_schema.py`): `0 SQLite)
│ ├── server.py # Main FastAPI app
│...

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine security environment repo; minimal traction.