What does this repo signal mean?

Microsoft published microsoft/STATE-Bench (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo microsoft/STATE-Bench · language Python · Benchmark for evaluating state-tracking in language models.. onlylabs links this event to 1 captured evidence page and 6 related repo signals. It also maps to Evals and quality, Product and customer in the data-business radar.

Microsoft Repo: microsoft/STATE-Bench

Captured source

source ↗

GitHub/github.com/microsoft/STATE-Bench

microsoft/STATE-Bench repository metadata

Source ↗

published May 11, 2026seen Jun 5captured Jun 11http 200method plain

microsoft/STATE-Bench

Description: Benchmark AI Agents on Enterprise Workflows

Language: Python

License: MIT

Stars: 45

Forks: 6

Open issues: 7

Created: 2026-05-11T18:48:50Z

Pushed: 2026-06-10T20:31:37Z

Default branch: main

Fork: no

Archived: no

README:

Main Track · Agent Learning Track

STATE-Bench evaluates AI agents on realistic, multi-step enterprise workflows across three domains: travel, customer support, and shopping assistant.

Each task gives the agent a task-local sandbox database, domain-specific tools, and a simulated user. To pass a task, the agent must do multi-step reasoning by gathering the right information with domain tools, applying the correct policy, taking actions to update the database to the right final state when needed, and following the required procedure in conversation.

Overview

STATE-Bench includes 450 challenging enterprise tasks across three domains.

| Domain | Tasks | Description | | --- | ---: | --- | | Travel | 150 | Flight, hotel, and car rental bookings; cancellations, updates, fee and policy reasoning, cross-product trip planning | | Customer Support | 150 | Returns, refunds, exchanges, warranty claims, cancellations, shipping issues, and order changes | | Shopping Assistant | 150 | Product search, cart updates, applying promos, loyalty redemption, shipping options, and compatibility checks |

Choose Your Benchmark Track

Start with the track that matches what you want to evaluate. Each track guide links to the setup and reference docs only when you need them.

| Goal | Start here | | --- | --- | | Evaluate an agent or model directly on the provided enterprise benchmark tasks | [Main Track](docs/RUN_BENCHMARK.md) | | Evaluate agentic memory, skills, or prompt optimization | [Agent Learning Track](docs/AGENT_LEARNING_TRACK.md) |

The Main Track is the default benchmark path. The Agent Learning Track uses the same simulator, domain tools, judges, and metrics, but adds train trajectories and a retrieval hook for reusable learnings such as memories, skills, or prompt optimizations.

Sample task trajectory from the Travel domain.

Metrics

STATE-Bench reports four headline metrics:

| Metric | What it measures | | --- | --- | | Task Completion pass@1 | Average task completion rate across five runs per task. | | Task Completion pass^5 | Percentage of tasks completed successfully on all five runs. | | UX Score | LLM-judged conversation quality on a 1-5 scale. | | Cost Per Task | Average agent cost from user-reported token usage and pricing. |

License

STATE-Bench is released under the MIT License. See [LICENSE](LICENSE).

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Disclosures

Datasets provided in this benchmark were synthetically generated using large language models. The benchmark is intended for research purposes and users should exercise caution and consider the limitations of synthetic data when interpreting results.

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

New benchmark repo with low stars

Microsoft has a repo signal matching evals and quality, product and customer.

Evals and quality Product and customer