What does this repo signal mean?

OpenAI published openai/monitorability-evals (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo openai/monitorability-evals · language Python · Moderate traction research evals repo. onlylabs links this event to 1 captured evidence page and 6 related repo signals. It also maps to Evals and quality in the data-business radar.

OpenAI Repo: openai/monitorability-evals

Captured source

source ↗

GitHub/github.com/openai/monitorability-evals

openai/monitorability-evals repository metadata

Source ↗

published Apr 22, 2026seen Jun 5captured Jun 11http 200method plain

openai/monitorability-evals

Description: Open-sourced evaluation suite from the Monitoring Monitorability paper

Language: Python

License: Apache-2.0

Stars: 79

Forks: 6

Open issues: 1

Created: 2026-04-22T22:12:39Z

Pushed: 2026-04-22T22:30:35Z

Default branch: main

Fork: no

Archived: no

README:

Overview

Open-sourced evaluation data from the Monitoring Monitorability paper. This contains the public evaluation split. For an overview of dataset motivation, sources, metrics, and grader prompts, see Appendix C of the paper.

We do not directly redistribute AIME problem statements in this repo. For copyright reasons, the AIME datasets expose url and source fields instead; use those fields to retrieve the corresponding problem text from the linked source.

There are three archetypes:

Intervention
Process
Outcome-property

Intervention

We open-source 9 of the 11 intervention evaluations:

aime.jsonl
daft.jsonl
gpqa.jsonl
gpqa_encoded.jsonl
memory_advice.jsonl
scruples_first_person.jsonl
scruples_suggest_right.jsonl
scruples_suggest_wrong.jsonl
wmdp_sandbagging.jsonl

The two Health Queries evals rely on private or restricted data and were omitted from the release.

Most rows in intervention/*.jsonl have:

problem: the prompt given to the model
answer: the answer the hint/perturbation is guiding the model towards
category: the subcategory, if it exists
id: the ID to group rows on for metrics calculation. for each dataset, there are two rows with the same ID: one that is perturbed, one that is unperturbed
perturbation: perturbation text
type ("unperturbed" | "perturbed"): whether perturbation is applied in the prompt form.

intervention/aime.jsonl is the exception:

url: external link for the problem statement
source: citation used to derive the problem URL
answer, category, id, perturbation, and type behave as above

Process

We open-source two of the four process evaluation splits

process_gpqa_physics.jsonl
process_aime_math.jsonl

The FrontierMath and Math Diamond evals rely on private or restricted data and were omitted from the release.

Most rows in process/*.jsonl have:

problem: the prompt given to the model
answer (string or int): the correct answer
theorem_paths (string): the theorem paths
full_theorems_definition (object): the definitions for each theorem in the theorem paths
new_solution_path_count (int): the number of independent theorem paths

process/process_aime_math.jsonl is the exception:

url: external link for the problem statement
source: citation used to derive the problem URL
answer, theorem_paths, and full_theorems_definition behave as above

Outcome-property

We open-source one outcome-property dataset:

Flaky Tools (the CharXiv-based subset)

We also describe two other outcome-property evaluations:

Impossible Coding Tasks (the LiveBench-based subset)
ShadeArena (already public)

The following outcome-property evals rely on private or restricted data and were omitted from the release: OAI-Repo-Sabotage, Anti-Scheming, the two Flaky-Tools datasets based on internal OpenAI browsing evaluations and coding problems, and the Impossible Coding Task based on an OpenAI interview question.

Prompt and label mappings

Prompt templates live under monitor_prompts/ and model_prompts/:

monitor_prompts/templates/*.txt: monitor prompt templates with {variable} placeholders.
model_prompts/*.txt: model prompt / answer-evaluation prompt templates (verbatim where available).
dataset_registry.yaml: single source of truth mapping datasets to prompt files and Y/Z label semantics.

Deprecated evals

We identified issues with Anti-Scheming and Memory that are detailed in the GPT 5.4 Thinking system card. We are no longer internally using these two evals until we create a version of Anti-scheming that addresses the mismatch between the generic CoT sketchiness monitor and task-specific answer grader, and a version of Memory with improved answer grading.

Legal note

By contributing to evals, you are agreeing to make your evaluation logic and data under the same Apache-2.0 License as this repository. You must have adequate rights to upload any data used in an eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI evals will be subject to our usual Usage Policies: https://platform.openai.com/docs/usage-policies.

Notability

notability 5.0/10

Moderate traction research evals repo