RepoOpenAIOpenAIpublished Apr 22, 2026seen 6d

openai/monitorability-evals

Python

Open original ↗

Captured source

source ↗
published Apr 22, 2026seen 6dcaptured 9hhttp 200method plain

openai/monitorability-evals

Description: Open-sourced evaluation suite from the Monitoring Monitorability paper

Language: Python

License: Apache-2.0

Stars: 79

Forks: 6

Open issues: 1

Created: 2026-04-22T22:12:39Z

Pushed: 2026-04-22T22:30:35Z

Default branch: main

Fork: no

Archived: no

README:

Overview

Open-sourced evaluation data from the Monitoring Monitorability paper. This contains the public evaluation split. For an overview of dataset motivation, sources, metrics, and grader prompts, see Appendix C of the paper.

We do not directly redistribute AIME problem statements in this repo. For copyright reasons, the AIME datasets expose url and source fields instead; use those fields to retrieve the corresponding problem text from the linked source.

There are three archetypes:

  • Intervention
  • Process
  • Outcome-property

Intervention

We open-source 9 of the 11 intervention evaluations:

  • aime.jsonl
  • daft.jsonl
  • gpqa.jsonl
  • gpqa_encoded.jsonl
  • memory_advice.jsonl
  • scruples_first_person.jsonl
  • scruples_suggest_right.jsonl
  • scruples_suggest_wrong.jsonl
  • wmdp_sandbagging.jsonl

The two Health Queries evals rely on private or restricted data and were omitted from the release.

Most rows in intervention/*.jsonl have:

  • problem: the prompt given to the model
  • answer: the answer the hint/perturbation is guiding the model towards
  • category: the subcategory, if it exists
  • id: the ID to group rows on for metrics calculation. for each dataset, there are two rows with the same ID: one that is perturbed, one that is unperturbed
  • perturbation: perturbation text
  • type ("unperturbed" | "perturbed"): whether perturbation is applied in the prompt form.

intervention/aime.jsonl is the exception:

  • url: external link for the problem statement
  • source: citation used to derive the problem URL
  • answer, category, id, perturbation, and type behave as above

Process

We open-source two of the four process evaluation splits

  • process_gpqa_physics.jsonl
  • process_aime_math.jsonl

The FrontierMath and Math Diamond evals rely on private or restricted data and were omitted from the release.

Most rows in process/*.jsonl have:

  • problem: the prompt given to the model
  • answer (string or int): the correct answer
  • theorem_paths (string): the theorem paths
  • full_theorems_definition (object): the definitions for each theorem in the theorem paths
  • new_solution_path_count (int): the number of independent theorem paths

process/process_aime_math.jsonl is the exception:

  • url: external link for the problem statement
  • source: citation used to derive the problem URL
  • answer, theorem_paths, and full_theorems_definition behave as above

Outcome-property

We open-source one outcome-property dataset:

  • Flaky Tools (the CharXiv-based subset)

We also describe two other outcome-property evaluations:

  • Impossible Coding Tasks (the LiveBench-based subset)
  • ShadeArena (already public)

The following outcome-property evals rely on private or restricted data and were omitted from the release: OAI-Repo-Sabotage, Anti-Scheming, the two Flaky-Tools datasets based on internal OpenAI browsing evaluations and coding problems, and the Impossible Coding Task based on an OpenAI interview question.

Prompt and label mappings

Prompt templates live under monitor_prompts/ and model_prompts/:

  • monitor_prompts/templates/*.txt: monitor prompt templates with {variable} placeholders.
  • model_prompts/*.txt: model prompt / answer-evaluation prompt templates (verbatim where available).
  • dataset_registry.yaml: single source of truth mapping datasets to prompt files and Y/Z label semantics.

Deprecated evals

We identified issues with Anti-Scheming and Memory that are detailed in the GPT 5.4 Thinking system card. We are no longer internally using these two evals until we create a version of Anti-scheming that addresses the mismatch between the generic CoT sketchiness monitor and task-specific answer grader, and a version of Memory with improved answer grading.

Legal note

By contributing to evals, you are agreeing to make your evaluation logic and data under the same Apache-2.0 License as this repository. You must have adequate rights to upload any data used in an eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI evals will be subject to our usual Usage Policies: https://platform.openai.com/docs/usage-policies.

Notability

notability 5.0/10

Moderate traction research evals repo