openai/hallucinations-paper-experiments
Jupyter Notebook
Captured source
source ↗openai/hallucinations-paper-experiments
Description: Experiments for paper
Language: Jupyter Notebook
License: MIT
Stars: 10
Forks: 2
Open issues: 1
Created: 2026-02-20T20:01:58Z
Pushed: 2026-05-15T18:39:31Z
Default branch: main
Fork: no
Archived: no
README:
Reproducing the experiments for the paper "Evaluating large language models for accuracy incentivises hallucinations"
This folder contains a self-contained notebook that runs the core SimpleQA-based experiments used in our paper "Evaluating large language models for accuracy incentivises hallucinations":
Reference: Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, E. (2026). *Evaluating large language models for accuracy incentivizes hallucinations*. Nature. https://doi.org/10.1038/s41586-026-10549-w
@article{kalai2026evaluating,
author = {Kalai, Adam Tauman and Nachum, Ofir and Vempala, Santosh S. and Zhang, Edwin},
title = {Evaluating large language models for accuracy incentivizes hallucinations},
journal = {Nature},
year = {2026},
doi = {10.1038/s41586-026-10549-w},
url = {https://doi.org/10.1038/s41586-026-10549-w}
}- `experiment.ipynb`: downloads the SimpleQA test set, queries four frontier LMs under different abstention instructions, grades outputs with the SimpleQA grader prompt, and generates the paper-style plot(s).
- `lm.py`: lightweight OpenRouter-backed LM wrapper with a shared on-disk cache for reproducible reruns.
This uses the full SimpleQA test set comprising 4,326 questions and computes bootstrapped p-values.
See some [Examples](EXAMPLES.md)
What the notebook does
- Dataset: SimpleQA test set downloaded from OpenAI public blob storage into
~/data/. - Models queried (through OpenRouter API):
google/gemini-3-pro-preview,openai/gpt-5.2,x-ai/grok-4,anthropic/claude-opus-4.5- Grading: uses
openai/gpt-4.1with the SimpleQA grader template (copied into the notebook). - Open Rubric: we evaluate the effect of stating the scoring system explicitly in the prompt.
- Consistency Mitigation: uses a “two samples + consistency check” procedure; inconsistent pairs abstain as
"I don't know". (This isk=2in the notebook, withk=1being the baseline of just querying the model once.)
Thus each model is evaluated:
- With a penalty $L \in {0, 1, 3, 9}$ for errors.
- Open Rubric (where the scoring system is stated explicitly) and Closed Rubric (meaning just the question).
- Baseline and Consistency Mitigation.
Of course, the closed rubric need only be evaluated once (but is rescored at each penalty)since the answers do not depend on the penalty.
Setup
- Python: the notebook metadata targets Python 3.12.9.
- Install dependencies (minimal set used by
experiment.ipynb/lm.py):
python -m venv .venv source .venv/bin/activate pip install -U pip pip install -r requirements.txt
API keys (required)
Set the environment variable used by lm.py:
export OPENROUTER_API_KEY="..."
Running the experiment
Open and run the notebook top-to-bottom:
jupyter lab hallucinations/nature/experiment.ipynb
The notebook will:
- Create
~/data/if missing - Download
simple_qa_test_set.csvfromhttps://openaipublic.blob.core.windows.net/simple-evals/simple_qa_test_set.csvinto~/data/simple_qa_test_set.csvif missing - Run a large batch of LM calls (unless you reduce
NUM_SAMPLES) - Plot baseline vs mitigation hallucination/abstention/accuracy rates
Caching + reproducibility
All model calls are cached on disk via diskcache in /.cache/ (repo root is discovered by walking up to a .git directory). This is convenient in case you want to perform any analysis which does not change prompts.
For a fast smoke test, set NUM_SAMPLES = 10 (or similar) near the top of the notebook, and optionally reduce MAX_PARALLEL.
Cost
The costs are significant because frontier models are used. In our four-model experiment on SimpleQA, the costs were:
| Model | Cost | | -------------------- | ------------- | | gemini-3-pro-preview | $826.18 | | GPT-5.2 | $690.45 | | grok-4 | $949.52 | | opus-4.5 | $154.40 | | GPT-4.1 (grading) | $158.01 | | Total | $2,778.56 |
Each model is run twice on each question for each abstention threhsold, and also consistency checks are performed.
Notability
notability 4.0/10OpenAI paper code repo, low traction