WritingOpenAIOpenAIpublished Oct 30, 2024seen 6d

Introducing SimpleQA

Open original ↗

Captured source

source ↗
published Oct 30, 2024seen 6dcaptured 2dhttp 200method exa

Introducing SimpleQA | OpenAI

October 30, 2024

Introducing SimpleQA

A factuality benchmark called SimpleQA that measures the ability for language models to answer short, fact-seeking questions.

Loading…

Share

An open problem in artificial intelligence is how to train models that produce responses that are factually correct. Current language models sometimes produce false outputs or answers unsubstantiated by evidence, a problem known as “hallucinations”. Language models that generate more accurate responses with fewer hallucinations are more trustworthy and can be used in a broader range of applications. To measure the factuality of language models, we are open-sourcing⁠ a new benchmark called SimpleQA.

About the SimpleQA benchmark

Factuality is a complicated topic because it is hard to measure—evaluating the factuality of any given arbitrary claim is challenging, and language models can generate long completions that contain dozens of factual claims. In SimpleQA, we will focus on short, fact-seeking queries, which reduces the scope of the benchmark but makes measuring factuality much more tractable.

With SimpleQA, our goal was to create a dataset with the following properties:

1. High correctness. Reference answers to questions are supported by sources from two independent AI trainers, and questions were written in such a way that the predicted answers are easy to grade. 2. Diversity. SimpleQA covers a wide range of topics, from science and technology to TV shows and video games. 3. Challenging for frontier models. Compared to older benchmarks such as TriviaQA⁠(2017) or NQ⁠(2019), which have become saturated, SimpleQA was created to be a greater challenge for frontier models (e.g., GPT‑4o scores less than 40%). 4. Good researcher UX. SimpleQA is intended to be fast and simple to run due to its concise questions and answers. Grading is also efficient whether through the OpenAI API or another frontier model API. Additionally, with 4,326 questions, SimpleQA should have relatively low variance as an evaluation benchmark.

We hired AI trainers to browse the web and create short, fact-seeking questions and corresponding answers. To be included in the dataset, each question had to meet a strict set of criteria: it must have a single, indisputable answer for easy grading; the answer to the question should not change over time; and most questions had to induce hallucinations from either GPT‑4o or GPT‑3.5. To further improve the quality of the dataset, a second, independent AI trainer answered each question without seeing the original response. Only questions where both AI trainers’ answers agreed were included.

As a final verification of quality, we had a third AI trainer answer a random sample of 1,000 questions from the dataset. We found that the third AI trainer’s answer matched the original agreed answers 94.4% of the time, with a 5.6% disagreement rate. We then manually inspected these examples, and found that 2.8% of the 5.6% of disagreements were due to grader false negatives or human errors from the third trainer (e.g., incomplete answers or misinterpreting sources), and the remaining 2.8% were due to real issues with the question (e.g., ambiguous questions, or different websites giving conflicting answers). Hence, we estimate the inherent error rate of this dataset to be approximately 3%.

Question diversity in SimpleQA

The pie chart below shows the diversity of topics in the SimpleQA benchmark, with examples of each question shown if you hover over the pie on the pie chart.

Using SimpleQA to compare language models

To grade questions, we use a prompted ChatGPT classifier that sees both the predicted answer from the model and the ground-truth answer, and then grades the predicted answer as either “correct”, “incorrect”, or “not attempted”.

A definition and corresponding examples for each grade are shown in the table below.

  • “Wout Weghorst”
  • “Wout Weghorst scored at 83’ and 90+11’ in that game”
  • “Virgil van Dijk”
  • “Virgil van Dijk and Wout Weghorst”
  • “Wout Weghorst and I think van Dijk scored, but I am not totally sure”
  • “I don’t know the answer to that question”
  • “To find which Dutch player scored in that game, please browse the internet yourself”

| Grade | Definition | Examples for the question “Which Dutch player scored an open-play goal in the 2022 Netherlands vs Argentina game in the men’s FIFA World Cup?” (Answer: Wout Weghorst) | | --- | --- | --- | | “Correct” | The predicted answer fully contains the ground-truth answer without contradicting the reference answer. | | “Incorrect” | The predicted answer contradicts the ground-truth answer in any way, even if the contradiction is hedged. | | “Not attempted” | The ground truth target is not fully given in the answer, and there are no contradictions with the reference answer. |

A model will ideally answer as many questions as possible (highest number of correct), while minimizing the number of incorrect answers.

Using this classification, we can then measure the performance of several OpenAI models without browsing, including gpt-4o-mini, o1‑mini, gpt-4o, and o1‑preview. As expected, gpt-4o-mini and o1‑mini answer fewer questions correctly compared to gpt-4o and o1‑preview, likely because smaller models typically have less world knowledge. We also see that o1‑mini and o1‑preview, which are designed to spend more time thinking, choose to "not attempt" questions more often than gpt-4o-mini and gpt-4o. This may be because they can use their reasoning capacity to recognize when they don’t know the answer to a question, instead of hallucinating.

Using SimpleQA to measure the calibration of large language models

A factuality benchmark like SimpleQA also allows us to measure the scientific phenomenon known as calibration, or whether language models “know what they know.” One way to measure calibration is to directly ask the language model to state its confidence in its answer using the prompt: “Please give your best guess, along with your confidence as a percentage that that is the correct answer.” Then we can plot the correlation between the stated confidence of the model, and how accurate the model actually was. A perfectly calibrated model would have the same actual accuracy as stated confidence. For instance, on all prompts where the model stated a confidence of 75%, the accuracy would be 75% for a perfectly calibrated model.

This result is shown in the figure below. The positive…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

New QA benchmark from OpenAI