What does this fork signal mean?

FriendliAI forked friendliai/simple-evals-archived (forked from openai/simple-evals). This fork signal points to upstream code the lab may be inspecting, patching, or building on. High-signal details: repo friendliai/simple-evals-archived · parent openai/simple-evals · Routine fork of archived repo.. onlylabs links this event to 1 captured evidence page and 6 related fork signals.

FriendliAI Fork: friendliai/simple-evals-archived

Captured source

source ↗

GitHub/github.com/friendliai/simple-evals-archived

friendliai/simple-evals-archived repository metadata

Source ↗

published Oct 27, 2024seen Jun 5captured Jun 11http 200method plain

friendliai/simple-evals-archived

Language: Python

License: MIT

Stars: 0

Forks: 0

Open issues: 3

Created: 2024-10-27T10:12:37Z

Pushed: 2026-04-02T10:34:15Z

Default branch: package

Fork: yes

Parent repository: openai/simple-evals

Archived: yes

README:

Overview

This repository contains a lightweight library for evaluating language models. We are open sourcing it so we can be transparent about the accuracy numbers we're publishing alongside our latest models.

Benchmark Results

| Model | Prompt | MMLU | GPQA [^8] | MATH [^6]| HumanEval | MGSM[^5] | DROP[^5] (F1, 3-shot) | SimpleQA |:----------------------------:|:-------------:|:------:|:------:|:--------:|:---------:|:------:|:--------------------------:|:---------:| | o3 | | | | | | | | | | | o3-high [^10] | n/a [^7] | 93.3 | 83.4 | 98.1 | 88.4 | 92.0 | 89.8 | 48.6 | | o3 [^9] [^10] | n/a | 92.9 | 82.8 | 97.8 | 87.4 | 92.3 | 80.6 | 49.4 | | o3-low [^10] | n/a | 92.8 | 78.6 | 96.9 | 87.3 | 91.9 | 82.3 | 49.4 | | o4-mini | | | | | | | | | | o4-mini-high [^9] [^10] | n/a | 90.3 | 81.3 | 98.2 | 99.3 | 93.5 | 78.1 | 19.3 | | o4-mini [^9] [^10] | n/a | 90.0 | 77.6 | 97.5 | 97.3 | 93.7 | 77.7 | 20.2 | | o4-mini-low [^10] | n/a | 89.5 | 73.6 | 96.2 | 95.9 | 93.0 | 76.0 | 20.2 | | o3-mini | | | | | | | | | | | o3-mini-high | n/a | 86.9 | 77.2 | 97.9 | 97.6 | 92.0 | 80.6 | 13.8 | | o3-mini | n/a | 85.9 | 74.9 | 97.3 | 96.3 | 90.8 | 79.2 | 13.4 | | o3-mini-low | n/a | 84.9 | 67.6 | 95.8 | 94.5 | 89.4 | 77.6 | 13.0 | | o1 | | | | | | | | | | o1 | n/a | 91.8 | 75.7 | 96.4 | - | 89.3 | 90.2 | 42.6 | | o1-preview | n/a | 90.8 | 73.3 | 85.5 | 92.4 | 90.8 | 74.8 | 42.4 | | o1-mini | n/a | 85.2 | 60.0 | 90.0 | 92.4 | 89.9 | 83.9 | 07.6 | | GPT-4.1 | | | | | | | | | | | gpt-4.1-2025-04-14 | assistant [^2]| 90.2 | 66.3 | 82.1 | 94.5 | 86.9 | 79.4 | 41.6 | | gpt-4.1-mini-2025-04-14 | assistant | 87.5 | 65.0 | 81.4 | 93.8 | 88.2 | 81.0 | 16.8 | | gpt-4.1-nano-2025-04-14 | assistant | 80.1 | 50.3 | 62.3 | 87.0 | 73.0 | 82.2 | 07.6 | | GPT-4o | | | | | | | | | | | gpt-4o-2024-11-20 | assistant | 85.7 | 46.0 | 68.5 | 90.2 | 90.3 | 81.5 | 38.8 | | gpt-4o-2024-08-06 | assistant | 88.7 | 53.1 | 75.9 | 90.2 | 90.0 | 79.8 | 40.1 | | gpt-4o-2024-05-13 | assistant | 87.2 | 49.9 | 76.6 | 91.0 | 89.9 | 83.7 | 39.0 | | gpt-4o-mini-2024-07-18 | assistant | 82.0 | 40.2 | 70.2 | 87.2 | 87.0 | 79.7 | 09.5 | | GPT-4.5-preview | | | | | | | | | | gpt-4.5-preview-2025-02-27 | assistant | 90.8 | 69.5 | 87.1 | 88.6 | 86.9 | 83.4 | 62.5 | | GPT-4 Turbo and GPT-4 | | | | | | | | | | gpt-4-turbo-2024-04-09 | assistant | 86.7 | 49.3 | 73.4 | 88.2 | 89.6 | 86.0 | 24.2 | | gpt-4-0125-preview | assistant | 85.4 | 41.4 | 64.5 | 86.6 | 85.1 | 81.5 | n/a | | gpt-4-1106-preview | assistant | 84.7 | 42.5 | 64.3 | 83.7 | 87.1 | 83.2 | n/a | | Other Models (Reported) | | | | | | | | | Claude 3.5 Sonnet | unknown | 88.3 | 59.4 | 71.1 | 92.0 | 91.6 | 87.1 | 28.9 | | Claude 3 Opus | unknown | 86.8 | 50.4 | 60.1 | 84.9 | 90.7 | 83.1 | 23.5 | | Llama 3.1 405b | unknown | 88.6 | 50.7 | 73.8 | 89.0 | 91.6 | 84.8 | n/a | Llama 3.1 70b | unknown | 82.0 | 41.7 | 68.0 | 80.5 | 86.9 | 79.6 | n/a | Llama 3.1 8b | unknown | 68.4 | 30.4 | 51.9 | 72.6 | 68.9 | 59.5 | n/a | Grok 2 | unknown | 87.5 | 56.0 | 76.1 | 88.4 | n/a | n/a | n/a | Grok 2 mini | unknown | 86.2 | 51.0 | 73.0 | 85.7 | n/a | n/a | n/a | Gemini 1.0 Ultra | unknown | 83.7 | n/a | 53.2 | 74.4 | 79.0 | 82.4 | n/a | Gemini 1.5 Pro | unknown | 81.9 | n/a | 58.5 | 71.9 | 88.7 | 78.9 | n/a | Gemini 1.5 Flash | unknown | 77.9 | 38.6 | 40.9 | 71.5 | 75.5 | 78.4 | n/a

Background

Evals are sensitive to prompting, and there's significant variation in the formulations used in recent publications and libraries. Some use few-shot prompts or role playing prompts ("You are an expert software programmer..."). These approaches are carryovers from evaluating *base models* (rather than instruction/chat-tuned models) and from models that were worse at following instructions.

For this library, we are emphasizing the *zero-shot, chain-of-thought* setting, with simple instructions like "Solve the following multiple choice problem". We believe that this prompting technique is a better reflection of the models' performance in realistic usage.

We will not be actively maintaining this repository and monitoring PRs and Issues. In particular, we're not accepting new evals. Here are the changes we might accept.

Bug fixes (hopefully not needed!)
Adding adapters for new models
Adding new rows to the table below with eval results, given new models and new system prompts.

This repository is NOT intended as a replacement for https://github.com/openai/evals, which is designed to be a comprehensive collection of a large number of evals.

Evals

This repository currently contains the following evals:

MMLU: Measuring Massive Multitask Language Understanding, reference: https://arxiv.org/abs/2009.03300, https://github.com/hendrycks/test, MIT License
MATH: Measuring Mathematical Problem Solving With the MATH Dataset, reference: https://arxiv.org/abs/2103.03874, https://github.com/hendrycks/math, MIT License
GPQA: A Graduate-Level Google-Proof Q&A Benchmark, reference: https://arxiv.org/abs/2311.12022, https://github.com/idavidrein/gpqa/, MIT License
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs, reference: https://arxiv.org/abs/1903.00161, https://allenai.org/data/drop, Apache License 2.0
MGSM: Multilingual Grade School Math Benchmark (MGSM), Language Models are Multilingual Chain-of-Thought Reasoners, reference: https://arxiv.org/abs/2210.03057, https://github.com/google-research/url-nlp, [Creative Commons Attribution 4.0...

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

Routine fork of archived repo.