ForkNous ResearchNous Researchpublished Aug 11, 2025seen 5d

NousResearch/eqbench3

forked from EQ-bench/eqbench3

Open original ↗

Captured source

source ↗
published Aug 11, 2025seen 5dcaptured 9hhttp 200method plain

NousResearch/eqbench3

License: MIT

Stars: 3

Forks: 1

Open issues: 0

Created: 2025-08-11T06:51:26Z

Pushed: 2025-08-12T06:24:59Z

Default branch: main

Fork: yes

Parent repository: EQ-bench/eqbench3

Archived: no

README:

EQ-Bench 3

EQ-Bench 3 is a multi-turn emotional intelligence benchmark. It assesses active EQ skills, interpersonal skills, psychological insight and analytical depth. It challenges language models with role-play or analysis tasks that require empathy, depth of insight, social dexterity, and more. An auxiliary judge model (by default, Claude Sonnet 3.7) scores or pairwise-compares the outputs.

For full details on the benchmark including methodology, criteria, bias analysis, repeatabilty experiments and more, click here.

Features

  • Role-Play Scenarios: The tested model is placed in conversation-based scenarios (e.g., parenting, relationship conflict, workplace tension). It must articulate what it (and others) feel/think before delivering its final response.
  • Analysis Scenarios: The model is asked to read or interpret a transcript and perform an in-depth analysis of human dynamics and subtext.
  • Rubric Scoring: A judge LLM assigns a multi-criteria rubric score (0–20, scaled up to 0–100) to each scenario outcome, focusing on empathy, emotional reasoning, social skill, etc.
  • Pairwise ELO Analysis: The judge compares transcripts from two different models for the same scenario, awarding a “win” margin for each of several criteria. A TrueSkill/ELO-like solver aggregates all pairwise comparisons to produce a final ranking.

Leaderboard

!image

The full EQ-Bench leaderboard is viewable at: https://eqbench.com/

Table of Contents

1. [Project Overview](#project-overview) 2. [Installation & Requirements](#installation--requirements) 3. [Quickstart](#quickstart) 4. [Running the Benchmark](#running-the-benchmark)

  • [Command-Line Arguments](#command-line-arguments)
  • [Example Commands](#example-commands)
  • [Rubric vs ELO](#rubric-vs-elo)
  • [Local vs Canonical Results Files](#local-vs-canonical-results-files)

5. [Example Use Case: Compare a single model to a baseline with Elo](#5-example-use-case-compare-a-single-model-to-a-baseline-with-elo) 6. [Merging Local Results into the Canonical Leaderboard](#merging-local-results-into-the-canonical-leaderboard) 7. [Folder and File Structure](#folder-and-file-structure) 8. [Limitations and Notes](#limitations-and-notes) 9. [Contact and License](#contact-and-license) 10. [Citation](#citation)

---

1. Project Overview

EQ-Bench 3 aims to measure active emotional intelligence abilities in LLMs. Rather than knowledge-based or short-answer questions, tasks here are multi-turn dialogues or analysis questions that test empathy, social dexterity, and psychological insight. The evaluated model’s responses are then graded by a judge model:

  • Rubric pass: The judge model issues a numerical score for each scenario.
  • ELO pass: The judge model performs pairwise comparisons of transcripts from different models, resulting in an overall ELO ranking (via TrueSkill).

Key Points

  • Scenarios vary from relationship drama to conflict mediation, pushing the tested model to reason about others’ emotions.
  • Analysis tasks require deeper reflection on a provided transcript or scenario.
  • Judge model: By default, a Claude model (Sonnet 3.7) is used, but any LLM accessible via an OpenAI-compatible endpoint can serve as judge.
  • Truncation: Pairwise judgments truncate outputs to level the playing field. Rubric judgments typically do not truncate (to preserve detail).

---

2. Installation & Requirements

1. Clone this repository:

git clone https://github.com/EQ-bench/eqbench3.git
cd eqbench3

2. (Optional) Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate

3. Install dependencies:

pip install -r requirements.txt

4. Configure your API keys in .env:

cp .env.example .env
# then edit .env with your API keys, e.g.:
# TEST_API_KEY=sk-...
# JUDGE_API_KEY=sk-...
  • TEST_API_KEY & TEST_API_URL are used when calling the tested model.
  • JUDGE_API_KEY & JUDGE_API_URL are used by the judge model.

---

3. Quickstart

1. Run a single iteration of EQ-Bench (Rubric Scoring only, no Elo):

python eqbench3.py \
--test-model openai/gpt-4.1-mini \
--model-name gpt-4.1-mini-demo-run \
--judge-model anthropic/claude-3.7-sonnet \
--no-elo \
--iterations 1
  • Runs 1 iteration of every scenario, scoring them with the rubric.
  • Data is recorded in eqbench3_runs.json, and results displayed to console.

3. Full benchmark run with Elo to place the model on the leaderboard:

python eqbench3.py \
--test-model openai/gpt-4.1-mini \
--model-name my-gpt4-run \
--judge-model anthropic/claude-3.7-sonnet
  • After the roleplay scenarios are completed, it does a multi-stage pairwise pass (comparing the evaluated model to known models in local+leaderboard data).
  • Matchup results & ELO rating is stored in elo_results_eqbench3.json, and displayed to console.

---

4. Running the Benchmark

You interact with the main script [eqbench3.py](./eqbench3.py). It orchestrates:

  • Roleplay scenarios (multi-turn, plus self-debrief).
  • Rubric scoring (judge LLM reads final transcripts and grades on several criteria).
  • ELO analysis (judge LLM does pairwise comparisons, then solves rating with TrueSkill).

Command-Line Arguments

| Argument | Description | |------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------| | --test-model (required) | API model identifier for the tested model (e.g. openai/gpt-4.1-mini). | | --model-name | Logical name used for storing ELO data (defaults to --test-model if not supplied). It must be unique to avoid collisions in ELO data. | | --judge-model | Model to be used as the judge for ELO and/or rubric scoring. If --no-elo and --no-rubric are both set, you can skip this. | | --runs-file | Local runs data file (.json or .json.gz) storing scenario transcripts & statuses. Default: eqbench3_runs.json. | | --elo-results-file | Local ELO data file for storing pairwise…

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

Trivial fork with minimal traction