What does this fork signal mean?

Nous Research forked NousResearch/creative-writing-bench (forked from EQ-bench/creative-writing-bench). This fork signal points to upstream code the lab may be inspecting, patching, or building on. High-signal details: repo NousResearch/creative-writing-bench · parent EQ-bench/creative-writing-bench · Routine fork, low traction. onlylabs links this event to 1 captured evidence page and 6 related fork signals.

Nous Research Fork: NousResearch/creative-writing-bench

Captured source

source ↗

GitHub/github.com/NousResearch/creative-writing-bench

NousResearch/creative-writing-bench repository metadata

Source ↗

published Aug 6, 2025seen Jun 6captured Jun 11http 200method plain

NousResearch/creative-writing-bench

Language: HTML

Stars: 4

Forks: 3

Open issues: 0

Created: 2025-08-06T18:01:51Z

Pushed: 2025-08-12T15:39:08Z

Default branch: main

Fork: yes

Parent repository: EQ-bench/creative-writing-bench

Archived: no

README:

Creative Writing Benchmark v3

🎨 Welcome to the Creative Writing Benchmark v3 repository! This benchmark evaluates the creative writing capabilities of large language models using a hybrid rubric and Elo scoring system, designed for enhanced discrimination, especially at the top end of model performance. This is the system used for the Creative Writing leaderboard on EQ-Bench.com.

How the Benchmark Works

The evaluation process involves several steps:

1. Generation: The model under test generates responses to 32 distinct writing prompts across 3 iterations (96 items total). Generation uses a temperature of 0.7 and min_p of 0.1 to encourage creativity while maintaining some consistency. 2. Rubric Scoring: Each generated piece is individually assessed by a judge model (Anthropic's Claude 3.7 Sonnet recommended for leaderboard parity) against a comprehensive rubric. 3. Initial Elo Inference: The aggregate rubric score is used to estimate an initial Elo rating for the model being evaluated relative to existing models. 4. Pairwise Matchups (Sparse): The model is compared against neighboring models on the leaderboard using pairwise matchups. The judge determines the better output across several criteria, assigning a score margin (using + symbols). 5. Glicko Calculation: Elo scores are calculated using the Glicko-2 rating system, modified to incorporate the win margin (number of +'s) from pairwise comparisons. This process loops until model positions stabilize. 6. Pairwise Matchups (Comprehensive): More thorough pairwise comparisons are conducted with the model's final neighbors. 7. Final Elo Calculation: The definitive leaderboard Elo score is computed based on all comparisons. 8. Normalization: Raw Elo scores are normalized by anchoring specific models (e.g., deepseek/deepseek-r1 to 1500, mistralai/ministral-3b to 200) to ensure comparability over time.

More info here: https://eqbench.com/about.html#creative-writing-v3

Key Features

Hybrid Scoring: Combines isolated rubric scoring with more discriminative pairwise Elo comparisons.
Glicko-2 System: Uses a robust rating system that accounts for rating uncertainty and volatility, adapted to weight win margins.
Discriminative Prompts: Prompts are designed to challenge models in areas like humor, romance, spatial awareness, and unique perspectives.
Bias Mitigation: Incorporates strategies to reduce known LLM judge biases (Length, Position, Verbosity, Poetic Incoherence).
Iteration-Based: Runs multiple iterations per prompt to account for generation variability.

Usage

Prerequisites

Python 3.x
API keys for the test and judge models (compatible with OpenAI/OpenRouter API format).
Required Python packages.

Setup

1. Clone the repository:

git clone https://github.com/EQ-bench/creative-writing-bench.git
cd creative-writing-bench

2. Install dependencies:

pip install -r requirements.txt
# Or manually:
# pip install requests python-dotenv numpy scipy tqdm glicko2 nltk joblib

You also need to download NLTK data:

import nltk
nltk.download('punkt')
nltk.download('cmudict')

3. Configure API Keys:

Copy the example environment file: cp .env.example .env
Edit the .env file and add your API keys and desired endpoint URLs for the test and judge models. You can also adjust timeouts and retries here.

Running the Benchmark

Execute the main script with your desired parameters. For a leaderboard-comparable score, use the recommended judge model and the provided runs file:

python3 creative_writing_bench.py \
--test-model "your-model-provider/your-model-name" \
--judge-model "anthropic/claude-3.7-sonnet" \
--runs-file "creative_bench_runs.json" \
--creative-prompts-file "data/creative_writing_prompts_v3.json" \
--run-id "my_model_run_1" \
--threads 500 \
--verbosity "INFO" \
--iterations 3

Important Arguments:

--test-model: Identifier for the model you want to evaluate.
--judge-model: Identifier for the judge model (use anthropic/claude-3.7-sonnet for leaderboard scores).
--runs-file: Path to the JSON file storing run data. **Crucially, to get an Elo score comparable to the EQ-Bench leaderboard, you *must* use the creative_bench_runs.json file provided in this repository**, as it contains the necessary historical data for Elo calculation. Start with the one provided here. Subsequent runs will update this file.
--iterations: Number of generation iterations per prompt (default and recommended: 3).
--run-id: A unique prefix for this specific run attempt. Helps organize results if you run the same model multiple times.
--threads: Number of parallel threads for generation and judging (adjust based on your API rate limits and system). High numbers like 500 assume generous rate limits.
--verbosity: Logging level (e.g., DEBUG, INFO).

Understanding the Output

Progress will be logged to the console.
Detailed run data, including generated text and judge scores, is saved in the specified --runs-file (e.g., creative_bench_runs.json).
Elo analysis results, including pairwise comparisons and final ratings, are stored in elo_results.json.
The final normalized Elo score (elo_norm) for your test model will be printed at the end and saved in elo_results.json. This is the score comparable to the EQ-Bench leaderboard.

Benchmark Philosophy

Evaluating creative writing is inherently subjective. This benchmark aims to provide a reliable *relative* ranking by:

Using a judge model (Sonnet 3.7) known for decent literary assessment.
Employing pairwise comparisons for better discrimination than rubric scores alone.
Choosing prompts that deliberately expose model weaknesses, creating a steeper evaluation gradient.
Acknowledging and attempting to mitigate known LLM judge biases.

However, no benchmark is perfect. Always supplement scores by reading sample outputs and forming your own judgment.

Scoring System: Rubric vs. Elo

Rubric Score: An aggregate score based on judging...

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Routine fork, low traction