WritingTogether AITogether AIpublished Feb 2, 2026seen 5d

Fine-tuning open LLM judges to outperform GPT-5.2

Open original ↗

Captured source

source ↗

Fine-tuning open LLM judges to outperform GPT-5.2

⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →

Introducing Together AI's new look →

🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →

⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →

📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →

🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →

All blog posts

Fine-Tuning

Published 2/2/2026

Fine-tuning open LLM judges to outperform GPT-5.2

Authors

Zain Hasan, Jasmine Li, Ivan Provilkov

Table of contents

40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...

Summary

Open-source LLM judges fine-tuned with DPO can outperform GPT-5.2 at evaluating model outputs. We trained GPT-OSS 120B on 5,400 preference pairs to beat GPT-5.2's accuracy—delivering superior performance at 15x lower cost and 14x faster speeds.

A deep dive into using preference optimization to train open-source models that beat GPT 5.2. We show that fine-tuned open-source models like gpt-oss 120b and Qwen 3 235B Instruct more often agree with human preference labels on a held-out evaluation set. We evaluate using Reward Bench 2 which measures alignment with human judgment, not absolute correctness or ground-truth quality. The table below is a quick sneak preview of the results we got, if you'd rather just see the code please feel free to jump into the cookbook !

  • Together AI models page

** Speed on Together AI as benchmarked by Artificial Analysis 3rd party Model Baseline + DPO Fine-tune Cost per 1M tokens* Cost vs GPT-5.2 Speed** Speed vs GPT-5.2 GPT-5.2 61.62% N/A $1.75 input / $14 output - 62.9 tok/sec - gpt-oss 120B 57.91% 62.63% $0.15 input / $0.60 output 15.3× cheaper 908.7 tok/sec 14× faster Qwen3 235B 62.63% 61.28% $0.20 input / $0.60 output 12.4× cheaper 261.6 tok/sec 4.2× faster Llama 4 Mav 50.2% — $0.27 input / $0.85 output 9.1× cheaper 64.7 tok/sec 1× faster

The LLM-as-a-judge paradox Here's a paradox that's bothered me for some time now: we're using LLMs to evaluate LLMs. The same technology that generates hallucinations is now our primary tool for detecting them. It sounds like asking the fox to guard the henhouse 😀. But it works. And not just works, it's become the dominant framework for evaluating LLM-powered products at scale. The reason is simple: for most tasks judging is easier than generating . When an LLM generates a response, it juggles complex context, follows multi-step instructions, and synthesizes information from its training data. When it evaluates a response, it performs a focused classification task of the form: does this text contain harmful content? Is response A better than response B? This insight opens up an interesting question: if judging is a simpler task, can we fine-tune smaller, open-source models to be *better* judges than massive closed-source alternatives? We ran the experiment. The answer is yes! In this deep dive, we'll show you how we fine-tuned open-source LLM judges to outperform GPT-5.2 on human preference alignment using Direct Preference Optimization (DPO). We'll cover: The experimental setup and benchmark (RewardBench 2) Baseline evaluation of 4 judge models (3 open, 1 closed) DPO fine-tuning methodology and results Category-level analysis revealing where each model excels and where preference tuning helped/hurt Practical code to implement this yourself!

Let's dive in. Why LLM-as-a-judge works Before we get to the experiment, let's build intuition for why this technique is so effective. The evaluation scaling problem Evaluating LLM outputs is fundamentally different from evaluating traditional ML models. With a classifier, you compute accuracy against ground truth labels. With a recommender, you measure ranking quality with NDCG. But with generative text? There are many ways to be "right." A summary can be accurate without matching the reference word-for-word. A chatbot response can be helpful in different styles. Metrics like BLEU or ROUGE capture surface-level overlap but miss semantic equivalence. Human evaluation handles these nuances, but it doesn't scale. You can't have humans review every response in production. Enter LLM-as-a-judge The breakthrough insight is that LLMs, trained on vast amounts of human-written text, have internalized patterns of quality, relevance, and appropriateness. By crafting the right evaluation prompt, you can activate these capabilities for focused assessment tasks. ‍

Figure: The LLM-as-a-Judge workflow. An external LLM evaluates outputs from your production system using criteria you define. The key is that the evaluator/Judge LLM operates independently of the generation process. It examines the output and judges it on its merits. Even if your chatbot was tricked into generating harmful content, an external evaluator can still detect this because it's performing a simpler, focused classification task. Types of LLM judges There are three main paradigms: Pairwise Comparison : Given two responses, which is better? Useful for A/B testing models or prompts. Direct Scoring : Rate a single response on a scale (1-10) or classify it (helpful/unhelpful). Useful for production monitoring. ‍ Reference-Based Evaluation : Compare a response against source material or a reference answer. Essential for RAG systems and hallucination detection.

For this experiment, we focus on a pairwise comparison depicted in the flowchart below, this is the classic "LLM-as-a-Judge" setup that the technique is named after.

The experiment: Can open-source judges beat GPT-5.2? GPT-5.2 represents the current state-of-the-art in closed-source LLM judges. It's powerful, but: Expensive : Per-token costs add up at scale - with open models you can deploy them on your GPUs and at scale this is significantly more price effective. Opaque : No visibility into model weights or behavior - you can probe the judge to understand why it's behaving a certain way. Vendor lock-in : Your evaluation pipeline depends on an external API.

For many of the above reasons it would be beneficial if we could use open judges that we could deploy where we wish, probe as we see necessary and continually improve. But we also don't want to leave performance on the table, we'd like to have our cake and eat it too! Here we'll see that if you have a dataset of preferences and human labels(which…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Substantive claim but low traction on HN