WritingTogether AITogether AIpublished Nov 4, 2025seen 5d

How to evaluate and benchmark Large Language Models (LLMs)

Open original ↗

Captured source

source ↗

How to evaluate and benchmark Large Language Models (LLMs)

⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →

Introducing Together AI's new look →

🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →

⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →

📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →

🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →

All blog posts

Model Library

Published 11/4/2025

How to evaluate and benchmark Large Language Models (LLMs)

Test, compare, and understand LLM performance.

Authors

Zain Hasan

Table of contents

40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...

TL;DR

Learn how to evaluate and benchmark large language models using datasets like MMLU, GSM8K, and HumanEval. Going further, we’ll also explore methods and best practices for reliable, real-world LLM performance testing.

Large language models (LLMs) have transformed how we interact with AI, from powering chatbots to generating code and solving complex mathematical problems. But as these models become increasingly sophisticated, a critical question remains: How do we actually measure their capabilities and determine which models are truly better ? The answer is benchmarks and evaluation frameworks — the systematic approaches we use to test, compare, and understand LLM performance. Understanding how to properly evaluate LLMs is essential. In this blog we'll explore everything you need to know about LLM evaluation, from the fundamental principles of good benchmarks to the various evaluation methodologies used in practice. Additionally, we also include practical code notebooks to help you get hands on running evals for real-world use-cases we see our customers working on everyday! Why LLM benchmarks and evaluations matter The foundation of progress Benchmarks serve as the compass for AI development and progress. They allow us to answer fundamental questions: Is Kimi-K2 better than Claude for agentic coding tasks? How much have language models improved at mathematical reasoning over the past year? Which fine-tune of an open-source model performs the best on your internal evaluations? Consider the release of DeepSeek R1, which made headlines for its competitive performance against frontier models. The company's claims weren't based on subjective impressions, they were backed by systematic evaluation across six different benchmarks, including AIME 2024, CodeForces, GSM8K, and GPQA Diamond. This standardized comparison allowed the AI community to quickly understand where DeepSeek R1 stood relative to previously released and established models like OpenAI's o1 and Anthropic Claude series.

Source: https://huggingface.co/deepseek-ai/DeepSeek-R1 Tracking the AI revolution Benchmarks also help us track broader trends in AI capabilities. Take the MMLU benchmark, which tests knowledge across 57 college-level subjects. By plotting MMLU performance over time, we can see that last year we passed this fascinating inflection point: open-source models are now on par in performance — or in some cases better than — closed-source systems, marking a significant shift in the AI landscape.

Source: https://x.com/maximelabonne/status/1972615048511250647 This convergence isn't just academic, it has profound implications for how organizations deploy AI, how researchers access cutting-edge capabilities, and how the entire AI ecosystem evolves. Identifying capabilities and limitations Perhaps most importantly, benchmarks also help us understand not just what models can do, but what they cannot do. They reveal blind spots, highlight areas needing improvement, and guide research priorities. This is crucial for building reliable AI systems and setting appropriate expectations for real-world deployment. A great example of this is seen in recently released models that actually show degraded performance on general knowledge benchmarks like SimpleQA from OpenAI. For example, Qwen3 235B outperformed many other models on most agentic and reasoning benchmarks — but when evaluated for general knowledge, it severely underperforms! This has since been rectified with the newer Instruct and Reasoning versions of Qwen3, performing significantly better on SimpleQA.

Source: https://x.com/nathanhabib1011/status/1917230699582751157 What makes a good LLM benchmark? Five key principles. Not all benchmarks are created equal. The most valuable evaluation frameworks share five key characteristics that make them reliable indicators of model capability. 1. Difficulty: The moving target challenge A good benchmark must be challenging enough to distinguish between different models. This might seem obvious, but it's harder to achieve than you might think. Consider MATH, a benchmark featuring competition-level mathematical problems (shown in green in the plot below). When it was first introduced, state-of-the-art models achieved single-digit accuracy scores — the problems seemed nearly impossible for AI systems to solve. Fast-forward to today, and the same benchmark sees models achieving over 90% accuracy. What was once a discriminating test has become routine for modern systems. This illustrates the "benchmark saturation" phenomenon: as models improve, previously challenging benchmarks become easy, requiring researchers to constantly develop new, more difficult evaluations. The trend is accelerating as can be seen in the plot above. While it took nearly four years for models to achieve high performance on MMLU, newer benchmarks like GPQA (PhD-level questions) saw rapid improvement within just a year. This acceleration reflects the unprecedented pace of AI development we're witnessing and highlights the importance of having increasingly difficult benchmarks that can help us distinguish between the best models.

Source: https://x.com/_jasonwei/status/1889096555254456397 2. Diversity: Beyond single-domain testing Large language models are general-purpose systems that people use for everything from entertainment to work and everything in between, so their evaluation should reflect this breadth. A benchmark that only tests mathematical reasoning might miss crucial limitations in common sense reasoning or creative writing. The MixEval framework illustrates this principle well, showing how different domains — from STEM…

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

Educational blog post, not a model release.