RepoDatabricks (DBRX)Databricks (DBRX)published Sep 17, 2024seen 5d

databricks/judges

Python

Open original ↗

Captured source

source ↗
published Sep 17, 2024seen 5dcaptured 9hhttp 200method plain

databricks/judges

Description: A small library of LLM judges

Language: Python

License: Apache-2.0

Stars: 338

Forks: 35

Open issues: 0

Created: 2024-09-17T04:21:03Z

Pushed: 2025-07-31T15:07:36Z

Default branch: main

Fork: no

Archived: yes

README:

judges ‍⚖️

1. [Overview](#overview) 2. [Installation](#installation) 3. [API](#api)

  • [Types of Judges](#types-of-judges)
  • [Classifiers](#classifiers)
  • [Graders](#graders)
  • [Using Judges](#using-judges)
  • [Classifier Judges](#classifier-judges)
  • [Combining Judges](#combining-judges)
  • [Jury Object](#jury-object)

4. [Usage](#usage)

  • [Pick a model](#pick-a-model)
  • [Send data to an LLM](#send-data-to-an-llm)
  • [Use a judges classifier LLM as an evaluator model](#use-a-judges-classifier-llm-as-an-evaluator-model)
  • [Use a Jury for averaging and diversification](#use-a-jury-for-averaging-and-diversification)
  • [Use AutoJudge to create a custom LLM judge](#use-autojudge-to-create-a-custom-llm-judge)

5. [Creating Custom Judges](#creating-custom-judges) 6. [CLI](#cli) 6. [Appendix of Judges](#appendix)

  • [Classifiers](#classifiers)
  • [Grader](#graders)

Overview

judges is a small library to use and create LLM-as-a-Judge evaluators. The purpose of judges is to have a curated set of LLM evaluators in a low-friction format across a variety of use cases that are backed by research, and can be used off-the-shelf or serve as inspiration for building your own LLM evaluators.

Installation

pip install judges

API

Types of Judges

The library provides two types of judges:

1. Classifiers: Return boolean values.

  • True indicates the inputs passed the evaluation.
  • False indicates the inputs did not pass the evaluation.

2. Graders: Return scores on a numerical or Likert scale.

  • Numerical scale: 1 to 5
  • Likert scale: terrible, bad, average, good, excellent

Using Judges

All judges can be used by calling the .judge() method. This method accepts the following parameters:

  • input: The input to be evaluated.
  • output: The output to be evaluated.
  • expected (optional): The expected result for comparison.

The .judge() method returns a Judgment object with the following attributes:

  • reasoning: The reasoning behind the judgment.
  • score: The score assigned by the judge.

Classifier Judges

If the underlying prompt for a classifier judge outputs a Judgment similar to True or False (e.g., good or bad, yes or no, 0 or 1), the judges library automatically resolves the outputs so that a Judgment only has a boolean label.

Combining Judges

The library also provides an interface to combine multiple judges through the Jury object. The Jury object has a .vote() method that produces a Verdict.

Jury Object

  • .vote(): Combines the judgments of multiple judges and produces a Verdict.

Usage

Pick a model

By default, judges uses `instructor` for structured outputs and models due to its widespread use. To get started, set your OPENAI_API_KEY or whatever key you want for a specific model provider. Refer to the instructor docs for more providers.

Send data to an LLM

Next, if you'd like to use this package, you can follow the examples in the examples directory, or follow the code below:

from openai import OpenAI

client = OpenAI()

question = "What is the name of the rabbit in the following story. Respond with 'I don't know' if you don't know."

story = """
Fig was a small, scruffy dog with a big personality. He lived in a quiet little town where everyone knew his name. Fig loved adventures, and every day he would roam the neighborhood, wagging his tail and sniffing out new things to explore.

One day, Fig discovered a mysterious trail of footprints leading into the woods. Curiosity got the best of him, and he followed them deep into the trees. As he trotted along, he heard rustling in the bushes and suddenly, out popped a rabbit! The rabbit looked at Fig with wide eyes and darted off.

But instead of chasing it, Fig barked in excitement, as if saying, "Nice to meet you!" The rabbit stopped, surprised, and came back. They sat together for a moment, sharing the calm of the woods.

From that day on, Fig had a new friend. Every afternoon, the two of them would meet in the same spot, enjoying the quiet companionship of an unlikely friendship. Fig's adventurous heart had found a little peace in the simple joy of being with his new friend.
"""

# set up the input prompt
input = f'{story}\n\nQuestion:{question}'

# write down what the model is expected to respond with
# NOTE: not all judges require an expected answer. refer to the implementations
expected = "I don't know"

# get the model output
output = client.chat.completions.create(
model='gpt-4o-mini',
messages=[
{
'role': 'user',
'content': input,
},
],
).choices[0].message.content

Use a judges classifier LLM as an evaluator model

from judges.classifiers.correctness import PollMultihopCorrectness

# use the correctness classifier to determine if the first model
# answered correctly
correctness = PollMultihopCorrectness(model='openai/gpt-4o-mini')

judgment = correctness.judge(
input=input,
output=output,
expected=expected,
)
print(judgment.reasoning)
# The 'Answer' provided ('I don't know') matches the 'Reference' text which also states 'I don't know'. Therefore, the 'Answer' correctly corresponds with the information given in the 'Reference'.

print(judgment.score)
# True

Use a Jury for averaging and diversification

A jury of LLMs can enable more diverse results and enable you to combine the judgments of multiple LLMs.

from judges import Jury
from judges.classifiers.correctness import PollMultihopCorrectness, RAFTCorrectness

poll = PollMultihopCorrectness(model='openai/gpt-4o')
raft = RAFTCorrectness(model='openai/gpt-4o-mini')

jury = Jury(judges=[poll, raft], voting_method="average")

verdict = jury.vote(
input=input,
output=output,
expected=expected,
)
print(verdict.score)

Use AutoJudge to create a custom LLM judge

autojudge is an extension to the judges library that builds on our previous work aligning judges to human feedback -- given a labeled dataset with feedback and a natural language description of an evaluation task, autojudge

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Solid new repo from Databricks, moderate stars