ForkFireworks AIFireworks AIpublished Jul 26, 2024seen 5d

fw-ai/alpaca_eval

forked from tatsu-lab/alpaca_eval

Open original ↗

Captured source

source ↗
published Jul 26, 2024seen 5dcaptured 13hhttp 200method plain

fw-ai/alpaca_eval

Description: An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

Language: Jupyter Notebook

License: Apache-2.0

Stars: 0

Forks: 0

Open issues: 0

Created: 2024-07-26T16:11:21Z

Pushed: 2024-07-26T17:34:43Z

Default branch: main

Fork: yes

Parent repository: tatsu-lab/alpaca_eval

Archived: no

README:

AlpacaEval : An Automatic Evaluator for Instruction-following Language Models

AlpacaEval 2.0 with length-controlled win-rates (paper) has a spearman correlation of 0.98 with ChatBot Arena while costing less than $10 of OpenAI credits run and running in less than 3 minutes. Our goal is to have a benchmark for chat LLMs that is: fast (

---

Updates:

:tada: Length-controlled Win Rates are out and used by default! This increases the correlation with ChatBot Arena from 0.93 to 0.98, while significantly decreasing length gameability. The raw win rates are still shown on the website and the CLI. More details [here](#length-controlled-win-rates).

:tada: AlpacaEval 2.0 is out and used by default! We improved the auto-annotator (better and cheaper) and use GPT-4 preview as baseline. More details [here](#alpacaeval-20). For the old version, set your environment variable IS_ALPACA_EVAL_2=False.

---

Table of Contents

1. [Overview](#overview) 2. [Quick Start](#quick-start) 2. [Leaderboards and how to interpret them](#leaderboards-and-how-to-interpret-them)

  • [Models](#models)
  • [Evaluators](#evaluators)

3. [Use-cases](#use-cases)

  • [Evaluating a model](#evaluating-a-model)
  • [Making a new leaderboard](#making-a-new-leaderboard)
  • [Making a new evaluator](#making-a-new-evaluator)

4. [Contributing](#contributing)

  • [Contributing a model](#contributing-a-model)
  • [Contributing an evaluator](#contributing-an-evaluator)
  • [Contributing an eval set](#contributing-an-eval-set)
  • [Contributing a completion function](#contributing-a-completion-function)

5. [Limitations](#limitations) 6. [Analysis](#additional-analysis-and-plots)

  • [Analyzing an evaluator](#analyzing-an-evaluator)
  • [Analyzing an eval set](#analyzing-an-eval-set)

7. [Citation](#citation) 8. [Additional information](#additional-information)

  • [Length-controlled win rates](#length-controlled-win-rates)
  • [AlpacaEval 2.0](#alpacaeval-20)
  • [Data Release](#data-release)
  • [Differences with AlpacaFarm](#differences-with-alpacafarm)
  • [Related work](#related-work)
  • [Interpreting annotations](#interpreting-annotations)
  • [Major updates](#major-updates)

Overview

Evaluation of instruction-following models (e.g., ChatGPT) typically requires human interactions. This is time-consuming, expensive, and hard to replicate. AlpacaEval in an LLM-based automatic evaluation that is fast, cheap, replicable, and validated against 20K human annotations. It is particularly useful for model development. Although we improved over prior automatic evaluation pipelines, there are still fundamental [limitations](#limitations) like the preference for longer outputs. AlpacaEval provides the following:

evaluation set. Caution: Automatic evaluators (e.g. GPT-4) may be biased towards models that generate longer outputs and/or that were fine-tuned on the model underlying the evaluator (e.g. GPT-4).

  • [Automatic evaluator](#evaluators): an automatic evaluator that has high agreement with humans (validated on 20K

annotations). We evaluate a model by measuring the fraction of times a powerful LLM (e.g. GPT-4) prefers the outputs from that model over outputs from a reference model. Our evaluators enable caching and output randomization by default.

  • [Toolkit for building automatic evaluators](#analysis): a simple interface for

building advanced automatic evaluators (e.g. with caching, batching, or multi-annotators) and analyzing them (quality, price, speed, statistical power, bias, variance etc).

  • [Human evaluation data](#data-release): 20K human preferences between a given and reference model

on the AlpacaFarm evaluation set. 2.5K of these are cross-annotations (4 humans annotating the same 650 examples).

of AlpacaFarm's evaluation set, where "instructions" and "inputs" are merged into one field, and reference outputs are longer. [Details here](#data-release).

When to use and not use AlpacaEval?

When to use AlpacaEval? Our automatic evaluator is a quick and cheap proxy for human evaluation of simple instruction-following tasks. It is useful if you have to run many evaluations quickly, e.g., during model development.

When not to use AlpacaEval? As any other automatic evaluator, AlpacaEval should not replace human evaluation in high-stake decision-making, e.g., to decide on model release. In particular, AlpacaEval is limited by the fact that (1) the instructions in the eval set might not be representative of advanced usage of LLMs; (2) automatic evaluators may have biases such as favoring style over factuality of the answer; and (3) AlpacaEval does not measure the risks that a model could cause. Details in [limitations](#limitations).

Quick Start

To install the stable release, run

pip install alpaca-eval

To install the nightly version, run

pip install git+https://github.com/tatsu-lab/alpaca_eval

Then you can use it as follows:

export OPENAI_API_KEY= # for more complex configs, e.g. using Azure or switching clients see client_configs/README.md
alpaca_eval --model_outputs 'example/outputs.json'

This will print the leaderboard to the console, and save both the leaderboard and the annotations to the same directory as the model_outputs file. Important parameters are the following:

  • model_outputs : A path to a json file for the outputs of the model to add to the leaderboard. Each dictionary

should contain the keys instruction and output.

  • annotators_config: This is the annotator to use. We recommend using weighted_alpaca_eval_gpt4_turbo (

default for AlpacaEval 2.0), which…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Routine fork of evaluation repo