RepoOpenBMB (MiniCPM)OpenBMB (MiniCPM)published Aug 18, 2023seen 5d

OpenBMB/UltraFeedback

Python

Open original ↗

Captured source

source ↗
published Aug 18, 2023seen 5dcaptured 9hhttp 200method plain

OpenBMB/UltraFeedback

Description: A large-scale, fine-grained, diverse preference dataset (and models).

Language: Python

License: MIT

Stars: 368

Forks: 17

Open issues: 12

Created: 2023-08-18T02:17:47Z

Pushed: 2023-12-29T16:39:19Z

Default branch: main

Fork: no

Archived: no

README:

News

  • [2023/12/29]: We have fixed the overall_score as pointed in this issue and updated the dataset on HuggingFace. Please refer to the below "Update" section for details.
  • [2023/09/26]: UltraRM unleashes the power of UltraLM-13B-v2.0 and UltraLM-13B! A simple best-of-16 sampling achieves 92.30% (UltraLM2, 🥇 in 13B results) and 91.54% (UltraLM, 🥇 in LLaMA-1 results) win rates against text-davinci-003 on AlpacaEval benchmark!
  • [2023/09/26]: We release the UltraFeedback dataset, along with UltraFeedback-powered reward model UltraRM and critique model UltraCM! Both built new SOTAs over open-source models!

Update

The initial version of UltraFeedback includes 2628 completions that were assigned an overall score of 10. However, as pointed in Issue #8, many of these completions should have been assigned a score of 1. Intuitively, a completion with an overall score of 10 should be high-quality, which can be reflected in its corresponding averaged fine-grained scores. Hence, to rectify the scores, we processed all the potentially faulty completions based on their fine-grained scores. Specifically,

  • Completions with fine-grained scores 4 have been deemed to accurately represent a score of 10 and thus their overall_score has been left unchanged.
  • For the remaining completions, we have conducted a re-annotation process based on the original critique, with slight modifications to the prompts.

Please refer to ./src/fix_overall_score_issue.py for implementation details.

Links

Introduction

UltraFeedback is a large-scale, fine-grained, diverse preference dataset, used for training powerful reward models and critic models. We collect about 64k prompts from diverse resources (including UltraChat, ShareGPT, Evol-Instruct, TruthfulQA, FalseQA, and FLAN, see [here](#instruction-sampling) for dataset statistics). We then use these prompts to query multiple LLMs (see [here](#model-sampling) for model lists) and generate 4 different responses for each prompt, resulting in a total of 256k samples.

To collect high-quality preference and textual feedback, we design a fine-grained annotation instruction, which contains 4 different aspects, namely instruction-following, truthfulness, honesty and helpfulness. We then ask GPT-4 to annotate the collected samples based on the instruction.

Features

  • Scale: UltraFeedback consists of 64k prompts, 256k responses and high-quality feedback. RLHF researchers could further construct around 340k comparison pairs to train their reward models.
  • Diversity: As a preference dataset, diversity is the core requirement for UltraFeedback. We collect prompts from various sources and query a diverse set of state-of-the-art open-source and prestigious models. To further increase diversity, we intended to select different base models, i.e., LLaMA, Falcon, StarChat, MPT, GPT and Bard. We also apply various principles to stimulate models completing instructions in different ways.
  • High-density: UltraFeedback provides both numerical and textual feedback. Moreover, we wrote fine-grained annotation documents to help rate responses in all dimensions

Dataset Construction

Instruction Sampling

We sample 63,967 instructions from 6 public available and high-quality datasets. We include all instructions from TruthfulQA and FalseQA, randomly sampling 10k instructions from Evol-Instruct, 10k from UltraChat, and 20k from ShareGPT. For FLAN, we adopt a stratified sampling strategy, randomly sampling 3k instructions from "CoT" subset whereas sampling 10 instructions per task for the other three subsets, excluding those with overly long instructions.

{
"evol_instruct": 10000,
"false_qa": 2339,
"flan": 20939,
"sharegpt": 19949,
"truthful_qa": 811,
"ultrachat": 9929
}

Model Sampling

To prevent reward model from overfiting to certain text style or capturing spurious correlation between text style and rewards, we select different base models of all levels, with varying sizes, architectures and training data, to complete the instructions. We set up a pool of 17 models:

  • Commercial Models: GPT-4, GPT-3.5 Turbo, Bard
  • LLaMA family:

1. LLaMA-2-7B-chat, LLaMA-2-13B-chat, LLaMA-2-70B-chat 2. UltraLM-13B, UltraLM-65B 3. WizardLM-7B-v1.2, WizardLM-13B-v1.2, WizardLM-70B-v1.0 4. Vicuna-33B-v1.3 5. Alpaca-7B

  • Non-LLaMA series:

1. Falcon-40B-instruct 2. MPT-30B-chat 3. StarChat-Beta 4. Pythia-12B

Principle Sampling

Following [1] and [2], we define a set of principles to explicitly align model behaviors from different aspects. We set up a pool of 4 principles: Helpfulness, Truthfulness, Honesty and Verbalized Calibration. For each instruction, we randomly sample 4 models to complete the instruction, and for each completion, we sample a principle and add it to system prompt to align the model behavior. Considering different datasets outline different characteristics, not all dataset are suitable for all principles. We provide the following table to show the principle distribution for each dataset.

| Datset | Principle | | ------------- | ------------------------------------------------------------ | | Evol-Instruct | 100% Helpful | | FalseQA | 100% TruthfulQA | | FLAN | 60% Helpful, 20% Truthful, 20% Verbalized Calibration | | ShareGPT | 60% Helpful, 20% Truthful, 18% Honesty, 2% Verbalized Calibration | | TruthfulQA | 100% Truthful | | UltraChat | 60% Helpful, 20% Truthful, 18% Honesty, 2% Verbalized Calibration |…

Excerpt shown — open the source for the full document.