RepoOpenBMB (MiniCPM)OpenBMB (MiniCPM)published Mar 30, 2024seen 5d

OpenBMB/Eurus

Python

Open original ↗

Captured source

source ↗
published Mar 30, 2024seen 5dcaptured 11hhttp 200method plain

OpenBMB/Eurus

Language: Python

License: Apache-2.0

Stars: 323

Forks: 15

Open issues: 8

Created: 2024-03-30T10:10:12Z

Pushed: 2024-09-18T15:55:39Z

Default branch: main

Fork: no

Archived: no

README:

Update News

Links

Introduction

Eurus

We release a suite of LLMs and a reward model. Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. Besides, Eurux-8x22B's performance further improves and achieves superb reasoning performance as well as excellent chat & instruction-following capabilities. We also train a reward model that demonstrates especially strong preference modeling performance on reasoning tasks.

  • *Eurux-8x22B-NCA* and *Eurux-8x22B-KTO*: It is SFT and NCA(KTO) fine-tuned from Mixtral-8x22B on all multi-turn trajectory pairs in UltraInteract and all pairs in UltraFeedback.
  • *Eurus-7B-SFT* and *Eurus-70B-SFT*: Fine-tuned from Mistral-7B and CodeLLaMA-70B on all correct actions in UltraInteract, mixing a small proportion of UltraChat, ShareGPT, and OpenOrca examples.
  • *Eurus-7B-KTO* and *Eurus-70B-NCA*: Preference fine-tuned on UltraInteract and UltraFeedback on top of SFT models.
  • *Eurus-RM-7B*: Trained on a mixture of UltraInteract, UltraFeedback, and UltraSafety.

UltraInteract

The strong performance of Eurus can be primarily attributed to UltraInteract, a large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. For each instruction, it includes a preference tree consisting of

  • (1) reasoning chains with diverse planning strategies in a unified format
  • (2) multi-turn interaction trajectories with the environment and the critique
  • (3) pairwise data to facilitate preference learning

Structure

UltraInteract collects a preference tree for each instruction, with the instruction being the root and each action a node. A trajectory is a root-to-leaf path consisting of a sequence of actions. In each preference tree, all nodes of correct actions and all trajectories ending with correct actions can be used for SFT. Paired correct and incorrect nodes or trajectories can be used for preference learning.

Illustrative Example

Here is an illustrative example of an UltraInteract trajectory over two turns. In each turn, the actor model generates step-by-step reasoning chains, and the environment and the critique model provide observations and textual critique respectively.

Stats

Below are some statistics about UltraInteract. It consists of 86k instructions, 286k correct answers, and 219k pairs.

Evaluation

Eurux-8x22b-NCA and Eurux-8x22b-KTO

We conducted overall coding, math, reasoning, knowledge, instruction-following, and chat benchmarking. Results are shown below, with the best scores in open-source models bolded. Eurux-8x22b-NCA and Eurux-8x22b-KTO achieve superb reasoning performance as well as excellent chat & instruction-following capabilities.

| Models/Benchmarks | Coding | | | Math | | | Reasoning | Knowledge | Ins-Following | Chat | |-------------------|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:-------------:|:---------:| | | HumanEval | MBPP | LeetCode | GSMPLUS | MATH | TheoremQA | BBH (CoT) | MMLU | IFEval | MT-Bench | | GPT-3.5-Turbo | 76.8 | 82.5 | 23.3 | 61.2 | 37.8 | 35.6 | 70.1 | 70.0 | 56.6 | 7.94 | | GPT-4 | 85.4 | 83.5 | 41.8 | 85.6 | 69.7 | 52.4 | 86.7 | 86.4 | 79.7 | 8.96 | | Mixtral-8x7B-Ins | 50.6 | 50.1 | 5.6 | 49.6 | 25.9 | 20.4 | 73.5 | 70.3 | 48.8 | 8.30 | | DS-LM-67B-Chat | 70.7 | 65.7 | 20.0 | 65.0 | 41.0 | 17.9 | 78.9 | 72.3 | 52.7 | 8.35 | | QWen-1.5-72B | 71.3 | 56.9 | 15.6 | 65.4 | 43.4 | 18.5 | 78.0 | 72.9 | 53.4 | 8.61 | | Llama-3-70B-Ins | 77.4 | 66.2 | 34.4 | 72.9 | 46.8 | 26.6 | 91.7 | 79.8 | 83.2 | 9.02 | | Eurus-70b-NCA | 79.3 | 71.9 | 33.3 | 62.8 | 41.7 | 32.6 | 80.0 | 59.4 | 49.2 | 7.54 | | Eurux-8x22b-KTO | 71.3 | 68.9 | 29.4 | 68.3 | 48.4 | 35.3 | 83.6 | 75.9 | 67.1 | 8.58 | | Eurux-8x22b-NCA | 75.0 | 69.7 | 35.0 | 68.1 | 49.0 | 35.5 | 83.5 | 75.6 | 67.1 | 8.46 |

Eurus-7B and Eurus-70B

  • Eurus, both the 7B and 70B variants, achieve the best overall performance among open-source models of similar sizes. Eurus even outperforms specialized models in corresponding domains in many cases. Notably, Eurus-7B outperforms baselines that are 5× larger, and Eurus-70B achieves better performance than GPT-3.5 Turbo.
  • Preference learning with UltraInteract can further improve performance, especially in math and the multi-turn ability.

Eurus-RM-7B

  • Eurus-RM-7B stands out as the best 7B RM overall and achieves similar or better performance than much larger baselines. Particularly, it outperforms GPT-4 in certain tasks.
  • Our training objective is beneficial in improving RM performance on hard…

Excerpt shown — open the source for the full document.