What does this repo signal mean?

OpenBMB (MiniCPM) published OpenBMB/InfiniteBench (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo OpenBMB/InfiniteBench · language Python. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

OpenBMB (MiniCPM) Repo: OpenBMB/InfiniteBench

Captured source

source ↗

GitHub/github.com/OpenBMB/InfiniteBench

OpenBMB/InfiniteBench repository metadata

Source ↗

published Nov 22, 2023seen 5dcaptured 9hhttp 200method plain

OpenBMB/InfiniteBench

Description: Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718

Language: Python

License: MIT

Stars: 386

Forks: 33

Open issues: 10

Created: 2023-11-22T12:05:56Z

Pushed: 2024-09-25T20:06:30Z

Default branch: main

Fork: no

Archived: no

README:

Introduction

Welcome to InfiniteBench, a cutting-edge benchmark tailored for evaluating the capabilities of language models to process, understand, and reason over super long contexts (100k+ tokens). Long contexts are crucial for enhancing applications with LLMs and achieving high-level interaction. InfiniteBench is designed to push the boundaries of language models by testing them against a context length of 100k+, which is 10 times longer than traditional datasets.

Features

Loooong Context: InfiniteBench is a pioneer in testing language models with a context length of 100k+, offering an unparalleled challenge in the field.
Diverse Domain: The benchmark comprises 12 unique tasks, each crafted to assess different aspects of language processing and comprehension in extended contexts.
Specialized Test: InfiniteBench consists of tasks that state-of-the-art LLMs are known to be capable of when using shorter context. This ensures that the performance degradation is only caused by the length of the contexts.
Real-World and Synthetic Scenarios: The tasks are a mix of real-world scenarios and synthetic constructs, ensuring a comprehensive evaluation of models. Real-world scenarios make the test pragmatic, and synthetic ones leave the space for extending the context length further with ease.

Task Composition

| Task Name | Context | # Examples | Avg Input Tokens | Avg Output Tokens | Description | | -------------------- | ------------- | ---------- | ---------------- | ----------------- | ------------------------------------------------------------------------------------------- | | En.Sum | Fake Book | 103 | 171.5k | 1.1k | Summarization of a fake book created with core entity substitution. | | En.QA | Fake Book | 351 | 192.6k | 4.8 | Free-form question answering based on the fake book. | | En.MC | Fake Book | 229 | 184.4k | 5.3 | Multiple choice questions derived from the fake book. | | En.Dia | Script | 200 | 103.6k | 3.4 | Identification of talkers in partially anonymized scripts. | | Zh.QA | New Book | 175 | 2068.6k | 6.3 | Question answering on a set of newly collected books. | | Code.Debug | Code Document | 394 | 114.7k | 4.8 | Finding which function in a code repo contains an crashing error (in multiple choice form). | | Code.Run | Synthetic | 400 | 75.2k | 1.3 | Simulating execution of multiple simple, synthetic functions. | | Math.Calc | Synthetic | 50 | 43.9k | 43.9k | Calculations involving super-long arithmetic equations. | | Math.Find | Synthetic | 350 | 87.9k | 1.3 | Finding special integers in a lengthy list. | | Retrieve.PassKey[^1] | Synthetic | 590 | 122.4k | 2.0 | Retrieving hidden keys in a noisy long context. | | Retrieve.Number | Synthetic | 590 | 122.4k | 4.0 | Locating repeated hidden numbers in a noisy long context. | | Retrieve.KV[^2] | Synthetic | 500 | 89.9k | 22.7 | Finding the corresponding value from a dictionary and a key. |

How to Download Data

Click here to download data from 🤗 Huggingface directly:

Using 🤗 Datasets

Alternatively, you can download using the 🤗 Datasets library as follows.

from datasets import load_dataset, Value, Sequence
ft = Features({"id": Value("int64"), "context": Value("string"), "input": Value("string"), "answer": Sequence(Value("string")), "options": Sequence(Value("string"))})
dataset = load_dataset("xinrongzhang2022/InfiniteBench", features=ft)

Using Scripts

cd InfiniteBench
bash scripts/download_dataset.sh

This will directly dump the data to data.

Evaluation Result

We evaluate SOTA proprietary and open-source LLMs, the result is as follows.

| Task Name | GPT-4 | YaRN-Mistral-7B | Kimi-Chat | Claude 2 | Yi-6B-200K | Yi-34B-200K | ChatGLM-3-6B-128K | | ---------------- | ------ | --------------- | --------- | -------- | -----------| -----------| -----------| | Retrieve.PassKey | 100% | 92.71% | 98.14% | 97.80% | 100.00% | 100.00% | 92.20% | | Retrieve.Number | 100% | 56.61% | 95.42% | 98.14% | 94.92% | 100.00% | 80.68% | | Retrieve.KV | 89.00% | < 5% | 53.60% | 65.40% | < 5% | < 5% | < 5% | | En.Sum | 14.73% | 9.09% | 17.96% | 14.50% | < 5% | < 5% |< 5% | | En.QA | 22.44% | 9.55% | 16.52% | 11.97% | 9.20% | 12.17% |< 5% | | En.MC | 67.25% | 27.95% | 72.49% | 62.88% | 36.68% |38.43% |10.48% | | En.Dia | 8.50% | 7.50% | 11.50% | 46.50% | < 5% |< 5% |< 5% | | Zh.QA | 25.96% | 16.98% | 17.93% | 9.64% | 15.07% |13.61% |< 5% | | Code.Debug | 37.06% | < 5% | 17.77% | < 5% | 9.14% |13.96% |7.36% | | Code.Run | 23.25% | < 5% | < 5% | < 5% | < 5% |< 5% |< 5% | | Math.Calc | < 5% | < 5% | < 5% | < 5% | < 5% |< 5% |< 5% | | Math.Find | 60.00% | 17.14% | 12.57% | 32.29% | < 5% |25.71% |7.71% |

Note:

1. The evaluation code for YaRN-Mistral-7B is implemented by ourselves, and please contact us or submit an issue if there are any problems. 2. Kimi-Chat, Claude 2, and GPT-4 are evaluated using the official API with default configuration. 3. For Math.Calc, the values in the parentheses have a measurement unit of 0.01%. This is because it is easy to get a very low score on this task. 4. The metric for task Math.Find, Math.Calc, Code.Run, Code.Debug, En.Dia, En.MC, Retrieve.KV, Retrieve.Number, and Retrieve.PassKey is accuracy;

The metric for task Zh.QA and En.QA are ROUGE F1 score;

The metric for En.Sum is the rougeLsum score from the 🤗 Evaluate library.

Installation

pip install -r requirements.txt

How to Run

Download the dataset the data folder (or set the --data_dir argument to the location of the dataset). The data folder structure should be as follows.

InfiniteBench
├── data
│ ├── code_debug.jsonl
│ ├── code_run.jsonl
│ ├── kv_retrieval.jsonl
│ ├── longbook_choice_eng.jsonl
│ ├── longbook_qa_chn.jsonl
│ ├── longbook_qa_eng.jsonl
│ ├── longbook_sum_eng.jsonl
│ ├── longdialogue_qa_eng.jsonl
│ ├── math_calc.jsonl
│ ├── math_find.jsonl
│ ├── number_string.jsonl
│ ├── passkey.jsonl
│ └── construct_synthetic_dataset.py
...

Then, in the src folder, execute:

python eval_yarn_mistral.py --task kv_retrieval
python eval_gpt4.py --task longbook_sum_qa
python…

Excerpt shown — open the source for the full document.