What does this repo signal mean?

OpenBMB (MiniCPM) published OpenBMB/UltraEval (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo OpenBMB/UltraEval · language Python. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

OpenBMB (MiniCPM) Repo: OpenBMB/UltraEval

Captured source

source ↗

GitHub/github.com/OpenBMB/UltraEval

OpenBMB/UltraEval repository metadata

Source ↗

published Nov 15, 2023seen 5dcaptured 9hhttp 200method plain

OpenBMB/UltraEval

Description: [ACL 2024 Demo] Official GitHub repo for UltraEval: An open source framework for evaluating foundation models.

Language: Python

License: Apache-2.0

Stars: 258

Forks: 23

Open issues: 4

Created: 2023-11-15T12:43:28Z

Pushed: 2024-10-30T09:16:01Z

Default branch: main

Fork: no

Archived: no

README:

Colab

![Open In Colab](https://colab.research.google.com/drive/1hNXtaR3V-VgmG59QJMW7YPaAYgwtmyuH?usp=sharing)

We provide a Colab notebook to help you get started with UltraEval.

News!

\[2024.6.4\] UltraEval was accepted by ACL 2024 System Demonstration Track (SDT).🔥🔥🔥
\[2024.4.11\] We published the UltraEval paper🔥🔥🔥, and we welcome discussions and exchanges on this topic.
\[2024.2.1\] MiniCPM has been released🔥🔥🔥, using UltraEval as its evaluation framework!
\[2023.11.23\] We open sourced the UltraEval evaluation framework and published the first version of the list.🔥🔥🔥

Overview

UltraEval is an open-source framework for evaluating the capabilities of foundation models, providing a suite of lightweight, easy-to-use evaluation systems that support the performance assessment of mainstream LLMs.

UltraEval's overall workflow is as follows:

Its main features are as follows: 1. Lightweight and Easy-to-use Evaluation Framework: Seamlessly designed with an intuitive interface, minimal dependencies, effortless deployment, excellent scalability, adaptable to diverse evaluation scenarios.

2. Flexible and Diverse Evaluation Methods: Supports a unified prompt template with an extensive array of evaluation metrics, allowing for personalized customization to suit specific needs.

3. Efficient and Swift Inference Deployment: Facilitates multiple model deployment strategies such as torch and vLLM, enabling multi-instance deployment for swift evaluation processes.

4. Publicly Transparent Open-Source Leaderboard: Maintains an open, traceable, and reproducible evaluation leaderboard, driven by community updates to ensure transparency and credibility.

5. Official and Authoritative Evaluation Data: Utilizes widely recognized official evaluation sets to ensure fairness and standardization in evaluations, ensuring results are comparable and reproducible.

6. Comprehensive and Extensive Model Support: Offers support for a wide spectrum of models, including those from the Huggingface open-source repository and personally trained models, ensuring comprehensive coverage.

Quick start

Welcome to UltraEval, your assistant for evaluating the capabilities of large models. Get started in just a few simple steps:

1. Install UltraEval

git clone https://github.com/OpenBMB/UltraEval.git
cd UltraEval
pip install .

2.Model evaluation

Enter the UltraEval root directory; all the following commands are executed in the root directory.

2.1 Generate the evaluation task file

Download datasets:

wget -O RawData.zip "https://cloud.tsinghua.edu.cn/f/11d562a53e40411fb385/?dl=1"

The Google Drive link is here.

Unzip evaluation datasets:

unzip RawData.zip

Preprocess the data:

python data_process.py

Execute the following command to display the supported data sets and their corresponding tasks:

python configs/show_datasets.py

Specify the tasks to be tested with the following instructions:

python configs/make_config.py --datasets ALL

The following is the specific parameter description:

datasets`: Select the data set, default is All(all data sets); Specify multiple data sets, with, spacing. For example, --datasets MMLU,Ceval
tasks`: Select the task, the default value is empty.
method`: Select the generation method, the default value is gen.
save`: Select the filename for the generated evaluation file, which defaults to eval_config.json.

Note ⚠️ : When 'tasks' have values, the number of 'datasets' must be 1, indicating that certain tasks under a specific dataset are to be executed; 'save' is a filename that should end with .json, and there is no need to input a path as it defaults to the 'configs' directory. Executing the above command will generate an evaluation file named 'eval_config.json' in the 'configs' directory.

The "RawData.zip" contains data collected from the official website. To expedite the unzipping process, the 'Math' and 'race' data have been preprocessed (the zip file includes the code, facilitating replication by users).

2.2 Local deployment model

As an example, deploying meta-llama/Llama-2-7b-hf using the vLLM deployment model:

python URLs/vllm_url.py \
--model_name meta-llama/Llama-2-7b-hf \
--gpuid 0 \
--port 5002

Below is a description of the specific parameters:

model_name`: Model name, when using vLLM, model_name and hugging face official name need to be consistent.
gpuid`: Specify the gpu id of the deployment model, default is 0. If more than one is needed, use , to separate them.
port`: The port number of the deployment URL, default 5002.

Executing the above code will generate a URL. For instance, the URL is http://127.0.0.1:5002/infer, where 5002 is the port number, and /infer is the URL path specified by the @app.route("/infer", methods=["POST"]) decorator in the URLs/vllm_url.py file.

For a model of individual training and a multi-GPU batch evaluation approach, see [Tutorial.md](./docs/tutorials/en/ultraeval.md).

2.3 Take the assessment and get the results

Create a bash script and execute the main.py program to get the results of the assessment:

python main.py \
--model general \
--model_args url=$URL,concurrency=1 \
--config_path configs/eval_config.json \
--output_base_path logs \
--batch_size 1 \
--postprocess general_torch \
--params models/model_params/vllm_sample.json \
--write_out

Below is a description of the specific parameters:

model`: Specifies the model. Currently, the general, gpt-3.5-turbo, and gpt-4 models are supported.
model_args`: Specify the URL generated in 2.2 and the number of concurrent threads to initialize the model parameters. Separate with commas, and connect parameter names and values with an equals sign. For example: url=$URL,concurrency=1.
config_path`: Specify the…

Excerpt shown — open the source for the full document.