arcee-ai/lm-evaluation-harness
forked from EleutherAI/lm-evaluation-harness
Captured source
source ↗arcee-ai/lm-evaluation-harness
Description: A framework for few-shot evaluation of language models.
Language: Python
License: MIT
Stars: 0
Forks: 0
Open issues: 0
Created: 2024-10-22T20:52:25Z
Pushed: 2024-10-28T21:22:01Z
Default branch: main
Fork: yes
Parent repository: EleutherAI/lm-evaluation-harness
Archived: no
README:
Test Ours
lm_eval --model vllm \ --model_args pretrained="Qwen/Qwen2.5-32B-Instruct",tensor_parallel_size=8,dtype=auto,gpu_memory_utilization=0.8,max_gen_toks=4096,use_blueberry=True \ --tasks bbh \ --batch_size auto \ --num_fewshot 3 \ --apply_chat_template True \
Language Model Evaluation Harness

---
*Latest News 📣*
- [2024/09] We are prototyping allowing users of LM Evaluation Harness to create and evaluate on text+image multimodal input, text output tasks, and have just added the
hf-multimodalandvllm-vlmmodel types andmmmutask as a prototype feature. We welcome users to try out this in-progress feature and stress-test it for themselves, and suggest they check out `lmms-eval`, a wonderful project originally forking off of the lm-evaluation-harness, for a broader range of multimodal tasks, models, and features. - [2024/07] [API model](docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. To run Llama 405B, we recommend using VLLM's OpenAI-compliant API to host the model, and use the `local-completions` model type to evaluate the model.
- [2024/07] New Open LLM Leaderboard tasks have been added ! You can find them under the [leaderboard](lm_eval/tasks/leaderboard/README.md) task group.
---
Announcement
A new v0.4.0 release of lm-evaluation-harness is available !
New updates and features include:
- New Open LLM Leaderboard tasks have been added ! You can find them under the [leaderboard](lm_eval/tasks/leaderboard/README.md) task group.
- Internal refactoring
- Config-based task creation and configuration
- Easier import and sharing of externally-defined task config YAMLs
- Support for Jinja2 prompt design, easy modification of prompts + prompt imports from Promptsource
- More advanced configuration options, including output post-processing, answer extraction, and multiple LM generations per document, configurable fewshot settings, and more
- Speedups and new modeling libraries supported, including: faster data-parallel HF model usage, vLLM support, MPS support with HuggingFace, and more
- Logging and usability changes
- New tasks including CoT BIG-Bench-Hard, Belebele, user-defined task groupings, and more
Please see our updated documentation pages in docs/ for more details.
Development will be continuing on the main branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub, or in the EleutherAI discord!
---
Overview
This project provides a unified framework to test generative language models on a large number of different evaluation tasks.
Features:
- Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented.
- Support for models loaded via transformers (including quantization via AutoGPTQ), GPT-NeoX, and Megatron-DeepSpeed, with a flexible tokenization-agnostic interface.
- Support for fast and memory-efficient inference with vLLM.
- Support for commercial APIs including OpenAI, and TextSynth.
- Support for evaluation on adapters (e.g. LoRA) supported in HuggingFace's PEFT library.
- Support for local models and benchmarks.
- Evaluation with publicly available prompts ensures reproducibility and comparability between papers.
- Easy support for custom prompts and evaluation metrics.
The Language Model Evaluation Harness is the backend for 🤗 Hugging Face's popular Open LLM Leaderboard, has been used in hundreds of papers, and is used internally by dozens of organizations including NVIDIA, Cohere, BigScience, BigCode, Nous Research, and Mosaic ML.
Install
To install the lm-eval package from the github repository, run:
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness cd lm-evaluation-harness pip install -e .
We also provide a number of optional dependencies for extended functionality. A detailed table is available at the end of this document.
Basic Usage
User Guide
A user guide detailing the full list of supported arguments is provided [here](./docs/interface.md), and on the terminal by calling lm_eval -h. Alternatively, you can use lm-eval instead of lm_eval.
A list of supported tasks (or groupings of tasks) can be viewed with lm-eval --tasks list. Task descriptions and links to corresponding subfolders are provided [here](./lm_eval/tasks/README.md).
Hugging Face transformers
To evaluate a model hosted on the HuggingFace Hub (e.g. GPT-J-6B) on hellaswag you can use the following command (this assumes you are using a CUDA-compatible GPU):
lm_eval --model hf \ --model_args pretrained=EleutherAI/gpt-j-6B \ --tasks hellaswag \ --device cuda:0 \ --batch_size 8
Additional arguments can be provided to the model constructor using the --model_args flag. Most notably, this supports the common practice of using the revisions feature on the Hub to store partially trained checkpoints, or to specify the datatype for running a model:
lm_eval --model hf \ --model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype="float" \ --tasks…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine fork of evaluation library