OpenBMB/UltraLink
Python
Captured source
source ↗OpenBMB/UltraLink
Description: An Open-Source Knowledge-Enhanced Multilingual Supervised Fine-tuning Dataset
Language: Python
License: MIT
Stars: 28
Forks: 6
Open issues: 0
Created: 2024-02-07T03:15:24Z
Pushed: 2025-01-19T09:28:05Z
Default branch: main
Fork: no
Archived: no
README:
News
- ❗️❗️ Febrary 6, 2024: Releasing a multi-lingual, knowledge-grounded data augmented, multi-round dialogue dataset UltraLink and the model weight of UltraLink-LM.
Introduction
UltraLink
UltraLink is a multi-lingual, knowledge-grounded data augmented, multi-round dialogue dataset. It contains language-specific chat data, language-agnostic chat data, code data and math data in 5 languages: English, Chinese, Spanish, Russian, and French. It can be downloaded in this huggingface link. Different from previous works that simply translate English instructions, we consider both the language-specific and language-agnostic abilities of LLMs. Firstly, we introduce a knowledge-grounded data augmentation approach to elicit more culture-specific knowledge of LLMs, improving their ability to serve users from different countries. Moreover, we find modern LLMs possess strong cross-lingual transfer capabilities, thus repeatedly learning identical content in various languages is not necessary. Consequently, we can substantially prune the language-agnostic SFT data without any performance degradation, making multilingual SFT more efficient.
UltraLink-LM
> The UltraLink-LM is a massively multilingual generative language model that follows instructions in 5 languages, English, French, Russian, Spanish, and Chinese. The model is capable of generating text in 5 languages with high quality and diversity. > UltraLink-LM outperforms PolyLM-Chat-13b, [Guanaco](JosephusCheung/Guanaco), and Bloomz-7b1-mt in code, math and chat abilities in four languages, and has a high-quality and diverse text generation performance in all languages. > The UltraLink-LM is trained using UltraLink, UltraChat, Magicoder-Evol, Magicoder-OSS, MetaMathQA, and ShareGPT. > We release the checkpoints under a MIT license to further our mission of multilingual technologies empowering a multilingual world. It can be downloaded in this huggingface link.
- Developed by: [OpenBMB]((https://www.openbmb.cn/home))
- Model type: a Transformer style autoregressive massively multilingual language model.
- Paper: UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised Fine-tuning Dataset
- Languages: English, French, Russian, Spanish, and Chinese.
- License: MIT
- Model: UltraLink-LM
- Model Size: 13 billion parameters
- Datasets: UltraLink, UltraChat(random select 10k samples), Magicoder-Evol, Magicoder-OSS, MetaMathQA, and ShareGPT(the English part of the dataset whose sample length is greater than 4k).
Performance
We report 6 evaluations in this section: multilingual HumanEval, MGSM, OMGEval, ARC, Hellaswag and MMLU. Natural language generation performance is evaluated by HumanEval MGSM and OMGEval, while natural language understanding is evaluated by ARC, Hellaswag and MMLU. Evaluations of modern LLMs may be biased and affected by many factors, we are also actively working on more comprehensive evaluation methods.
Multilingual HumanEval
HumanEval is a well-known benchmark for evaluating the code ability of LLMs. It execute the code snippets generated by the model and evaluate their correctness. Since there are no existing multilingual test set for code generation, we use GPT-3.5 with carefully-designed prompts to translation HumanEval into other languages.
| Model | En | Zh | Es | Ru | Fr | Avg | | ---------------------- | -------- | -------- | -------- | -------- | -------- | -------- | | Aya-101 | 0.6 | 0 | 0 | 0 | 0 | 0.1 | | Aya-5-languages* | 6.1 | 9.8 | 6.1 | 8.5 | 4.3 | 7.0 | | Bloomz-7b1-mt | 8.5 | 7.3 | 6.1 | 8.5 | 6.1 | 7.3 | | Phoenix-inst-chat-7b | 11.0 | 10.4 | 8.5 | 1.2 | 13.4 | 12.2 | | PolyLM-Multialpaca-13b | 8.5 | 7.3 | 6.1 | 6.1 | 6.1 | 6.8 | | PolyLM-Chat-13b | 10.4 | 7.9 | 6.1 | 7.3 | 8.5 | 8.1 | | Chimera-inst-chat-13b | 14.6 | 13.4 | 14.6 | 12.8 | 14.0 | 13.9 | | Okapi-7b | 12.2 | 11.0 | 8.5 | 8.5 | 8.5 | 9.8 | | Guanaco-7b | 9.2 | 6.7 | 11.0 | 9.8 | 12.8 | 9.9 | | Guanaco-13b | 18.3 | 15.9 | 9.8 | 8.5 | 14.6 | 12.2 | | UltraLink-LM | 60.4 | 43.9 | 40.9 | 49.4 | 39.6 | 46.8 |
\* Specially, Aya-5-languages is obtained by randomly extracting 3M data after selecting 5 languages(which are same languages that UltraLink supports) and then finetuned with 1 epoch on Llama-13b.
MGSM
We employ MGSM to evaluate the math reasoning abilities, which is a multilingual benchmark. It compares the result with correct answers and evaluates the model's ability to perform mathematical reasoning. | Model | En | Zh | Es | Ru | Fr | Avg | | ---------------------- | -------- | -------- | -------- | -------- | -------- | -------- | | Aya-101 | 8.8 | 4 | 6 | 8 | 9.2 | 7.2 | | Aya-5-languages | 28.8 | 5.6 | 18 | 17.2 | 19.2 | 17.8 | | Bloomz-7b1-mt | 2.8 | 1.6 | 2.0 | 0.4 | 2.8 | 1.7 | | Phoenix-inst-chat-7b | 3.2 | 3.2 | 2.8 | 3.2 | 3.2 | 3.1 | | PolyLM-Multialpaca-13b | 1.2 | 2.8 | 1.6 | 2.8 | 2.4 | 2.4 | | PolyLM-Chat-13b | 10.8 | 6.4 | 4.8 | 4.4 | 5.6 | 5.3 | | Chimera-inst-chat-13b | 14.0 | 11.6 | 10.0 | 12.0 |…
Excerpt shown — open the source for the full document.