microsoft/byol
Python
Captured source
source ↗microsoft/byol
Description: Toolkit for bringing low‑resource languages into LLMs.
Language: Python
License: MIT
Stars: 6
Forks: 0
Open issues: 17
Created: 2026-01-14T07:16:43Z
Pushed: 2026-06-11T01:24:23Z
Default branch: main
Fork: no
Archived: no
README:
BYOL: Bring Your Own Language Into LLMs
Syed Waqas Zamir, Wassim Hamidouche, Boulbaba Ben Amor, Luana Marotti, Inbal Becker-Reshef, and Juan Lavista Ferres
BYOL is a scalable framework for extending LLMs to low- and extreme-low-resource languages based on each language's digital footprint. Given a target language, BYOL assesses its digital resources, prepares training and evaluation data, performs continual pre-training and instruction tuning, and evaluates the adapted model on multilingual benchmarks.
News
- Apr 2026: BYOL toolkit released
- Apr 2026: Trained models for Chichewa and Māori released on 🤗 HuggingFace (collection)
- Apr 2026: Human-translated Global MMLU-Lite for Inuktitut, Chichewa, and Māori released on 🤗 HuggingFace
---
Installation
See [SETUP.md](SETUP.md) for the full setup guide (environment, credentials, third-party libs, verification).
---
Quick Inference
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "ai-for-good-lab/byol-nya-4b-merged" # Chichewa 4B
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", dtype=torch.bfloat16)
messages = [{"role": "user", "content": "Tandiuzeni za dziko la Malawi."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True, return_dict=True).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))---
Released Models
We release BYOL-adapted LLMs for Chichewa and Māori. Merged models are recommended for most users (chat/instruction-following). CPT models are base models for text completion. Intermediate IT checkpoints are also available in the collection.
| Language | 1B | 4B | 12B | |---|---|---|---| | Chichewa (nya) | CPT | CPT · Merged | CPT · Merged | | Māori (mri) | CPT | CPT · Merged | CPT · Merged |
Released Evaluation Data
We release human-translated Global MMLU-Lite for Chichewa, Māori, and Inuktitut — extending the original 18-language benchmark with 3 new low-resource languages.
from datasets import load_dataset
ds = load_dataset("ai-for-good-lab/Global-MMLU-Lite", "mri", split="test") # nya, mri, or iku---
Step-by-Step BYOL Pipeline
The BYOL pipeline takes any language from assessment to evaluation: classify the language's digital footprint, find the best translators, prepare training and evaluation data, train, merge, and evaluate on multilingual benchmarks.
Each step can be run independently. Replace ` with an ISO-639-3 code (e.g., nya, mri`). Detailed documentation is linked within each section — refer to those for the full set of options and configurations.
Step 1-3: Language Resource Assessment
Classify the language and find the best translation models. READ [Full docs](byol/language_resource_assessment/README.md)
# Classify language resource level python -m byol.language_resource_assessment --task language-classification --tgt-lang # Benchmark translators (API + local models, multi-GPU) python -m byol.language_resource_assessment --task find-best-translator --tgt-lang --device 0,1 # Benchmark open-weight LLMs for language adaptation python -m byol.language_resource_assessment --task find-best-llm --tgt-lang --device 0,1
Step 4-6: Data Preparation
Prepare training (CPT, SFT) and evaluation data. READ [Full docs](byol/data_prep/README.md)
# CPT: download, refine, translate bilingual training corpus python -m byol.data_prep --stage cpt --tgt-lang # SFT: instruction-tuning data from SmolTalk2 + AYA python -m byol.data_prep --stage sft --tgt-lang # Eval: translate 10 English benchmarks to target language python -m byol.data_prep --stage eval --tgt-lang
> Config files are auto-generated at configs/data_prep/{stage}/.yaml on first run. > Add --max-samples 10 for quick testing. Use --translator to override the default (see list of supported Machine Translators [here](byol/translation_backends/README.md)).
Step 7: Generate Eval Task Configs
Scaffold lm-evaluation-harness task YAMLs for the new language:
python -m byol.eval add-language --lang --name
Step 8-9: Training
Train with LlamaFactory. [Full docs](byol/train/README.md)
# Continual Pre-Training python -m byol.train cpt --tgt-lang --model google/gemma-3-4b-pt --device 0 # Supervised Fine-Tuning (on top of CPT checkpoint) python -m byol.train sft --tgt-lang --model --device 0
Step 10: Model Merging
python -m byol.train.merge general \ --model-pt google/gemma-3-4b-pt \ --model-it google/gemma-3-4b-it \ --model-el \ --beta 0.6 --device 3 --dtype bfloat16 \ --output results//train/merged/
Step 11-13: Evaluation
Evaluate on multilingual benchmarks using lm-evaluation-harness. [Full docs](byol/eval/README.md)…
Excerpt shown — open the source for the full document.
Notability
notability 2.0/10Low stars; routine repo release