RepoMicrosoftMicrosoftpublished Jan 14, 2026seen 5d

microsoft/byol

Python

Open original ↗

Captured source

source ↗
published Jan 14, 2026seen 5dcaptured 9hhttp 200method plain

microsoft/byol

Description: Toolkit for bringing low‑resource languages into LLMs.

Language: Python

License: MIT

Stars: 6

Forks: 0

Open issues: 17

Created: 2026-01-14T07:16:43Z

Pushed: 2026-06-11T01:24:23Z

Default branch: main

Fork: no

Archived: no

README:

BYOL: Bring Your Own Language Into LLMs

Syed Waqas Zamir, Wassim Hamidouche, Boulbaba Ben Amor, Luana Marotti, Inbal Becker-Reshef, and Juan Lavista Ferres

BYOL is a scalable framework for extending LLMs to low- and extreme-low-resource languages based on each language's digital footprint. Given a target language, BYOL assesses its digital resources, prepares training and evaluation data, performs continual pre-training and instruction tuning, and evaluates the adapted model on multilingual benchmarks.

News

  • Apr 2026: BYOL toolkit released
  • Apr 2026: Trained models for Chichewa and Māori released on 🤗 HuggingFace (collection)
  • Apr 2026: Human-translated Global MMLU-Lite for Inuktitut, Chichewa, and Māori released on 🤗 HuggingFace

---

Installation

See [SETUP.md](SETUP.md) for the full setup guide (environment, credentials, third-party libs, verification).

---

Quick Inference

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "ai-for-good-lab/byol-nya-4b-merged" # Chichewa 4B
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", dtype=torch.bfloat16)

messages = [{"role": "user", "content": "Tandiuzeni za dziko la Malawi."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True, return_dict=True).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

---

Released Models

We release BYOL-adapted LLMs for Chichewa and Māori. Merged models are recommended for most users (chat/instruction-following). CPT models are base models for text completion. Intermediate IT checkpoints are also available in the collection.

| Language | 1B | 4B | 12B | |---|---|---|---| | Chichewa (nya) | CPT | CPT · Merged | CPT · Merged | | Māori (mri) | CPT | CPT · Merged | CPT · Merged |

Released Evaluation Data

We release human-translated Global MMLU-Lite for Chichewa, Māori, and Inuktitut — extending the original 18-language benchmark with 3 new low-resource languages.

from datasets import load_dataset
ds = load_dataset("ai-for-good-lab/Global-MMLU-Lite", "mri", split="test") # nya, mri, or iku

---

Step-by-Step BYOL Pipeline

The BYOL pipeline takes any language from assessment to evaluation: classify the language's digital footprint, find the best translators, prepare training and evaluation data, train, merge, and evaluate on multilingual benchmarks.

Each step can be run independently. Replace ` with an ISO-639-3 code (e.g., nya, mri`). Detailed documentation is linked within each section — refer to those for the full set of options and configurations.

Step 1-3: Language Resource Assessment

Classify the language and find the best translation models. READ [Full docs](byol/language_resource_assessment/README.md)

# Classify language resource level
python -m byol.language_resource_assessment --task language-classification --tgt-lang

# Benchmark translators (API + local models, multi-GPU)
python -m byol.language_resource_assessment --task find-best-translator --tgt-lang --device 0,1

# Benchmark open-weight LLMs for language adaptation
python -m byol.language_resource_assessment --task find-best-llm --tgt-lang --device 0,1

Step 4-6: Data Preparation

Prepare training (CPT, SFT) and evaluation data. READ [Full docs](byol/data_prep/README.md)

# CPT: download, refine, translate bilingual training corpus
python -m byol.data_prep --stage cpt --tgt-lang

# SFT: instruction-tuning data from SmolTalk2 + AYA
python -m byol.data_prep --stage sft --tgt-lang

# Eval: translate 10 English benchmarks to target language
python -m byol.data_prep --stage eval --tgt-lang

> Config files are auto-generated at configs/data_prep/{stage}/.yaml on first run. > Add --max-samples 10 for quick testing. Use --translator to override the default (see list of supported Machine Translators [here](byol/translation_backends/README.md)).

Step 7: Generate Eval Task Configs

Scaffold lm-evaluation-harness task YAMLs for the new language:

python -m byol.eval add-language --lang --name

Step 8-9: Training

Train with LlamaFactory. [Full docs](byol/train/README.md)

# Continual Pre-Training
python -m byol.train cpt --tgt-lang --model google/gemma-3-4b-pt --device 0

# Supervised Fine-Tuning (on top of CPT checkpoint)
python -m byol.train sft --tgt-lang --model --device 0

Step 10: Model Merging

python -m byol.train.merge general \
--model-pt google/gemma-3-4b-pt \
--model-it google/gemma-3-4b-it \
--model-el \
--beta 0.6 --device 3 --dtype bfloat16 \
--output results//train/merged/

Step 11-13: Evaluation

Evaluate on multilingual benchmarks using lm-evaluation-harness. [Full docs](byol/eval/README.md)…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Low stars; routine repo release