amazon-science/multilingual-faithfulness
Python
Captured source
source ↗amazon-science/multilingual-faithfulness
Language: Python
License: CC-BY-4.0
Stars: 0
Forks: 0
Open issues: 0
Created: 2026-01-22T15:45:01Z
Pushed: 2026-03-20T20:43:49Z
Default branch: main
Fork: no
Archived: no
README:
Multilingual Faithfulness
A framework for generating synthetic multilingual data to train faithfulness judges for text summarization.
Overview
This repository provides tools to:
- Generate faithful and unfaithful summaries from multilingual datasets (WikiLingua)
- Generate labeled training data for faithfulness judges using LLM-as-a-judge
Installation
Scripts run inside the official vLLM Docker container, which bundles compatible versions of vLLM, PyTorch, and Transformers.
docker pull vllm/vllm-openai:latest
Additional Python dependencies (installed inside the container):
pip install hydra-core omegaconf datasets
Project Structure
multilingual-faithfulness/ ├── conf/ # Hydra configuration files │ ├── config.yaml # Main configuration │ └── task/ # Task-specific configs │ ├── gen_data.yaml # Training data generation │ └── gen_summs.yaml # Summary generation ├── data/ # Benchmark datasets (CSV) │ ├── llm_aggrefact.csv │ ├── mface.csv │ └── memerag.csv ├── scripts/ # Executable scripts │ ├── gen_data.py # Training data generation │ └── gen_summs.py # Summary generation ├── src/ # Library modules │ ├── data_loader.py # WikiLingua dataset loader │ ├── gen_data.py # Data generation functions │ ├── gen_summs.py # Summary generation functions │ ├── corrupt.py # Summary corruption strategies │ ├── llm_inference/ # LLM inference utilities (vLLM) │ └── utils/ # Helper functions and prompts ├── bash_files/ # Example shell scripts └── requirements.txt
Usage
All scripts should be run inside the vLLM Docker container:
docker run --gpus all --rm \ -v /path/to/repo:/workspace \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --ipc=host --entrypoint bash \ vllm/vllm-openai:latest -c \ "pip install hydra-core omegaconf datasets && \ cd /workspace && \ python3 scripts/.py "
1. Generate Summaries
Generate faithful and corrupted summaries from WikiLingua:
python3 scripts/gen_summs.py task=gen_summs \ model.base_llm=Qwen/Qwen3-4B-Instruct-2507 \ task.gen_summs.total_datapoints=14000 \ vllm.num_gpus=4 \ vllm.max_model_len=8192
2. Generate Training Data
Create labeled training data for the faithfulness judge:
python3 scripts/gen_data.py task=gen_data \ model.base_llm=Qwen/Qwen3-4B-Instruct-2507 \ task.data_gen.n_samples=1000 \ task.data_gen.summaries_path=./output/data/corrupt_v2 \ vllm.num_gpus=4 \ vllm.max_model_len=8192
Citations
If you use this work, please cite:
@inproceedings{alfano2026multilingual,
title = {Multilingual Self-Taught Faithfulness Evaluators},
author = {Carlo Alfano and Aymen Al Marjani and Zeno Jonke and Amin Mantrach and Saab Mansour and Marcello Federico},
year = {2026},
booktitle = {Findings of the Association for Computational Linguistics: EACL 2026}
}Security
See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
License
This library is licensed under the CC-BY-4.0 License. See the [LICENSE](LICENSE) file.
Notability
notability 5.0/10New research repo, substantive but no traction.