amazon-science/span-mt-metaeval
Python
Captured source
source ↗amazon-science/span-mt-metaeval
Language: Python
License: NOASSERTION
Stars: 2
Forks: 1
Open issues: 0
Created: 2026-02-23T10:03:38Z
Pushed: 2026-03-23T16:25:08Z
Default branch: main
Fork: no
Archived: no
README:
Span-Level Machine Translation Meta-Evaluation
A toolkit for evaluating the quality of automatic machine translation metrics using WMT Metrics Shared Task benchmarks. This framework enables researchers to run LLM-based evaluation of translations and measure how well these automatic evaluators correlate with human MQM annotations.
Overview
The MT Evaluation Framework supports two main research workflows:
1. Automatic Evaluation: Use LLMs (via AWS Bedrock) to evaluate translation quality by identifying errors (spans, categories, severities) following MQM guidelines 2. Meta-Evaluation: Measure how well your automatic evaluator's error detection aligns with human MQM annotations from WMT benchmarks
Supported Benchmarks
- WMT22: en-de, zh-en language pairs
- WMT23: en-de, zh-en language pairs
- WMT24: en-de, en-es, ja-zh language pairs
- WMT25: en-ko, ja-zh language pairs (MQM data)
⚠️ Security & Usage Disclaimer
See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
This framework is intended solely for personal research and experimentation purposes using non-sensitive, public data in isolated, trusted environments.
This package does NOT implement the security controls required for:
- Processing confidential, proprietary, or personally identifiable data
- Production or customer-facing environments
- Multi-tenant or shared environments requiring enterprise-grade security controls
Users are responsible for managing their own AWS credentials, infrastructure security, and data handling practices. By using this package, you acknowledge that:
- Cache files (local and S3) are stored without encryption or integrity verification.
- Logs may contain information about your AWS resources and evaluation content.
- No cost-control mechanisms are enforced, monitor your AWS usage accordingly.
- Input validation is minimal; do not use untrusted or adversarial inputs.
Any use beyond personal experimentation with public data would require significant security enhancements. If you have questions about appropriate use, consult your security team.
AI-Generated Content Notice
The outputs/ directory contains data generated by large language models (including Anthropic Claude, Qwen, and GPT models via AWS Bedrock). This data is provided for research and meta-evaluation purposes only.
This data may NOT be used for training or developing AI/ML models. See [outputs/README.md](outputs/README.md) for full provenance details and usage restrictions.
Data
The experiment outputs (LLM-based evaluator error predictions) are stored as compressed JSONL files for storage efficiency.
To decompress all files:
find outputs/ -name "*.jsonl.gz" -exec gunzip {} \;Installation
Create a conda environment (Optional, but recommended) using Python >= 3.12 ( cd MTEvaluationFramework
Install in development mode
pip install -e .
### Requirements - Python ≥ 3.12 - AWS credentials configured for Bedrock access (see [AWS Configuration](#aws-configuration)) - The [`mt-metrics-eval`](https://github.com/google-research/mt-metrics-eval) package (must be installed manually — see [Installation](#installation)) for loading WMT benchmark data ## Quick Start ### 1. Run LLM Evaluation on WMT Data Evaluate translations from WMT23 using Claude 3.5 Haiku:
python scripts/evaluate.py \ --model-id us.anthropic.claude-3-5-haiku-20241022-v1:0 \ --evaluation-schema unified-mqm-boosted-v5 \ --test-set wmt23 \ --lps en-de zh-en
This will: - Load WMT23 translation samples with human MQM annotations - Send each source/translation pair to the LLM for error annotation - Cache results in `outputs/` to avoid re-evaluation - Print a summary of evaluation costs and timing ### 2. Span-Level Meta-Evaluation Measure precision/recall/F1 of error span detection against human annotations:
python scripts/meta_evaluate.py \ --test-sets wmt23 \ --lps en-de zh-en
Output includes: - Precision: What fraction of predicted error spans match human annotations - Recall: What fraction of human error spans were detected - F1: Harmonic mean of precision and recall - Statistics broken down by language pair and globally ### 3. Score-Level Meta-Evaluation Compute correlation between automatic scores and human MQM scores (legacy, might not work without adjustments):
python scripts/meta_evaluate_scores.py \ --lps en-de en-es ja-zh
Output includes: - PCE (Pairwise Confidence Error) - KWT (Kendall's Tau with Ties) - Pearson correlation ## Workflows ### Workflow 1: Evaluating Translations The `evaluate.py` script runs LLM-based evaluation on WMT benchmark data.
python scripts/evaluate.py \ --model-id \ --evaluation-schema \ --test-set \ --lps
**Key options:** | Option | Description | |--------|------------------------------------------------------------------------------| | `--model-id` | Model identifier (see [Available Models](#available-models)) | | `--evaluation-schema` | Evaluation prompt schema (see [Available Evaluators](#available-evaluators)) | | `--test-set` | WMT test set year | | `--lps` | Space-separated language pairs (e.g., `en-de zh-en`) | | `--cache-dir` | Directory for caching results (default: `outputs`) | | `--zero-shot` | Run without few-shot examples (default: `true`) | | `--toy N` | Only evaluate first N samples (for testing) | | `--bedrock-max-concurrent` | Maximum number of concurrent async API calls (default: `5`) | **Batch inference options:** | Option | Description | |--------|-------------| | `--use-batch-inference` | Use AWS Bedrock batch inference (cost-effective for large runs) | | `--s3-bucket-name` | S3 bucket for batch inference I/O | | `--use-s3-backup` | Sync cache to S3 for persistence | **Example with batch inference:**
python scripts/evaluate.py \ --model-id global.anthropic.claude-sonnet-4-5-20250929-v1:0 \ --evaluation-schema unified-mqm-boosted-v5 \ --test-set wmt25 \ --lps en-ko ja-zh \ --use-batch-inference \ --s3-bucket-name my-bucket \ --run-specific-info wmt25-submission
### Workflow 2: Span-Level Meta-Evaluation The `meta_evaluate.py` script computes precision/recall/F1 for error span detection.…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Low traction research repo