RepoAmazon (Nova)Amazon (Nova)published Feb 23, 2026seen 5d

amazon-science/span-mt-metaeval

Python

Open original ↗

Captured source

source ↗
published Feb 23, 2026seen 5dcaptured 14hhttp 200method plain

amazon-science/span-mt-metaeval

Language: Python

License: NOASSERTION

Stars: 2

Forks: 1

Open issues: 0

Created: 2026-02-23T10:03:38Z

Pushed: 2026-03-23T16:25:08Z

Default branch: main

Fork: no

Archived: no

README:

Span-Level Machine Translation Meta-Evaluation

A toolkit for evaluating the quality of automatic machine translation metrics using WMT Metrics Shared Task benchmarks. This framework enables researchers to run LLM-based evaluation of translations and measure how well these automatic evaluators correlate with human MQM annotations.

Overview

The MT Evaluation Framework supports two main research workflows:

1. Automatic Evaluation: Use LLMs (via AWS Bedrock) to evaluate translation quality by identifying errors (spans, categories, severities) following MQM guidelines 2. Meta-Evaluation: Measure how well your automatic evaluator's error detection aligns with human MQM annotations from WMT benchmarks

Supported Benchmarks

  • WMT22: en-de, zh-en language pairs
  • WMT23: en-de, zh-en language pairs
  • WMT24: en-de, en-es, ja-zh language pairs
  • WMT25: en-ko, ja-zh language pairs (MQM data)

⚠️ Security & Usage Disclaimer

See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.

This framework is intended solely for personal research and experimentation purposes using non-sensitive, public data in isolated, trusted environments.

This package does NOT implement the security controls required for:

  • Processing confidential, proprietary, or personally identifiable data
  • Production or customer-facing environments
  • Multi-tenant or shared environments requiring enterprise-grade security controls

Users are responsible for managing their own AWS credentials, infrastructure security, and data handling practices. By using this package, you acknowledge that:

  • Cache files (local and S3) are stored without encryption or integrity verification.
  • Logs may contain information about your AWS resources and evaluation content.
  • No cost-control mechanisms are enforced, monitor your AWS usage accordingly.
  • Input validation is minimal; do not use untrusted or adversarial inputs.

Any use beyond personal experimentation with public data would require significant security enhancements. If you have questions about appropriate use, consult your security team.

AI-Generated Content Notice

The outputs/ directory contains data generated by large language models (including Anthropic Claude, Qwen, and GPT models via AWS Bedrock). This data is provided for research and meta-evaluation purposes only.

This data may NOT be used for training or developing AI/ML models. See [outputs/README.md](outputs/README.md) for full provenance details and usage restrictions.

Data

The experiment outputs (LLM-based evaluator error predictions) are stored as compressed JSONL files for storage efficiency.

To decompress all files:

find outputs/ -name "*.jsonl.gz" -exec gunzip {} \;

Installation

Create a conda environment (Optional, but recommended) using Python >= 3.12 ( cd MTEvaluationFramework

Install in development mode

pip install -e .

### Requirements

- Python ≥ 3.12
- AWS credentials configured for Bedrock access (see [AWS Configuration](#aws-configuration))
- The [`mt-metrics-eval`](https://github.com/google-research/mt-metrics-eval) package (must be installed manually — see [Installation](#installation)) for loading WMT benchmark data

## Quick Start

### 1. Run LLM Evaluation on WMT Data

Evaluate translations from WMT23 using Claude 3.5 Haiku:

python scripts/evaluate.py \ --model-id us.anthropic.claude-3-5-haiku-20241022-v1:0 \ --evaluation-schema unified-mqm-boosted-v5 \ --test-set wmt23 \ --lps en-de zh-en

This will:
- Load WMT23 translation samples with human MQM annotations
- Send each source/translation pair to the LLM for error annotation
- Cache results in `outputs/` to avoid re-evaluation
- Print a summary of evaluation costs and timing

### 2. Span-Level Meta-Evaluation

Measure precision/recall/F1 of error span detection against human annotations:

python scripts/meta_evaluate.py \ --test-sets wmt23 \ --lps en-de zh-en

Output includes:
- Precision: What fraction of predicted error spans match human annotations
- Recall: What fraction of human error spans were detected
- F1: Harmonic mean of precision and recall
- Statistics broken down by language pair and globally

### 3. Score-Level Meta-Evaluation

Compute correlation between automatic scores and human MQM scores (legacy, might not work without adjustments):

python scripts/meta_evaluate_scores.py \ --lps en-de en-es ja-zh

Output includes:
- PCE (Pairwise Confidence Error)
- KWT (Kendall's Tau with Ties)
- Pearson correlation

## Workflows

### Workflow 1: Evaluating Translations

The `evaluate.py` script runs LLM-based evaluation on WMT benchmark data.

python scripts/evaluate.py \ --model-id \ --evaluation-schema \ --test-set \ --lps

**Key options:**

| Option | Description |
|--------|------------------------------------------------------------------------------|
| `--model-id` | Model identifier (see [Available Models](#available-models)) |
| `--evaluation-schema` | Evaluation prompt schema (see [Available Evaluators](#available-evaluators)) |
| `--test-set` | WMT test set year |
| `--lps` | Space-separated language pairs (e.g., `en-de zh-en`) |
| `--cache-dir` | Directory for caching results (default: `outputs`) |
| `--zero-shot` | Run without few-shot examples (default: `true`) |
| `--toy N` | Only evaluate first N samples (for testing) |
| `--bedrock-max-concurrent` | Maximum number of concurrent async API calls (default: `5`) |

**Batch inference options:**

| Option | Description |
|--------|-------------|
| `--use-batch-inference` | Use AWS Bedrock batch inference (cost-effective for large runs) |
| `--s3-bucket-name` | S3 bucket for batch inference I/O |
| `--use-s3-backup` | Sync cache to S3 for persistence |

**Example with batch inference:**

python scripts/evaluate.py \ --model-id global.anthropic.claude-sonnet-4-5-20250929-v1:0 \ --evaluation-schema unified-mqm-boosted-v5 \ --test-set wmt25 \ --lps en-ko ja-zh \ --use-batch-inference \ --s3-bucket-name my-bucket \ --run-specific-info wmt25-submission

### Workflow 2: Span-Level Meta-Evaluation

The `meta_evaluate.py` script computes precision/recall/F1 for error span detection.…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low traction research repo