What does this repo signal mean?

Amazon (Nova) published amazon-science/LLM-Accuracy-Stats (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo amazon-science/LLM-Accuracy-Stats · language Python · Low-star repo from Amazon Science. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Amazon (Nova) Repo: amazon-science/LLM-Accuracy-Stats

Captured source

source ↗

GitHub/github.com/amazon-science/LLM-Accuracy-Stats

amazon-science/LLM-Accuracy-Stats repository metadata

Source ↗

published Feb 6, 2026seen Jun 5captured Jun 11http 200method plain

amazon-science/LLM-Accuracy-Stats

Description: Test optimized LLMs for degraded accuracy

Language: Python

License: NOASSERTION

Stars: 4

Forks: 1

Open issues: 0

Created: 2026-02-06T13:09:06Z

Pushed: 2026-02-12T05:47:50Z

Default branch: main

Fork: no

Archived: yes

README:

When LLMs get significantly worse: A statistical approach to detect model degradations

This repository contains the code for reproducing experiments from our ICLR 2026 paper on statistical detection of LLM model degradations using McNemar's test. We provide tools to detect whether accuracy changes in optimized models are due to actual degradation or evaluation noise.

Installation

We recommend using uv: https://docs.astral.sh/uv/getting-started/installation/

uv venv ~/venv_accuracy_paper --python 3.12 --seed
source ~/venv_accuracy_paper/bin/activate
pip install vllm==0.10.0 lm-eval[math,ifeval,sentencepiece]==0.4.8

General Usage

We generally recommend to use our permutation-based tests, see Appendix D. For binary data those provide equivalent results to the direct McNemar tests, but generalize to non-binary data and multiple reruns.

> If you run the same samples multiple times through the model (say with non-zero temperature), first average the score for each example!

Then organize the per-sample scores in csv files. One file per task+model.

Then run the script with the following arguments:

python continuous_aggregation_script.py model_paths.json ./output_dir/

Inputs:

model_paths.json: JSON file mapping model names to their task CSV files
output_dir: Directory for output files

CSV Format: Each CSV should contain prompt_id and score columns.

Outputs:

summary.csv: Pooled and per-task accuracies with p-values from permutation tests

Discrete Score Evaluation (LM-Eval Harness)

For binary scores, we also provide an aggregation script to perform a statistical analysis on top of LM-Evaluation Harness runs python aggregation_script.py task_metrics.json checkpoint_list.json ./output_dir/

When running lm eval it is crucial to store the results for each example using the flag --log_samples (see example scripts).

Inputs:

task_metrics.json: Defines which metrics to extract for each task
checkpoint_list.json: Lists models and their result paths
output_dir: Directory for output files

Outputs:

model_comparison.csv: Detailed metrics for all tasks and models
summary_with_stderr.csv: Category-level metrics with standard errors
summary_without_stderr.csv: Category-level metrics without standard errors
model_differences.csv: Contingency table values and disagreement ratios

Running Paper Experiments

LLM Evaluation

Update paths in:

checkpoint_list.json: Replace /path/to/your/results with your actual results directory
llm_experiments/*.sh: Replace /path/to/your/hf_cache, your_huggingface_token_here, and output paths with your actual paths/token

Scripts in llm_experiments/ directory. Example:

cd llm_experiments/
bash llama_3_paper.sh

Statistical Analysis

python aggregation_script.py task_metrics.json checkpoint_list.json ./output_dir/

Generate Figures for Synthetic Experiments and General insights

Scripts in plots_and_synthetic/ directory:

cd plots_and_synthetic/
python test_power_plot.py # Figure: Test power analysis
python pvalue_heatmap_comparison.py # Figure: P-value heatmaps
python test_power_vs_tasks.py # Figure: Power vs number of tasks
python intro_figure.py # Figure: Introductory example

Dataset Selection Analysis

Scripts in dataset_selection/ directory for seed/temperature analysis:

cd dataset_selection/
# Run evaluations with different seeds
bash mmlu_dataset_ablation_temp0.3.sh
bash mmlu_dataset_ablation_true_models.sh

# Analyze flip patterns across seeds
python seed_flip_analysis.py task_metrics.json results_dir1/ results_dir2/ [results_dir3/ ...]

Output: success_counts.json, success_histogram.pdf - Analysis of which documents consistently succeed/fail across different model runs

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). See the [LICENSE](LICENSE) file for details.

Citation

If you find our work useful or use our tests you can cite our paper:

@inproceedings{
anonymous2026when,
title={When {LLM}s get significantly worse: A statistical approach to detect model degradations},
author={Jonas Kübler, Kailash Budhathoki, Matthäus Kleindessner, Xiong Zhou, Junming Yin, Ashish Khetan, George Karypis},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=cM3gsqEI4K}
}

Notability

notability 3.0/10

Low-star repo from Amazon Science