amazon-science/LLM-Accuracy-Stats
Python
Captured source
source ↗amazon-science/LLM-Accuracy-Stats
Description: Test optimized LLMs for degraded accuracy
Language: Python
License: NOASSERTION
Stars: 4
Forks: 1
Open issues: 0
Created: 2026-02-06T13:09:06Z
Pushed: 2026-02-12T05:47:50Z
Default branch: main
Fork: no
Archived: yes
README:
When LLMs get significantly worse: A statistical approach to detect model degradations
This repository contains the code for reproducing experiments from our ICLR 2026 paper on statistical detection of LLM model degradations using McNemar's test. We provide tools to detect whether accuracy changes in optimized models are due to actual degradation or evaluation noise.
Installation
We recommend using uv: https://docs.astral.sh/uv/getting-started/installation/
uv venv ~/venv_accuracy_paper --python 3.12 --seed source ~/venv_accuracy_paper/bin/activate pip install vllm==0.10.0 lm-eval[math,ifeval,sentencepiece]==0.4.8
General Usage
We generally recommend to use our permutation-based tests, see Appendix D. For binary data those provide equivalent results to the direct McNemar tests, but generalize to non-binary data and multiple reruns.
> If you run the same samples multiple times through the model (say with non-zero temperature), first average the score for each example!
Then organize the per-sample scores in csv files. One file per task+model.
Then run the script with the following arguments:
python continuous_aggregation_script.py model_paths.json ./output_dir/
Inputs:
model_paths.json: JSON file mapping model names to their task CSV filesoutput_dir: Directory for output files
CSV Format: Each CSV should contain prompt_id and score columns.
Outputs:
summary.csv: Pooled and per-task accuracies with p-values from permutation tests
Discrete Score Evaluation (LM-Eval Harness)
For binary scores, we also provide an aggregation script to perform a statistical analysis on top of LM-Evaluation Harness runs python aggregation_script.py task_metrics.json checkpoint_list.json ./output_dir/
When running lm eval it is crucial to store the results for each example using the flag --log_samples (see example scripts).
Inputs:
task_metrics.json: Defines which metrics to extract for each taskcheckpoint_list.json: Lists models and their result pathsoutput_dir: Directory for output files
Outputs:
model_comparison.csv: Detailed metrics for all tasks and modelssummary_with_stderr.csv: Category-level metrics with standard errorssummary_without_stderr.csv: Category-level metrics without standard errorsmodel_differences.csv: Contingency table values and disagreement ratios
Running Paper Experiments
LLM Evaluation
Update paths in:
checkpoint_list.json: Replace/path/to/your/resultswith your actual results directoryllm_experiments/*.sh: Replace/path/to/your/hf_cache,your_huggingface_token_here, and output paths with your actual paths/token
Scripts in llm_experiments/ directory. Example:
cd llm_experiments/ bash llama_3_paper.sh
Statistical Analysis
python aggregation_script.py task_metrics.json checkpoint_list.json ./output_dir/
Generate Figures for Synthetic Experiments and General insights
Scripts in plots_and_synthetic/ directory:
cd plots_and_synthetic/ python test_power_plot.py # Figure: Test power analysis python pvalue_heatmap_comparison.py # Figure: P-value heatmaps python test_power_vs_tasks.py # Figure: Power vs number of tasks python intro_figure.py # Figure: Introductory example
Dataset Selection Analysis
Scripts in dataset_selection/ directory for seed/temperature analysis:
cd dataset_selection/ # Run evaluations with different seeds bash mmlu_dataset_ablation_temp0.3.sh bash mmlu_dataset_ablation_true_models.sh # Analyze flip patterns across seeds python seed_flip_analysis.py task_metrics.json results_dir1/ results_dir2/ [results_dir3/ ...]
Output: success_counts.json, success_histogram.pdf - Analysis of which documents consistently succeed/fail across different model runs
License
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). See the [LICENSE](LICENSE) file for details.
Citation
If you find our work useful or use our tests you can cite our paper:
@inproceedings{
anonymous2026when,
title={When {LLM}s get significantly worse: A statistical approach to detect model degradations},
author={Jonas Kübler, Kailash Budhathoki, Matthäus Kleindessner, Xiong Zhou, Junming Yin, Ashish Khetan, George Karypis},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=cM3gsqEI4K}
}Notability
notability 3.0/10Low-star repo from Amazon Science