RepoUpstage (Solar)Upstage (Solar)published Jun 17, 2025seen 5d

UpstageAI/KIEval

Python

Open original ↗

Captured source

source ↗
published Jun 17, 2025seen 5dcaptured 9hhttp 200method plain

UpstageAI/KIEval

Description: Official Implementation of KIEval: Evaluation Metric for Document Key Information Extraction

Language: Python

Stars: 3

Forks: 1

Open issues: 4

Created: 2025-06-17T11:59:48Z

Pushed: 2026-01-13T20:28:24Z

Default branch: main

Fork: no

Archived: no

README:

KIEval: Evaluation Metric for Document Key Information Extraction

KIEval is an application-centric evaluation metric for Document Key Information Extraction (KIE).

This repository contains the official implementation used in the paper ["KIEval: Evaluation Metric for Document Key Information Extraction"](https://arxiv.org/abs/2503.05488), ICDAR 2025.

---

Key characteristics

  • Multi-level evaluation:
  • Non-group level – entity-wise performance for standalone keys (e.g., store_name, date).
  • Group level – performance over sets of related entities (e.g., per-line item on a receipt).
  • Structure-aware matching: Uses strategies such as the Hungarian algorithm to align predicted and gold groups, handling variable numbers of entities and groups.
  • CORD pipeline provided: This repository ships with an end-to-end evaluation script for the CORD dataset via HuggingFace datasets.

---

Evaluation flow

  • Model Output2CSV conversion stage
  • Given the model output file and the target dataset's ontology information, convert the model output into CSV files
  • Evaluation stage
  • KIEval then assesses the model's performance based on the saved pred and gt CSV files

---

Repository structure

  • `run_eval.py`:

High-level script to evaluate a model on the CORD dataset using KIEval.

  • Loads the dataset via datasets.load_dataset.
  • Automatically builds an ontology from CORD ground-truth.
  • Converts ground truth and predictions into CSV format.
  • Runs the KIEval metric and writes a markdown summary.
  • `eval_utils.py`:
  • load_ontology(...): Derives ontology keys and grouping information from the dataset (currently supports CORD).
  • parse_arguments(...): CLI options for run_eval.py.
  • set_up_savefolder(...): Creates the folder structure for evaluation artifacts.
  • `KIEval/kieval.py`:

Core implementation of the KIEval metric:

  • Confusion-matrix construction at non-group and group levels.
  • Public function kieval(...) that operates on CSV files.
  • `KIEval/utils.py`:

Utility functions for:

  • Parsing CSVs into grouped and non-grouped entities.
  • Computing TP / FP / FN.
  • Performing Hungarian matching between groups.
  • Aggregating metrics and rendering markdown tables.

---

Installation

The recommended way to reproduce the environment is:

git clone
cd KIEval

# This repository was tested with python=3.8
pip install -r requirements.txt

---

Quick start: evaluating a model on CORD

The provided script run_eval.py implements a full evaluation pipeline for the CORD dataset, using data loaded via HuggingFace datasets.

1. Prepare CORD ground truth

The script expects a dataset accessible through datasets.load_dataset(...). For the public CORD dataset on HuggingFace, use:

python -c "from datasets import load_dataset; load_dataset('naver-clova-ix/cord-v2')"

run_eval.py will later call:

from datasets import load_dataset
dataset = load_dataset(args.data_path_or_name)

and use the split dataset['test'], where each sample has a JSON-formatted ground_truth field with a gt_parse structure.

2. Prepare model predictions

You are responsible for running your own KIE model on the CORD test split and writing its predictions to disk.

For reference, we provide a sample output from QwenVL_72B on CORD in KIEval/QwenVL_72B_Sample/QwenVL_72B_output.tar.gz.

You can extract and examine this to understand the expected format.

3. Run the evaluator

From the repository root:

python run_eval.py \
--model_output_dir /path/to/model_outputs \
--data_path_or_name naver-clova-ix/cord-v2 \
--save_dir /path/to/eval_results

This will:

  • Build an ontology from the dataset (all splits) and derive grouped ontology information.
  • Convert each document’s ground truth and prediction into a CSV representation suitable for the metric.
  • Run KIEval and write a summary markdown file:
  • `/path/to/eval_results/result.md` – aggregate KIEval scores.
  • Also create intermediate folders:
  • `/path/to/eval_results/gt/` – CSVs of gold annotations.
  • `/path/to/eval_results/pred/` – CSVs of predictions.

You can open result.md to check KIEval Entity F1 and KIEval Aligned scores (as discussed in the paper).

---

Python API

In addition to the CORD pipeline, you can call the KIEval metric directly from Python if you already have:

  • gold CSVs
  • pred CSVs
  • ontology file -- list of ontology keys to include in KIEval evaluation

Core metric function

The main entry point is KIEval.kieval.kieval:

from glob import glob

from KIEval.kieval import kieval

# Paths to CSVs (one gold and one prediction file per document)
gold_csv_files = sorted(glob("/path/to/gt/*.csv"))
pred_csv_files = sorted(glob("/path/to/pred/*.csv"))
with open("/path/to/ontology_file.json") as f:
ontology_keys = json.load(f)

result, _, _ = kieval(
gold_csv_files=gold_csv_files,
pred_csv_files=pred_csv_files,
shortlist=list(ontology_keys), # ontology keys to include in scoring
empty_token="",
grouping_strategy="hungarian", # or "max_em_score"
value_delimiter="||", # delimiter for multi-valued fields
entity_type_delimiter="\t", # delimiter between entity key and value in CSV
split_merged_value=True,
include_empty_token_in_score_calculation=False,
print_score=True,
)

print(result)

Expected CSV format (high level)

The CSV format is shared by gold and prediction files:

  • Files are logically split into blocks separated by blank lines.
  • Group blocks:
  • First line: integer group index (e.g., 0, 1, ...).
  • Second line: ontology keys separated by entity_type_delimiter (default: tab).
  • Following lines: values for each entity key, again separated by entity_type_delimiter.
  • Non-group blocks:
  • Lines that do not start with a digit belong to non-group entities.
  • Each line is keyvalue.
  • Only keys listed in shortlist are scored; others are filtered out.

For CORD, these CSVs are generated automatically by run_eval.py and fill_entity_for_cord(...), so you do not normally need to construct them by hand unless you are integrating a new dataset.…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Low traction, new repo with 3 stars