What does this repo signal mean?

Upstage (Solar) published UpstageAI/KIEval (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo UpstageAI/KIEval · language Python · Low traction, new repo with 3 stars. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Upstage (Solar) Repo: UpstageAI/KIEval

Captured source

source ↗

GitHub/github.com/UpstageAI/KIEval

UpstageAI/KIEval repository metadata

Source ↗

published Jun 17, 2025seen Jun 5captured Jun 11http 200method plain

UpstageAI/KIEval

Description: Official Implementation of KIEval: Evaluation Metric for Document Key Information Extraction

Language: Python

Stars: 3

Forks: 1

Open issues: 4

Created: 2025-06-17T11:59:48Z

Pushed: 2026-01-13T20:28:24Z

Default branch: main

Fork: no

Archived: no

README:

KIEval: Evaluation Metric for Document Key Information Extraction

KIEval is an application-centric evaluation metric for Document Key Information Extraction (KIE).

This repository contains the official implementation used in the paper ["KIEval: Evaluation Metric for Document Key Information Extraction"](https://arxiv.org/abs/2503.05488), ICDAR 2025.

---

Key characteristics

Multi-level evaluation:
Non-group level – entity-wise performance for standalone keys (e.g., store_name, date).
Group level – performance over sets of related entities (e.g., per-line item on a receipt).
Structure-aware matching: Uses strategies such as the Hungarian algorithm to align predicted and gold groups, handling variable numbers of entities and groups.
CORD pipeline provided: This repository ships with an end-to-end evaluation script for the CORD dataset via HuggingFace datasets.

---

Evaluation flow

Model Output2CSV conversion stage
Given the model output file and the target dataset's ontology information, convert the model output into CSV files
Evaluation stage
KIEval then assesses the model's performance based on the saved pred and gt CSV files

---

Repository structure

`run_eval.py`:

High-level script to evaluate a model on the CORD dataset using KIEval.

Loads the dataset via datasets.load_dataset.
Automatically builds an ontology from CORD ground-truth.
Converts ground truth and predictions into CSV format.
Runs the KIEval metric and writes a markdown summary.

`eval_utils.py`:
load_ontology(...): Derives ontology keys and grouping information from the dataset (currently supports CORD).
parse_arguments(...): CLI options for run_eval.py.
set_up_savefolder(...): Creates the folder structure for evaluation artifacts.

`KIEval/kieval.py`:

Core implementation of the KIEval metric:

Confusion-matrix construction at non-group and group levels.
Public function kieval(...) that operates on CSV files.

`KIEval/utils.py`:

Utility functions for:

Parsing CSVs into grouped and non-grouped entities.
Computing TP / FP / FN.
Performing Hungarian matching between groups.
Aggregating metrics and rendering markdown tables.

---

Installation

The recommended way to reproduce the environment is:

git clone
cd KIEval

# This repository was tested with python=3.8
pip install -r requirements.txt

---

Quick start: evaluating a model on CORD

The provided script run_eval.py implements a full evaluation pipeline for the CORD dataset, using data loaded via HuggingFace datasets.

1. Prepare CORD ground truth

The script expects a dataset accessible through datasets.load_dataset(...). For the public CORD dataset on HuggingFace, use:

python -c "from datasets import load_dataset; load_dataset('naver-clova-ix/cord-v2')"

run_eval.py will later call:

from datasets import load_dataset
dataset = load_dataset(args.data_path_or_name)

and use the split dataset['test'], where each sample has a JSON-formatted ground_truth field with a gt_parse structure.

2. Prepare model predictions

You are responsible for running your own KIE model on the CORD test split and writing its predictions to disk.

For reference, we provide a sample output from QwenVL_72B on CORD in KIEval/QwenVL_72B_Sample/QwenVL_72B_output.tar.gz.

You can extract and examine this to understand the expected format.

3. Run the evaluator

From the repository root:

python run_eval.py \
--model_output_dir /path/to/model_outputs \
--data_path_or_name naver-clova-ix/cord-v2 \
--save_dir /path/to/eval_results

This will:

Build an ontology from the dataset (all splits) and derive grouped ontology information.
Convert each document’s ground truth and prediction into a CSV representation suitable for the metric.
Run KIEval and write a summary markdown file:
`/path/to/eval_results/result.md` – aggregate KIEval scores.
Also create intermediate folders:
`/path/to/eval_results/gt/` – CSVs of gold annotations.
`/path/to/eval_results/pred/` – CSVs of predictions.

You can open result.md to check KIEval Entity F1 and KIEval Aligned scores (as discussed in the paper).

---

Python API

In addition to the CORD pipeline, you can call the KIEval metric directly from Python if you already have:

gold CSVs
pred CSVs
ontology file -- list of ontology keys to include in KIEval evaluation

Core metric function

The main entry point is KIEval.kieval.kieval:

from glob import glob

from KIEval.kieval import kieval

# Paths to CSVs (one gold and one prediction file per document)
gold_csv_files = sorted(glob("/path/to/gt/*.csv"))
pred_csv_files = sorted(glob("/path/to/pred/*.csv"))
with open("/path/to/ontology_file.json") as f:
ontology_keys = json.load(f)

result, _, _ = kieval(
gold_csv_files=gold_csv_files,
pred_csv_files=pred_csv_files,
shortlist=list(ontology_keys), # ontology keys to include in scoring
empty_token="",
grouping_strategy="hungarian", # or "max_em_score"
value_delimiter="||", # delimiter for multi-valued fields
entity_type_delimiter="\t", # delimiter between entity key and value in CSV
split_merged_value=True,
include_empty_token_in_score_calculation=False,
print_score=True,
)

print(result)

Expected CSV format (high level)

The CSV format is shared by gold and prediction files:

Files are logically split into blocks separated by blank lines.
Group blocks:
First line: integer group index (e.g., 0, 1, ...).
Second line: ontology keys separated by entity_type_delimiter (default: tab).
Following lines: values for each entity key, again separated by entity_type_delimiter.
Non-group blocks:
Lines that do not start with a digit belong to non-group entities.
Each line is keyvalue.
Only keys listed in shortlist are scored; others are filtered out.

For CORD, these CSVs are generated automatically by run_eval.py and fill_entity_for_cord(...), so you do not normally need to construct them by hand unless you are integrating a new dataset....

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Low traction, new repo with 3 stars