What does this repo signal mean?

Amazon (Nova) published amazon-science/MDSEval (Jupyter Notebook). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo amazon-science/MDSEval · language Jupyter Notebook · Low traction new repo from Amazon. onlylabs links this event to 1 captured evidence page and 6 related repo signals. It also maps to Evals and quality in the data-business radar.

Amazon (Nova) Repo: amazon-science/MDSEval

Captured source

source ↗

GitHub/github.com/amazon-science/MDSEval

amazon-science/MDSEval repository metadata

Source ↗

published Sep 16, 2025seen Jun 5captured Jun 11http 200method plain

amazon-science/MDSEval

Language: Jupyter Notebook

License: Apache-2.0

Stars: 7

Forks: 1

Open issues: 1

Created: 2025-09-16T21:14:23Z

Pushed: 2025-10-06T03:41:17Z

Default branch: main

Fork: no

Archived: no

README:

MDSEval: Meta-Evaluation Benchmark for Multimodal Dialogue Summarization

--- Authors: Yinhong Liu, Jianfeng He, Hang Su, Ruixue Lian, Yi Nian, Jake W. Vincent, Srikanth Vishnubhotla, Robinson Piramuthu, Saab Mansour.

Updates: Our work has been accepted by EMNLP 2025 🎉

This is the official repository for the **MDSEval** benchmark. It includes all human annotations, benchmark data, and the implementation of our newly proposed data filtering framework, Mutually Exclusive Key Information (MEKI). MEKI is designed to filter high-quality multimodal data by ensuring that each modality contributes unique information.

⚠️ Note: MDSEval is an evaluation benchmark. The data provided here should not be used for training NLP models.

Introduction to MDSEval

---

Multimodal Dialogue Summarization (MDS) is an important task with wide-ranging applications. To develop effective MDS models, robust automatic evaluation methods are essential to reduce both costs and human effort. However, such methods require a strong meta-evaluation benchmark grounded in human annotations.

MDSEval is the first meta-evaluation benchmark for MDS. It consists of:

Image-sharing dialogues
Multiple corresponding summaries
Human judgments across eight well-defined quality dimensions

To ensure data quality and diversity, we introduce a novel filtering framework, Mutually Exclusive Key Information (MEKI), which leverages complementary information across modalities.

Our contributions include:

The first formalization of key evaluation dimensions specific to MDS
A high-quality benchmark dataset for robust evaluation
A comprehensive assessment of state-of-the-art evaluation methods, showing their limitations in distinguishing between summaries from advanced MLLMs and their vulnerability to various biases

Download the Dialogue and Image Data

--- We first download and merge the textual dialogues from their source (PhotoChat and DialogCC)

bash prepare_dialog_data.sh

Then download the images for MDSEval:

bash prepare_image_data.sh

Note that the original hosting website is not very stable, so you may need to run the script multiple times to ensure all images are successfully downloaded.

MDSEval data

--- You can explore the MDSEval dataset using the provided notebook: demonstrations.ipynb, which contains functions to load and visualize the data.

The MDSEval dataset includes the following statistics:

| Statistic | Value | |-----------------------------|------:| | Total number of dialogues | 198 | | Summaries per dialogue | 5 | | Avg. turns per dialogue | 17.1 | | Avg. tokens per dialogue | 209.0 | | Evaluation aspects | 8 | | Avg. annotators per summary | 2.9 | | Avg. sentences per summary | 4.5 |

Human annotations across eight evaluation dimensions:

| Evaluation Dimensions | Scale | Note | |----------------------------------------|-----------------------------------------------------------------------------|---------------------------------------| | Multimodal Coherence (COH) | 1-5 | | | Conciseness (CON) | 1-5 | | | Multimodal Coverage (COV) | 1-5 | | | Multimodal Information Balancing (BAL) | 1-7 | Bipolar | | Topic Progression (PROG) | 1-5 | | | Multimodal Faithfulness (FAI) | Faithful, Not faithful to image, Not faithful to text, Not faithful to both | Both sentence level and summary level | | | | |

MEKI

--- To ensure the dataset is sufficiently challenging for multimodal summarization, dialogues should contain key information uniquely conveyed by a single modality — meaning it cannot be inferred from the other. To quantify this, we introduce Mutually Exclusive Key Information (MEKI) as a selection metric.

We embed both the image and textual dialogue into a shared semantic space, e.g. using the CLIP model, denoted as vectors $I\in \mathbb{R}^N$ and $T \in \mathbb{R}^N$. $N$ is the embedding dimension. Since CLIP embeddings are unit-normalized, we maintain this normalization for consistency.

To measure Exclusive Information (EI) in $I$ that is not present in $T$, we compute the orthogonal component of $I$ relative to $T$:

where $\langle \cdot , \cdot \rangle$ denote the dot product.

Next, to identify Exclusive Key Information (EKI) — crucial content uniquely conveyed by one modality — we first generate a pseudo-summary $S$, which extracts essential dialogue and image details. This serves as a reference proxy rather than a precise summary, helping distinguish key information. We embed and normalize $S$ in the CLIP space and compute:

which quantifies the extent of exclusive image-based key information. Similarly, we compute $EKI(T|I; S)$ for textual exclusivity.

Finally, the MEKI score aggregates both components:

where $\lambda=0.3$, chosen to balance the typically higher magnitude of the exclusivity term in text-based information, ensuring that the average magnitudes of both terms are approximately equal.

The MEKI implementation is provided in `meki.py`. Please follow the instructions in the file to use it.

License

---

MDSEval is constructed using images and dialogues from the following sources:

DialogCC – released under the MIT License.
PhotoChat – released under the Apache 2.0 License.

Accordingly, we release MDSEval under the Apache 2.0 License.

Citation

--- If you found the benchmark useful, please consider citing our work.

@misc{liu2025mdsevalmetaevaluationbenchmarkmultimodal,
title={MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization},
author={Yinhong Liu and Jianfeng He and Hang Su and Ruixue Lian and Yi Nian and Jake Vincent and Srikanth Vishnubhotla and Robinson Piramuthu and Saab Mansour},
year={2025},
eprint={2510.01659},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.01659},
}

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low traction new repo from Amazon