amazon-science/MDSEval
Jupyter Notebook
Captured source
source ↗amazon-science/MDSEval
Language: Jupyter Notebook
License: Apache-2.0
Stars: 7
Forks: 1
Open issues: 1
Created: 2025-09-16T21:14:23Z
Pushed: 2025-10-06T03:41:17Z
Default branch: main
Fork: no
Archived: no
README:
MDSEval: Meta-Evaluation Benchmark for Multimodal Dialogue Summarization
--- Authors: Yinhong Liu, Jianfeng He, Hang Su, Ruixue Lian, Yi Nian, Jake W. Vincent, Srikanth Vishnubhotla, Robinson Piramuthu, Saab Mansour.
Updates: Our work has been accepted by EMNLP 2025 🎉
This is the official repository for the **MDSEval** benchmark. It includes all human annotations, benchmark data, and the implementation of our newly proposed data filtering framework, Mutually Exclusive Key Information (MEKI). MEKI is designed to filter high-quality multimodal data by ensuring that each modality contributes unique information.
⚠️ Note: MDSEval is an evaluation benchmark. The data provided here should not be used for training NLP models.
Introduction to MDSEval
---
Multimodal Dialogue Summarization (MDS) is an important task with wide-ranging applications. To develop effective MDS models, robust automatic evaluation methods are essential to reduce both costs and human effort. However, such methods require a strong meta-evaluation benchmark grounded in human annotations.
MDSEval is the first meta-evaluation benchmark for MDS. It consists of:
- Image-sharing dialogues
- Multiple corresponding summaries
- Human judgments across eight well-defined quality dimensions
To ensure data quality and diversity, we introduce a novel filtering framework, Mutually Exclusive Key Information (MEKI), which leverages complementary information across modalities.
Our contributions include:
- The first formalization of key evaluation dimensions specific to MDS
- A high-quality benchmark dataset for robust evaluation
- A comprehensive assessment of state-of-the-art evaluation methods, showing their limitations in distinguishing between summaries from advanced MLLMs and their vulnerability to various biases
Download the Dialogue and Image Data
--- We first download and merge the textual dialogues from their source (PhotoChat and DialogCC)
bash prepare_dialog_data.sh
Then download the images for MDSEval:
bash prepare_image_data.sh
Note that the original hosting website is not very stable, so you may need to run the script multiple times to ensure all images are successfully downloaded.
MDSEval data
--- You can explore the MDSEval dataset using the provided notebook: demonstrations.ipynb, which contains functions to load and visualize the data.
The MDSEval dataset includes the following statistics:
| Statistic | Value | |-----------------------------|------:| | Total number of dialogues | 198 | | Summaries per dialogue | 5 | | Avg. turns per dialogue | 17.1 | | Avg. tokens per dialogue | 209.0 | | Evaluation aspects | 8 | | Avg. annotators per summary | 2.9 | | Avg. sentences per summary | 4.5 |
Human annotations across eight evaluation dimensions:
| Evaluation Dimensions | Scale | Note | |----------------------------------------|-----------------------------------------------------------------------------|---------------------------------------| | Multimodal Coherence (COH) | 1-5 | | | Conciseness (CON) | 1-5 | | | Multimodal Coverage (COV) | 1-5 | | | Multimodal Information Balancing (BAL) | 1-7 | Bipolar | | Topic Progression (PROG) | 1-5 | | | Multimodal Faithfulness (FAI) | Faithful, Not faithful to image, Not faithful to text, Not faithful to both | Both sentence level and summary level | | | | |
MEKI
--- To ensure the dataset is sufficiently challenging for multimodal summarization, dialogues should contain key information uniquely conveyed by a single modality — meaning it cannot be inferred from the other. To quantify this, we introduce Mutually Exclusive Key Information (MEKI) as a selection metric.
We embed both the image and textual dialogue into a shared semantic space, e.g. using the CLIP model, denoted as vectors $I\in \mathbb{R}^N$ and $T \in \mathbb{R}^N$. $N$ is the embedding dimension. Since CLIP embeddings are unit-normalized, we maintain this normalization for consistency.
To measure Exclusive Information (EI) in $I$ that is not present in $T$, we compute the orthogonal component of $I$ relative to $T$:
where $\langle \cdot , \cdot \rangle$ denote the dot product.
Next, to identify Exclusive Key Information (EKI) — crucial content uniquely conveyed by one modality — we first generate a pseudo-summary $S$, which extracts essential dialogue and image details. This serves as a reference proxy rather than a precise summary, helping distinguish key information. We embed and normalize $S$ in the CLIP space and compute:
which quantifies the extent of exclusive image-based key information. Similarly, we compute $EKI(T|I; S)$ for textual exclusivity.
Finally, the MEKI score aggregates both components:
where $\lambda=0.3$, chosen to balance the typically higher magnitude of the exclusivity term in text-based information, ensuring that the average magnitudes of both terms are approximately equal.
The MEKI implementation is provided in `meki.py`. Please follow the instructions in the file to use it.
License
---
MDSEval is constructed using images and dialogues from the following sources:
Accordingly, we release MDSEval under the Apache 2.0 License.
Citation
--- If you found the benchmark useful, please consider citing our work.
@misc{liu2025mdsevalmetaevaluationbenchmarkmultimodal,
title={MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization},
author={Yinhong Liu and Jianfeng He and Hang Su and Ruixue Lian and Yi Nian and Jake Vincent and Srikanth Vishnubhotla and Robinson Piramuthu and Saab Mansour},
year={2025},
eprint={2510.01659},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.01659},
}Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Low traction new repo from Amazon