microsoft/MNW
Captured source
source ↗microsoft/MNW
Stars: 17
Forks: 2
Open issues: 0
Created: 2025-11-20T22:45:49Z
Pushed: 2026-06-22T03:41:00Z
Default branch: main
Fork: no
Archived: no
README: 
📣 Announcements
V 2.0: Update alert! We updated our dataset! - [spring update April 2026]
🆕 NEW! You can now browse and take a peek at our dataset on the navigator: https://microsoft.github.io/MNW/
👋 Welcome to MNW Benchmark Dataset
This repository contains the MNW dataset (*Microsoft-Northwestern-Witness*) and/or associated code for benchmarking AI-detection models across images, video, and audio.
:warning: Notice
This dataset is intended for evaluation purposes and cannot be used for training or for commercial purposes.
Why do we need a new benchmark?
The detection of AI generated media and deepfake has been an active area of research for almost a decade. But in the past few years, a new paradigm has emerged with the diffusion architecture, showing impressive achievements in audio, image and video generation. Previous approaches to detection are now obsolete and the detection scene must re-invent itself. From only a few methods and models, we moved to a multitude of them pushing the state of the art constantly. With such a dynamic environment, detection models need to generalize better and stay up to date to maintain high performance across the board. Historically, evaluation of deepfake models were based on large datasets opened up during ‘detection challenges’. These datasets (see Meta 2020, gov.uk 2024 ) typically had a lot of depth but almost no breadth. They were suitable for the previous era (the GAN era) but are not up to the challenge brought by the new generative AI landscape and the evolving type of harm it brings: scams, non-consensual intimate image generation, disinformation, etc.
The evaluation of detection tools must reflect the evolution of the generative AI landscape.
What we need is breadth and regular updates Evaluation sets must be regularly updated and cover as many generators as possible. We argue that depth is less important than breadth and we propose the creation of an evaluation set that contains small samples of as many generators and ‘in the wild’ cases as possible- rather than millions of samples from a few generators.
‘In the lab’ and ‘in the wild’: two different beasts
We must recognize that too often, the performance of detectors, ‘in the wild’, does not match their performance ‘in the lab’. Content compression as a result of sharing on social networks and apps, but also manipulations and adversarial attacks makes it harder for models to consistently detect AI-generated media. Far too often, we boast great performance on testing sets, just to fail and real-world examples. Such real-world examples must co-exist in evaluation sets to provide a more realistic picture of the performance of our tools.
🤝 Existing Collaborators and Contributors
Founding collaborators:
- Microsoft AI for Good Lab: https://www.microsoft.com/en-us/research/group/ai-for-good-research-lab/
- Northwestern University Mc Cormick School of engineering: https://www.mccormick.northwestern.edu/computer-science/
- Witness: https://www.witness.org/
:fountain_pen: Cite us!
This dataset cannot be used for commercial purposes.
We have recently published our summary paper in IEEE Intelligent Systems. Please feel free to cite us!
@ARTICLE{11479406,
author={Roca, Thomas and Postiglione, Marco and Gao, Chongyang and Gortner, Isabel and Wojciak, Zuzanna and Wang, Pengce and Alimardani, Mahsa and Anlen, Shirin and White, Kevin and Ferres, Juan Lavista and Kraus, Sarit and Gregory, Sam and Subrahmanian, V. S. and Murugesan, San},
journal={ IEEE Intelligent Systems },
title={{ The Microsoft-Northwestern-WITNESS Benchmark for Deepfake Detection }},
year={2026},
volume={41},
number={02},
ISSN={1941-1294},
pages={15-23},
abstract={ We introduce the Microsoft-Northwestern-WITNESS (MNW) deepfake detection benchmark, a dataset designed to evaluate and improve artificial intelligence (AI)-generated content detection algorithms. The dataset contains more than 50,000 artifacts (images, videos, and audio files) generated by us. It also includes real-world examples of AI-manipulated or suspicious media encountered by journalists and human rights defenders globally, annotated by experts to reflect practical, high-stakes detection scenarios. The MNW dataset will be periodically updated to cover emerging generators and includes adversarial examples created with state-of-the-art attacks. This is a collaborative effort, and we encourage generative AI model developers to help maintain the dataset’s currency. This dataset is intended solely for evaluation purposes and cannot be used for training or commercial purposes. We recommend that entities purchasing detection solutions avoid using our dataset to evaluate commercial tools. Our goal is to establish high standards for developers and enhance the reliability of detection systems. },
keywords={Benchmark testing;Deepfakes;Detection algorithms;Performance evaluation;Image analysis;Media;Information integrity;Data integrity;Data models;Videos;Audio systems;Artificial intelligence;Standards;Quality assessment},
doi={10.1109/MIS.2026.3668398},
url = {https://doi.ieeecomputersociety.org/10.1109/MIS.2026.3668398},
publisher={IEEE Computer Society},
address={Los Alamitos, CA, USA},
month=mar}Notability
notability 3.0/10Low traction (17 stars), routine repository.