What does this repo signal mean?

Microsoft published microsoft/MNW. This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo microsoft/MNW · Low traction (17 stars), routine repository.. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Microsoft Repo: microsoft/MNW

Captured source

source ↗

GitHub/github.com/microsoft/MNW

microsoft/MNW repository metadata

Source ↗

published Nov 20, 2025seen 4dcaptured 4dhttp 200method plain

microsoft/MNW

Stars: 17

Forks: 2

Open issues: 0

Created: 2025-11-20T22:45:49Z

Pushed: 2026-06-22T03:41:00Z

Default branch: main

Fork: no

Archived: no

README: ![image](header_benchmark.png)

📣 Announcements

V 2.0: Update alert! We updated our dataset! - [spring update April 2026]

🆕 NEW! You can now browse and take a peek at our dataset on the navigator: https://microsoft.github.io/MNW/

👋 Welcome to MNW Benchmark Dataset

This repository contains the MNW dataset (*Microsoft-Northwestern-Witness*) and/or associated code for benchmarking AI-detection models across images, video, and audio.

:warning: Notice

This dataset is intended for evaluation purposes and cannot be used for training or for commercial purposes.

Why do we need a new benchmark?

The detection of AI generated media and deepfake has been an active area of research for almost a decade. But in the past few years, a new paradigm has emerged with the diffusion architecture, showing impressive achievements in audio, image and video generation. Previous approaches to detection are now obsolete and the detection scene must re-invent itself. From only a few methods and models, we moved to a multitude of them pushing the state of the art constantly. With such a dynamic environment, detection models need to generalize better and stay up to date to maintain high performance across the board. Historically, evaluation of deepfake models were based on large datasets opened up during ‘detection challenges’. These datasets (see Meta 2020, gov.uk 2024 ) typically had a lot of depth but almost no breadth. They were suitable for the previous era (the GAN era) but are not up to the challenge brought by the new generative AI landscape and the evolving type of harm it brings: scams, non-consensual intimate image generation, disinformation, etc.

The evaluation of detection tools must reflect the evolution of the generative AI landscape.

What we need is breadth and regular updates Evaluation sets must be regularly updated and cover as many generators as possible. We argue that depth is less important than breadth and we propose the creation of an evaluation set that contains small samples of as many generators and ‘in the wild’ cases as possible- rather than millions of samples from a few generators.

‘In the lab’ and ‘in the wild’: two different beasts

We must recognize that too often, the performance of detectors, ‘in the wild’, does not match their performance ‘in the lab’. Content compression as a result of sharing on social networks and apps, but also manipulations and adversarial attacks makes it harder for models to consistently detect AI-generated media. Far too often, we boast great performance on testing sets, just to fail and real-world examples. Such real-world examples must co-exist in evaluation sets to provide a more realistic picture of the performance of our tools.

🤝 Existing Collaborators and Contributors

Founding collaborators:

Microsoft AI for Good Lab: https://www.microsoft.com/en-us/research/group/ai-for-good-research-lab/
Northwestern University Mc Cormick School of engineering: https://www.mccormick.northwestern.edu/computer-science/
Witness: https://www.witness.org/

:fountain_pen: Cite us!

This dataset cannot be used for commercial purposes.

We have recently published our summary paper in IEEE Intelligent Systems. Please feel free to cite us!

@ARTICLE{11479406,
author={Roca, Thomas and Postiglione, Marco and Gao, Chongyang and Gortner, Isabel and Wojciak, Zuzanna and Wang, Pengce and Alimardani, Mahsa and Anlen, Shirin and White, Kevin and Ferres, Juan Lavista and Kraus, Sarit and Gregory, Sam and Subrahmanian, V. S. and Murugesan, San},
journal={ IEEE Intelligent Systems },
title={{ The Microsoft-Northwestern-WITNESS Benchmark for Deepfake Detection }},
year={2026},
volume={41},
number={02},
ISSN={1941-1294},
pages={15-23},
abstract={ We introduce the Microsoft-Northwestern-WITNESS (MNW) deepfake detection benchmark, a dataset designed to evaluate and improve artificial intelligence (AI)-generated content detection algorithms. The dataset contains more than 50,000 artifacts (images, videos, and audio files) generated by us. It also includes real-world examples of AI-manipulated or suspicious media encountered by journalists and human rights defenders globally, annotated by experts to reflect practical, high-stakes detection scenarios. The MNW dataset will be periodically updated to cover emerging generators and includes adversarial examples created with state-of-the-art attacks. This is a collaborative effort, and we encourage generative AI model developers to help maintain the dataset’s currency. This dataset is intended solely for evaluation purposes and cannot be used for training or commercial purposes. We recommend that entities purchasing detection solutions avoid using our dataset to evaluate commercial tools. Our goal is to establish high standards for developers and enhance the reliability of detection systems. },
keywords={Benchmark testing;Deepfakes;Detection algorithms;Performance evaluation;Image analysis;Media;Information integrity;Data integrity;Data models;Videos;Audio systems;Artificial intelligence;Standards;Quality assessment},
doi={10.1109/MIS.2026.3668398},
url = {https://doi.ieeecomputersociety.org/10.1109/MIS.2026.3668398},
publisher={IEEE Computer Society},
address={Los Alamitos, CA, USA},
month=mar}

Notability

notability 3.0/10

Low traction (17 stars), routine repository.