RepoByteDance (Doubao/Seed)ByteDance (Doubao/Seed)published Jul 30, 2025seen 5d

ByteDance-Seed/m3-agent

Python

Open original ↗

Captured source

source ↗
published Jul 30, 2025seen 5dcaptured 8hhttp 200method plain

ByteDance-Seed/m3-agent

Language: Python

License: Apache-2.0

Stars: 1378

Forks: 113

Open issues: 17

Created: 2025-07-30T13:12:32Z

Pushed: 2026-02-12T06:03:56Z

Default branch: master

Fork: no

Archived: no

README:

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory ICLR 2026

Abstract

We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot’s perspective (M3-Bench-robot) and 920 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross- modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 8.2%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design.

![illustration](figs/illustration.png)

A demo of M3-Agent as a personal assistant!

[![Watch the video](figs/demo.png)](https://www.youtube.com/watch?v=XUx31cBanfo)

The video can also be accessed on Bilibili

M3-Bench

We introduce M3-Bench, an long video question answerin dataset designed to evaluate the capability of multimodal agents to perform reasoning over long-term memory. Each instance in M3-Bench comprises a long video simulating the perceptual input of an agent, along with a series of open-ended question-answer pairs. The dataset is organized into two subsets: 1. M3-Bench-robot, which contains 100 real-world videos recorded from a robot's first-person perspective, 2. M3-Bench-web, which includes 920 web-sourced videos covering a wider variety of content and scenarios.

![architecture](figs/m3-bench-example.png)\ link1, link2, link3\ Examples from M3-Bench. M3-Bench-robot features long videos from realistic robotic work scenarios, while M3-Bench-web expands the video diversity to support broader evaluation. The question-answering tasks are designed to assess a multimodal agent’s ability to construct consistent and reliable long-term memory, as well as to reason effectively over that memory.

![architecture](figs/m3-bench-statistic.png)

Statistical overview of M3-Bench benchmark. Each question may correspond to multiple question types.

Videos

1. Download M3-Bench-robot from huggingface 2. Download M3-Bench-web from video_url in data/annotations/web.json\

Intermediate Outputs

[optional] You can either download the intermediate outputs we have processed from huggingface or generate them directly from the video by the following steps.

Memory Graphs

[optional] You can either download and extract the memory graphs we have processed from huggingface or generate them directly from the video by the following steps.

M3-Agent

![architecture](figs/m3-agent.png)

Architecture of M3-Agent. The system consists of two parallel processes: memorization and control. During memorization, M3-Agent processes video and audio streams online to generate episodic and semantic memory. During control, it executes instructions by iteratively thinking and retrieving from long-term memory. The long-term memory is structured as a multimodal graph.

Experimental Results

![architecture](figs/exp_result.png)

Results on M3-Bench-robot, M3-Bench-web, and VideoMME-long.

Run Locally

> Before running, add api config in configs/api_config.json

Memorization

Generate memory graphs for each video. The results are saved in data/memory_graphs.

  • The following steps are required only if you haven't downloaded *intermediate_outputs* and *memory_graphs* from huggingface or want to process other videos not from M3-Bench.

1. Set up environment

bash setup.sh
pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8
pip install qwen-omni-utils==0.0.4

2. Cut Video

Cut the video into 30 second segments.

#!/bin/bash

video="robot/bedroom_01"
input="data/videos/$video.mp4"
mkdir -p "data/clips/$video"
duration=$(ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 "$input")
duration_seconds=$(echo "$duration" | awk '{print int($1)}')

segments=$((duration_seconds / 30 + 1))
for ((i=0; i<segments; i++)); do
start=$((i * 30))
end=$(((i + 1) * 30))
output="data/clips/$video/$i.mp4"
ffmpeg -ss $start -i "$input" -t 30 -c copy "${output}"
done

3. Prepare data

Prepare a jsonl file with one video per line saved in data/data.jsonl

{"id": "bedroom_01", "video_path": "data/videos/robot/bedroom_01.mp4", "clip_path": "data/videos/clips/bedroom_01", "mem_path": "data/videos/memory_graphs/bedroom_01.pkl", "intermediate_path": "data/videos/intermediate_outputs/robot/bedroom_01"}

4. Generate Intermediate Outputs

This step uses Face Detection and Speaker Diarization tools to generate intermediate outputs.

  • If you want to use M3-Bench and have downloaded intermediate_outputs from huggingface, you can skip this step.
  • Download audio embedding model and save into models\ from…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

New repo from ByteDance, 1.3k stars, solid.