Introducing SAM Audio: The First Unified Multimodal Model for Audio Separation
Captured source
source ↗Introducing SAM Audio: The First Unified Multimodal Model for Audio Separation
Products AI Research Resources About Get Llama Try Meta AI
Computer Vision Introducing SAM Audio: The First Unified Multimodal Model for Audio Separation December 16, 2025 • 11 minute read
Takeaways
We’re introducing SAM Audio , a state-of-the-art unified model that uses intuitive and multimodal prompts for audio separation. Building on the Perception Encoder model we shared earlier this year, we’re sharing Perception Encoder Audiovisual (PE-AV), the technical engine that helps SAM Audio achieve state-of-the-art performance across a variety of audio separation tasks. SAM Audio and PE-AV are available starting today. We’re also sharing SAM Audio-Bench, the first in-the-wild audio separation benchmark, and SAM Audio Judge, the first automatic judge model for audio separation. We invite everyone to try SAM Audio by visiting the Segment Anything Playground where they can explore the capabilities of our new model, along with our most recent releases, SAM 3 and SAM 3D .
Just as Meta Segment Anything Model (SAM) revolutionized computer vision by enabling people to segment any object in images and videos, today we’re excited to share a first-of-its-kind model for segmenting sound. We’re introducing SAM Audio, a state-of-the-art unified model that transforms audio processing by making it easy to isolate any sound from complex audio mixtures using natural, multimodal prompts — whether through text, visual cues, or marking time segments. This intuitive approach mirrors how people naturally engage with sound, making audio separation more accessible and useful than ever before. At the heart of SAM Audio is Perception Encoder Audiovisual (PE-AV), a technical engine that helps drive state-of-the-art performance. Built on the open source Perception Encoder model we shared earlier this year, PE-AV enables the building of more advanced computer vision systems that can assist people in everyday tasks, including sound detection. Think of PE-AV like “the ears” that help SAM Audio function as “the brain” to complete audio segmentation tasks. Together, these models enable many exciting use cases. Imagine a video recording of a band performance and all it takes is one click on the guitar to isolate its audio. SAM Audio can also be used to separate audio with text prompts, such as filtering out loud traffic noise from a video filmed outside. Additionally, our industry-first span prompts help people fix their audio issues all at once, such as filtering out noise from a barking dog during an entire podcast recording.
At Meta, we’re using these advancements to help build the next generation of creative media tools. We see so many potential use cases, including audio clean-up, background noise removal, and other tools to help people enhance their creativity. Today, we’re sharing SAM Audio and PE-AV with the community, along with two research papers offering technical depth on each model. We’re also sharing SAM Audio-Bench, the first in-the-wild audio separation benchmark, and SAM Audio Judge, the first automatic judge model for audio separation. We’re bringing all of this work together in the Segment Anything Playground , our new platform that lets anyone try our latest models. Starting today, people can select from our collection of audio and video assets or upload their own to explore the capabilities of SAM Audio. As always, we look forward to continuing the conversations we’ve been having about SAM — and for the first time ever, hearing what people create with these groundbreaking new models.
A Unified, Multimodal Prompting Model For Segmenting Audio
Until now, audio segmentation and editing has been a fragmented space, with a variety of tools designed for single-purpose use cases. As a unified model, SAM Audio is the first to support multiple interaction modalities that match how people naturally think about audio, achieving state-of-the-art performance on tasks, such as instrument, speech, and general sound separation for both text-prompted and visual-prompted tasks. SAM Audio performs reliably across diverse, real-world scenarios — using text, visual, and temporal cues. This approach gives people precise and intuitive control over how audio is separated. We present three methods for segmenting audio that can be used alone or in any combination to achieve a desired outcome. Text prompting : Type "dog barking" or "singing voice" to extract specific sounds. Visual prompting : Click on speaking persons or sounding objects in video to isolate their audio. Span prompting : An industry first, this method lets people mark time segments where target audio occurs.
Model Architecture
At its core, SAM Audio leverages a generative modeling framework built on a flow-matching diffusion transformer. This architecture takes an audio mixture and one or more prompts, encodes them into a shared representation, and generates the target and residual audio tracks. In tandem with the generative modeling framework, we developed a comprehensive data engine for SAM Audio that addresses the challenge of obtaining large-scale, high-quality separation data. This engine combines advanced audio mixing, automated multimodal prompt generation, and a robust pseudo-labeling pipeline to produce realistic training data for real-world scenarios.
The model is trained on this diverse dataset, which includes real and synthetic mixtures spanning speech, music, and general sound events. Advanced audio data synthesis strategies further enhance the model’s robustness, ensuring reliable performance in a wide range of environments.
Perception Encoder Audiovisual
Our second model, Perception Encoder Audiovisual, is the engine behind SAM Audio’s results. It powers core components such as the primary captioning model and SAM Audio Judge, our automatic judge model for audio separation. Built on Meta Perception Encoder — an open source model we released in April — PE-AV extends advanced computer vision capabilities to audio. Just as we adapted the model for object detection in SAM 3, we expanded its framework to encode sounds for SAM Audio, enabling the system to separate complex audio mixtures and adapt to real-world scenarios where visual context is important.
By extracting frame-level video features and aligning them with audio representations, the system combines and timestamps audiovisual information. This design allows SAM Audio to accurately separate...
Excerpt shown — open the source for the full document.
Notability
notability 7.0/10Notable new multimodal model from Meta, extending SAM to audio.