google-deepmind/representations4d

Jupyter Notebook

Open original ↗

Captured source

source ↗

google-deepmind/representations4d

Language: Jupyter Notebook

License: Apache-2.0

Stars: 164

Forks: 7

Open issues: 2

Created: 2025-06-23T10:30:16Z

Pushed: 2026-06-08T15:58:11Z

Default branch: main

Fork: no

Archived: no

README:

4D Representations

Welcome to the official Google DeepMind repository for 4D Representations.

  • Scaling 4D Representations focuses on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model 22B parameters.

![scaling results](./assets/scaling_20M_20B.png)

  • Moving Off-the-Grid (MooG) introduces a self-supervised video representation that allows latent tokens to move freely across space and time, staying aligned with dynamic scene elements rather than fixed pixel grids. By combining cross-attention with positional embeddings, MooG disentangles representation structure from image structure, enabling tokens to bind to meaningful objects and regions. Trained with a simple next-frame prediction objective, MooG naturally learns object-centric tracking representations and achieves strong performance across downstream tasks with lightweight readouts.

![moog architecture](./assets/moog.png)

  • Recurrent Video Masked Autoencoders (RVM) proposes a recurrent, transformer-based approach to video representation learning that models temporal structure using an asymmetric masking objective and simple pixel reconstruction loss. RVM learns an efficient general-purpose encoder that matches or exceeds state-of-the-art video models on action recognition, tracking, and dense geometric tasks, while remaining competitive with strong image models. It is particularly effective in the small-model regime, achieving up to 30× greater parameter efficiency without distillation.

![rvm architecture](./assets/RVM.png)

*A Mixed Diet Makes DINO An Omnivorous Vision Encoder proposes a lightweight post-training recipe to adapt visual foundation models like DINOv2. The objective is to increase feature alignment between multi-sensory views (e.g., RGB images and depth maps) of the same scene. Omnivorous post-training not only improves a vision model's representation alignment (e.g., facilitating cross-modal retrieval), but also its downstream scene understanding (on 3D and semantic tasks), and ability to transfer to novel unseen modalities.

![omnivorous architecture](./assets/omnivorous-method.png)

Installation

git clone https://github.com/google-deepmind/representations4d.git
cd representations4d

python3 -m venv representations4d_env
source representations4d_env/bin/activate
pip install .

Demo

  • [![Open In

Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google-deepmind/representations4d/blob/main/colabs/scaling4d_depth_demo.ipynb) Depth estimation with 4DS-B-dist-e backbone

  • [![Open In

Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google-deepmind/representations4d/blob/main/colabs/moog_inference_demo.ipynb) Box tracking and point tracking with MooG backbone

  • [![Open In

Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google-deepmind/representations4d/blob/main/colabs/rvm_inference_demo.ipynb) Segmentation tracking, keypoint tracking, and masked video reconstruction with RVM backbone (encoder + decoder)

  • [![Open In

Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google-deepmind/representations4d/blob/main/colabs/rvm_evaluation_demo.ipynb) Segmentation tracking and keypoint tracking evaluation for video models

  • [![Open In

Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google-deepmind/representations4d/blob/main/colabs/omnivorous_dino_inference_demo.ipynb) Demo showing feature alignment in paired visual modalities in DINOv2 and Omnivorous Vision models.

Checkpoints

We release the following checkpoints

| Name | Model | # Params | File Size | Checkpoint | | -------- | ------- | :-------: | :-------: | :-------: | | 4DS-B-dist-e | Backbone (ViT-B) | 88M | 334MB | link | | 4DS-e | Backbone (ViT-e) | 3.8B | 14GB | link | | 4DS-B-dist-e ScanNet depth | Backbone (ViT-B) + Readout | 105M | 420MB | link | | MooG | Backbone (ConvNet + Transformer) | 35M | 140MB | link | | MooG | Box Track Readout (Cross Attention) | 35M | 140MB | link | | MooG | Point Track Readout (Cross Attention) | 35M | 140MB | link | | RVM | Encoder + Decoder (ViT-S) | 34M | 270MB | link | | RVM | Encoder + Decoder (ViT-B) | 117M | 641MB | link | | RVM | Encoder + Decoder (ViT-L) | 375M | 1.6GB | link | | RVM | Encoder + Decoder (ViT-H) | 743M | 3.1GB | link | | DINOv2 | Frozen Teacher (ViT-B) | 86.5M | 1.6GB | link | | Omnivorous DINOv2 | Adapted…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New repo from DeepMind, moderate stars