google-deepmind/representations4d
Jupyter Notebook
Captured source
source ↗google-deepmind/representations4d
Language: Jupyter Notebook
License: Apache-2.0
Stars: 164
Forks: 7
Open issues: 2
Created: 2025-06-23T10:30:16Z
Pushed: 2026-06-08T15:58:11Z
Default branch: main
Fork: no
Archived: no
README:
4D Representations
Welcome to the official Google DeepMind repository for 4D Representations.
- Scaling 4D Representations focuses on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model 22B parameters.

- Moving Off-the-Grid (MooG) introduces a self-supervised video representation that allows latent tokens to move freely across space and time, staying aligned with dynamic scene elements rather than fixed pixel grids. By combining cross-attention with positional embeddings, MooG disentangles representation structure from image structure, enabling tokens to bind to meaningful objects and regions. Trained with a simple next-frame prediction objective, MooG naturally learns object-centric tracking representations and achieves strong performance across downstream tasks with lightweight readouts.

- Recurrent Video Masked Autoencoders (RVM) proposes a recurrent, transformer-based approach to video representation learning that models temporal structure using an asymmetric masking objective and simple pixel reconstruction loss. RVM learns an efficient general-purpose encoder that matches or exceeds state-of-the-art video models on action recognition, tracking, and dense geometric tasks, while remaining competitive with strong image models. It is particularly effective in the small-model regime, achieving up to 30× greater parameter efficiency without distillation.

*A Mixed Diet Makes DINO An Omnivorous Vision Encoder proposes a lightweight post-training recipe to adapt visual foundation models like DINOv2. The objective is to increase feature alignment between multi-sensory views (e.g., RGB images and depth maps) of the same scene. Omnivorous post-training not only improves a vision model's representation alignment (e.g., facilitating cross-modal retrieval), but also its downstream scene understanding (on 3D and semantic tasks), and ability to transfer to novel unseen modalities.

Installation
git clone https://github.com/google-deepmind/representations4d.git cd representations4d python3 -m venv representations4d_env source representations4d_env/bin/activate pip install .
Demo
- [](https://colab.research.google.com/github/google-deepmind/representations4d/blob/main/colabs/scaling4d_depth_demo.ipynb) Depth estimation with 4DS-B-dist-e backbone
- [](https://colab.research.google.com/github/google-deepmind/representations4d/blob/main/colabs/moog_inference_demo.ipynb) Box tracking and point tracking with MooG backbone
- [](https://colab.research.google.com/github/google-deepmind/representations4d/blob/main/colabs/rvm_inference_demo.ipynb) Segmentation tracking, keypoint tracking, and masked video reconstruction with RVM backbone (encoder + decoder)
- [](https://colab.research.google.com/github/google-deepmind/representations4d/blob/main/colabs/rvm_evaluation_demo.ipynb) Segmentation tracking and keypoint tracking evaluation for video models
- [](https://colab.research.google.com/github/google-deepmind/representations4d/blob/main/colabs/omnivorous_dino_inference_demo.ipynb) Demo showing feature alignment in paired visual modalities in DINOv2 and Omnivorous Vision models.
Checkpoints
We release the following checkpoints
| Name | Model | # Params | File Size | Checkpoint | | -------- | ------- | :-------: | :-------: | :-------: | | 4DS-B-dist-e | Backbone (ViT-B) | 88M | 334MB | link | | 4DS-e | Backbone (ViT-e) | 3.8B | 14GB | link | | 4DS-B-dist-e ScanNet depth | Backbone (ViT-B) + Readout | 105M | 420MB | link | | MooG | Backbone (ConvNet + Transformer) | 35M | 140MB | link | | MooG | Box Track Readout (Cross Attention) | 35M | 140MB | link | | MooG | Point Track Readout (Cross Attention) | 35M | 140MB | link | | RVM | Encoder + Decoder (ViT-S) | 34M | 270MB | link | | RVM | Encoder + Decoder (ViT-B) | 117M | 641MB | link | | RVM | Encoder + Decoder (ViT-L) | 375M | 1.6GB | link | | RVM | Encoder + Decoder (ViT-H) | 743M | 3.1GB | link | | DINOv2 | Frozen Teacher (ViT-B) | 86.5M | 1.6GB | link | | Omnivorous DINOv2 | Adapted…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10New repo from DeepMind, moderate stars