What does this repo signal mean?

Google (DeepMind / Gemini) published google-deepmind/platonic_rep_video (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo google-deepmind/platonic_rep_video · language Python · Self-supervised video representation learning using platonic representations.. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Google (DeepMind / Gemini) Repo: google-deepmind/platonic_rep_video

Captured source

source ↗

GitHub/github.com/google-deepmind/platonic_rep_video

google-deepmind/platonic_rep_video repository metadata

Source ↗

published Apr 22, 2026seen Jun 5captured Jun 11http 200method plain

google-deepmind/platonic_rep_video

Language: Python

License: Apache-2.0

Stars: 6

Forks: 2

Open issues: 0

Created: 2026-04-22T19:30:55Z

Pushed: 2026-04-22T19:43:13Z

Default branch: main

Fork: no

Archived: no

README:

Dynamic Reflections: Probing Video Representations with Text Alignment

[Overview](#overview) | [Installation](#installation) | [Quick Start](#quick-start) | [Reproduction](#full-reproduction) | [Custom Models](#custom-models)

![Teaser](assets/header.png)

This repository provides the official implementation for the paper "Dynamic Reflections: Aligning Vision and Language Models through Temporal Representation Learning". We provide the necessary code to reproduce our key findings on video-text representation alignment.

Overview

The core idea of our work is to measure and improve the alignment between video and language representations. This codebase allows you to:

Extract features various video and language models on datasets like VaTeX and PVD.
Measure alignment between these representations using metrics like mutual k-NN.
Experiment with different model architectures, pooling strategies, and temporal dynamics.

The project is structured to be extensible, allowing you to easily add your own models and datasets.

Installation

We recommend using uv for managing dependencies.

1. Initialize and sync the environment:

git clone git@github.com:google-deepmind/platonic-rep-video.git
cd platonic-rep-video
uv init
uv sync

1. Install `ffmpeg`: ffmpeg is required for video processing with torchcodec. We recommend installing it via conda.

conda install ffmpeg

Our setup is tested with CUDA 12.4. If you are using a different CUDA version, please adjust the torch and torchvision versions in pyproject.toml accordingly.

Data

We provide support for two datasets: VaTeX and PVD.

VaTeX

Instructions for downloading VaTeX can be found here. We do not provide the videos, but we provide a lightly processed subset of 1024 examples of the train annotations which we use for our analysis for reproducibility.

Our dataloaders may need modifications depending on how the final videos are downloaded. We recommend processing the videos to speedup feature extraction.

Perception Encoder Video Dataset (PVD)

Instructions for downloading PVD can be found here. Similarly to VaTeX, we provide a lightly processed subset of 1024 examples of the test annotations which we used for our analysis. We also include our Gemini-2.5-Pro rephrased captions, which expand each model caption with 10 individual captions in a similar format to VaTeX.

We also provide a script for downloading the videos from huggingface in misc/download_pvd.py which is compatible with our dataloaders.

Quick Start

The easiest way to get started is to run the provided pvd_sample experiment. This will extract features for a small set of models on a sample of the VaTeX dataset. Our configs are located in configs/experiments.

1. Run feature extraction: We recommend extracting language and vision features separately. LLMs are automatically sharded across all available devices, while most video models should fit on one GPU. We speed this up by parallelizing multiple video models across devices, which makes it incompatible to run language and video feature extraction together. Use the appropriate flag for each modality.

# Extract language features
uv run scripts/main_extract.py pvd_sample --llm_only

# Extract vision features
uv run scripts/main_extract.py pvd_sample --video_only

For multi-GPU execution, you can use the launcher.py script to parallelize feature extraction (recommended for vision models):

uv run scripts/launcher.py pvd_sample --num-gpus 4 --video_only

Extracted features will be saved to the ./results/sample/pvd/ directory by default. You may need to modify if you use python instead of uv.

1. Measure and plot alignment: Once features are extracted, you can measure the alignment between them:

uv run scripts/measure_alignment.py pvd_sample
uv run scripts/plot_alignment.py pvd_sample

This will compute the alignment scores and save the results to ./results/alignment/sample/.

The same is possible for the VaTeX dataset using the vatex_sample config. The only difference is the data will need to be obtained according to the original instructions. Our provided dataloader will not work out of the box, as we preprocess out data into pickle files first.

Full Reproduction

To evaluate alignment on the full PVD dataset, you first need to run the vatex experiment to extract features for all models.

# Extract language features (No parallelization as LLMs are sharded by default)
uv run scripts/main_extract.py pvd --llm_only

# Extract vision features (multi-GPU)
uv run scripts/launcher.py --num_gpus 8 pvd --video_only

After extraction, run the measure_alignment.py script with the same config, and optionally plot the data as well.

uv run scripts/measure_alignment.py pvd
uv run scripts/plot_alignment.py pvd

The same is possible for the VaTeX dataset using the vatex config.

Finally, we also provide our retrieval accuracies for the video models on Kinetics-400 and SSv2, which should be enough to fully reproduce Figure 2 of our paper.

Custom Models

This project uses a modular, registry-based system to handle video and language models, making it easy to experiment with different architectures. Here are the core components:

Model Registries (`LLM_REGISTRY`, `VIDEO_MODEL_REGISTRY`): Central

dictionaries in registry/llm.py and registry/video.py that hold all available models. Each model is mapped to a unique string name (e.g., 'gemma2-2b-it', 'videomaev2_base'), allowing you to reference them easily in your experiment configurations.

`@register` Decorator (for Video Models): A Python decorator that

simplifies adding new video models. By decorating a model's class with @register('your-model-name'), you automatically add it to the VIDEO_MODEL_REGISTRY.

`VideoModelInterface` (Abstract Base Class): This class in

registry/video.py defines a standard contract for all video models for functions like initialize,preprocess, and forward_intermediates.

Adding a Language Model...

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

New DeepMind video repo, low stars