google-deepmind/platonic_rep_video
Python
Captured source
source ↗google-deepmind/platonic_rep_video
Language: Python
License: Apache-2.0
Stars: 6
Forks: 2
Open issues: 0
Created: 2026-04-22T19:30:55Z
Pushed: 2026-04-22T19:43:13Z
Default branch: main
Fork: no
Archived: no
README:
Dynamic Reflections: Probing Video Representations with Text Alignment
[Overview](#overview) | [Installation](#installation) | [Quick Start](#quick-start) | [Reproduction](#full-reproduction) | [Custom Models](#custom-models)

This repository provides the official implementation for the paper "Dynamic Reflections: Aligning Vision and Language Models through Temporal Representation Learning". We provide the necessary code to reproduce our key findings on video-text representation alignment.
Overview
The core idea of our work is to measure and improve the alignment between video and language representations. This codebase allows you to:
- Extract features various video and language models on datasets like VaTeX and PVD.
- Measure alignment between these representations using metrics like mutual k-NN.
- Experiment with different model architectures, pooling strategies, and temporal dynamics.
The project is structured to be extensible, allowing you to easily add your own models and datasets.
Installation
We recommend using uv for managing dependencies.
1. Initialize and sync the environment:
git clone git@github.com:google-deepmind/platonic-rep-video.git cd platonic-rep-video uv init uv sync
1. Install `ffmpeg`: ffmpeg is required for video processing with torchcodec. We recommend installing it via conda.
conda install ffmpeg
Our setup is tested with CUDA 12.4. If you are using a different CUDA version, please adjust the torch and torchvision versions in pyproject.toml accordingly.
Data
We provide support for two datasets: VaTeX and PVD.
VaTeX
Instructions for downloading VaTeX can be found here. We do not provide the videos, but we provide a lightly processed subset of 1024 examples of the train annotations which we use for our analysis for reproducibility.
Our dataloaders may need modifications depending on how the final videos are downloaded. We recommend processing the videos to speedup feature extraction.
Perception Encoder Video Dataset (PVD)
Instructions for downloading PVD can be found here. Similarly to VaTeX, we provide a lightly processed subset of 1024 examples of the test annotations which we used for our analysis. We also include our Gemini-2.5-Pro rephrased captions, which expand each model caption with 10 individual captions in a similar format to VaTeX.
We also provide a script for downloading the videos from huggingface in misc/download_pvd.py which is compatible with our dataloaders.
Quick Start
The easiest way to get started is to run the provided pvd_sample experiment. This will extract features for a small set of models on a sample of the VaTeX dataset. Our configs are located in configs/experiments.
1. Run feature extraction: We recommend extracting language and vision features separately. LLMs are automatically sharded across all available devices, while most video models should fit on one GPU. We speed this up by parallelizing multiple video models across devices, which makes it incompatible to run language and video feature extraction together. Use the appropriate flag for each modality.
# Extract language features uv run scripts/main_extract.py pvd_sample --llm_only # Extract vision features uv run scripts/main_extract.py pvd_sample --video_only
For multi-GPU execution, you can use the launcher.py script to parallelize feature extraction (recommended for vision models):
uv run scripts/launcher.py pvd_sample --num-gpus 4 --video_only
Extracted features will be saved to the ./results/sample/pvd/ directory by default. You may need to modify if you use python instead of uv.
1. Measure and plot alignment: Once features are extracted, you can measure the alignment between them:
uv run scripts/measure_alignment.py pvd_sample uv run scripts/plot_alignment.py pvd_sample
This will compute the alignment scores and save the results to ./results/alignment/sample/.
The same is possible for the VaTeX dataset using the vatex_sample config. The only difference is the data will need to be obtained according to the original instructions. Our provided dataloader will not work out of the box, as we preprocess out data into pickle files first.
Full Reproduction
To evaluate alignment on the full PVD dataset, you first need to run the vatex experiment to extract features for all models.
# Extract language features (No parallelization as LLMs are sharded by default) uv run scripts/main_extract.py pvd --llm_only # Extract vision features (multi-GPU) uv run scripts/launcher.py --num_gpus 8 pvd --video_only
After extraction, run the measure_alignment.py script with the same config, and optionally plot the data as well.
uv run scripts/measure_alignment.py pvd uv run scripts/plot_alignment.py pvd
The same is possible for the VaTeX dataset using the vatex config.
Finally, we also provide our retrieval accuracies for the video models on Kinetics-400 and SSv2, which should be enough to fully reproduce Figure 2 of our paper.
Custom Models
This project uses a modular, registry-based system to handle video and language models, making it easy to experiment with different architectures. Here are the core components:
- Model Registries (`LLM_REGISTRY`, `VIDEO_MODEL_REGISTRY`): Central
dictionaries in registry/llm.py and registry/video.py that hold all available models. Each model is mapped to a unique string name (e.g., 'gemma2-2b-it', 'videomaev2_base'), allowing you to reference them easily in your experiment configurations.
- `@register` Decorator (for Video Models): A Python decorator that
simplifies adding new video models. By decorating a model's class with @register('your-model-name'), you automatically add it to the VIDEO_MODEL_REGISTRY.
- `VideoModelInterface` (Abstract Base Class): This class in
registry/video.py defines a standard contract for all video models for functions like initialize,preprocess, and forward_intermediates.
Adding a Language Model…
Excerpt shown — open the source for the full document.
Notability
notability 4.0/10New DeepMind video repo, low stars