google-deepmind/videoprism

Python

Open original ↗

Captured source

source ↗
published May 29, 2025seen 6dcaptured 8hhttp 200method plain

google-deepmind/videoprism

Description: Official repository for "VideoPrism: A Foundational Visual Encoder for Video Understanding" (ICML 2024)

Language: Python

License: Apache-2.0

Stars: 374

Forks: 38

Open issues: 9

Created: 2025-05-29T19:25:15Z

Pushed: 2026-05-14T21:55:03Z

Default branch: main

Fork: no

Archived: no

README:

VideoPrism: A Foundational Visual Encoder for Video Understanding

VideoPrism is a general-purpose video encoder designed to handle a wide spectrum of video understanding tasks, including classification, retrieval, localization, captioning, and question answering. It is pre-trained on a massive and diverse dataset: 1 billion image-text pairs from WebLI, 36 million high-quality video-text pairs, and 582 million video clips with noisy or machine-generated parallel text (subject to data wipeout). The pre-training approach is designed for these hybrid data, to learn both from video-text pairs and the videos themselves. VideoPrism is fairly easy to adapt to new video understanding tasks, and achieves state-of-the-art performance on 31 out of 33 public video understanding benchmarks using a single frozen model.

This repository releases the model weight checkpoints and hosts JAX/Flax utility functions for checkpoint loading and model inference.

Updates

  • [Mar-13-26]: Added video classification fine-tuning with the frozen backbone [`Colab notebook`]. :fire::fire:
  • [Jul-16-25]: Released VideoPrism video-text encoders for cross-modal retrieval [`Colab notebook`]. :fire::fire:
  • [Jun-15-25]: Added models to [`Hugging Face`].
  • [Jun-05-25]: Added video encoder demo [`Colab notebook`].
  • [Jun-03-25]: Released VideoPrism video encoders (Base and Large) [`Blog`] [`Paper`]. :fire::fire:

TODOs

  • [ ] Add PyTorch model support.

Getting started

You will need Python 3.9 or later. Download the code from GitHub and run:

$ git clone https://github.com/google-deepmind/videoprism.git
$ cd videoprism
$ pip install .

Please get started with the following example code for model checkpoint loading and inference or use the Colab notebook for video encoders / Colab notebook for video-text encoders:

import jax
from videoprism import models as vp

# Video encoders.
model_name = 'videoprism_public_v1_base' # configuration name
flax_model = vp.get_model(model_name)
loaded_state = vp.load_pretrained_weights(model_name)

@jax.jit
def forward_fn(inputs):
return flax_model.apply(loaded_state, inputs, train=False)

video_inputs = ... # Shape = [batch_size, num_frames, height, width, 3].
outputs, _ = forward_fn(video_inputs) # Shape = [batch_size, num_tokens, feature_channels].

# Video-text encoders.
model_name = 'videoprism_lvt_public_v1_base' # configuration name
flax_model = vp.get_model(model_name)
loaded_state = vp.load_pretrained_weights(model_name)
text_tokenizer = vp.load_text_tokenizer('c4_en')

@jax.jit
def forward_fn(inputs, text_token_ids, text_token_paddings, train=False):
return flax_model.apply(
loaded_state,
inputs,
text_token_ids,
text_token_paddings,
train=train,
)

video_inputs = ... # Shape = [batch_size, num_frames, height, width, 3].
text_queries = ... # A list of input text queries.
text_ids, text_paddings = vp.tokenize_texts(text_tokenizer, text_queries)
video_embeddings, text_embeddings, _ = forward_fn(
video_inputs, text_ids, text_paddings) # Shape = [batch_size, feature_channels].

Video Classification example

We provide a Colab notebook for video classification to show how to fine-tune VideoPrism for video classification by keeping the pre-trained backbone frozen and training only a lightweight attention-pooler + projection head.

Released models

We release the following model variants:

| Model Name | Configuration Name | Model Type | Backbone | #Params | File Size | Checkpoint | | -------- | -------- | ------- | :-------: | :-------: | :-------: | :-------: | | VideoPrism-B | videoprism_public_v1_base | Video encoder | ViT-B | 114M | 458MB | link | | VideoPrism-L | videoprism_public_v1_large | Video encoder | ViT-L | 354M | 1.42GB | link | | VideoPrism-LvT-B | videoprism_lvt_public_v1_base | Video-text encoders | ViT-B | 248M | 991MB | link | | VideoPrism-LvT-L | videoprism_lvt_public_v1_large | Video-text encoders | ViT-L | 580M | 2.30GB | link |

Video encoders take videos with shape (batch_size, num_frames, 288, 288, 3) as inputs and output embeddings with shape (batch_size, num_frames * 16 * 16, feature_channels) which could be reshaped into (batch_size, num_frames, 16, 16, feature_channels) for spatiotemporal representations. During model training, num_frames is set to 16 and 8 for VideoPrism-B and VideoPrism-L, respectively. Both models are expected to work with arbitrary num_frames by interpolating the temporal positional embeddings. The RGB values of input videos should be normalized in [0.0, 1.0].

In video-text models, both video and text encoders produce global embeddings with shape (batch_size, feature_channels), whose similarities could be measured by cosine…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

New repo from DeepMind, moderate stars.