google-deepmind/tips
Jupyter Notebook
Captured source
source ↗google-deepmind/tips
Description: TIPSv2 (CVPR'26) and TIPS (ICLR'25)
Language: Jupyter Notebook
License: Apache-2.0
Stars: 543
Forks: 36
Open issues: 1
Created: 2025-03-03T13:18:42Z
Pushed: 2026-06-01T23:42:16Z
Default branch: main
Fork: no
Archived: no
README: 
TIPS / TIPSv2
This repository contains the implementation and models introduced in:
- TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment, CVPR 2026
- TIPS: Text-Image Pretraining with Spatial Awareness, ICLR 2025
The TIPS series of models (Text-Image Pretraining with Spatial Awareness) are foundational image-text encoders built for general-purpose computer vision and multimodal applications. Our models were validated on a comprehensive suite of 9 tasks and 20 datasets, displaying excellent performance that matches or exceeds other recent vision encoders, with particularly strong spatial awareness.
We recommend using the latest version, TIPSv2, but still provide the earlier TIPSv1 for completeness. For a more detailed overview, please visit the Project Webpage and check out the papers:
See also our [demos and notebooks](#demos-and-notebooks) for a quick start.
Demos and notebooks
 --> Inference Colab in Pytorch
 --> Inference Colab in Jax
We also provide task-specific notebooks:
 --> Zero-shot segmentation (Pytorch)
 --> Train a linear head for foreground segmentation (Pytorch)
 --> Inference with DPT heads for segmentation, depth and normals (Pytorch)
How to use
We provide both Pytorch and Jax (Scenic) implementations:
tips/pytorch/: PyTorch inference for the model.tips/scenic/: Jax-based inference using the
We provide links to all available checkpoints, for both Pytorch and Jax model definitions, together with representative evals.
You can also find TIPSv2 models on HuggingFace here.
TIPSv2 models
| Model size | #Params vision / text | Pytorch ckp. | Jax ckp. | PASCAL seg.↑ | NYU-depth↓ | ImageNet-KNN↑ | Flickr I→T↑ | Flickr T→I↑ | ADE150-ZS↑ | | :--------- | :-------------------- | :----------: | :------: | :---------: | :-------: | :----------: | :------: | :--------: | :--------: | | g/14 | 1.1B / 389.1M | [vision][v2-pth-g14-vision] \| [text][v2-pth-g14-text] | [vision][v2-jax-g14-vision] \| [text][v2-jax-g14-text] | 85.1 | 0.334 | 83.7 | 95.1 | 85.9 | 17.8 | | SO/14 | 412.4M / 448.3M | [vision][v2-pth-so14-vision] \| [text][v2-pth-so14-text]| [vision][v2-jax-so14-vision] \| [text][v2-jax-so14-text]| 85.2 | 0.339 | 82.8 | 94.8 | 84.0 | 23.3 | | L/14 | 303.2M / 183.9M | [vision][v2-pth-l14-vision] \| [text][v2-pth-l14-text] | [vision][v2-jax-l14-vision] \| [text][v2-jax-l14-text] | 85.1 | 0.339 | 82.5 | 95.4 | 83.3 | 24.7 | | B/14 | 85.7M / 109.6M | [vision][v2-pth-b14-vision] \| [text][v2-pth-b14-text] | [vision][v2-jax-b14-vision] \| [text][v2-jax-b14-text] | 84.0 | 0.374 | 79.8 | 92.6 | 80.0 | 17.4 |
TIPSv1 models
| Model size | #Params vision / text | Pytorch ckp. | Jax ckp. | PASCAL seg.↑ | NYU-depth↓ | ImageNet-KNN↑ | UNED-KNN↑ | Flickr I→T↑ | Flickr T→I↑ | | :--------- | :-------------------- | :------------------------------------------------------: | :------------------------------------------------------: | :---------: | :-------: | :----------: | :------: | :--------: | :--------: | | g/14-HR | 1.1B / 389.1M | [vision][v1-pth-g14-hr-vision] \| [text][v1-pth-g14-hr-text] | [vision][v1-jax-g14-hr-vision] \| [text][v1-jax-g14-hr-text] | 83.1 | 0.363 | 83.2 | 68.4 | 93.8 | 83.8 | | g/14-LR | 1.1B / 389.1M | [vision][v1-pth-g14-lr-vision] \| [text][v1-pth-g14-lr-text] | [vision][v1-jax-g14-lr-vision] \| [text][v1-jax-g14-lr-text] | 82.0 | 0.390 | 83.6 | 71.5 | 93.4 | 82.1 | | SO/14-HR | 412.4M / 448.3M | [vision][v1-pth-so14-hr-vision] \| [text][v1-pth-so14-hr-text]| [vision][v1-jax-so14-hr-vision] \| [text][v1-jax-so14-hr-text]| 83.7 | 0.362 | 83.0 | 68.6 | 94.2 | 83.8 | | L/14-HR | 303.2M / 183.9M | [vision][v1-pth-l14-hr-vision] \| [text][v1-pth-l14-hr-text] | [vision][v1-jax-l14-hr-vision] \| [text][v1-jax-l14-hr-text] | 83.9 | 0.372 | 82.5 | 67.8 | 93.6 | 83.5 | | B/14-HR | 85.7M / 109.6M | [vision][v1-pth-b14-hr-vision] \| [text][v1-pth-b14-hr-text] | [vision][v1-jax-b14-hr-vision] \| [text][v1-jax-b14-hr-text] | 82.9 | 0.379 | 80.0 | 62.7 | 91.3 | 79.4 | | S/14-HR | 21.6M / 33.6M | [vision][v1-pth-s14-hr-vision] \| [text][v1-pth-s14-hr-text] | [vision][v1-jax-s14-hr-vision] \| [text][v1-jax-s14-hr-text] | 80.6 | 0.425 | 75.1 | 57.7 | 86.3 | 74.7 |
Local Installation
To install locally instead of using the Colabs/HF, please follow the instructions below.
Installation (Pytorch)
Manage dependencies with a custom environment (eg. Conda)
conda create -n tips python=3.11 # Activate the environment. conda activate tips
Install Pytorch dependencies.
# Install pytorch (change to GPU version if needed) pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu # Install other dependencies. pip install tensorflow_text mediapy jax jaxlib scikit-learn # Optionally, install Jupyter to use the notebook. pip install jupyter
Clone the code from this repo.
git clone https://github.com/google-deepmind/tips.git # Add the…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10Decent stars from major lab