What does this repo signal mean?

Google (DeepMind / Gemini) published google-deepmind/seeing_without_pixels (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo google-deepmind/seeing_without_pixels · language Python · Teaching AI vision through 3D geometry rather than pixel grids. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Google (DeepMind / Gemini) Repo: google-deepmind/seeing_without_pixels

Captured source

source ↗

GitHub/github.com/google-deepmind/seeing_without_pixels

google-deepmind/seeing_without_pixels repository metadata

Source ↗

published Jun 3, 2026seen Jun 5captured Jun 11http 200method plain

google-deepmind/seeing_without_pixels

Language: Python

License: Apache-2.0

Stars: 6

Forks: 1

Open issues: 2

Created: 2026-06-03T04:04:52Z

Pushed: 2026-06-11T00:24:58Z

Default branch: main

Fork: no

Archived: no

README:

Seeing without Pixels: Perception from Camera Trajectories

Zihui Xue, Kristen Grauman, Dima Damen, Andrew Zisserman, Tengda Han

CVPR 2026

Project page | arXiv | [Data and checkpoints](DATA.md) | [Citation](#citation)

---

Can we understand a video from camera motion alone, without seeing any pixels?

Our paper shows that camera trajectories carry surprisingly rich semantic information. CamFormer takes only the camera's path through 3D space and learns to align it with natural-language descriptions, making it possible to retrieve what is happening in a video from motion alone.

---

This repository contains the released code for CamFormer pretraining and 5-way multiple-choice retrieval evaluation on:

Ego-Exo4D, using egocentric Aria camera trajectories
DynPose-100K, using exocentric camera trajectories
Nymeria, for egocentric zero-shot transfer

The code operates on camera-pose trajectories only; no video frames are used.

Method

CamFormer is a four-layer Transformer over camera poses. It maps a pose sequence to a single motion embedding and trains it against the frozen CLIP text encoder with a contrastive trajectory-text loss. For egocentric data, the model can encode a longer temporal context around each labeled action, then pool only the labeled sub-window. This helps disambiguate short or visually sparse camera motions.

Setup

Create the conda environment:

conda env create -f environment.yml
conda activate camformer

Or install into an existing Python 3.9+ environment:

pip install -r requirements.txt

The CLIP text encoder is installed from GitHub, so git must be available. If the pinned PyTorch wheels do not match your CUDA version, install torch and torchvision from pytorch.org first, then install the rest of requirements.txt.

Data

The metadata CSVs are included in data_files/. Large derived artifacts are hosted separately:

precomputed retrieval features, for reproducing results without a GPU
pretrained CamFormer checkpoints
camera-pose trajectory archives for training or checkpoint evaluation

See [DATA.md](DATA.md) for the download link, archive list, expected directory layout, and environment variables.

Fastest Reproduction

To reproduce retrieval numbers without downloading trajectories, download camformer_retrieval_features.zip from [DATA.md](DATA.md), unzip it in this repository, and run:

python eval_retrieval.py retrieval_features/egoexo4d
python eval_retrieval.py retrieval_features/dynpose_original
python eval_retrieval.py retrieval_features/dynpose_vipe
python eval_retrieval.py retrieval_features/nymeria_a
python eval_retrieval.py retrieval_features/nymeria_b
python eval_retrieval.py retrieval_features/nymeria_c
python eval_retrieval.py retrieval_features/nymeria_d

The main metric is Motion->Text MCQ acc.

Evaluation From Checkpoints

To run the model yourself, download camformer_checkpoints.zip and the pose archive for the dataset you want to evaluate. Set the environment variables described in [DATA.md](DATA.md), then run one of the commands below.

Ego-Exo4D, using Aria ground-truth poses:

python train.py --dataset egoexo4d_pretrain_longseq --test --scenario all \
--pose_encoding rel9d_grav --take_duration 8 --sample_dur \
--num_gpus 1 --batch_size 1000 \
--init_ckpt checkpoints/egoexo4d_dur8.pt

DynPose-100K, using original dataset poses:

python train.py --dataset dynpose_pretrain --test \
--pose_source original --pose_encoding rel9d \
--num_gpus 1 --batch_size 1000 \
--init_ckpt checkpoints/dynpose100k_original.pt

DynPose-100K, using ViPE-estimated poses:

python train.py --dataset dynpose_pretrain --test \
--pose_source vipe --pose_encoding rel9d \
--num_gpus 1 --batch_size 1000 \
--init_ckpt checkpoints/dynpose100k_vipe.pt

Nymeria zero-shot transfer, using the Ego-Exo4D long-context checkpoint:

python train.py --dataset nymeria_pretrain --test \
--text_column a --pose_encoding rel9d_grav \
--num_gpus 1 --batch_size 1000 \
--init_ckpt checkpoints/egoexo4d_dur16.pt

For Nymeria, --text_column selects the narration type:

a: body posture
b: hands and arms motion
c: legs and feet motion
d: focus of attention

Each --test run prints the directory where it saved frames.pt and text.pt. Pass that directory to eval_retrieval.py.

Pretraining

Training requires the corresponding pose archives from [DATA.md](DATA.md). The common settings are:

--pose_encoding rel9d_grav: relative 9D pose plus gravity in camera coordinates, used for egocentric checkpoints
--pose_encoding rel9d: relative 9D pose without gravity, used for DynPose-100K
--take_duration: number of seconds of context around each labeled action
--sample_dur: randomly vary the context duration during training
--pose_source: DynPose-100K pose source, either original or vipe
--use_pi3_pose: use Pi3-estimated Ego-Exo4D poses instead of Aria ground truth

Ego-Exo4D, Aria ground-truth poses:

python train.py --dataset egoexo4d_pretrain_longseq \
--pose_encoding rel9d_grav --take_duration 8 --sample_dur

Ego-Exo4D, Pi3-estimated poses:

python train.py --dataset egoexo4d_pretrain_longseq \
--use_pi3_pose --pose_encoding rel9d_grav \
--take_duration 8 --sample_dur

DynPose-100K, original poses:

python train.py --dataset dynpose_pretrain \
--pose_source original --pose_encoding rel9d

DynPose-100K, ViPE-estimated poses:

python train.py --dataset dynpose_pretrain \
--pose_source vipe --pose_encoding rel9d

Training writes logs under ~/data/logs///. W&B logging is enabled by default; run wandb login first, or set WANDB_MODE=offline if you want local-only runs.

Data Preparation

The released metadata and pose archives are enough for training and evaluation....

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

New repo from DeepMind, low traction.