google-deepmind/seeing_without_pixels
Python
Captured source
source ↗google-deepmind/seeing_without_pixels
Language: Python
License: Apache-2.0
Stars: 6
Forks: 1
Open issues: 2
Created: 2026-06-03T04:04:52Z
Pushed: 2026-06-11T00:24:58Z
Default branch: main
Fork: no
Archived: no
README:
Seeing without Pixels: Perception from Camera Trajectories
Seeing without Pixels: Perception from Camera Trajectories
Zihui Xue, Kristen Grauman, Dima Damen, Andrew Zisserman, Tengda Han
CVPR 2026
Project page | arXiv | [Data and checkpoints](DATA.md) | [Citation](#citation)
---
Can we understand a video from camera motion alone, without seeing any pixels?
Our paper shows that camera trajectories carry surprisingly rich semantic information. CamFormer takes only the camera's path through 3D space and learns to align it with natural-language descriptions, making it possible to retrieve what is happening in a video from motion alone.
---
This repository contains the released code for CamFormer pretraining and 5-way multiple-choice retrieval evaluation on:
- Ego-Exo4D, using egocentric Aria camera trajectories
- DynPose-100K, using exocentric camera trajectories
- Nymeria, for egocentric zero-shot transfer
The code operates on camera-pose trajectories only; no video frames are used.
Method
CamFormer is a four-layer Transformer over camera poses. It maps a pose sequence to a single motion embedding and trains it against the frozen CLIP text encoder with a contrastive trajectory-text loss. For egocentric data, the model can encode a longer temporal context around each labeled action, then pool only the labeled sub-window. This helps disambiguate short or visually sparse camera motions.
Setup
Create the conda environment:
conda env create -f environment.yml conda activate camformer
Or install into an existing Python 3.9+ environment:
pip install -r requirements.txt
The CLIP text encoder is installed from GitHub, so git must be available. If the pinned PyTorch wheels do not match your CUDA version, install torch and torchvision from pytorch.org first, then install the rest of requirements.txt.
Data
The metadata CSVs are included in data_files/. Large derived artifacts are hosted separately:
- precomputed retrieval features, for reproducing results without a GPU
- pretrained CamFormer checkpoints
- camera-pose trajectory archives for training or checkpoint evaluation
See [DATA.md](DATA.md) for the download link, archive list, expected directory layout, and environment variables.
Fastest Reproduction
To reproduce retrieval numbers without downloading trajectories, download camformer_retrieval_features.zip from [DATA.md](DATA.md), unzip it in this repository, and run:
python eval_retrieval.py retrieval_features/egoexo4d python eval_retrieval.py retrieval_features/dynpose_original python eval_retrieval.py retrieval_features/dynpose_vipe python eval_retrieval.py retrieval_features/nymeria_a python eval_retrieval.py retrieval_features/nymeria_b python eval_retrieval.py retrieval_features/nymeria_c python eval_retrieval.py retrieval_features/nymeria_d
The main metric is Motion->Text MCQ acc.
Evaluation From Checkpoints
To run the model yourself, download camformer_checkpoints.zip and the pose archive for the dataset you want to evaluate. Set the environment variables described in [DATA.md](DATA.md), then run one of the commands below.
Ego-Exo4D, using Aria ground-truth poses:
python train.py --dataset egoexo4d_pretrain_longseq --test --scenario all \ --pose_encoding rel9d_grav --take_duration 8 --sample_dur \ --num_gpus 1 --batch_size 1000 \ --init_ckpt checkpoints/egoexo4d_dur8.pt
DynPose-100K, using original dataset poses:
python train.py --dataset dynpose_pretrain --test \ --pose_source original --pose_encoding rel9d \ --num_gpus 1 --batch_size 1000 \ --init_ckpt checkpoints/dynpose100k_original.pt
DynPose-100K, using ViPE-estimated poses:
python train.py --dataset dynpose_pretrain --test \ --pose_source vipe --pose_encoding rel9d \ --num_gpus 1 --batch_size 1000 \ --init_ckpt checkpoints/dynpose100k_vipe.pt
Nymeria zero-shot transfer, using the Ego-Exo4D long-context checkpoint:
python train.py --dataset nymeria_pretrain --test \ --text_column a --pose_encoding rel9d_grav \ --num_gpus 1 --batch_size 1000 \ --init_ckpt checkpoints/egoexo4d_dur16.pt
For Nymeria, --text_column selects the narration type:
a: body postureb: hands and arms motionc: legs and feet motiond: focus of attention
Each --test run prints the directory where it saved frames.pt and text.pt. Pass that directory to eval_retrieval.py.
Pretraining
Training requires the corresponding pose archives from [DATA.md](DATA.md). The common settings are:
--pose_encoding rel9d_grav: relative 9D pose plus gravity in camera coordinates, used for egocentric checkpoints--pose_encoding rel9d: relative 9D pose without gravity, used for DynPose-100K--take_duration: number of seconds of context around each labeled action--sample_dur: randomly vary the context duration during training--pose_source: DynPose-100K pose source, eitheroriginalorvipe--use_pi3_pose: use Pi3-estimated Ego-Exo4D poses instead of Aria ground truth
Ego-Exo4D, Aria ground-truth poses:
python train.py --dataset egoexo4d_pretrain_longseq \ --pose_encoding rel9d_grav --take_duration 8 --sample_dur
Ego-Exo4D, Pi3-estimated poses:
python train.py --dataset egoexo4d_pretrain_longseq \ --use_pi3_pose --pose_encoding rel9d_grav \ --take_duration 8 --sample_dur
DynPose-100K, original poses:
python train.py --dataset dynpose_pretrain \ --pose_source original --pose_encoding rel9d
DynPose-100K, ViPE-estimated poses:
python train.py --dataset dynpose_pretrain \ --pose_source vipe --pose_encoding rel9d
Training writes logs under ~/data/logs///. W&B logging is enabled by default; run wandb login first, or set WANDB_MODE=offline if you want local-only runs.
Data Preparation
The released metadata and pose archives are enough for training and evaluation.…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10New repo from DeepMind, low traction.