meituan-longcat/LARYBench
Python
Captured source
source ↗meituan-longcat/LARYBench
Language: Python
License: MIT
Stars: 150
Forks: 8
Open issues: 3
Created: 2026-04-09T03:09:15Z
Pushed: 2026-06-10T08:18:09Z
Default branch: main
Fork: no
Archived: no
README:
LARY — A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment
LARY is a unified evaluation framework for latent action representations. Given any model that produces latent action representations (LAMs or visual encoders), LARY provides three complementary evaluation pipelines:
| Pipeline | Task | |---|---| | `get_latent_action` | Extract latent action representations from videos or image pairs | | `classification` | Probe how well latent actions capture *action semantics* (action-type recognition) | | `regression` | Probe how well latent actions can *decode physical robot actions* (action regression) |
---
News
- [2026-06-10] LARYBench now supports V-JEPA 2.1 and simplify the way to add new custom models. We welcome all kinds of models evaluating on LARYBench and contributing to our leaderboards!
- [2026-05-01] LARYBench now supports SigLIP2, relative-action regression evaluation (
target = action_tgt - action_src), and a fast dataset integrity checker. Happy Labor Day! - [2026-04-27] We have open-sourced all datasets on HuggingFace.
- [2026-04-21] We release the general LAMs trained in ablation studies, LAPA-DINOv3 and LAPA-DINOv2. Even though these models are still rough experimental prototypes, with clear flaws in both training data and methods, we’re sharing them anyway to help push latent action research forward together. Have fun~
- [2026-04-15] We release partial training datasets due to the license limitation.
- [2026-04-13] We release the code, text annotations, and partial validation datasets. Training datasets are coming soon.
Release Checklist
- [x] Code
- [x] Text annotations link
- [x] Partial Validation datasets
- [x] Partial Training datasets
- [x] Full datasets
---
Table of Contents
1. [Overview](#overview) 2. [Contributions](#contributions) 3. [Environment Setup](#environment-setup) 4. [Data Preparation](#data-preparation) 5. [Quick Start](#quick-start) 6. [Relative-Action Regression](#relative-action-regression) 7. [Supported Models](#supported-models) 8. [Adding a Custom Model](#adding-a-custom-model) 9. [Supported Datasets](#supported-datasets) 10. [Evaluation Outputs](#evaluation-outputs)
---
Overview
While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated.
We introduce the Latent Action Representation Yielding (LARY) Benchmark, a unified framework for evaluating latent action representations on both high-level semantic actions (*what to do*) and low-level robotic control (*how to do*). The comprehensively curated dataset encompasses over one million videos (1,000 hours) spanning 151 action categories, alongside 620K image pairs and 595K motion trajectories across diverse embodiments and environments. Our experiments reveal two crucial insights: (i) General visual foundation models, trained without any action supervision, consistently outperform specialized embodied LAMs. (ii) Latent-based visual space is fundamentally better aligned to physical action space than pixel-based space. These results suggest that general visual representations inherently encode action-relevant knowledge for physical control, and that semantic-level abstraction serves as a fundamentally more effective pathway from vision to action than pixel-level reconstruction.
Contributions
- LARYBench: We introduce LARYBench, a comprehensive benchmark that first decouples the evaluation of latent action representations from downstream policy performance. LARYBench probes representations along two complementary dimensions — high-level semantic action (*what to do*) encoding and the low-level physical dynamics required for robotic control (*how to do it*) — enabling direct, standardized measurement of representation quality itself.
- Large-Scale Data Engine: To support rigorous evaluation, we develop an automated data engine to re-segment and re-annotate a large-scale corpus, yielding 1.2M videos, 620K image pairs, and 595K trajectories across 151 action categories and 11 robotic embodiments, covering both human and robotic agents from egocentric and exocentric perspectives in simulated and real-world environments.
- Key Findings: Through systematic evaluation of 11 models, we reveal two consistent findings: (i) action-relevant features can emerge from large-scale visual pre-training without explicit action supervision, and (ii) latent-based feature spaces tend to align with robotic control better than pixel-based ones. These results suggest that future VLA systems may benefit more from leveraging general visual representations than from learning action spaces solely on scarce robotic data.
---
Environment Setup
Use larybench as the base environment.
conda create -n larybench python=3.10 -y conda activate larybench pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118 pip install -r requirements.txt
Some model families keep their original dependencies and should be configured from their upstream projects when you evaluate them:
| Model family | Environment guidance | |-----------------------------------------------------------------------------------------------------------------------------------|---| | dinov2, dinov3, siglip2, dinov2-origin, dinov3-origin, siglip2-origin, lapa, magvit2, univla, flux2, wan2-2 | Use larybench | | vjepa2, vjepa2.1 | Follow facebookresearch/vjepa2 and activate your vjepa2 env | | villa-x | Follow microsoft/villa-x and set VILLA_X_DIR |…
Excerpt shown — open the source for the full document.
Notability
notability 4.0/10New benchmark repo, moderate stars.