WritingReka AIReka AIpublished Jun 9, 2026seen 1d

Physicalrealismbench Attributable Physical Realism Evaluation For Video World Models

Open original ↗

Captured source

source ↗

PhysicalRealismBench-U: Attributable Physical Realism Evaluation for Video World Models

← Back to Blog

Jun 9, 2026

PhysicalRealismBench-U: Attributable Physical Realism Evaluation for Video World Models

PhysicalRealismBench-U: Attributable Physical Realism Evaluation for Video World Models

Intelligence is not only linguistic, but also visual and physical. While LLMs are becoming an increasingly mature technology and are successfully used in multiple digital domains spanning email editing, text summarization or even coding, their multimodal extension lacks visual and physical understanding of the world. On the one hand, they can recite complex physics laws using formal languages; on the other hand, they donʼt fully grasp object permanence, motion understanding, or how objects collide. Today, we release PhysicalRealismBench-U — a physical realism benchmark with a synthetic dataset containing programmatic physics violations — along with an evaluation pipeline to evaluate state-of-the-art VLMs in the context of physics understanding. We show that even the best existing models fail at fundamental physical reasoning tasks, which even kids would easily solve. Our findings are especially critical in the fast-emerging space of Physical General Intelligence or World Models.

The Problem: Intuitive Physics Neither a cat trying to gracefully catch a bird, nor a basketball player who skilfully shoots into the basket needs to write equations of motion to perform their tasks. Instead, they intuitively understand the laws of physics. They “know” how objects interact with each other, or how they fall. This happens due to the combination of evolution and lifelong learning. The same abilities are needed for physical general intelligence. An autonomous driving system that doesnʼt respect object permanence across occlusions will make catastrophic planning errors. A robot that fails to conserve support relations will take dangerous actions. Yet existing evaluation approaches fall short of catching those failures as they often focus on linguistic skills or generic understanding of concepts in images or videos. Those shortcomings are becoming increasingly important as VLMs are often re-purposed to serve as the “robotics brain” or used as an evaluator in various world model benchmarks such as VBench-2.0, WorldModelBench, or PAI-Bench, or as a reward function (VLM-RMs, RL-VLM-F, ERL-VLM, etc.). However, we show that current VLMs may skip frames, rely on spatial heuristics (see Insight 2: Border Proximity Triggers False Reasoning), and miss fundamental violations. Those findings have important ramifications, e.g., if used to evaluate world models they can produce a false sense of progress. The question is not whether current models sometimes get physics wrong — they do. The question is how systematically they fail, and whether the field has the tools to measure and diagnose these failures precisely enough to drive improvement. Our findings suggest the answer is negative: even state-of-the-art VLMs fail to detect basic violations like objects vanishing or moving without cause, and existing benchmarks lack the attribution machinery needed to turn these failures into actionable insights. A bare violated / not-violated verdict can be correct by chance, so it cannot distinguish a model that perceived the violation from one that guessed; only requiring the offending object and the frame range to be named makes a correct answer evidence of understanding, and that is exactly what existing benchmarks omit. This motivates both PhysicalRealismBench-U and a broader call for the community to prioritise physical realism as a first-class evaluation citizen.

Video

synthetic ground truth

Physics Laws

Template Library

Conservation of Mass, Gravity, Impenetrability, Conservation of Momentum.

Scene-Specific Q&A

object · time span evidence · law tag

VLM Judge

Violation Yes/NO + Reasoning

Attributable Diagnosis

which law - which object - which frames

Video

synthetic ground truth

Physics Laws

Template Library

Conservation of Mass, Gravity, Impenetrability, Conservation of Momentum.

Scene-Specific Q&A

object · time span evidence · law tag

VLM Judge

Violation Yes/NO + Reasoning

Attributable Diagnosis

which law - which object - which frames

Video

synthetic ground truth

Physics Laws

Template Library

Conservation of Mass, Gravity, Impenetrability, Conservation of Momentum.

Scene-Specific Q&A

object · time span evidence · law tag

VLM Judge

Violation Yes/NO + Reasoning

Attributable Diagnosis

which law - which object - which frames

[Fig. 1 — Evaluation pipeline diagram: Video → Physics Law Templates → Scene-Specific Q&A → VLM Judge + CV metrics → Attributable Diagnosis] How does our work extend existing benchmarks Our work is inspired by recently proposed benchmarks (Physion-Eval, PhysBench, PAI-Bench, WorldModelBench, VideoPhy-2) targeting physical realism. We complement them with synthetic data (from 3D rendered videos) that enables attributable and diagnosable results: ground-truth labels are programmatically and precisely computed, and each video exhibits at most one violation type, avoiding compounding factors. For the evaluation, we score the model on jointly identifying the violation type, the violating frames, and the violating object, which prevents potential shortcuts by the evaluated model and provides a strong certificate that the model has perceived the physical violation, both spatially and temporally. Comparison Table (Understanding Benchmarks) Each cell reflects what the benchmark provides regarding VLM evaluation; dataset annotations that aren't used in a quantitative metric are not counted.

Tab. 1: Comparison of PhysicalRealismBench-U with existing physical-understanding benchmarks, restricted to each benchmark's quantitatively scored VLM evaluation. Annotations not consumed by a metric are not counted, since they remain outside what the benchmark actually evaluates — Physion-Eval, for instance, uses an automatic VLM evaluation which does not contain law/entity/time-span labels, and their evaluation from expert law/object/time attribution annotations of VLMs remains qualitative rather than metric-based regarding object/time attribution and does not evaluate the law attribution.

  • Per the table's scope (quantitatively scored VLM evaluation): Physion-Eval provides law attribution annotations and reasoning (which includes time attribution and object reference in free form), but...

Excerpt shown — open the source for the full document.