Previous work has evaluated physics reasoning in foundation models using synthetic or semi-synthetic scenes and visual question-answering tasks. However, these benchmarks emphasize high-level events and lack the visual fidelity required to assess true low-level Newtonian understanding. We introduce NewtPhys, a 4D physically annotated dataset built from multiview images of real-world scenes with physics-grounded simulations. The dataset provides dense, fine-grained annotations across timesteps, including 3D forces and per-pixel physical quantities, bridging the gap between simplistic synthetic setups and realistic visual complexity. Using NewtPhys, we systematically evaluate 54 VLMs and 10 VFMs and reveal limitations in low-level physics reasoning. Beyond benchmarking, our dataset enables future research in physics-grounded vision and the development of next-generation physics-aware evaluations.
NewtPhys combines real-world DL3DV scenes with Google Scanned Objects, both represented as 3D Gaussian Splats, and simulates their interactions with a customized Newtonian solver derived from Simplicits. This makes it possible to script realistic scenes with rigid and deformable objects while recording collisions, inertia, gravity, deformation, and material stress over time.
The dataset uses 53 scenes and 109 objects, with 333 object trainings across physical-property variations. It contains 11K sequences rendered at 25 FPS, totaling 730K frames, and provides six dense per-pixel annotations: collisions, gravity force, material segmentation, instances, scene flow, and deformation gradient.
Dataset construction from multiview scenes and scanned objects to rendered sequences, pixel-aligned physical maps, VQA generation, and physics probing.
Sequence characteristics
Object physical properties
Motion and visibility
Dataset distributions
Additional statistics from the supplementary material highlighting sequence duration, collisions, visible object counts, physical properties, motion, visibility, and benchmark distributions.
The NewtPhys benchmark evaluates low-level Newtonian understanding through automatically generated visual question answering and pixel-level physics probing. VQA spans six axes of study: material understanding, mechanics, spatial reasoning, viewpoint, temporal reasoning, and permanence.
In total, the benchmark contains 141K VQA pairs, including 84K multi-frame and 57K single-frame samples. The study benchmarks 54 vision-language models and extends to 10 vision foundation models through frozen-encoder physics probing on pixel-aligned physical maps.
5 objects colliding (crop)
4 soft objects
Flexible pencil cases (crop)
5 objects interacting
10 objects in free fall
Multi-way interactions
NewtPhys (ours)
Across 54 VLMs, NewtPhys reveals a large gap between broad multimodal competence and low-level physical understanding. Models remain weak on mechanics and material-property estimation, and even strong families improve only modestly on the hardest Newtonian categories.
The paper further shows that performance on common multimodal benchmarks is only weakly aligned with low-level physics accuracy, which suggests that current notions of common sense do not adequately measure Newtonian understanding.
The first result block follows the main paper analysis of how VLMs perform on fundamental physics. With the exception of the strongest families, most models operate in a low-accuracy regime on NewtPhys, and the gap is most visible once reasoning requires force, collision, or material-property estimation rather than broad spatial plausibility.
Overall VQA performance of the largest model per family.
Category accuracy across model-rank groups.
The paper shows that material identification is noticeably easier than estimating density, Young's modulus, or Poisson ratio, which require finer physical interpretation. Softer objects also tend to be easier for models because deformation offers stronger visual cues, while increasing scene complexity does not uniformly hurt every category.
Physics-specific breakdown across density, mass, material identification, Poisson ratio, Young's modulus, collision, and kinematics.
Variation analyses for object softness and scene complexity.
The correlation analysis mirrors the discussion section of the paper: standard multimodal common-sense benchmarks correlate with NewtPhys only partially, and especially weakly for the low-level physics categories. This is one of the central claims of the work, namely that current benchmark ecosystems do not adequately measure Newtonian understanding.
Average common-sense correlation still underestimates low-level physics gaps.
Per-benchmark correlations stay weak for low-level physics reasoning.
The paper also evaluates 10 vision foundation models through physics probing. Frozen visual encoders are paired with lightweight decoders to predict collision, gravity, Gravity-OOD, and scene flow signals from NewtPhys. Self-supervised models generally perform best, MiDaS is the strongest fully supervised baseline, and DINO stands out on gravity prediction, but overall performance is still far from a robust physics-grounded representation.
| Family | Supervision | Model | Objective | Collision F1 | Gravity mAE | Gravity magE | Gravity-OOD mAE | Gravity-OOD magE | Scene Flow AEE |
|---|---|---|---|---|---|---|---|---|---|
| Vision | Fully-supervised | DeiT III | Classification | 48.47 | 19.34 | 21.44 | 15.20 | 35.43 | 1.29 |
| Vision | Fully-supervised | SAM | Segmentation | 54.80 | 20.73 | 17.04 | 16.28 | 33.98 | 0.94 |
| Vision | Fully-supervised | MiDaS | Depth | 54.95 | 12.12 | 15.06 | 8.23 | 33.80 | 0.95 |
| Vision | Self-supervised | MAE | SSL | 28.61 | 45.69 | 31.79 | 42.91 | 42.50 | 1.29 |
| Vision | Self-supervised | DINO | SSL | 56.54 | 13.79 | 14.92 | 9.30 | 33.33 | 0.94 |
| Vision | Self-supervised | DINOv2 | SSL | 56.52 | 14.95 | 14.76 | 11.02 | 33.81 | 0.95 |
| Vision | Agglomerative | AM-Radio | Distillation | 56.96 | 13.90 | 13.95 | 8.25 | 33.84 | 0.97 |
| Vision-Language | Vision-Language | CLIP | Image-Text Alignment | 53.85 | 11.30 | 14.65 | 7.62 | 34.35 | 0.98 |
| Vision-Language | Vision-Language | SigLIP | Image-Text Alignment | 40.91 | 41.87 | 27.52 | 40.19 | 38.54 | 1.27 |
| Vision-Language | Reconstruction | Stable Diffusion | Generation | 50.39 | 21.50 | 21.22 | 15.64 | 33.81 | 1.30 |
@misc{cavada2026newtphys,
title = {NewtPhys: Do Foundation Models Understand Newtonian Physics?},
author = {Sebastian Cavada and Soumava Paul and Tuan-Hung Vu and Andrei Bursuc and Raoul de Charette},
year = {2026}
}