Previous work has evaluated physics reasoning in foundation models using synthetic or semi-synthetic scenes and visual question-answering tasks. However, these benchmarks emphasize high-level events and lack the visual fidelity required to assess true low-level Newtonian understanding. We introduce NewtPhys, a 4D physically annotated dataset built from multiview images of real-world scenes with physics-grounded simulations. The dataset provides dense, fine-grained annotations across timesteps, including 3D forces and per-pixel physical quantities, bridging the gap between simplistic synthetic setups and realistic visual complexity. Using NewtPhys, we systematically evaluate 54 VLMs and 10 VFMs and reveal limitations in low-level physics reasoning. Beyond benchmarking, our dataset enables future research in physics-grounded vision and the development of next-generation physics-aware evaluations.
NewtPhys combines real-world DL3DV scenes with Google Scanned Objects, both represented as 3D Gaussian Splats, and simulates their interactions with a customized Newtonian solver derived from Simplicits. This makes it possible to script realistic scenes with rigid and deformable objects while recording collisions, inertia, gravity, deformation, and material stress over time.
The dataset uses 53 scenes and 109 objects, with 333 object trainings across physical-property variations. It contains 11K sequences rendered at 25 FPS, totaling 730K frames, and provides six dense per-pixel annotations: collisions, gravity force, material segmentation, instances, scene flow, and deformation gradient.
Dataset construction from multiview scenes and scanned objects to rendered sequences, pixel-aligned physical maps, VQA generation, and physics probing.
Sequence characteristics
Object physical properties
Motion and visibility
Dataset distributions
Additional statistics from the supplementary material highlighting sequence duration, collisions, visible object counts, physical properties, motion, visibility, and benchmark distributions.
The NewtPhys benchmark evaluates low-level Newtonian understanding through automatically generated visual question answering and pixel-level physics probing. VQA spans six axes of study: material understanding, mechanics, spatial reasoning, viewpoint, temporal reasoning, and permanence.
In total, the benchmark contains 141K VQA pairs, including 84K multi-frame and 57K single-frame samples. The study benchmarks 54 vision-language models and extends to 10 vision foundation models through frozen-encoder physics probing on pixel-aligned physical maps.
5 objects colliding (crop)
4 soft objects
Flexible pencil cases (crop)
5 objects interacting
10 objects in free fall
Multi-way interactions
NewtPhys (ours)
Across 54 VLMs, NewtPhys reveals a large gap between broad multimodal competence and low-level physical understanding. Models remain weak on mechanics and material-property estimation, and even strong families improve only modestly on the hardest Newtonian categories.
The paper further shows that performance on common multimodal benchmarks is only weakly aligned with low-level physics accuracy, which suggests that current notions of common sense do not adequately measure Newtonian understanding.
The first result block follows the main paper analysis of how VLMs perform on fundamental physics. With the exception of the strongest families, most models operate in a low-accuracy regime on NewtPhys, and the gap is most visible once reasoning requires force, collision, or material-property estimation rather than broad spatial plausibility.
Overall VQA performance of the largest model per family.
Category accuracy across model-rank groups.
The paper shows that material identification is noticeably easier than estimating density, Young's modulus, or Poisson ratio, which require finer physical interpretation. Softer objects also tend to be easier for models because deformation offers stronger visual cues, while increasing scene complexity does not uniformly hurt every category.
Physics-specific breakdown across density, mass, material identification, Poisson ratio, Young's modulus, collision, and kinematics.
Variation analyses for object softness and scene complexity.
Overall accuracy on NewtPhys correlates with the average performance across eight common-sense benchmarks, but this aggregate trend hides significant per-category discrepancies. Correlations are stronger for categories tied to visual attributes, such as spatial reasoning and viewpoint, while low-level physics categories show much weaker alignment.
In particular, material understanding and mechanics are only weakly correlated with common-sense benchmark performance. This suggests that current benchmark ecosystems still under-measure the skills required for Newtonian reasoning, such as estimating physical properties, forces, collisions, and motion.
Overall accuracy correlates with the eight-benchmark average, but category-level breakdowns reveal physics-specific gaps.
Per-benchmark correlations remain weak for low-level physics reasoning.
Material understanding
Mechanics
Spatial reasoning
Viewpoint
Temporal reasoning
Permanence
The category plots show why aggregate correlations are insufficient: common-sense scores track spatial and viewpoint questions more closely than they track mechanics and material-property reasoning.
The paper also evaluates 10 vision foundation models through physics probing. Frozen visual encoders are paired with lightweight decoders to predict collision, gravity, Gravity-OOD, and scene flow signals from NewtPhys. Self-supervised models generally perform best, MiDaS is the strongest fully supervised baseline, and DINO stands out on gravity prediction, but overall performance is still far from a robust physics-grounded representation.
| Family | Supervision | Model | Objective | Collision F1 | Gravity mAE | Gravity magE | Gravity-OOD mAE | Gravity-OOD magE | Scene Flow AEE |
|---|---|---|---|---|---|---|---|---|---|
| Vision | Fully-supervised | DeiT III | Classification | 48.47 | 19.34 | 21.44 | 15.20 | 35.43 | 1.29 |
| Vision | Fully-supervised | SAM | Segmentation | 54.80 | 20.73 | 17.04 | 16.28 | 33.98 | 0.94 |
| Vision | Fully-supervised | MiDaS | Depth | 54.95 | 12.12 | 15.06 | 8.23 | 33.80 | 0.95 |
| Vision | Self-supervised | MAE | SSL | 28.61 | 45.69 | 31.79 | 42.91 | 42.50 | 1.29 |
| Vision | Self-supervised | DINO | SSL | 56.54 | 13.79 | 14.92 | 9.30 | 33.33 | 0.94 |
| Vision | Self-supervised | DINOv2 | SSL | 56.52 | 14.95 | 14.76 | 11.02 | 33.81 | 0.95 |
| Vision | Agglomerative | AM-Radio | Distillation | 56.96 | 13.90 | 13.95 | 8.25 | 33.84 | 0.97 |
| Vision-Language | Vision-Language | CLIP | Image-Text Alignment | 53.85 | 11.30 | 14.65 | 7.62 | 34.35 | 0.98 |
| Vision-Language | Vision-Language | SigLIP | Image-Text Alignment | 40.91 | 41.87 | 27.52 | 40.19 | 38.54 | 1.27 |
| Vision-Language | Reconstruction | Stable Diffusion | Generation | 50.39 | 21.50 | 21.22 | 15.64 | 33.81 | 1.30 |
@inproceedings{cavada2026newtphys,
title = {{NewtPhys}: Do Foundation Models Understand Newtonian Physics?},
author = {Sebastian Cavada and Soumava Paul and Tuan-Hung Vu and Andrei Bursuc and Raoul de Charette},
year = 2026,
booktitle = {arXiv},
url = {https://arxiv.org/abs/2606.03986}
}