🍎 NewtPhys: Do Foundation Models Understand Newtonian Physics?

Sebastian Cavada^3,*, Soumava Paul¹, Tuan-Hung Vu^1,2, Andrei Bursuc^1,2, Raoul de Charette^1,*

¹Inria, ²Valeo.ai, ³MBZUAI

^*These authors contributed equally

Paper Dataset VQA Viewer Coming Soon Results

NewtPhys is a 4D force-wise physically annotated dataset of 11k sequences (730k). Our benchmark provides ground for VLM and VFM evaluation, with pixel-wise annotation and a large corpus of 140k VQA

NewtPhys teaser collage showing physically annotated real-world scenes.

Common-sense benchmark comparison from the NewtPhys teaser.

Highlighted NewtPhys model comparison from the teaser.

Left: exemplar scenes with selected overlays such as real-world velocity, visibility, collisions, and forces. Right: large-scale benchmarking of 54 VLMs showing that despite reasonable performance on commonsense tasks, current models still struggle to capture Newtonian physics.

Rendered Videos + Physics Map Overlay

Select a scene and a precomputed overlay video.

Abstract

Previous work has evaluated physics reasoning in foundation models using synthetic or semi-synthetic scenes and visual question-answering tasks. However, these benchmarks emphasize high-level events and lack the visual fidelity required to assess true low-level Newtonian understanding. We introduce NewtPhys, a 4D physically annotated dataset built from multiview images of real-world scenes with physics-grounded simulations. The dataset provides dense, fine-grained annotations across timesteps, including 3D forces and per-pixel physical quantities, bridging the gap between simplistic synthetic setups and realistic visual complexity. Using NewtPhys, we systematically evaluate 54 VLMs and 10 VFMs and reveal limitations in low-level physics reasoning. Beyond benchmarking, our dataset enables future research in physics-grounded vision and the development of next-generation physics-aware evaluations.

Dataset

NewtPhys combines real-world DL3DV scenes with Google Scanned Objects, both represented as 3D Gaussian Splats, and simulates their interactions with a customized Newtonian solver derived from Simplicits. This makes it possible to script realistic scenes with rigid and deformable objects while recording collisions, inertia, gravity, deformation, and material stress over time.

The dataset uses 53 scenes and 109 objects, with 333 object trainings across physical-property variations. It contains 11K sequences rendered at 25 FPS, totaling 730K frames, and provides six dense per-pixel annotations: collisions, gravity force, material segmentation, instances, scene flow, and deformation gradient.

Dataset construction from multiview scenes and scanned objects to rendered sequences, pixel-aligned physical maps, VQA generation, and physics probing.

Supplementary Statistics

Sequence characteristics

Object physical properties

Motion and visibility

Dataset distributions

Additional statistics from the supplementary material highlighting sequence duration, collisions, visible object counts, physical properties, motion, visibility, and benchmark distributions.

Benchmark

The NewtPhys benchmark evaluates low-level Newtonian understanding through automatically generated visual question answering and pixel-level physics probing. VQA spans six axes of study: material understanding, mechanics, spatial reasoning, viewpoint, temporal reasoning, and permanence.

In total, the benchmark contains 141K VQA pairs, including 84K multi-frame and 57K single-frame samples. The study benchmarks 54 vision-language models and extends to 10 vision foundation models through frozen-encoder physics probing on pixel-aligned physical maps.

Browse NewtPhys VQA Coming Soon

Qualitative Samples

Five objects colliding crop from the NewtPhys qualitative figure.

5 objects colliding (crop)

Four soft objects from the NewtPhys qualitative figure.

4 soft objects

Flexible pencil cases (crop)

Five objects interacting from the NewtPhys qualitative figure.

5 objects interacting

Ten objects in free fall from the NewtPhys qualitative figure.

10 objects in free fall

Multi-way interactions

NewtPhys (ours)

Interactive VQA Viewer

Results

Across 54 VLMs, NewtPhys reveals a large gap between broad multimodal competence and low-level physical understanding. Models remain weak on mechanics and material-property estimation, and even strong families improve only modestly on the hardest Newtonian categories.

The paper further shows that performance on common multimodal benchmarks is only weakly aligned with low-level physics accuracy, which suggests that current notions of common sense do not adequately measure Newtonian understanding.

VLM Benchmarking

The first result block follows the main paper analysis of how VLMs perform on fundamental physics. With the exception of the strongest families, most models operate in a low-accuracy regime on NewtPhys, and the gap is most visible once reasoning requires force, collision, or material-property estimation rather than broad spatial plausibility.

Overall VQA performance across model families.

Overall VQA performance of the largest model per family.

Category accuracy across model-rank groups.

Physics Categories

The paper shows that material identification is noticeably easier than estimating density, Young's modulus, or Poisson ratio, which require finer physical interpretation. Softer objects also tend to be easier for models because deformation offers stronger visual cues, while increasing scene complexity does not uniformly hurt every category.

Physics sub-category performance across models.

Physics-specific breakdown across density, mass, material identification, Poisson ratio, Young's modulus, collision, and kinematics.

Material understanding under Young's modulus variations.

Performance trends with increasing number of objects.

Variation analyses for object softness and scene complexity.

Common-Sense Correlation

Overall accuracy on NewtPhys correlates with the average performance across eight common-sense benchmarks, but this aggregate trend hides significant per-category discrepancies. Correlations are stronger for categories tied to visual attributes, such as spatial reasoning and viewpoint, while low-level physics categories show much weaker alignment.

In particular, material understanding and mechanics are only weakly correlated with common-sense benchmark performance. This suggests that current benchmark ecosystems still under-measure the skills required for Newtonian reasoning, such as estimating physical properties, forces, collisions, and motion.

Overall common-sense correlation with NewtPhys performance.

Overall accuracy correlates with the eight-benchmark average, but category-level breakdowns reveal physics-specific gaps.

Per-benchmark correlations remain weak for low-level physics reasoning.

Category-Level Correlations

Material understanding

Mechanics

Spatial reasoning

Viewpoint

Temporal reasoning

Permanence

The category plots show why aggregate correlations are insufficient: common-sense scores track spatial and viewpoint questions more closely than they track mechanics and material-property reasoning.

VFM Physics Probing

The paper also evaluates 10 vision foundation models through physics probing. Frozen visual encoders are paired with lightweight decoders to predict collision, gravity, Gravity-OOD, and scene flow signals from NewtPhys. Self-supervised models generally perform best, MiDaS is the strongest fully supervised baseline, and DINO stands out on gravity prediction, but overall performance is still far from a robust physics-grounded representation.

Family	Supervision	Model	Objective	Collision F1	Gravity mAE	Gravity magE	Gravity-OOD mAE	Gravity-OOD magE	Scene Flow AEE
Vision	Fully-supervised	DeiT III	Classification	48.47	19.34	21.44	15.20	35.43	1.29
Vision	Fully-supervised	SAM	Segmentation	54.80	20.73	17.04	16.28	33.98	0.94
Vision	Fully-supervised	MiDaS	Depth	54.95	12.12	15.06	8.23	33.80	0.95
Vision	Self-supervised	MAE	SSL	28.61	45.69	31.79	42.91	42.50	1.29
Vision	Self-supervised	DINO	SSL	56.54	13.79	14.92	9.30	33.33	0.94
Vision	Self-supervised	DINOv2	SSL	56.52	14.95	14.76	11.02	33.81	0.95
Vision	Agglomerative	AM-Radio	Distillation	56.96	13.90	13.95	8.25	33.84	0.97
Vision-Language	Vision-Language	CLIP	Image-Text Alignment	53.85	11.30	14.65	7.62	34.35	0.98
Vision-Language	Vision-Language	SigLIP	Image-Text Alignment	40.91	41.87	27.52	40.19	38.54	1.27
Vision-Language	Reconstruction	Stable Diffusion	Generation	50.39	21.50	21.22	15.64	33.81	1.30

BibTeX

@inproceedings{cavada2026newtphys,
  title        = {{NewtPhys}: Do Foundation Models Understand Newtonian Physics?},
  author       = {Sebastian Cavada and Soumava Paul and Tuan-Hung Vu and Andrei Bursuc and Raoul de Charette},
  year         = 2026,
  booktitle    = {arXiv},
  url          = {https://arxiv.org/abs/2606.03986}
}