🍎 NewtPhys: Do Foundation Models Understand Newtonian Physics?

*These authors contributed equally

NewtPhys is a 4D force-wise physically annotated dataset of 11k sequences (730k). Our benchmark provides ground for VLM and VFM evaluation, with pixel-wise annotation and a large corpus of 140k VQA

NewtPhys teaser collage showing physically annotated real-world scenes.
Legend for the teaser overlays.
Common-sense benchmark comparison from the NewtPhys teaser.
Highlighted NewtPhys model comparison from the teaser.

Left: exemplar scenes with selected overlays such as real-world velocity, visibility, collisions, and forces. Right: large-scale benchmarking of 54 VLMs showing that despite reasonable performance on commonsense tasks, current models still struggle to capture Newtonian physics.

Rendered Videos + Physics Map Overlay

Select a scene and a precomputed overlay video.

More Rendered Videos

Abstract

Previous work has evaluated physics reasoning in foundation models using synthetic or semi-synthetic scenes and visual question-answering tasks. However, these benchmarks emphasize high-level events and lack the visual fidelity required to assess true low-level Newtonian understanding. We introduce NewtPhys, a 4D physically annotated dataset built from multiview images of real-world scenes with physics-grounded simulations. The dataset provides dense, fine-grained annotations across timesteps, including 3D forces and per-pixel physical quantities, bridging the gap between simplistic synthetic setups and realistic visual complexity. Using NewtPhys, we systematically evaluate 54 VLMs and 10 VFMs and reveal limitations in low-level physics reasoning. Beyond benchmarking, our dataset enables future research in physics-grounded vision and the development of next-generation physics-aware evaluations.

Dataset

NewtPhys combines real-world DL3DV scenes with Google Scanned Objects, both represented as 3D Gaussian Splats, and simulates their interactions with a customized Newtonian solver derived from Simplicits. This makes it possible to script realistic scenes with rigid and deformable objects while recording collisions, inertia, gravity, deformation, and material stress over time.

The dataset uses 53 scenes and 109 objects, with 333 object trainings across physical-property variations. It contains 11K sequences rendered at 25 FPS, totaling 730K frames, and provides six dense per-pixel annotations: collisions, gravity force, material segmentation, instances, scene flow, and deformation gradient.

NewtPhys dataset construction pipeline.

Dataset construction from multiview scenes and scanned objects to rendered sequences, pixel-aligned physical maps, VQA generation, and physics probing.

Supplementary Statistics

Sequence characteristics

Sequence duration statistics.
Collision statistics.
Visible objects statistics.

Object physical properties

Young's modulus statistics.
Object volume statistics.
Object mass statistics.

Motion and visibility

Object motion statistics.
Camera motion statistics.
Field-of-view visibility statistics.

Dataset distributions

Question category distribution.
Material distribution pie chart.

Additional statistics from the supplementary material highlighting sequence duration, collisions, visible object counts, physical properties, motion, visibility, and benchmark distributions.

Benchmark

The NewtPhys benchmark evaluates low-level Newtonian understanding through automatically generated visual question answering and pixel-level physics probing. VQA spans six axes of study: material understanding, mechanics, spatial reasoning, viewpoint, temporal reasoning, and permanence.

In total, the benchmark contains 141K VQA pairs, including 84K multi-frame and 57K single-frame samples. The study benchmarks 54 vision-language models and extends to 10 vision foundation models through frozen-encoder physics probing on pixel-aligned physical maps.

Qualitative Samples

Five objects colliding crop from the NewtPhys qualitative figure.

5 objects colliding (crop)

Four soft objects from the NewtPhys qualitative figure.

4 soft objects

Flexible pencil cases crop from the NewtPhys qualitative figure.

Flexible pencil cases (crop)

Five objects interacting from the NewtPhys qualitative figure.

5 objects interacting

Ten objects in free fall from the NewtPhys qualitative figure.

10 objects in free fall

Multi-way interactions from the NewtPhys qualitative figure.

Multi-way interactions

NewtPhys (ours)

Interactive VQA Viewer

Results

Across 54 VLMs, NewtPhys reveals a large gap between broad multimodal competence and low-level physical understanding. Models remain weak on mechanics and material-property estimation, and even strong families improve only modestly on the hardest Newtonian categories.

The paper further shows that performance on common multimodal benchmarks is only weakly aligned with low-level physics accuracy, which suggests that current notions of common sense do not adequately measure Newtonian understanding.

VLM Benchmarking

The first result block follows the main paper analysis of how VLMs perform on fundamental physics. With the exception of the strongest families, most models operate in a low-accuracy regime on NewtPhys, and the gap is most visible once reasoning requires force, collision, or material-property estimation rather than broad spatial plausibility.

Overall VQA performance across model families.

Overall VQA performance of the largest model per family.

Category performance by model rank.

Category accuracy across model-rank groups.

Physics Categories

The paper shows that material identification is noticeably easier than estimating density, Young's modulus, or Poisson ratio, which require finer physical interpretation. Softer objects also tend to be easier for models because deformation offers stronger visual cues, while increasing scene complexity does not uniformly hurt every category.

Physics sub-category performance across models.

Physics-specific breakdown across density, mass, material identification, Poisson ratio, Young's modulus, collision, and kinematics.

Material understanding under Young's modulus variations.
Performance trends with increasing number of objects.

Variation analyses for object softness and scene complexity.

Common-Sense Correlation

The correlation analysis mirrors the discussion section of the paper: standard multimodal common-sense benchmarks correlate with NewtPhys only partially, and especially weakly for the low-level physics categories. This is one of the central claims of the work, namely that current benchmark ecosystems do not adequately measure Newtonian understanding.

Average common-sense correlation with NewtPhys performance.

Average common-sense correlation still underestimates low-level physics gaps.

Correlation per external benchmark.

Per-benchmark correlations stay weak for low-level physics reasoning.

VFM Physics Probing

The paper also evaluates 10 vision foundation models through physics probing. Frozen visual encoders are paired with lightweight decoders to predict collision, gravity, Gravity-OOD, and scene flow signals from NewtPhys. Self-supervised models generally perform best, MiDaS is the strongest fully supervised baseline, and DINO stands out on gravity prediction, but overall performance is still far from a robust physics-grounded representation.

Family Supervision Model Objective Collision F1 Gravity mAE Gravity magE Gravity-OOD mAE Gravity-OOD magE Scene Flow AEE
VisionFully-supervisedDeiT IIIClassification48.4719.3421.4415.2035.431.29
VisionFully-supervisedSAMSegmentation54.8020.7317.0416.2833.980.94
VisionFully-supervisedMiDaSDepth54.9512.1215.068.2333.800.95
VisionSelf-supervisedMAESSL28.6145.6931.7942.9142.501.29
VisionSelf-supervisedDINOSSL56.5413.7914.929.3033.330.94
VisionSelf-supervisedDINOv2SSL56.5214.9514.7611.0233.810.95
VisionAgglomerativeAM-RadioDistillation56.9613.9013.958.2533.840.97
Vision-LanguageVision-LanguageCLIPImage-Text Alignment53.8511.3014.657.6234.350.98
Vision-LanguageVision-LanguageSigLIPImage-Text Alignment40.9141.8727.5240.1938.541.27
Vision-LanguageReconstructionStable DiffusionGeneration50.3921.5021.2215.6433.811.30

BibTeX

@misc{cavada2026newtphys,
  title        = {NewtPhys: Do Foundation Models Understand Newtonian Physics?},
  author       = {Sebastian Cavada and Soumava Paul and Tuan-Hung Vu and Andrei Bursuc and Raoul de Charette},
  year         = {2026}
}