Researchers have found that state-of-the-art multimodal large language models struggle with a basic task: understanding how the physical world behaves over time — and have proposed a new training method that closes much of that gap.

Multimodal Large Language Models (MLLMs) — AI systems capable of processing both images and text, such as those underpinning products like GPT-4o — have made rapid progress in recognising objects, describing scenes, and answering questions about visual content. But a new study published on ArXiv reveals that these same models fall short when asked to reason about physical dynamics: predicting how fluids flow, how soft or deformable objects move, or simply which frame in a video comes next.

A Hidden Weakness in Vision AI

The research team designed two benchmark tasks specifically to probe this capability. The first, Next Frame Selection (NFS), asks a model to identify the correct next frame in a physical sequence from a set of candidates. The second, Temporal Coherence Verification (TCV), asks the model to judge whether a sequence of frames is physically plausible. Both tasks are, in the researchers' framing, foundational steps toward genuine physical reasoning — the kind of intuitive understanding humans develop in infancy.

Even state-of-the-art MLLMs perform poorly on these foundational tasks, pointing to a critical gap in how current systems represent the physical world.

The results were significant. Across both benchmarks, leading models performed well below what the researchers considered acceptable for a system claiming visual understanding. The particular weakness emerged around continuum objects — materials like fluids, cloth, and soft bodies that deform continuously rather than moving as rigid shapes. These are precisely the kinds of physical phenomena that are difficult to capture in still images, and models trained predominantly on static image-text pairs appear to have internalised very little about how they behave.

What Scene Dynamic Field Does Differently

To address this, the team introduced Scene Dynamic Field (SDF), a training approach that integrates physics simulators into a multi-task fine-tuning framework. Rather than relying solely on real-world video footage — which is expensive to collect and label — SDF generates training data from physics simulation engines, producing paired examples of physical scenarios alongside ground-truth dynamic information.

The key insight is that physics simulators can produce essentially unlimited examples of how fluids, deformable bodies, and other continuum objects behave under different conditions. By exposing models to this simulated data during fine-tuning, SDF teaches them to build richer internal representations of physical dynamics — what the paper calls a "dynamic field" of the scene, encoding motion and physical state rather than just appearance.

According to the research team, the results show gains of up to 20.7% on fluid dynamics tasks. Crucially, models trained with SDF also showed improved performance on physical domains they had not been trained on directly, suggesting the method encourages genuine generalisation rather than narrow task memorisation.

Why Simulated Data Is a Practical Advantage

One of the more practically significant aspects of SDF is its cost profile. Collecting and annotating real-world video data depicting physical phenomena — particularly unusual or complex ones — is time-consuming and expensive. Physics simulators, by contrast, can generate diverse, precisely labelled training scenarios at relatively low cost.

This positions SDF as a potentially scalable path toward physically grounded AI, without requiring the kind of massive real-world data pipelines that have characterised other advances in multimodal learning. The authors describe the approach as cost-efficient, though independent validation of that claim has not yet been published.

The research team has released both their code and the benchmark datasets publicly at GitHub, allowing other researchers to replicate results and test their own models against the NFS and TCV benchmarks.

The Broader Challenge of Physical Reasoning

The gap this paper identifies sits within a larger open problem in AI research: building systems that understand the physical world, not just its appearance. Current MLLMs are trained on enormous quantities of images and text, which gives them strong surface-level visual recognition. But understanding physics requires something different — a model of how objects interact, change state, and move through time.

Several research threads are pursuing this goal from different angles. Some teams focus on video prediction models that learn to anticipate future frames. Others work on world models — internal simulations that an AI system can query to reason about hypothetical physical scenarios. SDF sits closer to the fine-tuning end of this spectrum: it does not build a full world model, but it does inject physically grounded knowledge into an existing MLLM architecture at relatively low computational cost.

The authors explicitly frame their work as addressing the first step of physical reasoning — intuitive understanding — rather than higher-level causal or counterfactual reasoning. What happens when you drop an object, pour a liquid, or stretch a material is the baseline. More complex reasoning about why physical events occur, or what would happen under different conditions, remains a harder and largely unsolved challenge.

What This Means

For researchers and developers building multimodal AI systems, this work provides both a diagnostic tool — two benchmarks that expose a specific, measurable weakness — and a practical method for addressing it, one that may be adopted or extended as the field works toward AI that understands the physical world.