Author: Denis Avetisyan
A new benchmark reveals that vision-language models struggle with basic physical reasoning, often relying on memorization rather than genuine understanding of visual scenes.

QuantiPhy, a novel dataset and evaluation protocol, quantitatively assesses the ability of vision-language models to perform kinematic inference and demonstrates limitations in their physical understanding.
Despite advances in vision-language models, a rigorous quantitative evaluation of their physical reasoning capabilities remains a challenge. To address this gap, we introduce QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models, a dataset and evaluation protocol designed to measure a VLM’s ability to infer kinematic properties from video observations. Our results reveal a consistent discrepancy between qualitative plausibility and numerical accuracy in state-of-the-art models, indicating a reliance on pre-trained knowledge rather than faithful visual and textual referencing. Can we move beyond verbally plausible responses and develop VLMs with a genuinely numerically grounded understanding of the physical world?
The Erosion of Precision: Why Vision-Language Models Struggle with Physical Reality
Vision-Language Models (VLMs) demonstrate a remarkable ability to describe the contents of an image or scene, often generating fluent and contextually relevant textual explanations. However, this descriptive prowess frequently masks a critical limitation: a struggle with precise, quantitative predictions about the physical world. While a VLM might accurately identify objects and their relationships – stating, for example, that “the red ball is to the left of the blue box” – it often falters when asked to predict how much a tilted object will roll, or how long it will take for a falling object to reach the ground. This discrepancy arises because current training methodologies prioritize perceptual accuracy and linguistic coherence over a deep understanding of underlying physical principles like gravity, momentum, or friction – skills essential for navigating and interacting with the physical environment. Consequently, VLMs can generate visually plausible but physically unrealistic scenarios, highlighting the need for benchmarks and training data that specifically assess and cultivate quantitative physical reasoning.
Current evaluations of vision-language models frequently center on qualitative visual question answering, where success is determined by simply identifying what is happening in an image. This approach, however, overlooks a critical dimension of true intelligence: the ability to predict how much of something will occur. A model might correctly identify a scenario involving falling objects, but fail to accurately estimate the time it will take for them to reach the ground or the final velocity they will achieve upon impact. This emphasis on descriptive accuracy, rather than numerical precision, creates a significant gap in assessing a model’s genuine understanding of physical dynamics; a model could consistently answer “yes” or “no” questions correctly without possessing any robust grasp of quantitative relationships like $F = ma$ or the principles of conservation of energy. Consequently, advancements in qualitative understanding may not necessarily translate to improvements in real-world applications demanding precise physical predictions.
Truly assessing physical reasoning in Vision-Language Models necessitates a departure from simply identifying what is happening in an image to precisely quantifying how much of a physical phenomenon is occurring. Current benchmarks often focus on qualitative answers – determining if an object will fall, for example – but fail to probe the model’s ability to predict magnitudes like velocity, force, or time. A robust evaluation requires datasets and metrics that demand numerical predictions; instead of asking “will the block fall?”, the system must predict “how long will it take to fall?”, or “with what velocity will it impact the ground?”. This shift towards quantitative evaluation, measuring predictions against ground truth values like $v = gt$ (where $v$ is velocity, $g$ is gravitational acceleration, and $t$ is time), is crucial for developing models that don’t merely describe the physical world, but genuinely understand and can accurately predict its behavior.
A comprehensive understanding of physical dynamics necessitates more than simply identifying what is happening in an image; it requires predicting how much, and current evaluation methods for Vision-Language Models often fall short in this regard. Without benchmarks that demand numerical precision – assessing quantities like speed, force, or volume – it remains difficult to truly gauge a model’s ability to reason about the physical world. This lack of robust quantitative evaluation hinders progress, as models can achieve high scores on qualitative tasks without possessing a genuine understanding of underlying physical principles. Consequently, improvements to a VLM’s physical reasoning capabilities are difficult to reliably measure, and the development of truly physically-aware artificial intelligence is significantly challenged. The field requires a shift towards metrics that assess not just recognition, but accurate prediction of measurable physical properties, such as calculating the trajectory of a projectile or estimating the stability of a structure, to unlock meaningful advancements.

QuantiPhy: A Framework for Measuring True Physical Insight
QuantiPhy provides a systematic evaluation of Visual Language Models (VLMs) concerning their capacity to infer kinematic properties directly from visual inputs. This framework assesses a VLM’s ability to determine an object’s size, velocity, and acceleration based solely on visual data. Evaluation is performed by presenting VLMs with visual stimuli and quantifying the accuracy of their predicted kinematic values against ground truth data. The framework is designed to move beyond qualitative assessments of physical reasoning and provide a measurable, quantitative analysis of a VLM’s performance in inferring these fundamental physical characteristics. Specifically, QuantiPhy aims to establish benchmarks for assessing how well a VLM can translate visual observations into accurate estimations of $size$, $velocity$, and $acceleration$.
The QuantiPhy framework incorporates both synthetic and real-world captured data to facilitate comprehensive Visual Language Model (VLM) evaluation. Simulated data allows for precise control over ground truth kinematic properties and scene parameters, enabling systematic testing across a wide spectrum of conditions. Complementing this, the inclusion of captured data – images and videos of real-world physical phenomena – introduces inherent visual complexities such as lighting variations, occlusions, and sensor noise. This combined approach ensures that VLMs are evaluated not only on idealized scenarios but also on their ability to generalize to the more challenging and nuanced visual information present in authentic environments, leading to a more robust and reliable assessment of their physical reasoning capabilities.
QuantiPhy moves beyond qualitative evaluations of Visual Language Models (VLMs) by employing quantitative metrics to assess the accuracy of predicted kinematic properties. Specifically, the framework utilizes metrics such as Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) to compare predicted values – including size, velocity, and acceleration – with ground truth data derived from both synthetic and real-world scenes. These metrics provide a numerical assessment of a model’s performance, allowing for precise comparisons between different VLMs and facilitating targeted improvements. The use of these metrics enables a granular analysis of error distributions and identifies specific areas where a model struggles with physical reasoning, something not achievable through subjective human evaluation.
QuantiPhy moves beyond qualitative assessments of Visual Language Models (VLMs) by directly measuring the numerical accuracy of inferred physical properties. Rather than simply determining if a VLM correctly identifies a change in velocity, QuantiPhy quantifies how close the predicted velocity is to the ground truth, utilizing metrics such as mean absolute error (MAE) and root mean squared error (RMSE) for precise evaluation. This approach allows for a granular understanding of VLM performance, identifying specific types of physical inferences where models excel or struggle – for example, distinguishing between accurate estimations of object size and less precise predictions of acceleration. The resulting quantitative data facilitates comparative analysis of different VLM architectures and training strategies, enabling targeted improvements in physical reasoning capabilities.

Synthetic Worlds, Ground Truth Realities: The Architecture of Robust Evaluation
Blender is utilized as the core engine for generating synthetic data through physics-driven simulation. This approach allows for granular control over all scene parameters, including object properties, lighting conditions, camera angles, and environmental factors. By manipulating these parameters, we can systematically create a diverse dataset of simulated environments and events. The physics engine within Blender accurately models object interactions and motion, resulting in highly realistic visual data. This controlled generation process ensures data consistency and allows for the creation of ground truth annotations, which are crucial for training and evaluating computer vision algorithms.
The integration of Sketchfab’s online repository of 3D models is a core component of our data generation pipeline, providing a substantial increase in the visual diversity of simulated environments. Currently, the platform offers access to tens of thousands of models spanning a wide range of object categories and levels of geometric complexity. These models are incorporated into Blender scenes to create varied backgrounds and populate simulated environments with realistic objects, moving beyond the limitations of a fixed or small set of assets. This approach allows for the generation of training data that more closely reflects the complexity and variability of real-world scenes, improving the robustness and generalization capability of evaluated models.
Optical flow techniques are implemented to calculate the apparent motion of objects within the synthetic videos generated by the simulation suite. This process involves assigning a vector to each pixel, representing its displacement between consecutive frames. The resulting optical flow fields serve as ground truth data, providing a precise measurement of motion that can be directly compared against the motion estimations produced by the models under evaluation. This comparison enables quantitative assessment of model accuracy, allowing for precise benchmarking and identification of areas for improvement in motion prediction algorithms. The technique facilitates evaluation across various simulated scenarios and object types, offering a controlled and reliable method for performance analysis.
Data consistency, as measured by the discard rate of unusable samples, differs significantly between synthetically generated data and data collected from physical experiments. Blender-generated datasets exhibit a low discard rate of 3%, indicating a high degree of data reliability and minimizing the need for reprocessing or exclusion of flawed samples. Conversely, data acquired from real-world laboratory setups demonstrates a substantially higher discard rate of 30%. This disparity highlights the inherent difficulties in maintaining data quality during physical data acquisition, stemming from factors such as sensor noise, lighting variations, and imperfect experimental control, and underscores the value of synthetic data for training and evaluation where data consistency is paramount.

Unmasking Assumptions: Probing the Limits of Physical Intuition
Vision-Language Models (VLMs) don’t approach physical reasoning from a blank slate; instead, they heavily leverage pre-existing knowledge acquired during training. However, this reliance on prior experience can introduce vulnerabilities, as models may uncritically accept and apply assumptions that don’t universally hold true. Researchers are actively investigating the extent to which VLMs depend on these implicit priors – essentially, deeply ingrained expectations about how the physical world behaves – and whether these assumptions hinder their ability to solve novel problems or generalize to unfamiliar scenarios. The concern is that a model might succeed not because of genuine understanding, but because it correctly guesses based on frequently encountered patterns, potentially leading to errors when faced with situations that deviate from the norm. Understanding these ingrained biases is crucial for developing more robust and reliable artificial intelligence systems capable of true physical understanding.
The study investigates the adaptability of Vision-Language Models (VLMs) by deliberately challenging their ingrained understanding of physics. Researchers introduced ‘counterfactual priors’ – scenarios where fundamental physical rules are altered, such as gravity acting upwards or objects passing through each other – to determine if the models rely solely on learned correlations or possess a more generalized understanding of physical principles. By measuring performance under these unusual conditions, the investigation reveals whether VLMs can extrapolate beyond their training data and reason effectively even when confronted with violations of expected physical laws. This approach distinguishes between models that merely recognize patterns and those capable of true physical reasoning, highlighting the importance of robustness and adaptability in artificial intelligence systems designed to interact with the physical world.
Chain-of-Thought prompting serves as a critical diagnostic tool for understanding the internal logic of large language models when tackling physical reasoning problems. This technique encourages the model to not simply output an answer, but to explicitly detail the sequential steps taken to arrive at that conclusion. By making this reasoning process visible, researchers can pinpoint instances where models rely on potentially flawed heuristics or deeply ingrained, yet incorrect, assumptions about how the physical world operates. The articulated reasoning isn’t merely a post-hoc justification; it reveals the model’s process – highlighting whether solutions stem from genuine understanding or from pattern-matching based on biased training data. This transparency allows for targeted interventions, ultimately pushing these models toward more robust and reliable physical intuition, rather than reinforcing pre-existing, and potentially misleading, priors.
A critical evaluation of current physical reasoning models reveals a significant dependence on pre-existing, often unstated, assumptions about how the world operates. When these models are tested under standard conditions, performance can appear robust, masking an underlying fragility. However, deliberately altering fundamental physical rules – introducing counterfactual scenarios – dramatically exposes these limitations. Comparative analysis demonstrates that performance degrades considerably when models encounter situations that violate their implicit priors, suggesting that much of their apparent reasoning isn’t based on true understanding of physical principles, but rather on the successful matching of observed phenomena to pre-programmed expectations. This highlights the necessity of developing models capable of genuinely learning and adapting to novel physical circumstances, rather than simply extrapolating from existing ones.

The Human Benchmark: Charting a Path Towards True Physical Intelligence
A rigorous assessment of any artificial intelligence system requires a clear understanding of the limits of human ability. To that end, a dedicated human study was undertaken to establish an upper bound on performance in quantitative physical reasoning – the capacity to solve problems involving physical quantities and relationships. Participants were presented with a series of challenges designed to test their intuitive understanding of physics and their ability to apply mathematical principles to predict outcomes. The results of this study not only define the benchmark against which visual-language models (VLMs) are evaluated, but also illuminate the cognitive processes involved in human physical reasoning, providing valuable insights into the complexities of intelligence itself. Establishing this human performance ceiling is crucial for gauging the progress of AI and identifying the substantial opportunities that remain in developing truly intelligent systems.
A rigorous assessment of visual language models (VLMs) demands a clear understanding of human capabilities in the same domain. To this end, a dedicated human study was conducted to establish an upper bound on performance for quantitative physical reasoning tasks. This benchmark isn’t simply about achieving high scores; it provides a vital comparative measure, allowing researchers to pinpoint specific areas where VLMs fall short of human cognition. By contrasting VLM performance against human results, the study highlights critical gaps in areas such as understanding physical constraints, spatial relationships, and numerical estimation. This targeted analysis then informs the development of more effective architectures and training methodologies, guiding future efforts to create VLMs with enhanced and more human-like reasoning abilities.
Current vision-language models (VLMs) demonstrate a notable disparity in quantitative physical reasoning when contrasted with human capabilities. Evaluations reveal these models achieve a Mean Relative Accuracy (MRA) ranging from 0.2 to 0.6 on benchmark tasks. This suggests that, while VLMs are progressing in their ability to connect visual information with language, they still fall considerably short of human performance – often exhibiting errors in tasks requiring precise numerical estimation or understanding of physical relationships. The relatively low MRA highlights a substantial area for development, indicating that current architectures and training methodologies need refinement to approach the accuracy and robustness of human physical intuition.
Ongoing research prioritizes the creation of innovative neural network designs and training methodologies specifically aimed at enhancing visual language models’ abilities in quantitative physical reasoning. This involves exploring architectures that move beyond simple pattern recognition towards systems capable of constructing and manipulating internal representations of physical quantities and relationships. Simultaneously, new training strategies are being investigated, including curricula learning, self-supervised learning, and the incorporation of symbolic reasoning modules, all geared towards equipping these models with the capacity to not only ‘see’ a physical scenario but to accurately predict and extrapolate its behavior – ultimately striving for performance that mirrors, and potentially surpasses, human-level understanding of the physical world.

The introduction of QuantiPhy exposes a critical fragility within current Vision-Language Models – a reliance on memorized associations rather than genuine physical understanding. This echoes a fundamental truth of all engineered systems: they are not static achievements, but rather evolving entities susceptible to decay. As Brian Kernighan observed, “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” QuantiPhy doesn’t merely reveal a lack of kinematic inference; it highlights the inherent difficulty in building systems that adapt to unforeseen circumstances, requiring constant refinement and a willingness to confront the limitations of pre-existing knowledge. The benchmark serves as a diagnostic, illuminating the inevitable ‘bugs’ in even the most sophisticated architectures, and the necessity of embracing a continuous cycle of evaluation and improvement.
What Lies Ahead?
The emergence of QuantiPhy is not a revelation of inadequacy, but a necessary deceleration. It exposes the tendency of current Vision-Language Models to prioritize the patina of knowledge over the labor of genuine inference. Every delay in achieving seemingly seamless performance is, in fact, the price of understanding what these systems don’t understand. The benchmark’s focus on quantitative physical reasoning, specifically kinematic inference, highlights a critical fragility: architecture without a robust grounding in basic physical principles is ephemeral, prone to collapse under novel conditions.
Future work will undoubtedly attempt to patch the observed deficits with larger datasets and more complex architectures. However, the fundamental problem remains: current approaches treat physical understanding as a feature to be learned, rather than a scaffolding upon which intelligence can be built. A more fruitful path lies in explicitly incorporating principles of physics into the model’s inductive biases-a shift from statistical correlation to causal modeling.
The true metric of progress will not be benchmark scores, but the ability to extrapolate beyond the training distribution-to predict the unforeseen consequences of physical interactions. Time, after all, is not a constraint, but the arena in which all systems either decay gracefully, or crumble into irrelevance.
Original article: https://arxiv.org/pdf/2512.19526.pdf
Contact the author: https://www.linkedin.com/in/avetisyan/
See also:
- Mobile Legends: Bang Bang (MLBB) Sora Guide: Best Build, Emblem and Gameplay Tips
- Clash Royale Best Boss Bandit Champion decks
- Best Hero Card Decks in Clash Royale
- All Brawl Stars Brawliday Rewards For 2025
- Best Arena 9 Decks in Clast Royale
- Brawl Stars December 2025 Brawl Talk: Two New Brawlers, Buffie, Vault, New Skins, Game Modes, and more
- Vampire’s Fall 2 redeem codes and how to use them (June 2025)
- Clash Royale Witch Evolution best decks guide
- Clash of Clans Meltdown Mayhem December 2025 Event: Overview, Rewards, and more
- Clash Royale December 2025: Events, Challenges, Tournaments, and Rewards
2025-12-24 20:51