Seeing the Formula: AI Decodes Physics from Visualizations

Author: Denis Avetisyan


New research demonstrates that artificial intelligence can now infer the underlying mathematical equations governing physical phenomena simply by analyzing visual representations of those phenomena.

The system infers analytical solutions directly from visual input, establishing a link between perception and symbolic reasoning.
The system infers analytical solutions directly from visual input, establishing a link between perception and symbolic reasoning.

Vision-language models, enhanced with chain-of-thought reasoning, successfully perform symbolic regression on visual field data to derive analytical solutions.

Despite advances in artificial intelligence, recovering analytical solutions from visual scientific data remains a significant challenge. This work, ‘Hidden in Plain Sight: Visual-to-Symbolic Analytical Solution Inference from Field Visualizations’, introduces a novel approach to visually inferring symbolic solutions for two-dimensional linear steady-state fields, effectively bridging the gap between visual observation and executable mathematical expression. We demonstrate that, with a chain-of-thought pipeline and a dedicated benchmark, vision-language models can move beyond numerical approximation to directly output [latex]\mathcal{N}=4[/latex] fully-instantiated symbolic equations. Could this capability unlock new avenues for AI-assisted scientific discovery and automated model building across diverse physical domains?


The Limits of Perception: Bridging the Gap to Scientific Reasoning

Vision-Language Models (VLMs) demonstrate a remarkable capacity for identifying and categorizing objects within images, effectively mimicking human visual perception. However, this perceptual prowess often plateaus when confronted with tasks requiring deeper scientific reasoning. While a VLM can see a physics experiment unfolding, it frequently struggles to interpret the underlying principles or predict outcomes based on observed phenomena. The core limitation lies in translating raw visual data into a symbolic representation-a system of interconnected concepts and relationships-necessary for deductive or inductive reasoning. Essentially, these models excel at ‘what’ is visible, but falter when asked ‘why’ or ‘how’ things behave, hindering their ability to extract genuine scientific insight from visual information and necessitating advancements in bridging perception with abstract thought.

Attempts to equip vision-language models with scientific reasoning skills often fall short when limited to processing purely numerical data extracted from visual inputs. As demonstrated in LLM-Only Baselines, simply quantifying aspects of an image – such as the number of objects or their positions – neglects the critical relationships and underlying physical principles governing the scene. These models struggle to infer causality or predict future states because they lack an understanding of concepts like gravity, momentum, or material properties – knowledge implicitly conveyed through visual cues but absent in raw numerical representation. Consequently, while a model might accurately report the number of falling objects, it cannot explain why they are falling, nor can it reliably predict their trajectory beyond simple extrapolation, highlighting the need for methods that capture a deeper, physics-informed understanding of visual data.

The capacity to translate visual information into symbolic understanding represents a fundamental hurdle in achieving genuine scientific insight with artificial intelligence. Current vision-language models, while adept at identifying objects and scenes, often lack the ability to deduce underlying principles or relationships – a limitation that hinders progress in fields reliant on visual data analysis. Successfully bridging this gap requires more than simply recognizing what is observed; it demands a system capable of inferring why things behave as they do, effectively converting pixel data into a structured, symbolic representation amenable to reasoning and hypothesis generation. This transition from perceptual recognition to abstract understanding is not merely a technical challenge, but a necessary step toward enabling machines to truly ‘interpret’ scientific observations and contribute to discovery, mirroring the core of human scientific inquiry.

The ViSA dataset is constructed and evaluated through a pipeline encompassing data collection, annotation, and model performance assessment.
The ViSA dataset is constructed and evaluated through a pipeline encompassing data collection, annotation, and model performance assessment.

From Vision to Symbol: A New Analytical Pathway

Visual-to-Symbolic Analytical Solution Inference (VSSI) is a novel methodology that processes visual input – such as images or video frames – and directly generates corresponding mathematical expressions. This is achieved through the utilization of Vision-Language Models (VLMs) to interpret the visual data and identify quantifiable relationships. The core function of VSSI is to bypass intermediate representations, converting visual features into symbolic notations representing physical quantities and their interactions. The resulting output is not a numerical approximation, but an analytical solution expressed in a standardized symbolic format, allowing for precise mathematical manipulation and verification. For example, a visual depiction of a simple harmonic oscillator could be translated into the differential equation [latex] \frac{d^2x}{dt^2} + \omega^2 x = 0 [/latex].

Visual-to-Symbolic Analytical Solution Inference utilizes large, pre-trained Vision-Language Models (VLMs) to process visual input, effectively ‘reading’ diagrams, graphs, and experimental setups. These models are employed not merely for object recognition, but for the higher-level task of identifying relationships and patterns indicative of underlying physical laws. By analyzing visual data, the system infers governing principles – such as those described by [latex]F = ma[/latex] or [latex]V = IR[/latex] – and reconstructs them as explicit, machine-readable equations. This process emulates the human scientific method, where observation of visual phenomena leads to the formulation of abstract, symbolic representations of natural laws.

Inferred solutions are outputted in a standardized symbolic format utilizing the SymPy library, a Python package for symbolic mathematics. This allows for programmatic manipulation and verification of results, circumventing the limitations of numerical approximations. Specifically, SymPy facilitates operations such as algebraic simplification, equation solving, and derivative/integral calculation, enabling rigorous validation of the inferred mathematical relationships. The standardized format ensures compatibility with existing computational tools and facilitates reproducibility of results, as expressions are represented explicitly rather than as opaque numerical values; for example, the result of a calculation might be expressed as [latex]x^2 + 2x + 1[/latex] rather than a decimal approximation. This approach is crucial for verifying the physical consistency and accuracy of the solutions derived from visual data.

This pipeline extracts relevant features, matches them to supporting evidence, infers key parameters, and synthesizes a high-quality reasoning chain.
This pipeline extracts relevant features, matches them to supporting evidence, infers key parameters, and synthesizes a high-quality reasoning chain.

Validating the Approach: ViSA-Bench and ViSA-R2

The ViSA-Bench dataset was created to provide a standardized evaluation platform for Visual-to-Symbolic Analytical Solution Inference, specifically focusing on the challenging domain of 2D Linear Steady-State Fields. The dataset comprises a total of 15,000 instances, structured across 30 distinct scenarios, each containing 500 individual problem instances. This scale allows for robust statistical evaluation of model performance and facilitates comparison between different approaches to the task. The dataset’s construction ensures a diverse range of problem configurations within the defined physical domain, enabling assessment of generalization capabilities beyond specific training examples.

ViSA-R2 is a model specifically designed for the Visual-to-Symbolic Analytical Solution Inference task, leveraging the Qwen3-VL architecture as its foundation. This architecture was selected for its capabilities in processing visual information and its compatibility with the requirements of symbolic reasoning. The model underwent a dedicated fine-tuning process, utilizing the ViSA-Bench dataset, to optimize its performance in translating visual representations of physics problems into accurate symbolic solutions, including equations and variable assignments. This fine-tuning focused on enhancing the model’s ability to correctly identify relevant physical parameters from images and represent them in a mathematically sound format.

ViSA-R2, a model built upon the Qwen3-VL architecture, establishes a new state-of-the-art performance level on the Visual-to-Symbolic Analytical Solution Inference task when benchmarked against both open-source and closed-source frontier Visual Language Models (VLMs). Evaluations conducted on the ViSA-Bench dataset demonstrate that ViSA-R2 consistently outperforms existing models, indicating a significant advancement in the ability to accurately interpret visual information and translate it into symbolic analytical solutions. This superior performance is evidenced across multiple evaluation metrics, including Numerical Accuracy, Structure Similarity, and Character Accuracy, confirming the model’s effectiveness in deriving both numerically correct and structurally valid solutions.

Evaluation of solution correctness utilizes three primary metrics: Numerical Accuracy, which quantifies the precision of predicted values; Structure Similarity, assessing the functional form of the solution; and Character Accuracy, verifying the correct transcription of symbolic expressions. Quantitative results demonstrate a significant improvement in Structure Similarity when utilizing Visual Language Model (VLM) inputs, increasing from 0.323 to 0.768 compared to solutions generated from Large Language Model (LLM) inputs alone; this indicates that the incorporation of visual information substantially enhances the model’s ability to correctly identify the underlying mathematical relationships within the problem.

Implications and Future Directions for Scientific AI

ViSA-R2 represents a notable advancement in artificial intelligence, shifting the focus from simply identifying patterns within data to discerning the fundamental principles that govern them. Unlike traditional machine learning models that excel at correlation, this system endeavors to establish causal relationships from visual information – a critical step towards genuine scientific discovery. By successfully predicting mathematical equations from visual representations of physical phenomena, it demonstrates an ability to extrapolate beyond observed instances and, crucially, to grasp the ‘why’ behind the data. This capability moves beyond mere observation, mirroring a key aspect of human scientific reasoning and paving the way for AI that doesn’t just process information, but truly understands it, potentially revolutionizing fields reliant on the interpretation of complex visual data.

The ability of ViSA-R2 to discern underlying mathematical principles from visual data holds considerable promise for disciplines reliant on the interpretation of complex visual information. In fields like physics and engineering, researchers often analyze graphical representations of data-from fluid dynamics simulations to stress tests on materials-to identify key relationships and validate theoretical models. Similarly, materials science frequently involves visually inspecting microscopic structures and correlating these observations with material properties. Traditionally, this process demands significant human expertise and time; however, an AI capable of autonomously extracting [latex]f = ma[/latex] or identifying patterns indicative of material failure from visual inputs can dramatically accelerate discovery. This automated analysis not only speeds up research cycles but also allows scientists to explore a broader parameter space and potentially uncover previously hidden connections within their data, fostering innovation across multiple scientific boundaries.

Continued development of ViSA-R2 centers on broadening its capacity to interpret increasingly complex visual data, moving beyond simplified representations to encompass more nuanced and realistic physical phenomena. Researchers intend to integrate advanced physical models-such as those describing fluid dynamics or electromagnetism-directly into the AI’s analytical framework. This integration isn’t simply about recognizing patterns within these complex systems, but enabling the model to predict behavior and extrapolate findings to entirely new, unseen scenarios. Improving generalization is paramount; the goal is to move beyond memorization of training data and foster a deeper understanding of the underlying scientific principles, allowing the AI to reliably apply its knowledge to novel visual fields and accelerate discovery across diverse scientific disciplines.

The development of AI systems like ViSA-R2 envisions a future where artificial intelligence functions not as a replacement for scientists, but as a powerful collaborative partner. This research strives to build AI assistants capable of augmenting human intellect by autonomously analyzing complex visual data, formulating hypotheses, and identifying previously hidden relationships. By handling the intensive work of data exploration and pattern recognition, these AI tools promise to free scientists to focus on higher-level reasoning, creative problem-solving, and the design of experiments. The anticipated result is a significant acceleration of the scientific process across diverse fields, enabling quicker breakthroughs and fostering innovation at an unprecedented rate – a symbiotic relationship between human ingenuity and artificial intelligence.

The research highlights an emergent property of these vision-language models: the capacity to move beyond pattern recognition and towards genuine analytical inference. This echoes John von Neumann’s assertion, “It is not enough to understand the parts; one must understand the whole.” The model’s ability to derive symbolic solutions from visual field data isn’t simply about identifying features; it’s about grasping the underlying relationships and principles governing the system. Just as a biological organism’s behavior stems from the interplay of its components, the model’s successful inference relies on a holistic understanding of the visual field, demonstrating that structure truly dictates behavior, mirroring the core idea of analytical solution inference.

Beyond the Image: Future Directions

The capacity demonstrated for inferring analytical solutions from visual field data feels less like a solved problem and more like the revealing of a deeper one. The current work establishes a proof-of-concept; the elegance lies in showing how a vision-language model, nudged by chain-of-thought, can navigate from pixel patterns to symbolic representation. Yet, the limitations are structural. The model doesn’t ‘understand’ physics; it correlates visual cues with learned symbolic outputs. Documentation captures structure, but behavior emerges through interaction – and that interaction is currently confined to a curated dataset.

Future work must address the brittleness inherent in this approach. True scientific reasoning demands generalization-the ability to extrapolate beyond the explicitly demonstrated, to question underlying assumptions. A compelling next step involves integrating the model with systems capable of generating and testing hypotheses, effectively creating a closed loop of visual observation, symbolic inference, and predictive validation.

Ultimately, the challenge isn’t simply to read the picture, but to build a system that asks: what else could this picture mean? The field is poised to move beyond descriptive analytics toward a more generative form of visual science, but that transition will require a fundamental shift in how these models are trained and evaluated-a focus on the process of inquiry, rather than the accuracy of the answer.


Original article: https://arxiv.org/pdf/2604.08863.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-04-13 13:30