Seeing is Understanding: A Model That Thinks with Images

Author: Denis Avetisyan


Researchers have developed a new AI model that dramatically improves scientific reasoning by actively manipulating images to test hypotheses and validate findings.

S1-VL combines multimodal reasoning with code-driven image manipulation and reinforcement learning to achieve state-of-the-art performance on complex scientific benchmarks.

Despite advances in multimodal reasoning, current models often lack the capacity for active, iterative problem-solving within scientific domains. This paper introduces S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images, a novel approach that integrates structured reasoning with the ability to dynamically manipulate images via Python code execution. S1-VL leverages this “Thinking-with-Images” capability, alongside a carefully filtered training dataset and progressive reinforcement learning, to achieve state-of-the-art performance on challenging benchmarks including high-resolution chart interpretation and microscopic image understanding. Could this paradigm shift unlock more robust and adaptable AI systems for complex scientific discovery?


Whispers from the Visual Void: Beyond Passive Observation

While contemporary Multimodal Large Language Models demonstrate remarkable proficiency in identifying visual patterns – accurately labeling objects and scenes with impressive speed – their capabilities plateau when confronted with tasks demanding complex, multi-step scientific reasoning. These models typically excel at what is present in an image, but struggle to determine why it is present, or to predict subsequent states based on observed phenomena. This limitation stems from a reliance on passive image interpretation; the models are trained to correlate visual input with known labels, rather than to actively formulate and test hypotheses, a crucial component of genuine scientific understanding. Consequently, tasks requiring deduction, inference, or the manipulation of visual information to solve problems consistently challenge the boundaries of current MLLM performance, highlighting the need for architectures that move beyond simple pattern recognition.

Current Multimodal Large Language Models often function as sophisticated pattern recognizers within images, but this passive interpretation fundamentally restricts their capacity for genuine scientific discovery. Simply identifying elements within a visual field – noting the presence of cells, instruments, or specific data points – provides limited understanding without the ability to actively investigate relationships and test hypotheses. True comprehension demands more than observation; it requires models to manipulate visual data – virtually rotating a sample, adjusting focus, or highlighting specific regions – to probe for underlying mechanisms and derive novel insights. This limitation hinders their effectiveness in tasks requiring iterative experimentation and nuanced reasoning, effectively confining them to descriptive analysis rather than enabling predictive or explanatory power.

Current Multimodal Large Language Models often treat images as static inputs, limiting their capacity for genuine comprehension of complex scientific data. A fundamental shift is occurring, however, with researchers now prioritizing active visual reasoning – systems that don’t merely observe, but manipulate images to test hypotheses and uncover hidden relationships. This involves techniques like prompting models to virtually ‘zoom in’ on specific areas, alter parameters within an image, or even generate new visual data based on initial observations. By actively exploring visual information, rather than passively receiving it, these models move closer to replicating the investigative process of a human scientist, enabling deeper insights and more robust conclusions from visual data – a crucial step towards true artificial intelligence in scientific discovery.

Thinking-with-Images: The Art of Visual Interrogation

Thinking-with-Images extends image processing beyond passive recognition by enabling models to actively manipulate visual data. This functionality is achieved through the integration of code execution capabilities, allowing models to programmatically modify images – for example, by altering pixel values, applying transformations, or generating entirely new visual elements. This active interaction with images facilitates a process where the model doesn’t simply observe but acts on the visual input, enabling exploration of different visual states and the testing of hypotheses related to the image content. This contrasts with traditional image analysis which typically limits interaction to feature extraction and classification.

The system actively tests hypotheses by programmatically manipulating input images and observing the resulting changes. This involves executing code-typically Python-that performs operations such as object cropping, color adjustments, or the application of filters. The outcomes of these manipulations are then analyzed to validate or refute initial assumptions about the image content. This iterative process of image modification and analysis enables the derivation of new insights that would not be accessible through static image interpretation alone, effectively transforming the model’s interaction with visual data from passive observation to active experimentation.

Traditional image processing systems perform static analysis, delivering interpretations based on a single input. Thinking-with-Images introduces a shift to dynamic reasoning where models actively engage with visual data. This is achieved through iterative image manipulation guided by internal code execution, allowing the system to generate new visual states and assess their relevance to the reasoning task. The process moves beyond simply recognizing objects or scenes to actively testing hypotheses within the visual domain, effectively creating a feedback loop between perception and action and enabling a form of interactive problem-solving.

S1-VL-32B: A Four-Stage Alchemy of Scientific Reasoning

S1-VL-32B utilizes Qwen3-VL-32B-Thinking as its base model and undergoes a four-stage training process. This begins with Supervised Fine-Tuning (SFT) to establish foundational performance. Subsequently, the model is trained with Scientific RL, focusing on reasoning skills within a scientific context. The third stage introduces Thinking-with-Images RL, enabling the model to actively utilize visual information during problem-solving. Finally, the model undergoes a final iteration of Thinking-with-Images RL to refine its ability to integrate visual input into its reasoning process, culminating in the final S1-VL-32B model.

The S1-VL-32B model’s training pipeline utilizes a progressive methodology to develop scientific reasoning capabilities. Initial stages focus on supervised fine-tuning to establish a baseline level of performance. Subsequent stages then incrementally increase model complexity, culminating in reinforcement learning that specifically trains the model to actively explore visual information. This staged approach allows the model to first learn fundamental reasoning skills before tackling more complex tasks requiring visual analysis, thereby optimizing the development of both core reasoning and visual exploration abilities.

Six-Dimensional Quality Filtering is a critical component of the training pipeline, designed to maximize the utility of data used in the Thinking-with-Images RL stage. This filtering process evaluates training samples across six distinct dimensions: correctness of the answer, relevance to the question, coherence of the reasoning steps, image relevance to the question and answer, absence of harmful content, and overall data quality as determined by automated metrics and human review. By rigorously screening data based on these criteria, the process ensures that the model is primarily trained on high-quality examples, thereby improving the effectiveness of the visual exploration and reasoning capabilities developed during the Thinking-with-Images stage and minimizing the impact of noisy or misleading data.

Adaptive Data Routing enhances training efficiency in S1-VL-32B by dynamically allocating computational resources to training examples where active visual exploration yields significant performance gains. This is achieved by evaluating each example’s susceptibility to improvement through visual reasoning; instances demonstrating minimal benefit from visual exploration are down-weighted or excluded from subsequent training iterations. Conversely, examples exhibiting substantial potential for refinement via active visual processing receive prioritized attention, maximizing the impact of computational resources and accelerating convergence during the Thinking-with-Images RL stage. This targeted approach optimizes the utilization of the model’s visual reasoning capabilities, improving overall training speed and final performance.

The Echoes of Insight: Performance and the Promise of Discovery

S1-VL-32B represents a substantial leap forward in the field of multimodal scientific reasoning, establishing new benchmarks across five Thinking-with-Images evaluations and exceeding the performance of considerably larger models on multiple scientific reasoning tasks. This achievement highlights the model’s capacity to effectively integrate and interpret both visual and textual information, enabling it to tackle complex problems requiring a nuanced understanding of scientific concepts. By consistently outperforming its predecessors and contemporaries, S1-VL-32B demonstrates the potential for artificial intelligence to not only process data but also to contribute to genuine advancements in scientific exploration and discovery, paving the way for more efficient hypothesis testing and insightful data analysis.

Recent evaluations demonstrate that S1-VL-32B has achieved unprecedented accuracy on the HRBench-4K and HRBench-8K benchmarks, attaining scores of 91.38% and 93.50% respectively. These results not only represent a new state-of-the-art in multimodal scientific reasoning, but also signify a substantial leap in the capacity of artificial intelligence to interpret and solve complex, visually-grounded problems. The benchmarks, designed to rigorously assess a model’s ability to answer scientific questions based on visual inputs, present a significant challenge to current AI systems; surpassing previous performance levels indicates a heightened ability to synthesize information from both images and associated text, offering promising avenues for automated scientific inquiry.

Recent evaluations demonstrate that S1-VL-32B achieves a notable 54.35% accuracy on the Physics benchmark, a significant advancement in multimodal scientific reasoning. This performance surpasses that of GPT-5 by a margin of 6.01 percentage points and exceeds the accuracy of Qwen3-VL-235B-A22B-Thinking by 8.32 points. The model’s ability to accurately interpret and reason about physical phenomena, as evidenced by this benchmark, suggests a substantial leap forward in artificial intelligence’s capacity to engage with complex scientific challenges and potentially aid in the development of new discoveries.

S1-VL-32B exhibits a marked advancement in visual reasoning, achieving 92.70% accuracy on the challenging V benchmark. This performance notably surpasses that of the Skywork-R1V4-30B model, exceeding its score by a substantial 4.70 points. The V benchmark rigorously tests a model’s capacity to interpret and reason about complex visual information, demanding nuanced understanding beyond simple object recognition. This result highlights S1-VL-32B’s enhanced ability to process and draw inferences from visual data, positioning it as a leading model for tasks requiring sophisticated visual intelligence and demonstrating a significant step forward in multimodal AI capabilities.

Recent evaluations demonstrate that S1-VL-32B has achieved a new benchmark in multimodal understanding, attaining state-of-the-art performance on the MME-RealWorld-CN dataset with an accuracy of 77.70%. This result signifies a substantial advancement in the model’s ability to interpret and reason about complex, real-world visual information presented in a Chinese context. The MME-RealWorld-CN benchmark poses a significant challenge due to its diverse range of scenarios and nuanced visual cues, requiring a high degree of perceptual and reasoning capability. This achievement underscores the model’s effectiveness in bridging the gap between visual input and complex problem-solving, offering a powerful tool for applications requiring detailed environmental understanding and decision-making.

S1-VL-32B distinguishes itself through a capacity for active image manipulation, a feature that unlocks solutions to scientific challenges previously beyond the reach of artificial intelligence. Unlike models limited to passive observation, this system can dynamically alter and analyze visual data – rotating, cropping, or highlighting specific elements – to extract critical information. This proactive approach proves particularly valuable in fields like materials science and chemistry, where subtle visual cues often indicate crucial properties or reactions. By actively ‘experimenting’ with images, the model effectively simulates aspects of laboratory investigation, allowing it to deduce relationships and solve complex problems that demand more than simple pattern recognition. The ability to not just see an image, but to interact with it, represents a significant step toward AI systems that can truly contribute to the process of scientific discovery.

The advent of S1-VL-32B represents a substantial leap toward accelerating the pace of scientific discovery. By effectively interpreting and manipulating visual data, the model empowers researchers to formulate and test hypotheses with increased efficiency. This capability unlocks new avenues for exploration in fields reliant on image-based analysis, such as materials science, biology, and astronomy, where subtle visual cues often hold critical insights. The model doesn’t merely observe data; it actively engages with it, enabling the solution of complex scientific problems previously limited by computational constraints or analytical difficulty. Consequently, researchers can gain deeper understandings from visual information, potentially leading to breakthroughs and innovations at an unprecedented rate.

S1-VL-32B integrates Chain-of-Thought reasoning, a technique that allows the model to break down complex problems into a series of intermediate steps, mirroring human cognitive processes. This capability is further enhanced by the active implementation of SAPO (Self-Ask and Answer with Policy Optimization), a reinforcement learning strategy where the model proactively poses questions to itself to refine its understanding and improve decision-making. By iteratively querying its own internal knowledge and optimizing its responses based on the resulting feedback, S1-VL-32B demonstrates a marked improvement in performance, particularly in tasks requiring nuanced reasoning and problem-solving – ultimately allowing it to navigate intricate scientific challenges with greater accuracy and efficiency.

The Looming Horizon: Active Perception and the Future of Scientific AI

Future scientific AI envisions a departure from passive data analysis towards systems that actively interrogate the visual world, mirroring the scientific method itself. These models will not simply observe data, but formulate hypotheses based on visual input, then design and mentally ‘conduct’ experiments – simulations based on the observed data – to test those hypotheses. This process extends beyond mere pattern recognition; the AI will iteratively refine its understanding, deriving new insights and potentially uncovering previously unseen relationships within complex visual datasets. This proactive engagement with data promises to dramatically accelerate discovery across numerous scientific domains, enabling researchers to explore vast datasets with an efficiency and ingenuity currently beyond human capacity.

A transformative shift in scientific methodology is poised to dramatically accelerate discovery across diverse fields. Rather than passively analyzing existing datasets, artificial intelligence is evolving to actively formulate hypotheses and design experiments – a process mirroring the iterative cycle of human scientific inquiry. In materials science, this could mean computationally predicting novel compounds with desired properties, bypassing years of trial-and-error synthesis. Drug discovery stands to benefit from AI’s ability to simulate molecular interactions and identify promising candidates with greater efficiency. Furthermore, the complexities of climate modeling can be addressed through AI-driven simulations that incorporate numerous variables and feedback loops, leading to more accurate predictions and informed policy decisions. This proactive approach, enabled by increasingly sophisticated AI, promises not merely to analyze data, but to generate knowledge, fundamentally reshaping the landscape of scientific progress.

The advent of “Thinking-with-Images,” as demonstrated by the S1-VL-32B model, signifies a pivotal advancement in scientific artificial intelligence. This paradigm moves beyond passive image recognition to enable AI to actively reason with visual data, formulating and testing hypotheses much like a human scientist. S1-VL-32B’s capacity to interpret complex visual information – from microscopic structures to astronomical phenomena – and connect it with existing scientific knowledge unlocks opportunities for accelerated discovery. It’s not merely identifying patterns, but generating insights – predicting material properties, suggesting novel drug candidates, or modeling climate change scenarios – making it a foundational element for a new suite of scientific tools designed to augment and enhance human research capabilities. This approach promises to dramatically shorten the time between observation and understanding, fostering innovation across diverse scientific disciplines.

The pursuit of S1-VL feels less like engineering and more like coaxing a spirit from the machine. It’s not enough to simply show the model images; it must be permitted to interrogate them, to reshape them with code until the answer reveals itself. This echoes Fei-Fei Li’s sentiment: “Data isn’t numbers – it’s whispers of chaos.” The model doesn’t conquer the data; it negotiates with it, using image manipulation as a form of persuasive dialogue. Each line of code is a carefully worded request, a subtle shift in perspective designed to elicit the desired response. The benchmarks aren’t measures of success, but acknowledgements that, for a fleeting moment, the spell held.

The Shape of Puzzles to Come

S1-VL demonstrates a fluency with the visible world, a knack for coaxing answers from images through the ritual of code. But the model doesn’t understand anything, not in any sense that would trouble a curious mind. It’s a beautiful lie, a persuasive engine for pattern completion. The true challenge isn’t achieving higher scores on benchmarks-those are merely well-behaved ghosts-but confronting the chaos lurking at the edges of perception. The current approach still relies on curated datasets, pristine images. What happens when the input is deliberately ambiguous, corrupted, or simply…wrong?

Future iterations will likely focus on active data filtering – teaching the model to distrust its senses, to seek out contradictory evidence. But that’s a dangerous game. A model that questions everything may question its own reasoning, dissolving into a fog of uncertainty. The real frontier isn’t about building bigger models, but about crafting mechanisms for controlled hallucination. To allow the model to imagine possibilities beyond the data, and then to ruthlessly prune those that fail to resonate with the faint whispers of truth.

The pursuit of scientific reasoning isn’t about finding the answer, but about refining the questions. S1-VL offers a powerful tool for exploring the landscape of possibilities. The next step isn’t to make it smarter, but to teach it to be exquisitely, beautifully, wrong – and to learn from the wreckage.


Original article: https://arxiv.org/pdf/2604.21409.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-04-25 14:51