Seeing is Understanding: An Agent That Learns to Search and Reason with Images

Author: Denis Avetisyan

Researchers have developed a new AI agent, SenseNova-MARS, that combines visual perception and language understanding to tackle complex tasks requiring both image search and reasoning.

SenseNova-MARS addresses a complex visual challenge through a reasoning process that strategically integrates text search, image search, and image cropping techniques.

SenseNova-MARS leverages reinforcement learning to integrate high-resolution image cropping with text-based search, achieving state-of-the-art results on multimodal reasoning benchmarks using the BN-GSPO algorithm.

While Vision-Language Models demonstrate promise in complex reasoning, they often struggle to seamlessly integrate dynamic tool use with continuous thought-a key element of human problem-solving. This work introduces SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning, a novel framework that empowers VLMs with interleaved visual reasoning and tool manipulation via reinforcement learning. By dynamically integrating image and text search with image cropping, SenseNova-MARS achieves state-of-the-art performance on challenging visual understanding and search benchmarks, even surpassing proprietary models. Could this approach unlock truly agentic VLMs capable of robust, knowledge-intensive reasoning in visually complex environments?

Beyond Recognition: The Limits of Current Visual Reasoning

Contemporary multimodal models, despite advancements in processing both visual and textual data, frequently falter when confronted with reasoning challenges that demand more than simple pattern recognition. These systems often struggle with tasks requiring repeated analysis of visual information, coupled with the incorporation of knowledge sourced from beyond their immediate dataset. For example, a model might correctly identify objects in an image, but fail to infer the implications of their arrangement or relate them to real-world contexts without explicit prompting. This limitation stems from a reliance on correlational learning rather than genuine understanding; the models excel at recognizing previously seen patterns, but lack the capacity to flexibly apply knowledge or extrapolate to novel situations requiring iterative visual inspection and external information synthesis – a key distinction between recognition and true reasoning.

Many current attempts to enhance visual reasoning in artificial intelligence depend heavily on simply increasing the size of models and datasets – a technique known as brute-force scaling. While this approach can yield improvements on certain benchmarks, it fundamentally lacks the hallmarks of genuine reasoning. These massively scaled systems often excel at pattern recognition but struggle with tasks requiring iterative analysis, the application of external knowledge, or the ability to generalize to novel situations. The computational cost of maintaining and deploying such large models is also substantial, and performance gains frequently plateau, suggesting that simply adding more data or parameters will not unlock true intelligence. This highlights a critical limitation: scaling alone cannot replicate the flexible, adaptable reasoning capabilities inherent in human cognition.

Effective visual reasoning increasingly demands more than simple image recognition; it requires artificial agents capable of proactively gathering and processing information beyond the initially presented visuals. These agents must not only interpret diverse data streams – text, knowledge graphs, and other modalities – but also synthesize this information to formulate nuanced understandings and solutions. The limitations of current systems highlight a crucial need for agents that can actively search for relevant data, assess its credibility, and integrate it with visual input, moving beyond passive observation towards a dynamic, knowledge-driven approach to problem-solving. Such capabilities are paramount for tackling complex tasks demanding contextual awareness and the application of external knowledge, ultimately bridging the gap between artificial perception and genuine reasoning.

SenseNova-MARS utilizes an optimized <span class="katex-eq" data-katex-display="false"> ext{BN-GSPO}</span> policy to iteratively refine answers through adaptive image and text searches, guided by both format and answer rewards. — SenseNova-MARS utilizes an optimized $ext{BN-GSPO}$ policy to iteratively refine answers through adaptive image and text searches, guided by both format and answer rewards.

SenseNova-MARS: An Agentic Framework for Autonomous Reasoning

SenseNova-MARS represents a departure from passive multimodal models by implementing an agentic framework designed for autonomous operation and complex task completion. This framework moves beyond simple input-output processing by enabling the model to proactively seek information, refine its understanding of a given problem, and iteratively improve its responses. It achieves this by integrating capabilities for external search and visual analysis, effectively extending the model’s inherent knowledge base and perceptual abilities beyond its pre-trained parameters. This agentic approach allows SenseNova-MARS to address tasks requiring dynamic information access and reasoning not typically achievable with static multimodal systems.

SenseNova-MARS incorporates iterative search and visual analysis to enhance information gathering. The framework utilizes external search engines, specifically supported by the Serper API for real-time web access and the E5-retriever for semantic search capabilities, to proactively acquire relevant data. Complementing this, integrated cropping tools enable fine-grained visual analysis of images, allowing the system to focus on specific regions of interest within visual inputs and extract more precise information. This combined approach of search and visual analysis facilitates a more comprehensive understanding of the input and supports more informed reasoning.

SenseNova-MARS utilizes an agentic modeling approach, departing from traditional reactive AI systems by enabling proactive information acquisition and reasoning. This paradigm shifts the functionality from passively responding to inputs to actively formulating goals, identifying necessary information, and initiating searches to fulfill those goals. The agentic framework decomposes complex tasks into a series of sub-tasks, each requiring specific knowledge, and iteratively refines its understanding through successive information gathering and analysis. This allows the system to not only answer direct queries but also to independently seek out relevant data, synthesize information from multiple sources, and arrive at informed conclusions without explicit, step-by-step instructions.

The initial implementation of SenseNova-MARS utilizes the Qwen2.5-VL-7B multimodal large language model (LLM) as its core reasoning engine. This model, boasting 7 billion parameters, provides the foundational capabilities for processing both visual and textual inputs. Qwen2.5-VL-7B was selected for its demonstrated proficiency in visual question answering and image captioning tasks, enabling the framework to interpret visual data gathered through the cropping tools and integrate it with information retrieved via search engines. While the framework is designed to be adaptable to other LLMs, Qwen2.5-VL-7B currently serves as the primary model for conducting agentic reasoning and generating outputs.

Figure 1:Overall performance of SenseNova-MARS-8B compares to other models across six benchmarks. SenseNova-MARS-8B achieves state-of-the-art results, surpassing Gemini-3-Flash by 2.94 points on agentic search tasks. The framework achieves a V\* Bench accuracy of 92.2, HR-Bench 4K Accuracy of 83.1, and MME-RealWorld Accuracy of 67.9.

Refining Perception: Reinforcement Learning for Enhanced Visual Reasoning

Reinforcement Learning (RL) was implemented to refine the visual reasoning capabilities of the framework by directly incentivizing actions that lead to more effective image analysis. This involved defining a reward function that quantifies the quality of visual reasoning steps, encouraging the agent to focus on salient image features and avoid irrelevant processing. Through iterative training, the RL agent learns an optimal policy for navigating and interpreting visual information, resulting in improved performance on downstream reasoning tasks. The RL component operates by providing feedback on the agent’s actions, guiding it towards strategies that maximize the reward signal and enhance the overall effectiveness of the visual reasoning process.

The Pixel Reasoner method addresses visual understanding directly within the pixel space of images. This approach contrasts with methods that rely on higher-level feature extraction or semantic segmentation. By operating directly on pixel data, the method aims to enhance the agent’s ability to identify and focus on visually salient regions within an image. This is achieved through mechanisms designed to prioritize informative pixel locations, enabling more effective analysis of visual details and improving the overall quality of visual reasoning without pre-defined object recognition or scene understanding.

By employing reinforcement learning, the agent is enabled to selectively focus on pertinent areas within an image, effectively prioritizing regions that contribute most significantly to the reasoning process. This targeted attention mechanism facilitates a more efficient analysis by reducing the computational load associated with irrelevant visual information and amplifying the impact of key features. Consequently, the quality of the agent’s overall analysis is improved, leading to more accurate and reliable results in visual reasoning tasks.

SenseNova-MARS-8B was developed by integrating the Qwen3-VL-8B large multimodal model as its base. This integration resulted in a measurable performance increase, specifically an 11.71 point improvement on the MMSearch Pass@1 metric when compared to the performance of the prior MMSearch-R1 model. Pass@1 assesses the rank of the first correct answer, and this improvement demonstrates enhanced accuracy in retrieving relevant information from visual data using the SenseNova-MARS-8B framework.

Analysis of tool use reveals that SenseNova-MARS learns to efficiently utilize tools across benchmarks, as demonstrated by its increasing tool invocation frequency during reinforcement learning training and its strategic distribution of tool calls.

Toward Robust Evaluation and the Future of Agentic Reasoning

A critical need exists for standardized evaluation of visual language models (VLMs) tasked with complex, search-oriented reasoning, prompting the development of HR-MMSearch. This novel benchmark distinguishes itself by focusing on high-resolution images and requiring agents to actively search for information within visual scenes to answer detailed questions. Unlike existing datasets that often rely on pre-defined answers or simplified visual inputs, HR-MMSearch presents a more realistic challenge, demanding that VLMs not only perceive visual details but also strategically navigate and interpret complex imagery. The framework’s design specifically targets agentic VLMs – those capable of autonomous action and decision-making – offering a rigorous platform to assess their ability to effectively combine visual perception, language understanding, and search strategies to solve intricate tasks.

Evaluations utilizing the newly introduced HR-MMSearch benchmark reveal that SenseNova-MARS exhibits particularly strong capabilities in complex information retrieval scenarios. The system doesn’t merely identify relevant visual details; it demonstrates an ability to reason about the information presented in high-resolution images and effectively synthesize that understanding with textual queries. This performance suggests a significant advancement in the field of agentic visual-language models, as SenseNova-MARS can successfully navigate visually rich environments to locate specific information and provide coherent responses. The framework’s success on HR-MMSearch highlights its potential to serve as a foundation for building more reliable and capable multimodal agents designed for complex search-oriented tasks.

The newly established evaluation framework offers a robust platform for advancing the field of multimodal artificial intelligence. By providing a standardized and challenging benchmark, researchers gain a crucial tool for building and assessing agents capable of sophisticated reasoning and information retrieval from diverse data sources. This foundation enables iterative improvements in agent design, fostering the development of systems that are not only more accurate and efficient, but also demonstrably reliable in complex, real-world scenarios. Ultimately, this work paves the way for a new generation of multimodal agents poised to tackle increasingly intricate tasks and deliver consistently dependable performance.

Ongoing development prioritizes enhancements to the framework’s core capabilities, aiming for increased robustness against ambiguous or noisy data, and improved scalability to accommodate larger and more complex datasets. Researchers are actively exploring methods to facilitate more intricate reasoning processes, moving beyond simple information retrieval to encompass nuanced understanding and problem-solving. This includes investigations into techniques that enable the agent to synthesize information from multiple sources, handle conflicting evidence, and adapt its reasoning strategies based on the specific demands of the task, ultimately paving the way for genuinely intelligent multimodal agents capable of tackling real-world challenges.

The HR-MMSearch benchmark presents a challenging test of visual reasoning with high-resolution images and knowledge-intensive questions spanning diverse domains including sports, leisure, science, business, and academic research.

SenseNova-MARS embodies a pursuit of elegant efficiency in multimodal reasoning. The system’s capacity to dynamically crop images and integrate visual and textual search, guided by reinforcement learning, speaks to a harmonious interplay between perception and action. This echoes Andrew Ng’s sentiment: “AI is not about replacing humans; it’s about augmenting human capabilities.” The agent doesn’t simply process information; it actively refines its search strategy, mirroring a deep understanding of the task at hand – a sign that the architecture isn’t merely functional, but thoughtfully designed. The research demonstrates that good design, in this case, enhances perception by streamlining the search process and enabling complex reasoning.

The Horizon Beckons

SenseNova-MARS demonstrates a pleasing integration of perception and action, a feat often clumsily attempted. Yet, the elegance of a solution should not obscure the persistent questions it raises. The current reliance on reinforcement learning, while effective, feels… provisional. It suggests a lack of a more fundamental understanding of how agents truly reason with multimodal data. One suspects a deeper, more symbolic representation might eventually supersede these iterative approximations, yielding not just competence, but genuine insight.

The system’s performance, though commendable, hints at a brittleness common to all data-hungry approaches. The benchmarks, carefully curated as they are, rarely capture the messy ambiguity of the real world. A true test will be its ability to generalize beyond these constructed scenarios, to adapt to novel visual environments and unanticipated search queries. The pursuit of robustness – the ability to fail gracefully, rather than spectacularly – remains a significant challenge.

Ultimately, the path forward lies not simply in scaling up these models, but in refining their underlying principles. An interface should be intuitively understandable without extra words. The integration of visual and textual information isn’t merely a technical problem; it’s an aesthetic one. Refactoring is art, not a technical obligation. The goal isn’t just to build intelligent agents, but to create systems that reflect a deeper harmony between form and function.

Original article: https://arxiv.org/pdf/2512.24330.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Beyond Recognition: The Limits of Current Visual Reasoning

SenseNova-MARS: An Agentic Framework for Autonomous Reasoning

Refining Perception: Reinforcement Learning for Enhanced Visual Reasoning

Toward Robust Evaluation and the Future of Agentic Reasoning

The Horizon Beckons

See also: