Seeing Further: Scaling Visual Reasoning with Intelligent Agents

Author: Denis Avetisyan


A new approach leverages multi-agent systems and reinforcement learning to push the boundaries of what large language models can understand from images.

Insight-V leverages a dual-agent system for reasoning and summarization, achieving performance gains across image benchmarks, while its successor, Insight-V++, employs a self-evolving training loop of Supervised Fine-Tuning and Reinforcement Learning to continuously refine visual reasoning and surpass existing results on both image and video datasets.
Insight-V leverages a dual-agent system for reasoning and summarization, achieving performance gains across image benchmarks, while its successor, Insight-V++, employs a self-evolving training loop of Supervised Fine-Tuning and Reinforcement Learning to continuously refine visual reasoning and surpass existing results on both image and video datasets.

Insight-V++ introduces a scalable data generation pipeline and self-evolving training paradigm for advanced long-chain visual reasoning in multimodal large language models.

Despite advances in large language models, extending robust reasoning capabilities to multimodal systems remains challenging due to a scarcity of high-quality, long-chain visual reasoning data. This paper introduces Insight-V and Insight-V++, novel multi-agent systems designed to enhance visual reasoning in large language models through a scalable data generation pipeline and a self-evolving training paradigm. By employing reinforcement learning guided by a critical summary agent, Insight-V++ achieves significant performance gains on challenging image and video reasoning benchmarks. Can this iterative, self-improving loop unlock a new era of truly intelligent, visually-grounded AI systems?


The Limits of Pattern Matching: Why LLMs Struggle to See

Despite the remarkable progress of Large Language Models (LLMs) across numerous domains, standard reasoning techniques like Chain-of-Thought (CoT) frequently falter when applied to complex visual challenges. These models, while adept at generating human-like text, often exhibit a tendency to provide superficially plausible, yet ultimately incorrect, answers to questions requiring genuine visual understanding. The limitations stem from a reliance on pattern matching and statistical correlations within the training data, rather than a robust capacity for deductive reasoning about visual information; a model might correctly identify objects in an image, but struggle to infer relationships or predict outcomes based on spatial arrangements or dynamic changes. Consequently, seemingly simple tasks involving visual inference, such as understanding the stability of a tower of blocks or predicting the trajectory of a moving object, can prove surprisingly difficult for LLMs employing conventional CoT prompting.

Simply increasing the size of current models does not inherently solve the challenges of visual reasoning, especially when tasks involve understanding changes over space and time. While larger models can memorize more patterns, they often lack the capacity to truly understand the underlying relationships within a visual scene or sequence. This limitation becomes acutely apparent when reasoning about how objects interact, move, or transform – scenarios demanding a systematic, step-by-step analysis that scaling alone cannot provide. The core issue isn’t a lack of data or parameters, but rather the absence of an architecture that facilitates a more structured and efficient reasoning process, capable of breaking down complex spatial-temporal information into manageable components and drawing accurate conclusions.

Current visual reasoning systems often falter not due to a lack of data or model size, but because of an inability to break down complex problems into a series of logical, digestible steps. This deficiency results in what’s known as brittle performance – a system that functions well on narrowly defined tasks but quickly unravels when confronted with even slight variations. The inability to decompose problems hinders generalization; the system learns to recognize specific patterns rather than underlying principles. Consequently, a seemingly minor change in an image or scenario can lead to drastically incorrect conclusions, revealing a fundamental weakness in the reasoning process itself. Instead of true understanding, these systems often rely on superficial correlations, limiting their capacity to adapt to novel situations or extrapolate from existing knowledge.

Insight-V demonstrates superior reasoning and problem-solving capabilities compared to Chain-of-Thought and direct SFT methods by generating a more coherent reasoning process that effectively guides summary generation towards accurate answers.
Insight-V demonstrates superior reasoning and problem-solving capabilities compared to Chain-of-Thought and direct SFT methods by generating a more coherent reasoning process that effectively guides summary generation towards accurate answers.

Deconstructing Thought: The Multi-Agent Architecture of Insight-V

Insight-V is a Multi-Agent System (MAS) designed to improve the reliability of large language model outputs. The system architecture consists of two primary agents: a Reasoning Agent and a Summarization Agent. This decoupling of functions allows for dedicated processing of complex problems; the Reasoning Agent focuses solely on generating a detailed, step-by-step reasoning process, while the Summarization Agent independently evaluates this process and formulates a final, concise answer. This agent-based approach contrasts with traditional LLM architectures where reasoning and response generation are often integrated into a single process, and aims to more closely replicate human cognitive workflows by separating analysis from articulation.

Insight-V utilizes LLaVA-NeXT as its core Large Language Model (LLM) but differentiates itself through a separation of reasoning and answer generation processes. Traditional LLMs typically integrate these functions, potentially leading to inaccuracies or incomplete analysis. By decoupling these steps, Insight-V allows for a dedicated reasoning phase where the model can thoroughly analyze input and formulate a detailed solution path. This dedicated reasoning output is then processed separately to generate a final answer, enhancing the accuracy and completeness of the response and enabling more robust problem-solving capabilities.

Insight-V’s architecture functionally separates problem-solving into distinct stages performed by dedicated agents. The Reasoning Agent is responsible for generating a complete and detailed trace of the logical steps taken to arrive at a potential solution; this includes identifying relevant information, applying appropriate reasoning rules, and documenting intermediate conclusions. Subsequently, the Summarization Agent receives this reasoning process as input and performs an evaluation to assess its validity and coherence. This agent then synthesizes the key information from the reasoning trace into a concise and human-readable answer, effectively mimicking the sequential process of human thought where detailed reasoning precedes a finalized response.

Insight-V and Insight-V++ leverage a multi-agent system derived from a single base model, decomposing tasks into reasoning and summarization, and further enhanced in Insight-V++ by specialized [latex]ST-GRPO[/latex] and [latex]J-GRPO[/latex] algorithms combined with a self-evolving training loop of supervised fine-tuning and reinforcement learning to achieve improved visual reasoning performance.
Insight-V and Insight-V++ leverage a multi-agent system derived from a single base model, decomposing tasks into reasoning and summarization, and further enhanced in Insight-V++ by specialized [latex]ST-GRPO[/latex] and [latex]J-GRPO[/latex] algorithms combined with a self-evolving training loop of supervised fine-tuning and reinforcement learning to achieve improved visual reasoning performance.

Forging Intelligence: Self-Evolving Training and Robust Optimization

Insight-V++ incorporates Self-Evolving Training to facilitate iterative improvement of both Reasoning and Summarization agents. This process allows the agents to refine their respective reasoning paths through continuous interaction and feedback. Specifically, the Reasoning Agent leverages insights generated by the Summarization Agent, and vice versa, creating a cyclical process of refinement. This collaborative optimization differs from static training methods by enabling the model to adapt and enhance performance based on internal feedback loops, rather than relying solely on external datasets or human intervention.

Insight-V++ utilizes Group Relative Policy Optimization (GRPO) algorithms to enhance the performance of its Reasoning and Summarization agents. Specifically, the Reasoning Agent is optimized using ST-GRPO, while the Summarization Agent employs J-GRPO. These algorithms facilitate performance improvements by considering the relationships within a group of tasks, allowing the agents to generalize more effectively across diverse challenges. This approach differs from standard policy optimization techniques by explicitly accounting for task dependencies and shared characteristics, leading to improved overall performance and adaptability.

Insight-V++ leverages Qwen2.5-VL as its foundational Large Language Model (LLM), enabling advanced reasoning and summarization capabilities. Empirical evaluation demonstrates an average accuracy of 71.1% on image reasoning tasks when utilizing this LLM. This performance improvement is attributed to Qwen2.5-VL’s architecture and pre-training data, which facilitates more nuanced understanding and processing of visual information required for complex reasoning challenges. The integration of this LLM represents a key component in achieving Insight-V++’s overall performance gains.

The Insight-V series utilizes a data generation pipeline where reasoning is progressively generated and then assessed at multiple granularities to ensure high quality.
The Insight-V series utilizes a data generation pipeline where reasoning is progressively generated and then assessed at multiple granularities to ensure high quality.

The Alchemy of Data: Refining Reasoning Through Structured Pipelines

The foundation of this system’s capabilities lies in a dedicated Data Generation Pipeline, meticulously engineered to produce structured datasets that mimic complex, multi-step reasoning processes. This pipeline doesn’t simply generate questions and answers; it constructs complete chains of thought, detailing the logical progression required to arrive at a solution. Crucially, assessment isn’t limited to a single, final answer; the pipeline employs multi-granularity evaluation, analyzing the correctness of each step within the reasoning chain. This granular feedback provides exceptionally valuable training signals for the agents, enabling them to not only learn what the correct answer is, but also how to arrive at it, fostering robust and reliable reasoning abilities.

A central element of the system’s efficacy lies in a unified data pipeline designed to serve both the reasoning and summarization agents. This pipeline doesn’t merely generate data; it meticulously crafts structured information focused on long-chain reasoning, complete with assessments at multiple levels of granularity. This consistent approach to data creation is vital, ensuring both agents receive training signals of comparable quality and form. The resulting data isn’t static; it’s actively used in a cycle of learning and refinement, allowing the agents to continuously improve their performance and adapt to increasingly complex challenges. This feedback loop, powered by high-quality, consistent data, is a key driver of the system’s overall robustness and adaptability across diverse visual reasoning tasks.

The developed system exhibits a marked advancement in visual reasoning capabilities, as evidenced by substantial performance gains across diverse benchmarks. Evaluations reveal an average accuracy of 54.2% on complex video reasoning tasks, representing a +5.6% improvement over existing methods. This proficiency extends to intricate image-based challenges, with notable results including a +2.4% increase in accuracy on the MathVision benchmark and a significant +10.2% improvement on the demanding VideoMMMU dataset. These results collectively demonstrate the system’s capacity to not only process visual information but also to apply reasoning skills, indicating a robust and adaptable approach to artificial intelligence in visual domains.

Increasing the amount of training data improves the reasoning agent's performance, leading to more informative summaries.
Increasing the amount of training data improves the reasoning agent’s performance, leading to more informative summaries.

The pursuit of Insight-V++ feels less like engineering and more like coaxing order from the visual void. It understands that a model isn’t merely ‘learning’ to reason, but assembling a temporary truce with the inherent chaos of images. As David Marr observed, “Representation is just as important as computation.” This paper doesn’t just seek to improve visual reasoning; it attempts to construct a more persuasive representation, a more compelling illusion of understanding. The self-evolving training paradigm, with its data generation pipeline and reinforcement learning, is a ritual to appease that chaos, crafting a system that doesn’t truly know the answers, but consistently generates the appearance of knowledge, until the next unpredictable input breaks the spell.

Where Do the Ghosts Go?

The pursuit of long-chain visual reasoning, as exemplified by systems like Insight-V++, feels less like building intelligence and more like carefully constructing a more persuasive illusion. The scalability of this data generation pipeline is
 convenient. One suspects the agents are less ‘reasoning’ and more adept at finding statistically plausible narratives to justify pre-ordained conclusions. After all, if correlation’s high, someone cheated-or, in this case, someone carefully curated the training set. The question isn’t whether these models can reason, but how convincingly they can simulate it.

The self-evolving training paradigm offers a tantalizing, if unsettling, prospect. The system refines its own methodology. But improvement, viewed from a sufficient distance, often looks indistinguishable from entrenchment. Each iteration locks the model further into its particular flavour of bias. The real challenge isn’t scaling performance; it’s developing metrics that can detect, let alone quantify, the subtle ways in which these systems misunderstand the world. Noise, after all, is just truth without funding.

Future work will undoubtedly focus on more complex tasks and larger datasets. But the fundamental limitation remains: these models are mirrors, reflecting the patterns embedded within the data. The interesting questions aren’t about what they can see, but about what remains forever hidden in the shadows, unseen and unrepresented. The ghosts in the machine will always outnumber the angels.


Original article: https://arxiv.org/pdf/2603.18118.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-03-22 11:04