Beyond Sight: AI Agents Fill in the Gaps with Semantic Reasoning

Author: Denis Avetisyan


A new framework combines the power of large language models and collaborative agents to achieve more human-like amodal completion, enabling AI to ‘understand’ obscured objects.

The proposed framework achieves high-fidelity scene completion through a closed-loop reasoning process, beginning with holistic collaborative parsing of spatial geometry, followed by iterative verification to correct segmentation inaccuracies and address occlusions, and culminating in the generation of diverse, plausible semantic hypotheses before a single-pass synthesis of the completed scene-a process fundamentally grounded in the decoupling of semantic planning and visual synthesis.
The proposed framework achieves high-fidelity scene completion through a closed-loop reasoning process, beginning with holistic collaborative parsing of spatial geometry, followed by iterative verification to correct segmentation inaccuracies and address occlusions, and culminating in the generation of diverse, plausible semantic hypotheses before a single-pass synthesis of the completed scene-a process fundamentally grounded in the decoupling of semantic planning and visual synthesis.

This review details a multi-agent system leveraging chain-of-thought reasoning and a novel metric (MAC-Score) for robust and semantically consistent amodal completion.

Inferring unseen object parts remains a challenge for generative models, often hindered by inconsistencies and error accumulation. This is addressed in ‘Reasoning-Driven Amodal Completion: Collaborative Agents and Perceptual Evaluation’ which introduces a novel framework employing collaborative multi-agent reasoning, explicitly decoupling semantic planning from visual synthesis. By leveraging large language models and a self-correcting verification agent, the approach generates more coherent and plausible completions, assessed using a new human-aligned metric, the MAC-Score. Could this framework pave the way for generative models that truly “understand” incomplete scenes, mirroring human perceptual abilities?


The Challenge of Incomplete Perception: A Fundamental Discrepancy

The human visual system demonstrates a remarkable ability to perceive objects as whole, even when partially hidden by other objects – a phenomenon known as amodal completion. Unlike current computer vision systems, which typically process only visible pixels, humans intuitively “fill in” the missing parts, constructing a complete representation of the scene. This effortless perception isn’t simply guesswork; it’s a sophisticated process rooted in prior knowledge, contextual understanding, and an inherent grasp of object structure. While algorithms can detect edges and shapes, replicating this holistic understanding remains a significant challenge, as most systems struggle to infer the presence and form of occluded features with the same speed and accuracy as a person. This discrepancy highlights a fundamental gap between how machines “see” and how humans perceive the world, underscoring the complexity of visual cognition.

Despite advancements in computer vision, current amodal completion techniques – those designed to infer occluded parts of objects – frequently falter when tasked with creating realistically complete scenes. Both data-driven, or training-based, methods and those relying on pre-defined rules struggle to consistently generate completions that make sense in a broader context. Often, these systems prioritize simply ‘filling in the gaps’ without regard for whether the completed shape is semantically plausible – a chair with legs extending through a table, for example – or structurally sound, resulting in completions that violate physical expectations. This limitation highlights a critical gap between current algorithmic approaches and human perception, where seamless completion is intrinsically linked to a deep understanding of object properties, spatial relationships, and the surrounding environment.

The ability to perceive a complete scene despite visual obstructions represents a significant hurdle for artificial intelligence. Current computer vision systems frequently falter when tasked with inferring the hidden geometry and semantics of occluded objects, a process humans perform with remarkable ease. This challenge isn’t merely about filling in gaps; it demands a sophisticated form of reasoning – an ability to predict plausible continuations of shapes, understand object relationships, and maintain semantic consistency even when information is incomplete. Successfully addressing this requires moving beyond pixel-level completion and towards systems capable of building a coherent internal representation of the world, effectively ‘imagining’ the missing parts of a scene based on prior knowledge and contextual cues. This necessitates algorithms that can not only detect edges and surfaces but also interpret their meaning and integrate them into a unified, believable perception.

Our framework addresses complex occlusions by combining structural and semantic reasoning to infer hidden details with diverse hypothesis generation, and is evaluated using the <span class="katex-eq" data-katex-display="false">MAC-Score</span>, a new metric that robustly assesses amodal completion and resolves biases present in traditional evaluation methods.
Our framework addresses complex occlusions by combining structural and semantic reasoning to infer hidden details with diverse hypothesis generation, and is evaluated using the MAC-Score, a new metric that robustly assesses amodal completion and resolves biases present in traditional evaluation methods.

Decoupled Reasoning: A Multi-Agent Architecture for Completion

The Collaborative Multi-Agent Reasoning Framework functions by explicitly decoupling semantic planning from visual synthesis. This separation allows for specialized processing of the reasoning and generation stages; the framework does not attempt to perform these tasks simultaneously within a single model. Semantic planning focuses on high-level understanding and decision-making regarding the desired output, while visual synthesis is responsible for the actual generation of visual content based on the semantic plan. This modularity enables the utilization of specialized agents, each optimized for its respective task, and facilitates more robust handling of complex scenarios, such as those involving occluded or incomplete visual information.

The Collaborative Multi-Agent Reasoning Framework employs dedicated agents for scene analysis, specifically an Occlusion Analysis Agent and a Segmentation Agent. The Segmentation Agent initially partitions the input scene into distinct objects and regions. Subsequently, the Occlusion Analysis Agent operates on this segmented scene to identify areas where objects are partially or fully hidden from view. This agent utilizes depth information and contextual reasoning to determine the boundaries of occluded regions and estimate the extent of the missing content. The identified occluded regions and segmented objects are then passed to the Semantic Planning component for higher-level reasoning and content generation.

Semantic Planning within the framework utilizes a Multimodal Large Language Model (MLLM) to perform high-level reasoning specifically concerning content that is not directly visible in the input scene. The MLLM processes information derived from scene analysis – including identified objects and their relationships – and extrapolates plausible content for occluded regions. This reasoning is not based on pixel-level prediction, but rather on contextual understanding and the MLLM’s pre-trained knowledge of the world, allowing it to infer likely missing elements and their attributes. The output of Semantic Planning is a structured representation of this inferred content, which is then used by the visual synthesis stage to generate a complete and coherent scene.

Our ambiguity-aware hypothesis generation framework moves beyond simple inpainting by explicitly modeling unseen semantic interpretations-such as distinguishing between 'Stepping' and 'Lying Down'-and quantifying the confidence <span class="katex-eq" data-katex-display="false">P_r o b</span> associated with each, resulting in controllable and diverse completions.
Our ambiguity-aware hypothesis generation framework moves beyond simple inpainting by explicitly modeling unseen semantic interpretations-such as distinguishing between ‘Stepping’ and ‘Lying Down’-and quantifying the confidence P_r o b associated with each, resulting in controllable and diverse completions.

Refining the Plan: Ensuring Semantic Coherence Through Verification

The Chain-of-Thought (CoT) Verification Agent functions as a critical component in maintaining semantic accuracy during scene completion. This agent systematically evaluates the proposed completion by tracing the logical connections between objects and their relationships, identifying inconsistencies or inaccuracies in the generated scene. The evaluation process involves assessing whether the completed elements align with established knowledge and common-sense reasoning, and flags any deviations. Identified inaccuracies are then corrected through iterative refinement of the scene generation process, ensuring a semantically consistent and plausible final output.

The planning process incorporates a Hypothesis Generator to introduce semantic diversity and reduce potential biases in scene completion. This component proposes multiple plausible interpretations, expanding beyond the initially identified semantic possibilities. Quantitative evaluation demonstrates a 19.6% improvement in diversity, as measured by pairwise Learned Perceptual Image Patch Similarity (LPIPS), and a 46.3% improvement using pairwise CLIP distance, indicating a substantial increase in the range and originality of generated scene completions.

The verification process functions as a refinement stage following initial semantic analysis of a proposed scene completion. This subsequent evaluation assesses the logical consistency and realism of the generated content, identifying potential inaccuracies or implausible elements. By building upon the foundational semantic understanding established earlier, the verification agent ensures that the final scene adheres to established knowledge and expectations, resulting in a demonstrably higher quality and increased plausibility of the completed visualization. This layered approach to semantic validation is critical for generating coherent and believable scenes.

Our amodal completion method surpasses state-of-the-art approaches-including Pix2Gestalt, PD-MC, and OWAAC-by accurately reconstructing complex geometries, textures, and semantics, as demonstrated by its ability to resolve challenging cases involving anatomy, reasoning, and occluded text, and validated by both qualitative visual improvements and quantitative scoring.
Our amodal completion method surpasses state-of-the-art approaches-including Pix2Gestalt, PD-MC, and OWAAC-by accurately reconstructing complex geometries, textures, and semantics, as demonstrated by its ability to resolve challenging cases involving anatomy, reasoning, and occluded text, and validated by both qualitative visual improvements and quantitative scoring.

Beyond Synthesis: A New Metric for Assessing Completion Quality

A new metric, termed the MAC-Score, has been developed to assess the quality of synthesized data, moving beyond traditional evaluation methods. This innovative approach prioritizes two crucial aspects: structural completeness – ensuring all essential elements are present – and semantic consistency, verifying the logical and meaningful relationships between those elements. Unlike metrics easily fooled by superficial details, the MAC-Score is designed to mirror human perception of quality, focusing on whether the synthesized output is both whole and makes sense. By evaluating both MAC-Completeness and MAC-Consistency, this metric offers a more nuanced and reliable assessment, promising a closer correlation with human judgment and a more accurate reflection of true synthesis quality.

Conventional image quality metrics, such as LPIPS and SSIM, often prioritize pixel-level accuracy, making them susceptible to being fooled by visually insignificant high-frequency details or textures. The newly developed MAC-Score distinguishes itself by shifting the focus towards evaluating the meaningful structural components and semantic coherence of an image. Rather than assessing every pixel, it prioritizes whether the generated image accurately represents the underlying scene and its key elements. This approach allows the MAC-Score to better align with human visual perception, as individuals tend to prioritize the overall composition and semantic correctness of an image over minor, high-frequency discrepancies. By concentrating on these higher-level features, the metric provides a more robust and reliable assessment of image completion quality, effectively mirroring how humans judge the realism and completeness of a visual scene.

Evaluations using the HiFi-Amodal dataset reveal the framework achieves a MAC-Completeness score of 65.45%, indicating a substantial level of structural detail captured in generated content, alongside a MAC-Consistency score of 8.023, reflecting strong semantic coherence. Notably, user preference studies demonstrate this approach is favored by a significant margin – 72.128% of participants selected outputs from this framework over alternatives. This preference is particularly striking when compared to the next best performing method, which received a selection rate of only 15.324%, highlighting a considerable advantage in perceived quality and realism.

A key validation of the proposed MAC-Completeness metric lies in its substantial correlation with human assessment of image completion quality. Statistical analysis reveals a Spearman Rank Correlation coefficient of 0.516, indicating a strong, positive relationship between the metric’s score and how people perceive the structural soundness of a completed image. This suggests that MAC-Completeness doesn’t merely quantify technical aspects of the reconstruction, but effectively captures elements that resonate with human visual perception. The metric’s ability to mirror human judgment is crucial, as it moves beyond simple pixel-wise comparisons and towards a more holistic evaluation of completion success, prioritizing perceptually meaningful features and ultimately offering a more reliable indicator of realistic and compelling image synthesis.

Unlike traditional metrics which are misled by pixel-level deviations and incorrectly favor incomplete predictions, our MAC-Score accurately reflects human perception by consistently identifying plausible results like Prediction C as superior.
Unlike traditional metrics which are misled by pixel-level deviations and incorrectly favor incomplete predictions, our MAC-Score accurately reflects human perception by consistently identifying plausible results like Prediction C as superior.

Mitigating Instability: Towards Robust Completion in Dynamic Scenes

Progressive completion strategies, particularly those leveraging the power of Diffusion Models, face inherent challenges regarding stability and accuracy. These models generate outputs iteratively, refining an initial state over multiple steps; however, each step introduces potential for minor errors to accumulate. This phenomenon, known as Error Accumulation, can lead to increasingly noticeable distortions or inconsistencies in the final result. Furthermore, Inference Instability arises from the stochastic nature of diffusion processes – slight variations in the random seed or internal calculations can produce drastically different outcomes, even with the same input. Consequently, completions intended to be high-fidelity can become compromised, exhibiting artifacts or deviating significantly from the desired target, necessitating robust techniques to mitigate these risks and ensure reliable generation.

Minimizing the inherent risks of progressive completion-such as inference instability and error accumulation-requires a meticulously orchestrated system of agents and a robust verification process. Current approaches focus on carefully coordinating these agents to build upon prior results while simultaneously employing verification steps to identify and correct potential errors before they propagate. However, existing pipelines are not fully immune to these challenges, necessitating continued research into more resilient architectures. Future investigations aim to enhance the framework’s ability to detect and mitigate errors autonomously, ultimately leading to completion systems capable of consistently generating high-quality, stable outputs even with complex or ambiguous starting conditions.

The current framework, while demonstrating success in static environments, is poised for expansion into the complexities of dynamic scenes – those featuring moving objects and changing conditions. Researchers are actively investigating methods to equip the system with the ability to track and incorporate temporal information, allowing for more robust and accurate progressive completion even amidst motion. Beyond this, the principles underlying this work hold considerable promise for broader applications within computer vision; potential avenues include enhancing scene understanding by refining incomplete or ambiguous visual data, and improving the reliability of robotic navigation systems by enabling them to predict and adapt to partially observed environments. This adaptability suggests a versatile toolkit for any application requiring robust perception in uncertain conditions.

Progressive methods commonly fail due to inference instability, leading to incomplete objects, or through error accumulation, which results in structurally inconsistent artifacts as initial mistakes amplify during iterative refinement.
Progressive methods commonly fail due to inference instability, leading to incomplete objects, or through error accumulation, which results in structurally inconsistent artifacts as initial mistakes amplify during iterative refinement.

The pursuit of robust amodal completion, as detailed in this work, echoes a fundamental principle of mathematical elegance. The collaborative multi-agent framework, utilizing large language models and chain-of-thought reasoning, strives for solutions grounded in semantic consistency – a provable understanding of the incomplete scene. This resonates with Geoffrey Hinton’s assertion: “The best way to think about intelligence is to think about it as the ability to discover the underlying structure of the world.” The MAC-Score, designed for human-aligned evaluation, attempts to quantify this ‘underlying structure,’ moving beyond mere perceptual fidelity to assess the logical coherence of the completed perception. It’s a pursuit of correctness, not simply ‘working’ solutions.

What’s Next?

The pursuit of amodal completion, as demonstrated by this work, continues to highlight a fundamental tension. Systems can appear to reason, generating outputs aligned with human expectation, yet the underlying mechanisms often remain opaque. If it feels like magic, one hasn’t revealed the invariant. The introduction of a multi-agent framework, while promising, merely shifts the burden of proof – now, not one model must be correct, but the collaboration must be. This raises immediate questions regarding agent specialization, conflict resolution, and the formal guarantees of emergent behavior.

The proposed MAC-Score, an attempt to quantify semantic consistency, is a welcome, if provisional, step. However, a score, however cleverly derived, is ultimately a proxy. True progress demands a move beyond empirical evaluation toward formal verification. Can the constraints of physical plausibility and semantic coherence be expressed as provable theorems? The field needs less benchmarking and more theorem proving.

Future work should not focus solely on scaling existing architectures or refining evaluation metrics. The real challenge lies in constructing systems grounded in logical foundations. Amodal completion isn’t about simulating intelligence; it’s about building systems where correctness isn’t a matter of luck, but a mathematical necessity. Until then, the elegant illusion of reasoning will remain just that-an illusion.


Original article: https://arxiv.org/pdf/2512.20936.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-27 00:53