Can Teams of AI Build Better?

Author: Denis Avetisyan

New research explores whether letting vision-language models collaborate through dialogue can improve their ability to understand and reason about spatial relationships.

Reconstruction success rates varied with structural complexity in a two-agent, multi-turn image-based setting, with performance notably influenced by the inclusion or exclusion of layer-wise representations within the target images.

Multi-agent dialogue offers modest gains in vision-language model performance on spatial reasoning tasks, with decomposed visual representations proving crucial for success.

Despite advances in vision-language models (VLMs), robust spatial reasoning within collaborative tasks remains a significant challenge. This work, ‘Multi-Turn Multi-Agent Dialogue for Collaborative Reconstruction Improves VLM Performance on Spatial Reasoning, But Only Barely’, investigates a collaborative structure-building environment to assess the impact of multi-agent dialogue on VLM performance. Results demonstrate that while dialogue offers marginal improvements, success hinges on detailed textual representations of target structures and decomposed visual inputs. Can future research overcome these limitations to unlock truly collaborative and spatially aware VLM agents for complex robotic tasks?

Orchestrating Intelligence: Introducing the Structure Building Task

The growing field of embodied artificial intelligence increasingly leverages Vision-Language Models (VLMs) to bridge the gap between perception and action, enabling agents to understand and react to their surroundings based on both visual input and textual commands. However, simply processing images and text isn’t enough; truly intelligent embodied agents demand robust reasoning capabilities to interpret ambiguous instructions and plan complex sequences of actions. Moreover, effective interaction is crucial, requiring agents to not only understand what needs to be done, but also to skillfully manipulate objects and navigate dynamic environments. Current VLMs often struggle with these demands, exhibiting limitations in their ability to extrapolate from learned data and generalize to novel situations, thus necessitating new benchmarks and methodologies to assess and improve their reasoning and interactive skills.

The Structure Building Task presents a novel challenge for evaluating artificial intelligence agents by demanding the interpretation of instructions and subsequent manipulation of a virtual environment. This assessment moves beyond simple object recognition, requiring agents to understand relational language – specifying how blocks should be arranged – and translate that understanding into precise motor actions within a physics-based simulation. The task’s design emphasizes a need for agents to not only ‘see’ and ‘read’ but to actively reason about spatial relationships and apply that reasoning to achieve a defined construction goal, offering a benchmark for progress in embodied AI and robotic control.

The Structure Building Task fundamentally demands collaborative problem-solving, requiring agents to synthesize both visual perception and textual instruction into coordinated physical action. Success isn’t achieved through individual performance, but through effective communication and shared understanding of the building goal; agents must interpret instructions describing the desired structure, visually assess the current state of the environment, and then translate this combined information into precise manipulation of virtual blocks. This necessitates a dynamic interplay where agents infer each other’s intentions, resolve ambiguities in the instructions, and adapt their actions based on observed progress – mirroring the complex coordination found in real-world collaborative construction scenarios. Ultimately, the task serves as a benchmark for evaluating an agent’s ability to not merely understand instructions, but to actively participate in a shared building process, demonstrating a crucial step towards more sophisticated embodied AI systems.

The system employs a two-agent structure-a Programmer that generates instructions based on a target structure, and a Robot that executes those instructions to modify its environment-facilitating structure building.

A Dualistic Architecture: Defining Agent Roles

The system is designed around a two-agent architecture, employing two distinct Visual Language Models (VLMs) to facilitate collaborative construction. One VLM is designated as the ‘Programmer’, responsible for processing target structure information and formulating build instructions. The second VLM functions as the ‘Robot’, interpreting these instructions to manipulate objects within a simulated environment. This separation of concerns allows for modularity and enables focused training of each VLM for its specific role in the building process, mirroring a human-robot collaboration paradigm.

The ‘Programmer’ agent within the framework is provided with target structural information through one of two input modalities: a ‘Top-Down Target Image’ which presents a complete visual representation of the desired outcome, or a ‘Layer-Wise Target Representation’ which conveys the structure’s composition as a sequence of individual layers. These inputs serve as the basis for instruction generation; the ‘Programmer’ processes this information to formulate a series of commands intended to guide the ‘Robot’ in replicating the target structure. The generated instructions are then transmitted to the ‘Robot’ agent for execution within the virtual environment.

The ‘Robot’ agent functions within a virtual simulation to execute received instructions. This agent parses the instructions generated by the ‘Programmer’ and translates them into actions performed within the simulated environment. These actions modify the ‘Grid State’, which represents the current configuration of the virtual building space. The ‘Virtual Simulator’ then updates this ‘Grid State’ based on the performed actions, providing feedback to the ‘Robot’ and enabling iterative construction attempts. This closed-loop process allows the ‘Robot’ to incrementally build the target structure based solely on the instructions received, without direct perception of the final goal.

Both the Programmer and Robot utilize a shared state representation, with the Programmer receiving the target structure and Robot's state, and the Robot receiving the Programmer's textual instruction and its own state. — Both the Programmer and Robot utilize a shared state representation, with the Programmer receiving the target structure and Robot’s state, and the Robot receiving the Programmer’s textual instruction and its own state.

Refining Collaboration: Communication Dynamics

The system incorporates a clarification loop whereby the ‘Robot’ proactively requests additional information from the ‘Programmer’ when encountering ambiguous instructions or insufficient data to execute a task. This functionality is implemented to address uncertainties during operation, preventing incorrect actions based on incomplete understanding. The ‘Robot’ identifies ambiguities by analyzing instruction syntax and context, formulating specific questions to resolve them before proceeding. This dialogue-based approach enables the system to operate more reliably in complex scenarios and reduces the need for extensive pre-programming to account for all possible contingencies.

Correction Instructions are a core component of the iterative refinement process, allowing the ‘Programmer’ to address inaccuracies or suboptimal performance exhibited by the ‘Robot’. These instructions, provided in response to observed errors or deviations from desired behavior, serve as direct feedback to modify the ‘Robot’s subsequent actions. The system is designed to accept a range of correction types, including adjustments to parameters, alterations to procedural steps, and complete re-specifications of tasks. This feedback loop enables error recovery and facilitates continuous improvement in the ‘Robot’s ability to execute instructions and achieve desired outcomes, leveraging the standardized structures defined within the ‘SARTCo Dataset’ for consistent evaluation.

The SARTCo Dataset serves as the foundational resource for defining the expected structures of successful communication and task completion within the system. This dataset comprises a collection of annotated dialogues and corresponding robot actions, establishing a standardized benchmark for quantitatively evaluating the performance of the communication dynamics. Specifically, it allows for the measurement of both the ‘Robot’s’ ability to elicit necessary clarifications and the ‘Programmer’s’ effectiveness in providing corrective instructions, enabling comparative analysis of different approaches to communication loop design and iterative refinement strategies.

Reconstruction success rates decrease with increasing structure complexity in multi-turn two-agent episodes involving clarification questions, but are improved by utilizing layer-wise target representations.

Assessing System Capabilities: Benchmarking VLM Performance

A comparative analysis was conducted utilizing both a proprietary model, GPT-5.2-Chat, and the openly accessible Qwen3-VL-30B, to rigorously assess their capabilities within a defined framework. This benchmarking process involved subjecting each model to identical tasks and conditions, allowing for a direct evaluation of their respective strengths and weaknesses. The selection of both closed and open-weight models was intentional, aiming to provide insights into the performance landscape across different development and accessibility paradigms. By establishing a standardized evaluation protocol, researchers sought to objectively quantify the performance of each model and identify areas ripe for further innovation in the field of vision-language models.

Evaluation of model performance centered on ‘Reconstruction Success Rate’, a metric quantifying how accurately a visual language model (VLM) can recreate a target representation based on textual instruction. Notably, GPT-5.2-Chat achieved a peak accuracy of 91.7% when operating in the Text-Text modality and utilizing layer-wise target representations-meaning the model successfully reconstructed the desired outcome by analyzing information at different levels of abstraction within its neural network. This approach effectively guided the model’s reconstruction process, demonstrating a strong capability in interpreting textual prompts and translating them into accurate visual outputs, and suggesting a promising avenue for advanced VLM development.

While GPT-5.2-Chat demonstrated strong performance in simpler reconstruction tasks, its efficacy diminished considerably when faced with a more complex, collaborative scenario. Specifically, when tasked with reconstructing images broken down into components and requiring interaction over multiple turns – simulating a two-agent system – the model’s reconstruction success rate dropped to just 49.3%. This suggests that GPT-5.2-Chat, despite its overall capabilities, struggles with maintaining coherence and accurately integrating information across extended interactions and decomposed visual inputs, highlighting a limitation in its ability to effectively manage complexity within a multi-step, collaborative problem-solving process.

Evaluations revealed substantial challenges for the Qwen3-VL-30B model in completing the assigned robotic task, culminating in the termination of 91.3% of experimental episodes. This high rate of failure suggests a core limitation in the model’s capacity to effectively process and respond to the iterative feedback loop inherent in the collaborative setting. Even with repeated opportunities to refine its actions based on environmental cues, the model consistently struggled to maintain task coherence, highlighting a need for improved robustness in handling complex, multi-turn interactions and potentially indicating deficiencies in its ability to learn from and correct errors during execution.

The observed performance of both GPT-5.2-Chat and Qwen3-VL-30B suggests a promising trajectory for Vision-Language Models in orchestrating complex robotic procedures. While GPT-5.2-Chat demonstrated a high degree of success in reconstructing tasks under specific conditions, its diminished performance with decomposed imagery underscores the need for enhanced robustness in handling imperfect or fragmented visual data. The substantial failure rate of Qwen3-VL-30B highlights critical limitations in its capacity to adapt and persevere through iterative interaction, pointing to areas where improvements in instruction following and error recovery are essential. These findings collectively suggest that future research should prioritize developing VLMs capable of not only understanding high-level commands but also of dynamically adjusting to unforeseen circumstances and correcting mistakes in real-time, ultimately paving the way for more reliable and versatile robotic systems.

GPT-5.2-Chat and Qwen3-VL-30B both exhibit failure modes in multi-agent interactions, including inaccurate color and depth perception, excessive clarification requests, and reconstruction errors stemming from misinterpreted objects.

The pursuit of robust spatial reasoning, as demonstrated in this work with multi-agent dialogue, highlights a fundamental principle of system design. The paper’s findings, while showing marginal gains, underscore that improvements in one area – collaborative interaction – do not inherently resolve underlying complexities. As John von Neumann observed, “There is no possibility of obtaining a satisfactory answer to a question if the question itself is not properly formulated.” The challenge lies not simply in building a structure, but in the models’ ability to accurately represent and reason about the spatial relationships inherent within it. The decomposition of visual representations proves crucial, suggesting that clarity of foundational elements dictates the efficacy of the overall system, aligning with the idea that structure fundamentally governs behavior.

Future Foundations

The modest gains observed in collaborative structure building-a barely perceptible nudge forward-highlight a fundamental truth about complex systems. Adding more voices to the conversation does not inherently improve understanding; it merely redistributes the noise. The infrastructure of reasoning, it seems, is not strengthened by simply layering on additional dialogue turns, but by refining the underlying structural representation. This work demonstrates the power of decomposed visual information, suggesting that a more modular approach to perception-breaking down the whole into manageable, interconnected parts-is critical.

The enduring challenge remains spatial reasoning itself. Vision-language models are, at present, akin to cities built on unstable foundations. Each new capability is added as an extension, rather than being integrated into a cohesive, load-bearing framework. True progress will require a move away from incremental additions and towards a reimagining of the foundational principles. The goal should not be to build more elaborate facades, but to reinforce the core structural elements.

Future research must prioritize the development of more robust and inherently spatial representations, moving beyond pixel-level data to embrace a more symbolic understanding of three-dimensional relationships. Only then will these systems truly begin to ‘see’-not just process light, but comprehend the architecture of the world around them.

Original article: https://arxiv.org/pdf/2605.31387.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

2026-06-01 18:33