Beyond Dialogue: Agents Collaborating in the Mind of AI

Author: Denis Avetisyan

New research demonstrates a method for multi-agent systems to coordinate and reason entirely within the latent space of large language models, unlocking more efficient and accurate collaboration.

LatentMAS establishes a system-wide collaborative intelligence by enabling large language model agents to exchange information through shared $KV$-caches-a latent working memory constructed from layer-wise transfers of hidden states-rather than treating them as isolated tools.

LatentMAS, a novel framework, enables agents to share information and strategize through the LLM’s internal representations, reducing reliance on explicit text exchange.

While multi-agent systems hold promise for enhancing large language model reasoning, current approaches rely on text-based communication which introduces inefficiencies and potential information loss. This paper introduces Latent Collaboration in Multi-Agent Systems and proposes LatentMAS, a novel framework enabling agents to collaborate directly within the continuous latent space of LLMs. By leveraging auto-regressive latent thought generation and a shared latent working memory, LatentMAS achieves superior performance across diverse benchmarks, demonstrating improved accuracy, reduced token usage, and faster inference-all without requiring additional training. Could this paradigm shift towards latent collaboration unlock a new era of efficient and expressive system-level intelligence?

The Illusion of Reasoning

Despite remarkable progress in large language models (LLMs), the ability to consistently perform complex reasoning – tasks demanding multiple sequential steps and nuanced understanding – continues to present a substantial challenge. While LLMs excel at pattern recognition and information retrieval, they often falter when required to synthesize information across extended contexts or apply abstract principles to novel situations. This isn’t simply a matter of scale; increasing model size doesn’t automatically equate to improved reasoning capabilities. The core difficulty lies in the models’ reliance on statistical correlations within training data, rather than a genuine grasp of causal relationships or logical inference. Consequently, LLMs can be easily misled by superficial patterns or exhibit brittleness when faced with variations in problem framing, highlighting a critical gap between statistical learning and true cognitive reasoning.

While Chain of Thought (CoT) prompting has demonstrated success in enhancing reasoning capabilities in large language models, its practical application faces notable limitations. The method’s reliance on generating a sequence of intermediate reasoning steps proves computationally intensive, demanding significant processing power and time, especially as the complexity of the problem increases. More critically, CoT often struggles with “long-range dependencies” – the ability to effectively connect information presented early in the reasoning chain to conclusions drawn later on. As the number of steps expands, the model’s attention can dissipate, leading to errors or incoherent reasoning, effectively diminishing its ability to solve problems requiring sustained, multi-step analysis. This inherent difficulty highlights the need for more efficient and robust reasoning architectures that can maintain coherence and accuracy across extended cognitive processes.

Current approaches to complex reasoning in artificial intelligence frequently depend on the explicit generation of textual explanations, a process that introduces considerable inefficiencies. This reliance on language as an intermediary step – where the model must verbalize its thought process – creates a bottleneck in information transfer and demands substantial computational resources. Each generated token requires processing, and the cumulative effect across multiple reasoning steps can significantly slow down performance and increase costs. Furthermore, the very act of translating internal representations into natural language introduces opportunities for error and can obscure the core logic of the reasoning process, hindering both speed and accuracy. Researchers are actively exploring methods to bypass this textual intermediary, seeking more direct and efficient ways for models to manipulate and utilize information without the need for verbose, language-based explanations.

LatentMAS enhances system-level reasoning accuracy and significantly reduces computational costs across diverse benchmarks and LLM scales compared to single models and text-based methods.

Beyond Dialogue: The Emergence of Latent Collaboration

LatentMAS establishes a collaborative framework where multiple agents interact and solve problems exclusively within a continuous latent space, eliminating the need for discrete token-based communication. This end-to-end approach means agents directly share and manipulate hidden representations – numerical vectors capturing semantic meaning – rather than exchanging natural language text. The system encodes inputs into this latent space, performs collaborative reasoning through modifications of these latent vectors, and then decodes the final latent representation back into an output. By operating directly on these continuous representations, LatentMAS avoids the inefficiencies and potential ambiguities inherent in natural language processing and tokenization, potentially leading to faster and more robust collaboration between agents.

LatentMAS utilizes a Multi-Agent System (MAS) architecture where individual agents collaboratively address tasks not through direct textual communication, but by exchanging and refining hidden representations – vector embeddings capturing semantic information. This approach allows agents to share knowledge and coordinate strategies without the bandwidth and computational overhead associated with explicit language processing. Each agent maintains its own latent state, which is updated based on its observations and the received latent states from other agents. This shared latent space facilitates a form of distributed reasoning, enabling the agents to collectively converge on solutions to complex problems by iteratively refining these hidden representations. The system’s performance relies on the ability of agents to effectively interpret and integrate the information encoded within these shared latent vectors, effectively bypassing the need for traditional message passing.

Latent Thoughts Generation operates by enabling agents to process and refine information using continuous hidden states, rather than discrete token sequences. This approach significantly reduces computational demands as hidden states represent compressed, dense representations of information, circumventing the need for repeated tokenization and detokenization processes. By reasoning directly on these hidden states, agents minimize the number of operations required for each reasoning step, leading to improved efficiency in problem-solving tasks. Furthermore, the continuous nature of hidden states allows for nuanced representation and manipulation of information that is not easily captured by discrete tokens, potentially enhancing the quality of the reasoning process and reducing overall computational cost compared to traditional token-based methods.

LatentMAS encodes semantically consistent and expressive thoughts, as demonstrated by newly generated latent embeddings that comprehensively cover the embedding space of text-based tokens.

The Persistence of Memory: A Latent Workspace

Latent Working Memory in LatentMAS functions by storing and transferring key-value (KV) caches generated during the forward pass of transformer layers. These KV caches, which represent contextual information learned by the model, are not discarded after each layer’s computation but are instead preserved and made available to subsequent layers or agents within the system. This mechanism enables the sharing of crucial information across the network without requiring recomputation, effectively creating a persistent, accessible working memory. The KV caches act as a readily available repository of past processing, allowing LatentMAS to maintain context and accelerate information transfer between different components.

Input-Output Alignment within LatentMAS functions by re-projecting the hidden states generated by the final transformer layer back into the original input embedding space. This process establishes a direct correspondence between the system’s outputs and valid input tokens, enabling effective communication between different agents or modules. Specifically, the final hidden states are transformed to match the dimensions and characteristics of the input embeddings, allowing downstream components to interpret and utilize the information as if it were a standard input token. This alignment is critical for maintaining data consistency and facilitating the transfer of knowledge between agents without requiring complex decoding or interpretation layers.

LatentMAS achieves computational efficiency by utilizing the Key-Value (KV) cache, a storage mechanism that retains the keys and values from previous transformer layer computations. This cached data allows the model to avoid recalculating these values for repeated or similar inputs, substantially reducing redundant computation. Specifically, instead of recomputing attention weights and context vectors for each reasoning step, LatentMAS retrieves them directly from the KV cache. This process accelerates the reasoning process by minimizing the number of floating-point operations required, resulting in faster inference times and reduced resource consumption.

LatentMAS significantly improves efficiency in sequential multi-agent systems by achieving faster inference speeds and reducing overall token usage compared to single models and TextMAS.

Validation and the Limits of Benchmark Fidelity

LatentMAS was subjected to rigorous evaluation across a range of established benchmarks to assess its performance capabilities. Testing included GSM8K, a dataset for grade school math problems; AIME24 and AIME25, benchmarks focused on advanced mathematics and reasoning; and other standardized tests. Results consistently demonstrate LatentMAS’s ability to outperform existing models on these diverse tasks, indicating a broad applicability and robust performance across different problem domains and difficulty levels. Quantitative data from these evaluations will be detailed in subsequent sections, highlighting specific metrics and comparative analysis against baseline systems.

LatentMAS demonstrates state-of-the-art performance across several demanding evaluation tasks. Specifically, the framework achieves leading results on GPQA-Diamond, a question answering benchmark requiring multi-hop reasoning; MedQA, a medical knowledge-based question answering dataset; MBPP-Plus, an enhanced version of the MBPP code generation challenge; and HumanEval-Plus, a more difficult iteration of the HumanEval benchmark focusing on code synthesis. These results indicate LatentMAS’s capacity to effectively address complex reasoning, knowledge retrieval, and code generation problems, surpassing existing models on these challenging datasets.

Evaluations demonstrate that LatentMAS achieves a performance improvement of up to 14.6% in accuracy when compared to baseline models across multiple benchmarks. This improvement is coupled with substantial reductions in computational cost; specifically, token usage is reduced by 70.8% to 83.7%. Furthermore, LatentMAS exhibits a 4x to 4.3x acceleration in end-to-end inference speed, indicating a significant efficiency gain over existing methods.

Multi-agent systems can be structured sequentially or hierarchically to achieve complex coordination.

Beyond Performance: The Architecture of Adaptability

LatentMAS establishes a powerful new foundation for innovation across diverse scientific and technological fields. The framework’s capacity to model and integrate complex reasoning processes promises significant advancements in scientific discovery, where it can assist in hypothesis generation and data analysis. In medical diagnosis, LatentMAS could enable more accurate and personalized treatment plans by synthesizing patient data with medical literature. Furthermore, the system’s ability to support multi-step problem-solving is particularly relevant to advanced robotics, potentially leading to robots capable of more nuanced interactions with complex, real-world environments and greater autonomy in dynamic situations. These applications represent just a glimpse of the broader potential for LatentMAS to drive progress in areas reliant on sophisticated, adaptable intelligence.

LatentMAS establishes a pathway toward artificial intelligence agents capable of more nuanced and flexible problem-solving through collaborative reasoning. By enabling models to articulate and build upon each other’s knowledge, the framework moves beyond isolated decision-making, mirroring the strengths of human collaborative efforts. This approach allows agents to not only arrive at solutions but also to justify their reasoning, identify potential flaws, and adapt strategies based on feedback from other agents-a crucial step toward building truly intelligent systems. The potential impact extends to complex domains requiring iterative refinement and diverse perspectives, such as scientific hypothesis generation, medical diagnosis, and the development of robust, autonomous robots capable of navigating unpredictable environments.

Ongoing development prioritizes expanding the capabilities of LatentMAS by applying it to significantly larger and more complex models. This scaling effort isn’t merely about computational power; it aims to unlock emergent properties within the framework, potentially revealing nuanced reasoning patterns previously obscured by model size limitations. Researchers intend to move beyond controlled experiments and rigorously test LatentMAS in practical applications – from accelerating scientific discovery by identifying hidden relationships in vast datasets, to improving the accuracy of medical diagnoses through more comprehensive data analysis, and ultimately, building more robust and adaptable AI systems for real-world robotics and automation challenges. The anticipated outcome is a demonstrable increase in the framework’s impact, transitioning it from a promising research tool to a tangible asset in solving complex, pressing problems.

The pursuit of seamless multi-agent collaboration, as demonstrated by LatentMAS, echoes a fundamental truth about complex systems. They rarely conform to initial design; instead, they evolve within the constraints of their environment. Robert Tarjan observed, “A system is only as good as its weakest link.” This holds particularly true when agents operate within the latent space of large language models. The framework’s efficiency – minimizing token usage and maximizing accuracy – isn’t merely a technical achievement. It’s an acknowledgement that true robustness arises not from preventing failure, but from anticipating and accommodating it. Long stability, often celebrated in engineering, is merely a delay in the inevitable reshaping of the system; LatentMAS embraces that evolution by shifting collaboration into a more fluid, adaptable realm.

The Currents Shift

LatentMAS offers a glimpse into a future where agency isn’t defined by the articulation of language, but by the navigation of internal states. It is a refinement, not a revolution. The efficiency gained by operating within the LLM’s latent space is noteworthy, yet it merely postpones the inevitable entropy of scale. Dependencies will accrue, and the cost of maintaining coherence in these shared latent spaces will, in time, eclipse the initial gains. One does not build collaboration; one observes its fleeting emergence, and prepares for its decay.

The true challenge lies not in optimizing communication, but in understanding the limits of shared representation. Can genuine novelty arise from a system entirely confined to the gradients of a pre-trained model? Or does this approach merely distill existing knowledge, creating an echo chamber of statistical likelihood? The framework’s reliance on the LLM’s ‘working memory’ – the KV cache – feels particularly provisional. These caches are artifacts of implementation, not intrinsic features of intelligence.

Future work will inevitably focus on scaling these latent interactions, perhaps attempting to weave them into more complex architectures. But such endeavors should proceed with caution. Architecture isn’t structure-it’s a compromise frozen in time. The most interesting questions are not about how to connect these agents, but about what happens when those connections inevitably fray.

Original article: https://arxiv.org/pdf/2511.20639.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/