Composing Language Models with Learned Connections

Author: Denis Avetisyan

Researchers are exploring ways to combine the power of multiple pre-trained language models without retraining, unlocking new levels of performance and efficiency.

The architecture leverages three distinct, frozen layer-1 models to encode input, projecting their hidden states into a shared latent space and averaging them to create a unified representation [latex]\mathbf{z}\_{1}[/latex], which then informs two further frozen layer-2 models at a sparsity level of 0.75; these layer-2 representations are subsequently projected into another shared space, culminating in a cross-attention output node that generates the final prediction-a process that trains only five projection matrices and the output node, totaling 17.6 million parameters.

This work introduces frozen language model graphs, a parameter-efficient method for cross-architecture communication via a shared latent space, enabling strong performance on diverse benchmarks.

Despite the increasing scale of large language models, efficiently combining their capabilities remains a significant challenge. This is addressed in ‘Dead Weights, Live Signals: Feedforward Graphs of Frozen Language Models’ which introduces a parameter-efficient architecture composing multiple pretrained LLMs via a learned, shared latent space for communication. By treating frozen models as nodes in a feedforward graph and optimizing only a small number of projection matrices, the approach achieves strong performance on reasoning and knowledge benchmarks, surpassing single-model baselines and parameter-matched classifiers. Can this differentiable communication paradigm unlock emergent capabilities beyond those present in individual, monolithic language models?

Beyond Scale: Rethinking Intelligence

Despite the impressive scale of contemporary large language models, consistently achieving robust performance on complex reasoning tasks remains a significant challenge. These models, while proficient at pattern recognition and generating human-like text, often falter when confronted with problems demanding multi-step inference, logical deduction, or the integration of diverse knowledge sources. This limitation isn’t necessarily due to a lack of data or model size, but rather a fundamental constraint within the monolithic architecture itself. Current LLMs typically process information sequentially, hindering their ability to explore multiple reasoning pathways in parallel or to effectively decompose intricate problems into manageable sub-components. Consequently, researchers are increasingly focused on exploring novel architectural innovations – moving beyond simply scaling up existing models – to unlock true reasoning capabilities and create artificial intelligence systems that can reliably tackle complex challenges.

The limitations of scaling monolithic large language models have spurred investigation into alternative architectures, leading to the development of the Frozen LLM Graph. This innovative approach eschews traditional training methods by composing multiple, frozen LLMs – meaning their weights remain fixed – and connecting them in a directed acyclic graph. This architecture allows for specialized LLMs to handle distinct sub-tasks within a complex reasoning problem, effectively distributing the cognitive load. Remarkably, this system achieves state-of-the-art performance on various benchmarks while maintaining exceptional parameter efficiency; the Frozen LLM Graph requires only 17.6 million trainable parameters – a fraction of the billions typically found in contemporary LLMs – demonstrating a pathway towards more accessible and sustainable artificial intelligence.

Bridging the Divide: A Shared Latent Space

The Layer 1 Projection constitutes the foundational element of our methodology, functioning as a mapping process that transforms the hidden state vectors of distinct Large Language Models (LLMs) into a unified, Shared Latent Space. This projection involves a learned transformation – specifically, a set of matrices applied to the hidden states – allowing for the representation of information from models like Qwen2.5-1.5B, Llama-3.2-1B, and Gemma-2-2B within a common vector space. The resulting Shared Latent Space facilitates interoperability, enabling the models to exchange and process information despite architectural differences, and forms the basis for a combined reasoning system.

Layer 1 Projection facilitates interoperability between large language models, specifically Qwen2.5-1.5B, Llama-3.2-1B, and Gemma-2-2B, by enabling the transfer of learned representations. This process allows each model to contribute its specialized knowledge to a collective reasoning process, effectively creating a unified system. The projection doesn’t require retraining of the individual models; instead, it maps their existing hidden states into a shared representational space, allowing for cross-model communication and synergistic problem-solving without significant performance degradation.

The architecture employs a Shared Latent Space to facilitate inter-model communication without inducing catastrophic interference. This is achieved by limiting the gradient signal impacting the output nodes of the Layer-1 Projection Matrices to 13%. This constrained gradient flow allows each Large Language Model – including Qwen2.5-1.5B, Llama-3.2-1B, and Gemma-2-2B – to contribute its specialized knowledge and reasoning capabilities while minimizing disruption to the established representations within the shared space. The resulting system leverages the distinct strengths of each model without causing instability or performance degradation due to conflicting updates.

During training on MMLU, the projection matrix gradient norms of layers 1-3 converged to similar values, indicating a lack of specialization, while the norm of layer 4 ([latex]W_4[/latex], Phi-3-mini) consistently exceeded that of layer 5 ([latex]W_5[/latex], Mistral-7B), demonstrating emergent selective routing across all active matrices.

Layer 2 Integration: Augmenting the Reasoning Stream

Layer 2 Injection facilitates knowledge transfer by directly incorporating the shared latent state – a condensed representation of information – into the residual stream of Layer 2 models. This process is specifically implemented within architectures like Phi-3-mini and Mistral-7B, leveraging their existing residual connections. By injecting the latent state at this point, the models can modify subsequent computations based on information derived from a preceding, potentially different, model. This differs from simply concatenating outputs, as the injected state influences the internal calculations of the Layer 2 model rather than being treated as a separate input feature.

Residual Stream Injection facilitates the transfer of information from layer-1 models to layer-2 models by directly incorporating the shared latent state into the residual stream of the layer-2 architecture. This is achieved by adding the output of the layer-1 model to the residual connection within layer-2, effectively augmenting the layer-2 representation with data processed by the layer-1 model. This process allows layer-2 to leverage the features and knowledge extracted by layer-1 during its forward pass, enhancing the overall representational capacity and improving downstream performance without requiring extensive retraining of the layer-2 model.

The Cross-Attention Output Node functions as a consolidation point for representations refined through Layer 2 Injection and Residual Stream Injection. Specifically, it receives the enriched layer-2 representations – those informed by the latent state of layer-1 models – and performs a weighted aggregation. This aggregation process combines the information from each layer, effectively creating a unified representation used for generating final predictions. The resulting output leverages the collective knowledge embedded within all participating models, leading to improved performance compared to utilizing any single model in isolation.

Validation and Emergent Reasoning

The Frozen LLM Graph was subjected to validation using three established reasoning benchmarks: ARC-Challenge, OpenBookQA, and MMLU. ARC-Challenge assesses commonsense reasoning in scientific contexts, while OpenBookQA tests understanding of open-book questions requiring multi-step reasoning. MMLU (Massive Multitask Language Understanding) evaluates knowledge across 57 diverse subjects. Performance on these benchmarks provides a quantitative assessment of the graph’s reasoning capabilities and allows for comparison against individual Large Language Models (LLMs) and other compositional architectures.

Evaluation of the Frozen LLM Graph on established reasoning benchmarks yielded accuracy scores of 87.3% on the ARC-Challenge dataset, 82.8% on OpenBookQA, and 67.2% on the MMLU benchmark. These results represent the performance achieved when combining multiple Large Language Models (LLMs) through the described framework, and serve as a quantitative measure of the system’s reasoning capabilities. The observed scores demonstrate a consistent ability to achieve high levels of accuracy across diverse knowledge domains and question types presented in these datasets.

Performance evaluations on ARC-Challenge, OpenBookQA, and MMLU indicate that the Frozen LLM Graph surpasses the accuracy of its strongest individual component model by 11.4 percentage points on ARC-Challenge, 6.2 percentage points on OpenBookQA, and 1.2 percentage points on MMLU. These gains demonstrate the effectiveness of compositional intelligence, where the synergistic interaction of multiple language models yields improved reasoning capabilities beyond those achievable by any single model in isolation. This outcome highlights the potential of combining specialized models to address complex tasks requiring broader knowledge and more nuanced inference.

Analysis of the Cross-Attention Output Node revealed a pattern of Selective Routing, wherein the model consistently assigned higher weighting to the output of Phi-3-mini compared to other constituent language models. This preferential weighting was not explicitly programmed and emerged during the training process, suggesting an internal mechanism for prioritizing information sources. The observed behavior indicates the Frozen LLM Graph is not simply averaging outputs, but dynamically adjusting the contribution of each LLM based on the input query, representing a form of emergent reasoning capability and demonstrating compositional intelligence beyond the performance of individual models.

Ridge Regression analysis was conducted to assess the efficacy of cross-architecture alignment within the Frozen LLM Graph, a crucial element for enabling knowledge transfer between constituent language models. Results demonstrate a consistent performance improvement over a parameter-matched learned head across multiple reasoning benchmarks: 9.1% on ARC-Challenge, 5.2% on OpenBookQA, and 6.7% on MMLU. This indicates that the implemented alignment strategy effectively facilitates the combination of knowledge from different model architectures, leading to enhanced reasoning capabilities beyond what is achievable with a standard learned interface of comparable size.

After 50 training steps, the model achieves 80.7% accuracy on ARC-Challenge and peaks at 67.2% on MMLU, demonstrating performance exceeding the best single-model baseline with mid-epoch evaluations on a 200-example subset (violet) and full evaluations at the epoch end (coral).

Toward a Future of Composable Intelligence

The conventional approach to large language models centers on monolithic designs – single, massive networks trained end-to-end. However, a fundamental shift is underway with the development of the Frozen LLM Graph, a composable system built from pre-trained, ‘frozen’ language model components. This architecture diverges from the traditional paradigm by assembling specialized modules – each possessing unique capabilities – and connecting them to form a dynamic network. Rather than retraining the entire model for each new task, this system optimizes connections between these fixed components, offering significant advantages in efficiency and adaptability. This modularity allows for the creation of highly customized intelligence systems, potentially unlocking reasoning abilities and task performance beyond the reach of current monolithic models, all while dramatically reducing the number of trainable parameters.

The remarkable potential of the Frozen LLM Graph hinges on the strategic allowance of gradient flow across its deliberately frozen boundaries. While the vast majority of the model’s parameters remain static, preserving learned knowledge, select connections are permitted to update during training. This nuanced approach avoids catastrophic forgetting – a common issue with traditional fine-tuning – and allows for targeted adaptation to new tasks. By carefully controlling which connections ‘leak’ gradient signals, the system efficiently refines its reasoning capabilities with a minimal number of trainable parameters. Essentially, the frozen graph provides a robust foundation, and the limited gradient flow acts as a fine-tuning mechanism, enabling continuous learning and improved performance without the computational burden of retraining the entire model.

The Frozen LLM Graph demonstrates a remarkable feat of efficiency, achieving substantial gains in reasoning and adaptability with a surprisingly small number of trainable parameters. Against a foundation of 12 billion frozen parameters – representing a wealth of pre-existing knowledge – the system requires optimization of only 17.6 million parameters to perform effectively across a range of tasks. This disproportionate ratio suggests a pathway toward significantly reducing the computational cost and energy consumption associated with large language models, without sacrificing performance. The ability to fine-tune a small subset of parameters while leveraging a vast, frozen knowledge base unlocks possibilities for more accessible and sustainable artificial intelligence, paving the way for deployment on resource-constrained devices and broader adoption across diverse applications.

Ongoing research endeavors are directed towards substantially expanding the scale of this frozen LLM graph, moving beyond current limitations to encompass a far greater number of specialized modules and interconnections. Investigations are also underway to determine optimal connection topologies – the specific patterns by which these modules communicate – with the aim of maximizing information flow and synergistic reasoning. Crucially, the future development hinges on automating the model composition process itself, transitioning from manual configuration to an intelligent system capable of dynamically assembling and optimizing the graph based on task demands and available resources. This automation promises a truly adaptive and efficient intelligence, capable of rapidly reconfiguring its architecture to tackle novel challenges with minimal computational overhead.

The pursuit of efficient communication between frozen language models, as detailed in the study, echoes a fundamental principle of network design. It isn’t about adding complexity, but streamlining the signal. Vinton Cerf observed, “The Internet is a global nervous system.” This aligns perfectly with the paper’s focus on establishing a shared latent space – a streamlined ‘nervous system’ – allowing these models to communicate without the need for extensive retraining. The efficacy of this approach-achieving strong performance with parameter efficiency-demonstrates that the ‘what’s left’-the focused communication pathway-is indeed what matters most, not the sheer volume of parameters.

What Remains?

The pursuit of composition, of building complex systems from stable, frozen components, inevitably reveals the fragility of ‘understanding’ itself. This work demonstrates a method – a scaffolding, if one will – for communication between language models, but it does not, and could not, address the fundamental question of what is actually being communicated. The latent space alignment, however elegant, remains a mirror reflecting the models’ existing biases and limitations, not a portal to genuine semantic integration. Future effort must, therefore, shift from simply enabling communication to measuring its fidelity.

The current paradigm favors parameter efficiency, a virtue born of necessity, yet it risks becoming a constraint. The insistence on ‘frozen’ weights, while pragmatic, implicitly assumes the bulk of intelligence resides within the pretrained models. What if the true potential lies not in leveraging existing knowledge, but in learning how to learn, within the communication layer itself? A more radical approach might explore differentiable communication pathways capable of adapting and evolving beyond the constraints of the initial models.

Ultimately, the true test of this line of inquiry will not be benchmark performance, but conceptual clarity. Can this methodology reveal the underlying structure of language, or will it merely generate increasingly convincing simulacra? The elegance of the solution should not distract from the opacity of the problem. The fewer moving parts, the more keenly felt the absence of a guiding principle.

Original article: https://arxiv.org/pdf/2604.08335.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/