Beyond the Single Model: Unlocking AI Potential Through Aggregation

Author: Denis Avetisyan

A new theoretical analysis reveals how combining multiple large language models can dramatically improve performance, but only under specific conditions.

Aggregation operations expand a system’s capabilities-specifically through feasibility expansion (creating infeasible outputs from feasible ones), support expansion (increasing output diversity), and binding set contraction (moving solutions from the boundary to the interior)-and any such operation must leverage at least one of these mechanisms to genuinely empower the system designer, a principle further refined by demonstrating that strengthened versions of these mechanisms directly correlate with increased system power.

This paper identifies three key mechanisms-feasibility expansion, support expansion, and binding set contraction-that govern the benefits of aggregation in compound AI systems.

While deploying multiple instances of a large language model and aggregating their outputs is a common strategy for enhancing performance, it remains unclear whether this aggregation genuinely unlocks capabilities beyond those of a single model. This work, ‘Power and Limitations of Aggregation in Compound AI Systems’, theoretically investigates this question through a principal-agent framework, identifying three distinct mechanisms-feasibility expansion, support expansion, and binding set contraction-by which aggregation can expand the range of elicitable outputs. We prove that any elicitability-expanding aggregation operation must implement one of these mechanisms, offering a complete characterization of when aggregation truly overcomes limitations in both model capabilities and prompt engineering. Under what conditions can we reliably leverage compound AI systems to surpass the inherent constraints of individual models and unlock novel functionalities?

The Illusion of Intelligence: Why Single Models Fail

Large Language Models, despite demonstrating remarkable proficiency in areas like text generation and translation, operate within boundaries dictated by their foundational design and the data used during training. These models aren’t general-purpose intelligences; rather, they are highly sophisticated pattern-matching systems that excel at predicting the most probable continuation of a given sequence. Consequently, their capabilities are inherently limited by the scope and quality of the data they were exposed to – a model trained primarily on historical texts, for example, will struggle with contemporary scientific jargon. Furthermore, the very architecture of these models, typically based on deep neural networks with a fixed number of parameters, imposes constraints on their ability to represent and process complex information, ultimately defining the limits of their individual performance and adaptability.

The practical application of large language models frequently reveals a core limitation: a struggle to produce consistently diverse and feasible outputs. While capable of impressive feats of text generation, these models often fall into predictable patterns or generate responses that, while grammatically correct, lack practical applicability or novelty. This isn’t simply a matter of stylistic repetition; the core issue lies in the model’s inability to thoroughly explore the solution space for complex problems. Consequently, tasks requiring creative problem-solving, nuanced judgment, or the integration of disparate information sources often exceed the model’s capabilities, leading to outputs that are either unimaginative, internally inconsistent, or demonstrably incorrect despite appearing plausible on the surface. The result is a bottleneck in utilizing these powerful tools for genuinely innovative endeavors.

The pursuit of increasingly complex solutions with current large language models frequently encounters a dual constraint: the limitations of prompting techniques and the inherent capabilities of the models themselves. Even with sophisticated prompt engineering – crafting inputs designed to elicit specific responses – models often reach a ceiling in terms of novelty, accuracy, and feasibility. This isn’t simply a matter of refining the input; the models, trained on finite datasets and governed by specific algorithmic architectures, possess an inherent boundary to their problem-solving capacity. Consequently, tasks demanding true innovation, nuanced understanding, or integration of disparate knowledge domains often prove intractable, highlighting the need for advancements beyond incremental improvements in prompting strategies and toward fundamentally more capable artificial intelligence systems.

Empirical results using GPT-4o-mini demonstrate that aggregation operations-support expansion, binding-set contraction, and feasibility expansion-expand elicitability by combining prompt-specific outputs [latex] \boldsymbol{x}^{(1)} [/latex] and [latex] \boldsymbol{x}^{(2)} [/latex] into a combined output [latex] \boldsymbol{x}^{(A)} [/latex] that is more readily achievable by a single model than either input alone, as evidenced by the proximity of the elicited output [latex] \boldsymbol{x}^{\*}_{P}(\boldsymbol{x}^{(A)}) [/latex] to [latex] \boldsymbol{x}^{(A)} [/latex] and detailed in Table 2.

Beyond the Single Oracle: Building Compound AI Systems

A Compound AI System overcomes the limitations of single Large Language Models (LLMs) by employing an Aggregation Operation to synthesize outputs from multiple LLMs. This process involves generating responses from several LLMs to a given prompt, then combining these responses using a defined method – such as averaging, voting, or utilizing a learned weighting scheme – to produce a single, consolidated output. The aggregation step aims to reduce variance, improve accuracy, and enhance the robustness of the overall system by leveraging the diverse strengths of individual LLMs and mitigating their individual weaknesses. This contrasts with relying on a single LLM which is subject to inherent biases and potential errors.

The Compound AI System utilizes the Principal Agent Framework to formalize the relationship between the system designer – the principal – and the Large Language Models (LLMs) functioning as agents. This framework models the designer as setting objectives and the LLMs as acting to achieve those objectives, though potentially with differing information or incentives. By explicitly defining this interaction, the system can be structured to optimize for desired outcomes by accounting for potential agency problems, such as LLM divergence or suboptimal responses. The framework facilitates the design of mechanisms – including reward functions and aggregation operations – to align the LLM’s behavior with the principal’s goals and maximize overall system performance.

Reward Function Specification is a critical component of designing a Compound AI System. Within the Principal Agent Framework, the reward function dictates how each Large Language Model (LLM), functioning as an ‘agent’, is evaluated and incentivized. Precise definition of this function is necessary to align agent behavior with desired system outcomes; ambiguous or poorly defined rewards can lead to unintended consequences or incoherent outputs. The function must quantitatively assess the quality, relevance, and consistency of each LLM’s contribution, providing a measurable signal for optimization. This typically involves establishing specific metrics and weighting criteria to prioritize certain characteristics in the generated output, thereby guiding the agents towards producing valuable and coherent results.

Expanding the Realm of Possibility: Elicitability Expansion in Action

Elicitability Expansion, achieved through model aggregation, demonstrably increases the diversity and scope of achievable outputs beyond the limitations of individual models. This improvement isn’t simply a matter of averaging predictions; it requires a correctly implemented aggregation strategy that leverages the strengths of multiple models. By combining diverse perspectives, the aggregated system can explore a larger portion of the solution space, generating outputs that no single constituent model could produce independently. The increase in achievable outputs is quantifiable, representing a broadened range of feasible solutions and improved performance on complex tasks where a single model’s inherent biases or limitations would otherwise constrain the results.

Elicitability Expansion relies on two primary mechanisms: Feasibility Expansion and Support Expansion. Feasibility Expansion occurs when the aggregated system, through collective reasoning, identifies and relaxes constraints that may unduly limit the solution space for individual models. This allows for the consideration of a wider range of potential outputs. Complementing this, Support Expansion broadens the characteristics of the generated outputs themselves, moving beyond the typical limitations of any single model’s training data or inherent biases; this is achieved by synthesizing diverse perspectives and capabilities within the aggregated system, resulting in a more comprehensive and varied output distribution.

BindingSetContraction is a key process in enhancing solution robustness within aggregated models. Initially, individual models often converge on boundary solutions – points that satisfy constraints but offer limited flexibility. BindingSetContraction identifies and relaxes the constraints binding these solutions, effectively shifting the aggregated solution away from these boundaries and toward interior points within the feasible solution space. This movement improves stability, as interior points are less sensitive to minor perturbations in input data or model parameters, and enables the generation of more diverse and generally applicable outputs by exploring a wider range of possibilities within the defined constraints.

Analysis of [latex]\mathbb{R}^{768}[/latex] embeddings from GPT-4o-mini reveals that different prompts yield semantically distinct outputs, and combining these outputs via intersection or style aggregation produces vectors unlike any individual prompt's result, as demonstrated by differences in [latex]ℓ₂[/latex] distance and projection onto principal components. — Analysis of [latex]\mathbb{R}^{768}[/latex] embeddings from GPT-4o-mini reveals that different prompts yield semantically distinct outputs, and combining these outputs via intersection or style aggregation produces vectors unlike any individual prompt’s result, as demonstrated by differences in [latex]ℓ₂[/latex] distance and projection onto principal components.

From Theory to Practice: Demonstrating Impact with Reference Generation

The Reference Generation Task presents a rigorous challenge for evaluating the strengths of aggregated language models, demanding the creation of comprehensive and pertinent reference lists. Unlike tasks focused on single-answer generation, this task necessitates nuanced understanding and the ability to synthesize information from multiple sources, effectively mirroring real-world research needs. A model’s success isn’t simply determined by identifying a relevant source, but by assembling a diverse and complete set of references, requiring it to move beyond surface-level keyword matching and demonstrate genuine comprehension of a given topic. This complexity makes it an ideal proving ground for testing whether aggregation techniques can genuinely enhance a model’s ability to explore a wider solution space and deliver more comprehensive outputs than any single model could achieve independently.

The application of a Compound AI System, coupled with an Aggregation Operation, to the Reference Generation Task yielded demonstrably improved outputs compared to those of individual Large Language Models. This enhancement wasn’t merely qualitative; rigorous measurement using the [latex]ℓ1[/latex] distance revealed a consistent divergence between the aggregated output and the closest result achievable by any single model, ranging from 0.03 to 0.15. This quantifiable difference indicates that the aggregation process unlocks a novel output space, generating references that no single model, when prompted independently, could produce, thus validating the potential of combined AI systems to surpass the limitations of their constituent parts and achieve a greater breadth and quality of response.

The successful application of intelligent aggregation to the Reference Generation Task provides compelling evidence that Elicitability Expansion moves beyond theoretical possibility to become a demonstrable practical benefit. Empirical results reveal that combining the outputs of individual Large Language Models, through a specifically designed aggregation operation, consistently yields more diverse and higher-quality reference lists. This isn’t simply a matter of increased volume; the observed [latex]ℓ1[/latex] distance-ranging from 0.03 to 0.15-between the aggregated output and the closest output achievable by any single model underscores a genuine shift in the solution space. Specifically, three distinct mechanisms-increased recall, refined relevance, and novel combination-were consistently observed to contribute to this improved performance, confirming that the benefits of aggregation are not accidental but stem from a systematic enhancement of the model’s elicitable knowledge.

The pursuit of compound AI systems, as detailed in the study of aggregation mechanisms, feels predictably optimistic. The paper meticulously outlines feasibility expansion, support expansion, and binding set contraction – elegant concepts on paper. Yet, one suspects production environments will gleefully discover edge cases these mechanisms didn’t anticipate. As John von Neumann observed, “There is no telling what the future holds.” This holds particularly true when attempting to wrangle multiple LLMs; the theory suggests unlocking greater performance through aggregation, but the reality will invariably involve debugging emergent behaviors and patching unforeseen vulnerabilities. Tests are, after all, a form of faith, not certainty.

The Road Ahead

This exploration of aggregation’s mechanics offers a neat taxonomy, but anyone who’s deployed more than a handful of Large Language Models knows that ‘feasibility expansion’ quickly becomes a synonym for ‘expanding the surface area for failure’. The principal-agent framework is elegant, of course-until production data reveals that the ‘agents’ are actively adversarial, or simply hallucinating in unison. Any system where ‘binding set contraction’ is a desired outcome should be viewed with extreme skepticism; it suggests the system was fundamentally unstable to begin with.

The paper correctly identifies elicitability as a core constraint. Yet, the pursuit of ‘better prompts’ feels increasingly like an exercise in elaborate superstition. Documentation, naturally, will be the primary vector for propagating these illusions. If a bug is reproducible, it’s a feature, and the prompt is working as intended. The real challenge isn’t scaling aggregation, but accepting that increasingly complex systems are, at their core, brittle.

Future work will undoubtedly focus on automating prompt engineering, and inventing new methods for measuring ‘trustworthiness’. The field will discover, repeatedly, that anything self-healing just hasn’t broken yet. The more interesting question isn’t how to build more powerful compounds, but whether there’s a point where increased capability yields diminishing returns, and a corresponding increase in operational cost and unpredictability.

Original article: https://arxiv.org/pdf/2602.21556.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Intelligence: Why Single Models Fail

Beyond the Single Oracle: Building Compound AI Systems

Expanding the Realm of Possibility: Elicitability Expansion in Action

From Theory to Practice: Demonstrating Impact with Reference Generation

The Road Ahead

See also: