AI Teams Now Outcreate Humans

Author: Denis Avetisyan

New research reveals that groups of artificial intelligence agents are demonstrably more creative than teams of people when tackling complex problems.

Multi-agent language model teams demonstrably surpass human teams in generating creative and novel ideas-as evidenced by significantly higher scores in both dimensions ([latex]d=1.50[/latex] and [latex]d=1.29[/latex], respectively)-while maintaining comparable levels of usefulness ([latex]d=0.08[/latex], [latex]p=0.142[/latex]), and this advantage is further reflected in a distribution of ideas shifted toward greater novelty and usefulness, particularly within the top 5% ([latex]M=0.53[/latex] for LLM teams versus [latex]M=0.37[/latex] for human teams).

Multi-agent systems leveraging large language models outperform human teams in creative tasks through efficient exploration of semantic space, offering new insights into computational social science and AI design.

Despite longstanding challenges in replicating human ingenuity, artificial intelligence is rapidly advancing across cognitive domains. This is explored in ‘Multi-agent AI systems outperform human teams in creativity’, which investigates the creative potential of collaborative AI. Our findings demonstrate that teams of large language model agents not only exceed the performance of single agents, but also substantially surpass human teams in generating novel and useful ideas. How can we systematically leverage these insights to design multi-agent systems that consistently unlock augmented creative capabilities and reshape collaborative problem-solving?

Defining the Essence of Creative Potential

The essence of creativity isn’t simply generating new ideas, but producing those that are both original and valuable. This presents a fundamental challenge for any creative system – be it a human brain, an artificial intelligence, or an evolutionary process – because true innovation requires a delicate equilibrium. A concept lacking novelty is merely repetition, while an idea devoid of usefulness, however unique, remains an impractical curiosity. Establishing this balance is surprisingly difficult to achieve; systems often lean towards either random variation without purpose, or predictable outputs lacking originality. Consequently, a comprehensive understanding of creative potential necessitates moving beyond simple measures of idea quantity and focusing instead on the synergistic interplay between novelty and pragmatic application.

Assessing creative potential extends far beyond simply tallying the number of ideas produced; a truly robust methodology must delve into the quality of those ideas and their practical application. Researchers are increasingly employing techniques like consensual assessment, where experts evaluate outputs for both novelty and usefulness, providing a more nuanced scoring system than sheer volume. This approach acknowledges that creativity isn’t merely about generating options, but about producing solutions that are both original and valuable. Furthermore, computational models are being developed to assess the ‘distance’ of an idea from existing knowledge, attempting to mathematically quantify its novelty, while simultaneously evaluating its feasibility based on established principles. These multifaceted evaluations strive to move beyond superficial metrics and capture the complex interplay of imagination and pragmatism inherent in genuinely creative endeavors.

Evaluating creative potential presents a significant challenge because distinguishing genuinely novel ideas from those arising purely by chance is remarkably difficult. Conventional assessments, such as simply counting the number of ideas generated, fail to account for the crucial element of usefulness, meaning a random combination, while technically novel, lacks practical application or meaningful insight. Studies reveal that statistical models can often produce outputs indistinguishable from human brainstorming sessions when judged solely on novelty, highlighting the need for more sophisticated methodologies. These approaches must incorporate criteria beyond simple originality, demanding an evaluation of value, relevance, and the potential for positive impact to truly identify creative outputs and separate them from meaningless variations.

LLMs demonstrate a creativity advantage over humans by more consistently generating ideas that are both novel and useful, as evidenced by their distribution in [latex] ext{novelty-usefulness}[/latex] space compared to human-generated ideas.

Mapping Dialogue as a Semantic Trajectory

SemanticTrajectoryAnalysis, in the context of dialogue modeling, represents conversational flow as a continuous path within a multi-dimensional semantic space. This is achieved by treating each utterance or conversational turn as a point in this space, defined by its semantic meaning. The trajectory formed by connecting these points illustrates how the conversation evolves semantically over time, capturing shifts in topic, changes in perspective, and the overall argumentative or exploratory structure of the dialogue. Analyzing the characteristics of this trajectory – its length, curvature, and direction – provides quantifiable metrics for understanding conversational dynamics and identifying key moments of semantic divergence or convergence.

QwenEmbedding is utilized to transform conversational contributions, or ‘ideas’, into high-dimensional vectors. This vectorization process allows for the quantification of semantic distance between ideas; the cosine similarity between two vectors represents the degree of semantic relatedness, with lower values indicating greater divergence. Specifically, each token within a conversational turn is embedded, and these token embeddings are aggregated – typically through averaging – to produce a single vector representing the overall semantic content of that turn. This facilitates objective measurement of how topics shift and evolve throughout a dialogue, enabling computational analysis of conversational divergence and coherence.

Principal Component Analysis (PCA) is implemented to address the high dimensionality inherent in representing conversational semantic trajectories as vectors. The initial vector space, derived from [latex]QwenEmbedding[/latex] representations of dialogue turns, often consists of thousands of dimensions. PCA reduces this dimensionality to a smaller set of principal components – typically two or three – while retaining the maximum variance in the data. This reduction enables visualization of the semantic trajectories in a 2D or 3D space, allowing for qualitative assessment of conversational divergence and exploration patterns. The resulting plots illustrate how ideas evolve and relate to each other throughout the dialogue, facilitating analysis of conversational flow and topic shifts.

Analysis of conversational trajectories in a semantic space reveals that high-creativity discussions ([latex]negative[/latex] distance from the semantic centroid) exhibit broad exploration across multiple concepts, while low-creativity discussions ([latex]positive[/latex] distance) remain narrowly focused, as visualized by the divergence between blue (early) and red (late) turns.

The Influence of Discussion Structure on Creative Outcomes

Four distinct discussion structures – OpenDiscussion, IterativeRefinement, InstructedDiscussion, and ProgressiveImprovement – were implemented to systematically evaluate their impact on creative outcomes. OpenDiscussion allowed for unrestricted idea generation and modification. IterativeRefinement focused on building upon existing ideas through incremental changes. InstructedDiscussion incorporated specific prompts or constraints to guide idea development. Finally, ProgressiveImprovement emphasized selecting and enhancing the most promising ideas from each turn. This controlled implementation allowed for quantitative comparison of how differing rules governing idea evolution affected the resulting semantic space and overall creative performance.

The implemented discussion structures – OpenDiscussion, IterativeRefinement, InstructedDiscussion, and ProgressiveImprovement – each enforce unique constraints on how ideas are altered and chosen during a conversational process. OpenDiscussion allows for unrestricted modification and selection, while IterativeRefinement prioritizes incremental changes based on prior statements. InstructedDiscussion utilizes explicit prompts to guide idea evolution, and ProgressiveImprovement focuses on selecting and building upon the most coherent ideas presented thus far. Consequently, these differing rules result in demonstrably distinct patterns of semantic exploration, as evidenced by variations in PathLength and GlobalCoherence metrics across the structures.

Quantitative analysis using PathLength and GlobalCoherence metrics demonstrated a statistically significant relationship between discussion structure and the characteristics of generated creative outcomes. PathLength, measuring the number of steps taken to reach a final idea, indicated that structures like OpenDiscussion fostered broader exploration, resulting in longer paths. Conversely, structures such as InstructedDiscussion prioritized focused refinement, yielding shorter PathLengths. GlobalCoherence, quantifying the semantic relatedness of ideas within a discussion, showed a corresponding trend. Statistical modeling revealed that discussion structure and the underlying language model jointly explained 26.8% of the variance in these metrics, indicating a substantial, measurable influence on both the breadth and focus of creative idea generation.

Iterative refinement significantly boosts creativity in GPT-4.1, increasing ratings by [latex]eta=+0.097[/latex] (p<0.001), while the o3-high model maintains consistently high creativity across all discussion structures, with only instructed discussion showing a slight improvement ([latex]eta=+0.023[/latex], p=0.047).

LLMs: A New Paradigm for Creative Generation

Recent evaluations utilizing the ConsensualAssessmentTechnique reveal a compelling advantage for Large Language Model (LLM) teams in generating creative outputs when contrasted with human teams. This rigorous assessment methodology, which relies on collective human judgment to evaluate idea quality, consistently positioned LLM teams as superior innovators. The data indicates that LLMs not only produce a greater volume of ideas, but also demonstrate a capacity for generating concepts deemed both novel and valuable by human evaluators – a statistically significant finding that challenges conventional understandings of creative intelligence and hints at a transformative potential for these models in fields reliant on innovative thought.

Research indicates that large language models (LLMs) exhibit a heightened capacity for generating ideas that are both original and practical when compared to human teams. Evaluations utilizing the Consensual Assessment Technique reveal a substantial difference in creative output, quantified by a Cohen’s d of 1.50-a statistically significant effect size indicating a large disparity. This suggests LLMs aren’t simply mimicking human creativity, but are actively producing a greater volume of ideas judged as both novel and valuable by human evaluators, potentially reshaping understandings of innovation and problem-solving processes.

Analysis of the creative process itself revealed a striking difference between large language model (LLM) teams and human teams. Trajectory analysis, a method for mapping the evolution of ideas, accounted for 32.6% of the variance in LLM team creativity – significantly higher than the 17.0% observed in human teams. This suggests that the path LLMs take towards novel solutions is more predictable and, crucially, more efficient in generating creative outcomes, indicating a fundamentally different approach to idea generation and refinement compared to human collaboration. The greater explanatory power of trajectory analysis for LLMs highlights the potential to not only leverage their creative output but also to understand how they achieve it, potentially unlocking further enhancements to their innovative capabilities.

The emergence of large language models as potent creative engines compels a re-evaluation of longstanding beliefs about innovation and problem-solving. Recent research demonstrates that LLMTeams not only match, but consistently surpass, human teams in generating novel and useful ideas, as assessed by the rigorous ConsensualAssessmentTechnique. This isn’t merely a quantitative difference – trajectory analysis reveals that the creative process within LLM teams is more predictable and explainable than in their human counterparts. The implications extend beyond simple task completion; these models suggest a fundamentally different pathway to innovation, one characterized by rapid ideation, consistent quality, and a level of analytical transparency previously unattainable. Consequently, LLMs are poised to become indispensable tools, reshaping how individuals and organizations approach challenges and unlock new possibilities, effectively redefining the boundaries of creative potential.

Analysis of conversational trajectories reveals that LLM teams prioritize efficient exploration for creativity, benefiting from broad discussions and minimized path lengths, while human teams prioritize smooth conversational flow and global coherence, and show greater creativity gains from iterative discussion scaffolding than more advanced reasoning models like o3-high.

The study’s findings regarding multi-agent systems and creative problem-solving echo a fundamental principle of complex systems: emergent behavior arises from the interaction of simpler components. As Andrey Kolmogorov observed, “The most important things are the ones we don’t know.” This sentiment resonates with the research’s unveiling of unexpectedly effective strategies within the AI teams. The AI’s superior performance isn’t simply a matter of processing power, but stems from a collective exploration of ‘semantic trajectories’ – a dynamic, iterative process where the system’s behavior reveals unforeseen creative avenues. Every optimization within the AI architecture, as demonstrated by the results, inevitably creates new tension points, demanding a holistic understanding of the entire system to unlock true creative potential.

The Road Ahead

The demonstration that multi-agent systems can surpass human performance in creative tasks is not, perhaps, surprising. Rather, it illuminates a fundamental principle: performance is not inherent to intelligence, but emerges from the structure of exploration. These systems, unburdened by the cognitive shortcuts and ingrained assumptions of human teams, navigate semantic space with a relentless, if somewhat naive, efficiency. Every new dependency – each agent added to the collective – is the hidden cost of freedom, requiring careful consideration of the resulting systemic complexity.

The trajectory analysis employed here provides a valuable, if preliminary, map of this exploration. However, it remains to be seen whether these computational ‘creative paths’ genuinely reflect novel thought, or merely efficient recombination of existing ideas. The question isn’t simply that these systems create, but how their internal representations of novelty differ from human intuition – a difference likely embedded in the very architecture of their semantic landscapes.

Future work must address the limitations of current evaluation metrics. Simple assessments of ‘novelty’ or ‘usefulness’ fail to capture the nuanced, often subjective, nature of creativity. The field requires a more sophisticated understanding of the feedback loops between agent interaction, semantic exploration, and the emergence of genuinely innovative solutions. The challenge isn’t building creative AI, but understanding the systemic principles that allow creativity – in any form – to flourish.

Original article: https://arxiv.org/pdf/2605.17885.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

2026-05-19 11:33