Author: Denis Avetisyan
Researchers are exploring whether large vision-language models can replicate the complex, open-ended creative processes observed in systems like Picbreeder, potentially unlocking new frontiers in artificial intelligence.

This review investigates how factors like memory, exploration, and multi-agent interactions influence the emergence of diverse and high-quality images when using large vision-language models inspired by neuroevolutionary techniques.
Historically, automating creative processes has proven challenging due to the difficulty of replicating the seemingly boundless novelty characteristic of human ingenuity. This paper, ‘In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models’, investigates this capacity by recreating the Picbreeder system-a collaborative, human-driven engine for generating diverse images through interactive evolution-using frontier Vision-Language Models (VLMs). We find clear qualitative differences in the output of our VLM-driven system compared to the original human baseline, and explore the roles of exploratory noise, agent diversity, and memory in fostering open-endedness. Can these factors unlock truly generative capabilities in artificial agents, and what does this reveal about the fundamental ingredients of creativity itself?
Breaking the Canvas: The Limits of Human Direction
For centuries, the creation of visually complex imagery was fundamentally a human endeavor, demanding substantial artistic skill and meticulous direction. Artists, designers, and illustrators painstakingly crafted each element – line, color, form – relying on personal aesthetic judgment and technical proficiency to realize a desired outcome. This traditional approach, while capable of producing breathtaking works, inherently constrained the exploration of potential designs. The creative process was limited by the artistās individual bandwidth – their capacity to mentally manipulate and refine countless variables – and by the subjective nature of artistic preference, meaning vast regions of the possible design space remained unexplored simply due to differing tastes or unconsidered alternatives. The resulting images, however beautiful, represented a single point within an immense landscape of potential visual forms, a landscape largely inaccessible without shifting the paradigm of creation itself.
The creation of complex visual designs has historically been constrained not by a lack of potential, but by the limitations of human perception and decision-making. A designer, faced with infinite possibilities in color, shape, and composition, can only evaluate a tiny fraction of the total design space due to cognitive bandwidth. Furthermore, aesthetic judgements are inherently subjective; personal preferences inevitably narrow the scope of exploration, preventing the discovery of truly novel or unexpected designs. This bottleneck restricts creative output, as the vast landscape of potential imagery remains largely unexplored, even by skilled artists – a limitation that automated systems aim to overcome by systematically traversing this otherwise inaccessible territory.
Interactive Evolutionary Computation (IEC) represented an early attempt to leverage computational power for creative tasks, yet it fundamentally remained reliant on substantial human input. The process involved presenting a computer-generated image to a human evaluator, who would then select the most aesthetically pleasing option from a set of variations. This selection served as a āfitnessā signal, guiding the algorithm to generate further iterations, but the entire cycle necessitated continuous human judgment – a bottleneck that limited both the speed and scope of exploration. While IEC demonstrated the potential of evolutionary algorithms in design, the significant cognitive load placed on the human operator prevented fully automated creativity and restricted the search to areas aligned with subjective human preferences, rather than truly novel possibilities.

The Algorithmic Muse: Self-Directed Creation
Vision-Language Models (VLMs) represent a shift in image generation by functioning as self-directed agents. These models autonomously produce images based on textual prompts and, crucially, possess the capacity for self-assessment. This is achieved through internal feedback mechanisms that evaluate generated images against the initial prompt and a defined set of aesthetic or quality criteria. Unlike traditional pipelines requiring human intervention for iterative refinement, VLMs can independently iterate on image creation, modifying parameters and compositions until a satisfactory result is achieved. This capability distinguishes them from systems that simply translate text to image; VLMs exhibit agency in the creative process, allowing for exploration beyond explicitly defined parameters.
The substitution of algorithmic processes for human direction in image generation enables a substantially expanded creative search space. Traditional artistic creation is limited by the time, skill, and inherent biases of the artist. Automated systems, however, can iteratively generate and evaluate a far greater number of image variations, exploring combinations of parameters and aesthetic choices beyond typical human consideration. This capacity for exhaustive variation is not simply a quantitative increase; it allows for the discovery of novel and unexpected visual forms that might not otherwise emerge, effectively bypassing established artistic conventions and opening pathways to previously unexplored aesthetic territories. The system’s capacity is limited primarily by computational resources, rather than subjective constraints.
The process of automated image creation relies on NeuroEvolution of Augmenting Topologies (NEAT) to evolve Compositional Pattern Producing Networks (CPPNs). NEAT is a genetic algorithm that optimizes the structure and weights of CPPNs, effectively searching for network configurations capable of generating desired visual patterns. These CPPNs function as generative blueprints; given an input coordinate, the network computes a corresponding color value, defining the image. By evolving the CPPNās architecture and parameters, the system can explore a vast space of possible images, leading to the creation of diverse and complex visual content without explicit programming of specific image features. The CPPNās internal structure dictates the patterns and textures produced, and NEAT’s optimization process tailors this structure to achieve specific aesthetic goals or meet defined criteria.

Quantifying the Unseen: Measuring Creative Exploration
Tree Balance is employed as a quantitative metric for evaluating the diversity of images generated by the Visual Language Model (VLM). This metric functions by constructing a tree structure representing the relationships between generated images based on feature similarity; a balanced tree indicates a wider exploration of the image space, while an imbalanced tree suggests the model is converging on a limited set of designs. Specifically, Tree Balance assesses the distribution of images across the branches of this tree, with higher values indicating greater diversity in the generated image population and reduced redundancy. The calculation considers the number of images present in each branch, penalizing scenarios where a single branch dominates the structure.
Semantic Recall is implemented as a quantitative measure of a VLMās ability to leverage previously generated successful designs during new image creation. This is achieved by embedding both textual prompts and generated images into a shared embedding space, allowing for similarity comparisons. The recall value, calculated as the proportion of successfully recalled designs from a reference set, currently achieves approximately 0.7 when utilizing a context length of [latex]C_L = 1[/latex]. This indicates the system effectively rediscovers and builds upon prior outputs within the specified context window, demonstrating a capacity for iterative refinement and design reuse.
The incorporation of exploratory noise into the image selection process addresses the potential for a generative agent to converge on local optima during image creation. This technique introduces stochasticity, prompting the agent to consider variations beyond the immediately apparent best options. Empirical results demonstrate that a moderate noise level, specifically [latex] \epsilon = 0.25 [/latex], correlates with improvements in visual coverage, indicating a broader exploration of the image space. This suggests that controlled randomization can effectively mitigate limitations imposed by purely deterministic selection criteria, leading to a more diverse and potentially innovative output.
![Varying the number of agents during VLM-Picbreeder sessions, guided by [latex]N_{ANA}[/latex], impacts the diversity and quality of generated images, as demonstrated by the highest Semantic Recall and Visual Coverage archives.](https://arxiv.org/html/2605.23908v1/x6.png)
Expanding the Creative Horizon: Context and Collective Intelligence
The capacity of Vision-Language Models (VLMs) to generate compelling and logically consistent designs is fundamentally linked to the length of context they can process. A VLMās ability to maintain coherence isn’t simply about understanding a single prompt; itās about retaining and referencing a history of interactions. Providing agents with a richer, more extensive record of previous design iterations, feedback, and evolving goals allows them to build upon earlier ideas, avoid repetition, and ultimately produce more sophisticated and nuanced outputs. This expanded contextual awareness enables the model to resolve ambiguities, understand implicit requirements, and generate designs that reflect a cohesive and consistent creative vision – effectively moving beyond isolated image creation towards a sustained, iterative design process.
The creative potential of visual language models is significantly amplified when employing a multi-agent system, effectively broadening the scope of exploration during image generation. By simulating a collaborative environment with numerous agents – up to 1000 in recent studies – researchers observe a marked increase in both semantic coverage and tree balance. This approach allows for a more diverse range of ideas to be considered, as each agent, potentially possessing unique characteristics or āpersonalitiesā, contributes to the design process from a distinct perspective. The resulting increase in semantic coverage ensures a wider variety of concepts are represented, while improved tree balance indicates a more thorough exploration of the design space, preventing premature convergence on limited solutions and ultimately fostering more innovative and nuanced visual outputs.
The advancement of multi-agent systems for creative tasks relied heavily on the capabilities of large multimodal models, notably Gemini-3-Pro-Preview and SigLIP-2-B. These models provided the necessary foundation for nuanced image generation, allowing agents to not only create visuals but also to interpret and respond to complex prompts with greater fidelity. Crucially, their robust embedding capabilities enabled effective semantic communication between agents, facilitating a collaborative design process where ideas could be shared, refined, and built upon. This ability to represent concepts as vectors in a high-dimensional space allowed the system to explore a significantly broader range of creative possibilities, moving beyond simple visual outputs to more abstract and conceptually rich designs. The performance of these models in understanding and translating semantic information proved essential for orchestrating the interactions within the multi-agent system and ultimately driving the diversity and quality of generated content.

Unbound Creation: The Future of Algorithmic Art
Recent advancements showcase a pathway toward automating sophisticated creative tasks by synergistically combining Visual Language Models (VLMs) with neuroevolutionary algorithms like NEAT. This methodology moves beyond simple generative models by enabling a system to not just produce content, but to actively search for novel and effective designs. The process involves defining specific metrics – assessing qualities like aesthetic appeal or functional performance – and then utilizing NEAT to evolve populations of designs generated and evaluated through the VLM. This closed-loop system effectively allows a machine to autonomously refine its creative output, demonstrating the feasibility of complex creative processes previously reliant on human intuition and expertise, and opening doors to entirely new forms of automated design exploration.
The convergence of advanced algorithms and visual language models heralds a new era of open-ended search in creative design. Previously, automated systems required explicit human direction, limiting exploration to predefined parameters. Now, these systems can autonomously generate, evaluate, and refine designs, effectively bypassing the constraints of human bias or limited imagination. This capability fosters a continuous cycle of innovation, allowing algorithms to independently discover novel solutions and aesthetic possibilities. The implications extend beyond mere efficiency; it suggests the potential for genuinely original creations, born not from human intent, but from the iterative process of algorithmic evolution and the unbiased assessment of aesthetic metrics. This unlocks previously inaccessible design spaces and promises a future where creativity is not solely a human domain.
Ongoing research endeavors are directed towards bolstering the efficacy of automated creative processes through iterative refinement of existing techniques and the investigation of novel architectural designs. This includes exploring alternative neural network structures and optimization algorithms to enhance the systemās capacity for generating genuinely original content. The scope of application is also expanding, with efforts underway to adapt these methodologies beyond their initial focus, potentially impacting fields like music composition, architectural design, and even scientific discovery – effectively broadening the frontier of what can be achieved through algorithmic creativity and unlocking possibilities across diverse creative domains.

The pursuit within this research mirrors a fundamental principle of complex systems: pushing boundaries reveals inherent structures. It’s not simply about replicating Picbreederās output with Vision-Language Models, but about understanding how open-endedness emerges through iterative exploration and selection-a process akin to reverse-engineering creativity itself. As Robert Tarjan aptly stated, āProgramming is the art of defining a problem so that a computer can solve it.ā This holds true here; defining the conditions for artificial creativity-memory, exploration, and multi-agent interaction-allows the models to ‘solve’ the problem of generating novel and diverse images, showcasing the power of structured experimentation in unlocking emergent behaviors.
Beyond the Seed: Future Directions
The attempt to replicate Picbreederās emergent aesthetic using large vision-language models isnāt about achieving a perfect imitation. Itās a stress test. The system reveals where current architectures falter when pushed beyond rote memorization and into genuinely novel territory. The observed reliance on initial conditions, the struggle to sustain diversity without explicit incentives – these arenāt bugs, but indicators. They pinpoint the missing components in a larger framework. Reality, after all, is open source – the code exists, but the tools to read it are still under development.
Future work shouldnāt focus solely on scaling up models or refining prompts. The real challenge lies in building systems that forget as intelligently as they learn. That embrace constraint not as a limitation, but as a catalyst for innovation. Investigating the role of internal ānoiseā – the seemingly random fluctuations within a network – could prove crucial. Perhaps true creativity isnāt about finding the optimal solution, but about exploiting the beautiful imperfections inherent in any complex system.
The next iteration requires a shift in perspective. Itās not about creating art, but about modeling the process of aesthetic discovery. The goal isnāt to judge the output, but to understand the underlying generative mechanics. Only then can one begin to reverse-engineer the elusive ingredients of open-endedness and, perhaps, glimpse the architecture of imagination itself.
Original article: https://arxiv.org/pdf/2605.23908.pdf
Contact the author: https://www.linkedin.com/in/avetisyan/
See also:
- Top 5 Best New Mobile Games to play in May 2026
- FC Mobile 26 TOTS (Team of the Season) event Guide and Tips
- The SATISFY x adidas Adizero Adios Pro 4 Debuts in Three Earthy Colorways
- These Cartoon Reboots Totally Missed the Point of the Originals (& Went Downhill Fast)
- Supercellās āneo mo.coā update set for the Summer of 2026 and this might save the game
- Honor of Kings x Attack on Titan Collab Skins: All Skins, Price, and Availability
- Zenless Zone Zero version 2.8 āNew: Eridan Sunsetā update will release on May 6, 2026
- Yummy Tteokbokki ASMR redeem codes and how to use them (May 2026)
- eFootball 2026 Starter Set Show TimeĀ Gabriel Martinelli pack: Review, Best Progression Builds, and Skills
- Honkai: Star Rail Silver Wolf Lv. 999 Build Guide: Best Relics, Light Cone, Team Comps, and more
2026-05-27 06:21