Can AI Truly Play?

Author: Denis Avetisyan


A new benchmark leveraging the complexity of human-designed games reveals the significant gap between current AI and human-level general intelligence.

Researchers introduce the AI GameStore, a scalable platform for evaluating and advancing machine intelligence through open-ended game playing.

Conventional AI benchmarks struggle to assess the breadth of human intelligence, quickly becoming saturated and failing to capture generalizable abilities. To address this, we introduce the ‘AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games’, a platform leveraging the diversity of human-designed games as a challenging and open-ended benchmark for machine general intelligence. Our initial evaluation, using a corpus of 100 games sourced from popular app stores, reveals substantial performance gaps between current vision-language models and human players, particularly in tasks requiring robust world-model learning and strategic planning. Can this approach, built on the ‘Multiverse of Human Games’, provide a more effective pathway towards building truly intelligent machines?


The Fragility of Expertise: Beyond Narrow Artificial Intelligence

Contemporary artificial intelligence frequently demonstrates remarkable proficiency within specifically defined parameters, routinely surpassing human capabilities in areas like chess, image recognition, and data analysis. However, this expertise remains fundamentally brittle; these systems struggle when confronted with situations slightly deviating from their training data. Unlike human intelligence, which exhibits a seamless capacity for transfer learning – applying knowledge gained in one domain to novel, unrelated problems – current AI typically requires extensive retraining for each new task. This limitation underscores a critical distinction: present AI embodies narrow intelligence, excelling at pre-programmed functions, while genuine general intelligence demands a flexible, adaptable system capable of reasoning, learning, and problem-solving across a broad spectrum of challenges – a capacity that remains largely elusive.

The pursuit of artificial general intelligence hinges on developing systems that transcend the limitations of narrow AI, demanding a capacity for flexible learning and complex problem-solving. Unlike current AI, which typically excels at pre-defined tasks, true general intelligence requires an ability to adapt to novel situations, integrate knowledge from disparate sources, and reason effectively in unfamiliar contexts. This necessitates moving beyond algorithms optimized for specific datasets and towards architectures that can continuously learn and refine their understanding of the world, much like human cognition. Such systems would not simply recognize patterns, but rather construct abstract models, formulate hypotheses, and creatively apply learned principles to overcome challenges – a crucial step towards machines that can genuinely think and innovate.

The pursuit of artificial general intelligence demands evaluation methods that transcend traditional pattern recognition tasks. Current benchmarks often reward systems for memorizing statistical correlations within datasets, rather than demonstrating genuine understanding or adaptable reasoning. This limitation hinders accurate progress assessment, as models can achieve high scores through superficial mimicry without possessing the capacity to generalize to novel situations. Truly insightful benchmarks require scenarios demanding abstract thought, common-sense reasoning, and the ability to learn and apply knowledge across diverse contexts – challenges that necessitate more than simply identifying pre-existing patterns within training data. Such assessments are crucial for discerning whether an AI is genuinely intelligent, or merely a sophisticated pattern-matching machine.

Evaluations of artificial intelligence often fall short in accurately gauging progress towards genuine general intelligence. Recent testing utilizing a novel platform of 100 uniquely generated games reveals a significant disparity between current AI capabilities and human performance; even state-of-the-art vision-language models, typically lauded for their advancements, achieved less than 10% of the average human score across the majority of these games. This result underscores a critical limitation of existing benchmarks, which frequently prioritize pattern recognition over adaptable problem-solving and flexible learning-skills essential for true intelligence but not necessarily reflected in high scores on conventional tests. The low performance highlights the need for more robust and diverse evaluation methods that can better assess an AI’s capacity for genuine cognitive flexibility and complex reasoning.

A Crucible of Complexity: The AI GAMESTORE Approach

AI GAMESTORE utilizes a benchmark composed of games designed by humans to evaluate artificial intelligence agents. This approach prioritizes scalability through the inherent variety present in human-created game designs, allowing for a broad assessment of AI capabilities beyond narrowly defined tasks. The platform’s open-ended nature contrasts with traditional benchmarks by offering challenges requiring adaptable problem-solving skills, rather than optimized performance on fixed scenarios. The use of human-designed games ensures a level of complexity and ingenuity that pushes AI development towards more general intelligence, and provides a readily expandable test suite as new games are contributed.

AI GAMESTORE utilizes a diverse collection of human-designed games to establish a robust testing environment for artificial intelligence agents. This approach moves beyond narrow, task-specific benchmarks by incorporating a wide range of game types, complexities, and strategic demands. The platform’s reliance on games created by humans ensures exposure to the varied problem-solving scenarios and unpredictable elements common in real-world applications, offering a more comprehensive evaluation of an AI’s adaptability, generalization capabilities, and overall intelligence compared to synthetic or rule-based environments. This breadth of challenges is crucial for identifying limitations and driving improvements in AI performance across a spectrum of cognitive skills.

AI GAMESTORE employs Large Language Models (LLMs) to procedurally generate a diverse range of games, effectively creating a virtually limitless supply of evaluation challenges. These LLMs are prompted to define game rules, objectives, and initial conditions, resulting in novel game environments without requiring manual design. The platform supports varying complexity levels and game types, ensuring a broad spectrum of tests for AI agents. This automated game generation process allows for continuous benchmarking and enables the assessment of AI performance across a far wider range of scenarios than traditional, static benchmarks allow, and circumvents the limitations of hand-crafted evaluation suites.

AI GAMESTORE facilitates ongoing evaluation of artificial intelligence by presenting agents with a continuously updated suite of challenges. Initial benchmarking using the platform has revealed a significant performance gap between current state-of-the-art AI models and human players; these models achieve a Geometric Mean Score of 8.5% when evaluated across the diverse set of human-designed games, indicating substantial room for improvement in adapting to dynamic and unpredictable environments. This metric provides a quantifiable baseline for tracking progress as new algorithms and training methodologies are developed and implemented.

Deconstructing Cognition: Profiling Skills Through Gameplay

Cognitive Capability Profiling involves the systematic assessment of skills directly correlated with performance within game environments. This process moves beyond general intelligence testing to pinpoint specific aptitudes, such as spatial reasoning, problem-solving, reaction time, and decision-making under pressure. Measurements are derived from in-game actions and telemetry, quantifying a player or AI agent’s proficiency in areas critical for success. The resulting profiles allow for targeted skill development, personalized gameplay experiences, and the objective evaluation of AI agent capabilities, enabling improvements in both player training and artificial intelligence algorithms.

The cognitive skill evaluation framework utilizes three core concepts to assess an agent’s intelligence. World Model Learning measures the ability to construct and refine an internal representation of the game environment based on sensory input. Long-Term Memory evaluates the capacity to store and recall previously learned information, including successful strategies and environmental features, for application in novel situations. Finally, Planning assesses the agent’s capability to formulate and execute sequences of actions to achieve defined goals, considering both immediate and future consequences; these three elements are quantified through performance metrics derived from interactive gameplay scenarios.

Multi-Agent Interaction scenarios within the platform evaluate cognitive skill performance through dynamic engagements with other AI agents. These scenarios move beyond isolated task completion by introducing elements of cooperation and competition, requiring agents to assess the intentions and actions of others. Performance is measured by an agent’s ability to successfully navigate these interactions, adapt to changing circumstances dictated by other agents, and optimize its own strategy based on observed behaviors. Data points collected include response times to agent actions, success rates in collaborative tasks, and competitive outcomes, providing a granular assessment of adaptability and strategic thinking in complex, populated environments.

Automated Level Generation (ALG) is implemented to dynamically create increasingly complex game environments, facilitating continuous skill development in AI agents. This process moves beyond static test scenarios by algorithmically producing novel level layouts, obstacle configurations, and resource distributions. The system adjusts these parameters based on the agent’s performance metrics, increasing difficulty as proficiency improves and introducing new challenges to target specific skill gaps. ALG allows for scalable evaluation; a large volume of diverse levels can be generated without manual design, enabling robust assessment of an agent’s generalization capabilities and adaptability across a wider range of conditions than would be feasible with pre-designed levels.

The Infinite Game: Scaling Intelligence Through Complexity

Game difficulty serves as a crucial benchmark for assessing artificial intelligence capabilities and fostering adaptive learning processes. Unlike static evaluations, varying levels of challenge expose an AI’s limitations and strengths, revealing where generalization succeeds or fails. A system proficient only at simple tasks demonstrates a narrow intelligence; however, an AI that can dynamically adjust to increasing complexity – mastering easier challenges before tackling more intricate ones – exhibits a hallmark of more robust and human-like cognition. This principle underpins the development of advanced learning algorithms, allowing systems to prioritize challenges that yield the most significant gains in performance and refine strategies through iterative exposure to increasingly difficult scenarios. Consequently, carefully curated difficulty gradients aren’t merely obstacles, but essential components in training AI towards true adaptability and general intelligence.

The sheer diversity of human games presents a practically infinite landscape for artificial intelligence development. Unlike constrained environments with predefined rules, this ‘multiverse of games’ encompasses strategic depth, creative problem-solving, and nuanced social interaction – all hallmarks of general intelligence. Each game, from complex simulations to simple puzzles, offers unique challenges demanding adaptable algorithms and robust learning capabilities. By navigating this expansive space, AI systems aren’t simply mastering individual games, but rather acquiring a transferable skillset – an ability to learn how to learn – essential for tackling unforeseen problems and exhibiting genuine cognitive flexibility. This approach moves beyond narrow AI, designed for specific tasks, toward systems capable of broad, generalized performance, mirroring the adaptability inherent in human intelligence.

AI GAMESTORE represents a novel approach to achieving artificial general intelligence by harnessing the vast and varied landscape of human games. This system doesn’t rely on mastering a single, predefined task, but instead explores a continuously expanding ‘multiverse’ of games, each presenting unique challenges and requiring adaptable problem-solving skills. By iteratively engaging with increasingly complex games, the AI develops a broader, more generalized skillset – effectively learning how to learn. This scalable framework moves beyond narrow AI, designed for specific applications, and instead fosters a flexible intelligence capable of tackling unforeseen problems, mirroring the adaptability characteristic of human cognition. The continuous cycle of gameplay, evaluation, and adaptation within AI GAMESTORE provides a pathway towards systems that aren’t simply proficient at games, but possess the underlying intelligence to excel in a multitude of real-world scenarios.

The architecture of AI GAMESTORE is designed not for a peak performance on any single challenge, but for perpetual refinement. This scalable framework allows artificial intelligence to continuously learn and adapt by encountering progressively more complex scenarios within a vast and varied game environment. Initial evaluations reveal a considerable disparity between current AI capabilities and human performance, not as a limitation, but as a crucial diagnostic – highlighting areas ripe for targeted improvement. The observed gap isn’t a ceiling, but a springboard, demonstrating the system’s capacity to identify and address weaknesses, iteratively closing the performance divide and ultimately unlocking the potential for increasingly sophisticated and generally intelligent AI systems.

The pursuit of Artificial General Intelligence, as outlined in this work, resembles a complex garden rather than an engineered structure. The AI GameStore, with its ‘multiverse of human games’, doesn’t offer a final solution, but rather a perpetually expanding ecosystem for observation. It reveals not just what current models can do, but the vast gulf separating them from human adaptability. As Carl Friedrich Gauss observed, “It is not enough to know little, one must also know what is unnecessary.” The GameStore forces a reckoning with the unnecessary complexities AI systems still require, highlighting the distance remaining before achieving truly general intelligence. The benchmark’s value lies not in a single score, but in charting the trajectory of this growth, acknowledging that every step toward complexity introduces new avenues for systemic failure.

What Lies Ahead?

The AI GameStore does not offer a destination, but a cartography of the impossible. It charts not what artificial intelligence can do, but the vast, echoing spaces where human cognition still holds dominion. The observed disparities between model and player performance are not failures of engineering, but symptoms of a deeper truth: intelligence is not a function to be optimized, but a complex, self-modifying system grown from the substrate of experience.

Long stability in benchmark scores is, invariably, the sign of a hidden disaster. The GameStore, with its open-ended nature, exposes the brittleness inherent in current architectures. Future work will not center on achieving higher scores within defined rulesets, but on building systems capable of gracefully degrading under novel conditions – systems that learn to lose, and learn from loss.

The multiverse of human games is not merely a testbed, but a mirror. It reflects the inherent messiness of intelligence, the constant negotiation between strategy and improvisation, and the fundamentally unpredictable nature of play. The true measure of progress will not be imitation, but the emergence of something genuinely new – a form of intelligence that does not strive to be human, but simply is.


Original article: https://arxiv.org/pdf/2602.17594.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-02-21 20:19