Robots at the Table: Building Reliable Game Players

Author: Denis Avetisyan


Researchers are tackling the challenge of creating robotic systems that can consistently and reliably play complex tabletop games alongside humans.

The system demonstrates that even minor perceptual, execution, or interaction errors within a prolonged, interactive task can accumulate through maintained internal states, ultimately corrupting subsequent reasoning and actions-a process vividly illustrated by propagation paths leading to immediate instability.
The system demonstrates that even minor perceptual, execution, or interaction errors within a prolonged, interactive task can accumulate through maintained internal states, ultimately corrupting subsequent reasoning and actions-a process vividly illustrated by propagation paths leading to immediate instability.

This review examines system design principles for maintaining internal state consistency, error recovery, and effective partitioning of foundation models in long-horizon robotic tabletop gameplay.

Maintaining reliable robotic interaction over extended periods remains a significant challenge, particularly in complex, stateful tasks. This is addressed in ‘System Design for Maintaining Internal State Consistency in Long-Horizon Robotic Tabletop Games’, which investigates a system architecture for enabling robots to play tabletop games with humans while preserving a consistent internal representation of the game state. The paper demonstrates that explicit partitioning of semantic reasoning from time-critical control, coupled with verified action primitives and interaction-level monitoring, is crucial for sustaining executable consistency throughout long-horizon gameplay. How can these system-level design principles be generalized to other long-duration, interactive robotic applications requiring robust state management?


The Inevitable Drift: Challenges in Sustained Robotic Interaction

Robotic systems aiming for sustained interaction in complex environments, such as during a tabletop game, face significant hurdles beyond simple manipulation. These systems require a continuously updated and accurate internal representation of the world – a consistent ‘internal state’ – to inform their actions over extended periods. Unlike pre-programmed routines, prolonged tasks demand a robot not only perceive the current state of objects and players, but also remember previous states and predict future ones. This necessitates robust perception systems capable of handling noisy or ambiguous sensor data, and precise execution of movements to avoid compounding errors. The challenge isn’t merely completing a single action, but maintaining a coherent understanding of the game’s progression, anticipating opponent moves, and adapting to unforeseen circumstances – a feat demanding far more than isolated, successful manipulations.

Robotic systems operating in dynamic environments critically depend on a consistent internal representation of the world – an ‘internal state’ – to inform their actions and react appropriately. However, achieving this consistency is a significant challenge. Imperfect sensors, susceptible to noise and limitations in their ability to accurately perceive the environment, introduce errors into this internal model. Furthermore, even with perfect perception, the execution of robotic actions isn’t flawless; actuators may not move precisely as commanded, or external disturbances can deflect intended trajectories. These seemingly small discrepancies, originating from both sensing and execution, accumulate over time, progressively corrupting the internal state and ultimately jeopardizing the robot’s ability to perform complex, sustained interactions. The fidelity of this internal state, therefore, dictates the robustness and reliability of any robotic system engaged in prolonged tasks.

The fragility of robotic interaction stems from the rapid accumulation of errors across three key areas. Perceptual inconsistency arises when a robot’s understanding of the environment deviates from reality due to sensor noise or limitations. This flawed understanding then fuels execution inconsistency, where the robot’s physical actions don’t precisely match its intended movements. Critically, these discrepancies aren’t isolated; they compound during ongoing interaction. A slight misperception of a game piece’s location, followed by a marginally inaccurate grasp, can create a cascade of errors. This leads to interaction inconsistency – unpredictable or inappropriate responses that disrupt the flow of the task. Without robust error correction, these inconsistencies quickly overwhelm the system, causing it to fail at even seemingly simple, sustained interactions – highlighting a significant hurdle in the development of truly adaptive robotic companions.

The system was successfully deployed and tested in diverse environments, ranging from controlled laboratory settings with multi-player turn-based gameplay to public exhibitions, demonstrating its functionality with top-down table views and close-up manipulation of tiles during discarding and claiming.
The system was successfully deployed and tested in diverse environments, ranging from controlled laboratory settings with multi-player turn-based gameplay to public exhibitions, demonstrating its functionality with top-down table views and close-up manipulation of tiles during discarding and claiming.

A Vision-Language Model: Anchoring Reason Within the System

A robotic tabletop game system has been developed utilizing the Qwen2.5-VL-7B Vision-Language Model (VLM) as its core reasoning component. This system integrates visual input processing with language understanding to enable autonomous gameplay. The Qwen2.5-VL-7B model processes images of the game environment and interprets the current state, subsequently leveraging its language capabilities to determine appropriate actions based on pre-defined game rules. This architecture establishes the VLM not merely as a perceptual component, but as the central decision-making unit for the robot within the game context.

The robot’s capacity for complex task execution is facilitated by the Vision-Language Model (VLM), which processes visual input from the environment to establish the current game state. This involves identifying objects, their positions, and their relationships. Simultaneously, the VLM accesses and applies the defined game rules to this visual data. Based on both the perceived state and the rule set, the VLM then generates a plan of action, outlining the necessary steps to achieve the desired game outcome. This planning process is dynamic, allowing the robot to adjust its strategy based on changes in the visual input and the unfolding game situation.

Traditional robotic control often relies on pre-programmed responses to specific stimuli, constituting reactive control. In contrast, employing a Vision-Language Model (VLM) facilitates a shift towards cognitive reasoning. This allows the robotic system to not only perceive its environment through visual input, but also to interpret that input in the context of defined rules and objectives. Consequently, the system can formulate plans involving multiple sequential actions, anticipate potential outcomes, and adjust its behavior dynamically when faced with unforeseen circumstances or changes in the game state – representing a capacity for strategic decision-making and adaptation beyond simple stimulus-response behaviors.

The training pipeline progressively enhances strategic reasoning by first distilling an RL policy into a VLM using LLM-generated traces, then refining decisions with single-step RL ([latex]GRPO[/latex]), and finally surpassing the original policy through self-play with [latex]DPO[/latex].
The training pipeline progressively enhances strategic reasoning by first distilling an RL policy into a VLM using LLM-generated traces, then refining decisions with single-step RL ([latex]GRPO[/latex]), and finally surpassing the original policy through self-play with [latex]DPO[/latex].

Distilling Intelligence: Refining the VLM Through Iterative Training

The initial stage of the VLM training pipeline utilizes Supervised Fine-Tuning (SFT) to establish a performance baseline. This process involves training the VLM on a dataset of expert gameplay, allowing it to learn an initial policy for strategic decision-making. Evaluation against a teacher policy – a pre-trained, high-performing agent – demonstrates an SFT-achieved win rate of 28%. This metric serves as the foundational performance level against which subsequent training stages, employing reinforcement learning techniques, are compared and evaluated for improvement.

Following Supervised Fine-Tuning, the VLM undergoes refinement via Single-Step Reinforcement Learning utilizing the Generative Rollout Policy Optimization (GRPO) algorithm. GRPO optimizes the VLM’s action selection by directly maximizing expected rewards based on rollouts generated from the current policy. This process moves the VLM beyond simply imitating the teacher policy and allows it to explore and exploit the game environment more effectively. The implementation of GRPO resulted in a measurable improvement in performance, increasing the VLM’s win rate from 28% to 44% when competing against the established teacher policy.

Direct Preference Optimization (DPO) and self-play training were implemented as the final stages of the VLM training pipeline to surpass performance limitations of the initial supervised and reinforcement learning phases. DPO utilized human preference data to directly optimize the VLM’s policy, while self-play involved the VLM playing against itself to generate diverse game scenarios. This combination enabled the VLM to explore and develop strategies not present in the initial training data, resulting in a final win rate of 48% against the teacher policy, a 4% improvement over the single-step reinforcement learning stage.

The VLM’s perceptual capabilities are enabled by a suite of specialized modules: YOLO for real-time object detection; SAM (Segment Anything Model) for precise image segmentation, identifying and isolating game elements; and FoundationPose for accurate pose estimation of both agents and objects within the game environment. These modules work in concert to provide the VLM with a detailed and efficient understanding of the game state, facilitating informed decision-making. The integration of these technologies allows the VLM to reliably identify key game elements, track their positions, and understand their configurations, all critical for strategic gameplay.

Anchoring Stability: Robustness Through Action and Perception

The robotic tabletop game system achieves reliable performance through the implementation of state-conditioned action primitives. These primitives are not simply pre-programmed movements; instead, each action is intrinsically linked to a set of verifiable preconditions within the game environment. Before executing any maneuver – such as moving a game piece or drawing a card – the system meticulously checks if these conditions are met. This approach avoids actions being initiated in unsuitable circumstances, drastically reducing the potential for errors and preventing collisions or invalid game states. By ensuring that every action is predicated on a defined and confirmed state, the system operates with a high degree of predictability and robustness, forming a foundational element of its autonomous gameplay capability.

The robotic system’s exceptional performance hinges on the integration of tactile sensing, which provides critical real-time feedback during object manipulation. This sensing capability doesn’t merely confirm a successful grasp, but actively verifies the stability and security of the hold, allowing for immediate adjustments before, during, or after contact. Consequently, the system achieves a remarkable 99.8% grasp success rate; however, crucially, the system isn’t simply successful – it’s resilient. When a grasp does falter, the tactile sensors detect the failure, triggering automated recovery procedures – a swift re-attempt, or adjustment of grip – without requiring external intervention. This proactive approach, driven by continuous sensory input, ensures consistent performance and minimizes disruptions during gameplay, making the robot remarkably robust in handling physical interactions.

The robotic system proactively addresses unforeseen circumstances through integrated failure detection and recovery protocols. These mechanisms move beyond simple error flagging; they actively diagnose the cause of disruptions, such as a slipped grasp or an obstructed movement, and initiate corrective actions. Recovery isn’t solely reactive; the system anticipates potential failures by monitoring action execution and sensor data for anomalies. Should a disruption occur, pre-defined recovery strategies – including re-grasping, re-planning, or requesting assistance – are automatically deployed to restore consistent operation. This robust approach not only minimizes downtime but also ensures the system can reliably continue gameplay even when faced with unexpected physical interactions or environmental changes, contributing significantly to the overall autonomous performance.

The robotic system demonstrably achieved a high level of autonomous function, successfully completing tabletop games in 89.3% of trials without any human assistance. This performance signifies a substantial advancement in robotic gameplay, moving beyond pre-programmed sequences to sustained, adaptable behavior. The completion rate wasn’t simply a matter of repeating a successful script; the robot navigated the complexities of the game, including object manipulation, rule adherence, and dynamic responses to its environment, all while maintaining consistent operation over extended periods. This level of autonomous completion validates the integration of perception, action, and recovery mechanisms, highlighting the potential for robots to engage in complex, interactive tasks with minimal external guidance.

The robotic system incorporates a sophisticated monitoring process to maintain game integrity even with human interaction. This ‘interaction-level monitoring’ doesn’t simply register external actions; it actively validates them against the established game rules and expected state. When a human attempts an unauthorized move – such as placing a piece in an invalid location or making an illegal play – the system immediately detects the inconsistency. Rather than halting or erroring, it intelligently addresses the disruption, often by subtly correcting the human’s input or reverting to a consistent game state. This proactive approach ensures the game remains logically sound and playable, effectively blending autonomous operation with limited, yet accommodated, human participation.

The pursuit of reliable long-horizon gameplay, as detailed in this study, mirrors the inevitable entropy inherent in all complex systems. The architecture proposed-partitioning foundational models from time-critical functions-represents a pragmatic approach to graceful degradation. As Alan Turing observed, “There is no longer any need to ask what a machine can do; we should be asking what it can be made to do.” This sentiment directly applies to the challenges of maintaining state consistency in robotic tabletop games. The work doesn’t attempt to create a perfect, error-free system, but rather one capable of robustly adapting to imperfections over extended periods-a testament to understanding that all architectures live a life, and we are just witnesses to its unfolding.

The Long Game

The pursuit of reliable robotic companions for extended interaction, as demonstrated by this work, inevitably highlights the brittleness inherent in all complex systems. Every failure is a signal from time; not necessarily a flaw in logic, but an acknowledgment of entropy. Partitioning foundational models offers a temporary reprieve, a means of shielding critical functions from the inevitable decay of generalized knowledge, but this is merely a deferral, not a solution. The true challenge lies not in building systems that avoid errors, but in designing those that gracefully accommodate them.

Future work must move beyond error recovery as a reactive measure and embrace it as a fundamental design principle. The system’s capacity to interpret, and even anticipate, its own limitations will prove more valuable than sheer computational power. Refactoring is a dialogue with the past; each iteration should not simply correct mistakes, but illuminate the system’s evolving understanding of its own operational boundaries.

Ultimately, the longevity of these systems hinges on acknowledging the temporal nature of information. The internal state, however meticulously maintained, is a snapshot, a fleeting approximation of a dynamic reality. The long game is not about achieving perfect consistency, but about cultivating a resilient adaptability – a capacity to learn from the past, navigate the present, and prepare, however imperfectly, for the future.


Original article: https://arxiv.org/pdf/2603.25405.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-03-28 18:40