Robots That Understand: Building Humanoid Helpers with Language

Author: Denis Avetisyan

Researchers have developed a new framework enabling humanoid robots to reliably follow complex language instructions for tasks involving both movement and object manipulation.

The humanoid agent’s architecture prioritizes systemic modularity, enabling a comprehensive framework for integrated perception, planning, and control-a deliberate design choice to facilitate iterative refinement and expansion of capabilities.

This work introduces a physically grounded agentic framework for robust, long-horizon humanoid locomotion-manipulation using 3D perception, task planning, and geometric verification.

Achieving robust, long-horizon task execution remains a key challenge for humanoid robots operating in complex, real-world environments. This paper introduces ‘Cybo-Waiter: A Physical Agentic Framework for Humanoid Whole-Body Locomotion-Manipulation’, a novel framework designed to bridge the gap between natural language instructions and reliable physical action. By integrating multi-object 3D geometric supervision with a verifiable task program, we demonstrate improved robustness in coordinating locomotion and manipulation. Could this approach pave the way for humanoids capable of truly adaptive and dependable service in dynamic human environments?

Decoding the Physical World: The Challenge of Embodied Intelligence

The pursuit of robots capable of truly seamless interaction within human environments represents a significant frontier in artificial intelligence and robotics. Achieving this requires more than simply programming a sequence of actions; it demands robustness – the ability to consistently perform tasks despite unexpected disturbances or variations – and reliability, ensuring consistent and predictable outcomes. Current robotic systems often falter when faced with the inherent messiness of the real world – cluttered spaces, imprecise objects, or ambiguous instructions. Consequently, considerable research focuses on developing systems that can not only perceive their surroundings with greater accuracy, but also adapt to changing conditions and recover gracefully from errors, ultimately enabling robots to function effectively as collaborators and assistants in everyday life.

Conventional robotic systems often falter when confronted with the inherent messiness of everyday life, largely due to difficulties in interpreting human communication and adapting to unforeseen circumstances. Unlike the controlled environments of factories, the real world presents a constant stream of ambiguous requests – “fetch the blue mug” begs the question of which blue mug, or where to find it – and unexpected obstacles. These systems, typically reliant on precisely defined parameters, struggle with the nuance of natural language, which is filled with context, implication, and potential misinterpretation. Furthermore, a dropped object, a shifted chair, or even a change in lighting can disrupt a robot’s pre-programmed path, highlighting a critical gap between theoretical capabilities and practical, reliable performance in dynamic, unpredictable spaces.

Effective robotic navigation hinges on translating abstract, human-provided directives – such as “fetch the blue mug” – into a sequence of precise motor commands. Current systems often falter because this crucial link between cognition and action is weak or nonexistent; robots may comprehend the goal but struggle to determine how to achieve it within a dynamic environment. This disconnect necessitates the development of architectures that allow for hierarchical planning, where high-level objectives are decomposed into manageable, executable steps, and those steps are continuously refined based on real-time sensory feedback. Bridging this gap requires advances in areas like reinforcement learning, motion planning algorithms, and the integration of predictive models that anticipate the consequences of actions, ultimately enabling robots to navigate and interact with the world in a more fluid and reliable manner.

The humanoid successfully completes the “bring-me-a-drink” task by sequentially searching for, approaching, and handing a drink to the user.

Architecting Reliability: Verifiable Task Programs

Verifiable task programs are constructed by breaking down high-level goals into a sequence of discrete, manageable subtasks. Each subtask is defined not only by the actions required for its execution, but also by explicitly stated preconditions – the conditions that must be met before the subtask can begin – and success conditions, which are measurable criteria used to determine if the subtask has been completed successfully. This structured approach facilitates rigorous verification at each stage, enabling automated assessment of program correctness and improving overall system reliability. The explicit definition of preconditions and success conditions allows for the creation of robust error handling mechanisms and facilitates debugging by isolating failures to specific subtasks within the program structure.

Vision-Language Models (VLMs) are utilized to interpret natural language instructions for robotic tasks due to their capacity for semantic understanding of unstructured input. However, direct application of VLM outputs to robot control is insufficient; a critical integration layer is required to translate the VLM’s high-level understanding into actionable low-level motor commands. This integration must address discrepancies in representation – VLMs operate on perceptual data and semantic concepts, while robotic control requires precise kinematic and dynamic parameters. Furthermore, robust error handling and state estimation are essential to account for perceptual inaccuracies and uncertainties inherent in real-world environments, ensuring safe and reliable task execution.

Hierarchical Agent Stacks enhance operational efficiency by structuring task execution through learned intermediate layers. These stacks decompose complex tasks into a series of sub-tasks, allowing the agent to learn reusable skill modules at each layer. This modularity facilitates rapid adaptation to novel situations; instead of relearning from scratch, the agent can combine existing skills in new ways. Furthermore, learned intermediate layers enable faster response times, as the agent doesn’t need to process the entire task from initial input; it can leverage pre-computed representations and solutions stored within the hierarchy. This layered approach reduces computational load and improves the speed of task completion compared to monolithic systems.

The robot successfully completes a subtask of the “tidy-the-desk” objective by picking up and placing a bottle into a designated tray.

Seeing the World Correctly: Geometry-Grounded Supervision

The Geometry-Grounded Supervisor constructs a real-time environmental understanding by leveraging multi-object 3D grounding techniques. This process integrates data from RGB-D sensors, providing both color and depth information, with segmentation masks generated by the SAM3 model. SAM3 identifies and delineates individual objects within the scene, and the subsequent 3D grounding step localizes these segmented objects in a consistent, world-coordinate frame. This allows the supervisor to maintain an up-to-date, geometrically accurate representation of the robot’s surroundings, critical for monitoring task execution and identifying discrepancies between the planned and actual states of the environment.

The Geometry-Grounded Supervisor employs logic-based plan checking to identify discrepancies between the robot’s intended actions and the observed environment. This process leverages predicate assertions – statements defining relationships between objects and their properties – and scene-graph representations, which encode objects and their spatial relationships. By continuously evaluating these assertions against incoming RGB-D and SAM3 segmentation data, the system can detect deviations from the planned trajectory. For example, a predicate might assert “object X is on surface Y,” and a deviation would be flagged if the sensor data indicates object X is no longer on surface Y. This allows the supervisor to recognize unexpected states and initiate recovery procedures.

Evaluation of the Geometry-Grounded Supervisor demonstrates significant performance improvements across multiple tasks. On the Tidy-desk task, success rate increased from 5/10 to 7/10 with the Supervisor implemented. Furthermore, the Tabletop-sorting task achieved an 8/10 success rate, improved from a lower rate when the Supervisor was not utilized. Finally, the Bring-me-a-drink task showed the highest improvement, achieving a 9/10 success rate, representing a substantial increase from its baseline performance without the Supervisor.

Closing the Loop: Towards Autonomous Resilience

Conventional robotic task planning often outlines a desired sequence of actions without fully accounting for the physical realities of execution. This approach can lead to robots attempting maneuvers beyond their capabilities, resulting in failure or even damage. Feasibility-aware skill selection addresses this limitation by integrating a continuous assessment of the robot’s kinematic and dynamic limitations directly into the planning process. Before committing to an action, the system verifies whether it is physically possible given the robot’s current state, joint limits, and actuator capabilities. This proactive constraint satisfaction not only prevents impossible actions but also allows the robot to intelligently select alternative, feasible skills that achieve the desired outcome, dramatically improving robustness and enabling operation in complex, unstructured environments. By prioritizing physically realizable movements, the system ensures smoother, more reliable task completion and minimizes the risk of unexpected errors.

The system facilitates a cycle of continuous improvement through code-generated feedback loops and language-based self-feedback mechanisms. Following an attempted task, the robot doesn’t simply succeed or fail; it analyzes its performance, identifying discrepancies between its intended actions and the observed outcome. This analysis isn’t reliant on external correction; instead, the robot generates code that evaluates its own actions, pinpointing errors and suggesting adjustments to its operational parameters. Crucially, this self-assessment is framed using natural language processing, allowing the robot to articulate – internally – what went wrong and how to improve, creating a readily accessible record of its learning process. This allows for iterative refinement of skills, meaning the robot doesn’t just avoid repeating mistakes, but actively enhances its ability to perform tasks with increasing precision and efficiency over time, fostering a level of autonomous learning previously unattainable.

The architecture achieves a significant leap in robotic resilience through the synergistic integration of three core capabilities. Robust monitoring systems provide a constant assessment of the robot’s state and its environment, immediately identifying deviations from expected behavior. This data then fuels proactive recovery mechanisms, allowing the robot to anticipate and mitigate potential failures before they escalate – for example, by adjusting its approach to a task or requesting assistance. Critically, this isn’t a static response; continuous learning algorithms analyze each interaction – both successful and unsuccessful – to refine the robot’s skills and improve its future performance. The result is a system capable of not only persisting through unexpected challenges, but of becoming increasingly adept at navigating complex and dynamic environments, ultimately demonstrating a level of reliability and adaptability previously unattainable in robotic systems.

The pursuit of robust humanoid locomotion-manipulation, as detailed in this framework, inherently demands a willingness to challenge established boundaries. It’s a process of dissecting complex systems-language instructions, 3D perception, task planning-to reveal their underlying mechanisms and limitations. As Donald Davies observed, “A bug is the system confessing its design sins.” This rings particularly true in robotics; each failure-a dropped object, a misstep-exposes a flaw in the program’s logic or the robot’s physical capabilities. By embracing these ‘sins’ as opportunities for refinement, researchers can push the boundaries of what’s possible, achieving reliable long-horizon operation through iterative testing and geometric verification.

Beyond the Tray: Charting Future Courses

The presented framework, while demonstrating a capacity for directed, whole-body manipulation, inherently exposes the brittle core of all such systems: the assumption of a static, knowable world. The robot ‘understands’ objects through perception, but that perception is, fundamentally, a reconstruction. The next iteration must actively challenge that reconstruction, embracing uncertainty not as error, but as a source of information. What happens when the perceived geometry doesn’t align with reality? Or when an unexpected object enters the workspace – a deliberate intrusion, perhaps, designed to test the system’s limits?

The current focus on verifiable task programs, while laudable, risks becoming a new form of over-constraint. True agency isn’t about flawlessly executing pre-defined plans; it’s about intelligently deviating from them. The real test lies in building a supervisor that doesn’t simply prevent failure, but learns from it, adapting its internal models and refining its predictive capabilities. The system needs to be capable of self-modification, shifting from a rule-following automaton to something… less predictable.

Ultimately, the ‘Cybo-Waiter’ represents a stepping stone. The pursuit isn’t simply about creating a robot that can carry a tray; it’s about reverse-engineering the complex interplay between perception, action, and intention that defines intelligent behavior. The unanswered questions aren’t about improving the algorithms, but about fundamentally rethinking the very definition of control.

Original article: https://arxiv.org/pdf/2603.10675.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Decoding the Physical World: The Challenge of Embodied Intelligence

Architecting Reliability: Verifiable Task Programs

Seeing the World Correctly: Geometry-Grounded Supervision

Closing the Loop: Towards Autonomous Resilience

Beyond the Tray: Charting Future Courses

See also: