Author: Denis Avetisyan
Researchers have developed a new framework enabling humanoid robots to reliably follow complex language instructions for tasks involving both movement and object manipulation.

This work introduces a physically grounded agentic framework for robust, long-horizon humanoid locomotion-manipulation using 3D perception, task planning, and geometric verification.
Achieving robust, long-horizon task execution remains a key challenge for humanoid robots operating in complex, real-world environments. This paper introduces ‘Cybo-Waiter: A Physical Agentic Framework for Humanoid Whole-Body Locomotion-Manipulation’, a novel framework designed to bridge the gap between natural language instructions and reliable physical action. By integrating multi-object 3D geometric supervision with a verifiable task program, we demonstrate improved robustness in coordinating locomotion and manipulation. Could this approach pave the way for humanoids capable of truly adaptive and dependable service in dynamic human environments?
Decoding the Physical World: The Challenge of Embodied Intelligence
The pursuit of robots capable of truly seamless interaction within human environments represents a significant frontier in artificial intelligence and robotics. Achieving this requires more than simply programming a sequence of actions; it demands robustness – the ability to consistently perform tasks despite unexpected disturbances or variations – and reliability, ensuring consistent and predictable outcomes. Current robotic systems often falter when faced with the inherent messiness of the real world – cluttered spaces, imprecise objects, or ambiguous instructions. Consequently, considerable research focuses on developing systems that can not only perceive their surroundings with greater accuracy, but also adapt to changing conditions and recover gracefully from errors, ultimately enabling robots to function effectively as collaborators and assistants in everyday life.
Conventional robotic systems often falter when confronted with the inherent messiness of everyday life, largely due to difficulties in interpreting human communication and adapting to unforeseen circumstances. Unlike the controlled environments of factories, the real world presents a constant stream of ambiguous requests – “fetch the blue mug” begs the question of which blue mug, or where to find it – and unexpected obstacles. These systems, typically reliant on precisely defined parameters, struggle with the nuance of natural language, which is filled with context, implication, and potential misinterpretation. Furthermore, a dropped object, a shifted chair, or even a change in lighting can disrupt a robot’s pre-programmed path, highlighting a critical gap between theoretical capabilities and practical, reliable performance in dynamic, unpredictable spaces.
Effective robotic navigation hinges on translating abstract, human-provided directives – such as “fetch the blue mug” – into a sequence of precise motor commands. Current systems often falter because this crucial link between cognition and action is weak or nonexistent; robots may comprehend the goal but struggle to determine how to achieve it within a dynamic environment. This disconnect necessitates the development of architectures that allow for hierarchical planning, where high-level objectives are decomposed into manageable, executable steps, and those steps are continuously refined based on real-time sensory feedback. Bridging this gap requires advances in areas like reinforcement learning, motion planning algorithms, and the integration of predictive models that anticipate the consequences of actions, ultimately enabling robots to navigate and interact with the world in a more fluid and reliable manner.

Architecting Reliability: Verifiable Task Programs
Verifiable task programs are constructed by breaking down high-level goals into a sequence of discrete, manageable subtasks. Each subtask is defined not only by the actions required for its execution, but also by explicitly stated preconditions – the conditions that must be met before the subtask can begin – and success conditions, which are measurable criteria used to determine if the subtask has been completed successfully. This structured approach facilitates rigorous verification at each stage, enabling automated assessment of program correctness and improving overall system reliability. The explicit definition of preconditions and success conditions allows for the creation of robust error handling mechanisms and facilitates debugging by isolating failures to specific subtasks within the program structure.
Vision-Language Models (VLMs) are utilized to interpret natural language instructions for robotic tasks due to their capacity for semantic understanding of unstructured input. However, direct application of VLM outputs to robot control is insufficient; a critical integration layer is required to translate the VLM’s high-level understanding into actionable low-level motor commands. This integration must address discrepancies in representation – VLMs operate on perceptual data and semantic concepts, while robotic control requires precise kinematic and dynamic parameters. Furthermore, robust error handling and state estimation are essential to account for perceptual inaccuracies and uncertainties inherent in real-world environments, ensuring safe and reliable task execution.
Hierarchical Agent Stacks enhance operational efficiency by structuring task execution through learned intermediate layers. These stacks decompose complex tasks into a series of sub-tasks, allowing the agent to learn reusable skill modules at each layer. This modularity facilitates rapid adaptation to novel situations; instead of relearning from scratch, the agent can combine existing skills in new ways. Furthermore, learned intermediate layers enable faster response times, as the agent doesn’t need to process the entire task from initial input; it can leverage pre-computed representations and solutions stored within the hierarchy. This layered approach reduces computational load and improves the speed of task completion compared to monolithic systems.

Seeing the World Correctly: Geometry-Grounded Supervision
The Geometry-Grounded Supervisor constructs a real-time environmental understanding by leveraging multi-object 3D grounding techniques. This process integrates data from RGB-D sensors, providing both color and depth information, with segmentation masks generated by the SAM3 model. SAM3 identifies and delineates individual objects within the scene, and the subsequent 3D grounding step localizes these segmented objects in a consistent, world-coordinate frame. This allows the supervisor to maintain an up-to-date, geometrically accurate representation of the robot’s surroundings, critical for monitoring task execution and identifying discrepancies between the planned and actual states of the environment.
The Geometry-Grounded Supervisor employs logic-based plan checking to identify discrepancies between the robot’s intended actions and the observed environment. This process leverages predicate assertions – statements defining relationships between objects and their properties – and scene-graph representations, which encode objects and their spatial relationships. By continuously evaluating these assertions against incoming RGB-D and SAM3 segmentation data, the system can detect deviations from the planned trajectory. For example, a predicate might assert “object X is on surface Y,” and a deviation would be flagged if the sensor data indicates object X is no longer on surface Y. This allows the supervisor to recognize unexpected states and initiate recovery procedures.
Evaluation of the Geometry-Grounded Supervisor demonstrates significant performance improvements across multiple tasks. On the Tidy-desk task, success rate increased from 5/10 to 7/10 with the Supervisor implemented. Furthermore, the Tabletop-sorting task achieved an 8/10 success rate, improved from a lower rate when the Supervisor was not utilized. Finally, the Bring-me-a-drink task showed the highest improvement, achieving a 9/10 success rate, representing a substantial increase from its baseline performance without the Supervisor.
Closing the Loop: Towards Autonomous Resilience
Conventional robotic task planning often outlines a desired sequence of actions without fully accounting for the physical realities of execution. This approach can lead to robots attempting maneuvers beyond their capabilities, resulting in failure or even damage. Feasibility-aware skill selection addresses this limitation by integrating a continuous assessment of the robot’s kinematic and dynamic limitations directly into the planning process. Before committing to an action, the system verifies whether it is physically possible given the robot’s current state, joint limits, and actuator capabilities. This proactive constraint satisfaction not only prevents impossible actions but also allows the robot to intelligently select alternative, feasible skills that achieve the desired outcome, dramatically improving robustness and enabling operation in complex, unstructured environments. By prioritizing physically realizable movements, the system ensures smoother, more reliable task completion and minimizes the risk of unexpected errors.
The system facilitates a cycle of continuous improvement through code-generated feedback loops and language-based self-feedback mechanisms. Following an attempted task, the robot doesn’t simply succeed or fail; it analyzes its performance, identifying discrepancies between its intended actions and the observed outcome. This analysis isn’t reliant on external correction; instead, the robot generates code that evaluates its own actions, pinpointing errors and suggesting adjustments to its operational parameters. Crucially, this self-assessment is framed using natural language processing, allowing the robot to articulate – internally – what went wrong and how to improve, creating a readily accessible record of its learning process. This allows for iterative refinement of skills, meaning the robot doesn’t just avoid repeating mistakes, but actively enhances its ability to perform tasks with increasing precision and efficiency over time, fostering a level of autonomous learning previously unattainable.
The architecture achieves a significant leap in robotic resilience through the synergistic integration of three core capabilities. Robust monitoring systems provide a constant assessment of the robot’s state and its environment, immediately identifying deviations from expected behavior. This data then fuels proactive recovery mechanisms, allowing the robot to anticipate and mitigate potential failures before they escalate – for example, by adjusting its approach to a task or requesting assistance. Critically, this isn’t a static response; continuous learning algorithms analyze each interaction – both successful and unsuccessful – to refine the robot’s skills and improve its future performance. The result is a system capable of not only persisting through unexpected challenges, but of becoming increasingly adept at navigating complex and dynamic environments, ultimately demonstrating a level of reliability and adaptability previously unattainable in robotic systems.
The pursuit of robust humanoid locomotion-manipulation, as detailed in this framework, inherently demands a willingness to challenge established boundaries. It’s a process of dissecting complex systems-language instructions, 3D perception, task planning-to reveal their underlying mechanisms and limitations. As Donald Davies observed, “A bug is the system confessing its design sins.” This rings particularly true in robotics; each failure-a dropped object, a misstep-exposes a flaw in the program’s logic or the robot’s physical capabilities. By embracing these ‘sins’ as opportunities for refinement, researchers can push the boundaries of what’s possible, achieving reliable long-horizon operation through iterative testing and geometric verification.
Beyond the Tray: Charting Future Courses
The presented framework, while demonstrating a capacity for directed, whole-body manipulation, inherently exposes the brittle core of all such systems: the assumption of a static, knowable world. The robot ‘understands’ objects through perception, but that perception is, fundamentally, a reconstruction. The next iteration must actively challenge that reconstruction, embracing uncertainty not as error, but as a source of information. What happens when the perceived geometry doesn’t align with reality? Or when an unexpected object enters the workspace – a deliberate intrusion, perhaps, designed to test the system’s limits?
The current focus on verifiable task programs, while laudable, risks becoming a new form of over-constraint. True agency isn’t about flawlessly executing pre-defined plans; it’s about intelligently deviating from them. The real test lies in building a supervisor that doesn’t simply prevent failure, but learns from it, adapting its internal models and refining its predictive capabilities. The system needs to be capable of self-modification, shifting from a rule-following automaton to something… less predictable.
Ultimately, the ‘Cybo-Waiter’ represents a stepping stone. The pursuit isn’t simply about creating a robot that can carry a tray; it’s about reverse-engineering the complex interplay between perception, action, and intention that defines intelligent behavior. The unanswered questions aren’t about improving the algorithms, but about fundamentally rethinking the very definition of control.
Original article: https://arxiv.org/pdf/2603.10675.pdf
Contact the author: https://www.linkedin.com/in/avetisyan/
See also:
- CookieRun: Kingdom 5th Anniversary Finale update brings Episode 15, Sugar Swan Cookie, mini-game, Legendary costumes, and more
- Call the Midwife season 16 is confirmed – but what happens next, after that end-of-an-era finale?
- PUBG Mobile collaborates with Apollo Automobil to bring its Hypercars this March 2026
- Robots That React: Teaching Machines to Hear and Act
- Taimanin Squad coupon codes and how to use them (March 2026)
- Heeseung is leaving Enhypen to go solo. K-pop group will continue with six members
- Alan Ritchson’s ‘War Machine’ Netflix Thriller Breaks Military Action Norms
- Robots Learn by Example: Building Skills from Human Feedback
- Peppa Pig will cheer on Daddy Pig at the London Marathon as he raises money for the National Deaf Children’s Society after son George’s hearing loss
- Genshin Impact Version 6.4 Stygian Onslaught Guide: Boss Mechanism, Best Teams, and Tips
2026-03-13 03:31