Robots Get a Skill Upgrade: Planning with Language and Learned Experience

Author: Denis Avetisyan

A new approach combines abstract action planning with the power of large language models to help robots understand and execute complex tasks in everyday environments.

The proposed MaP-AVR pipeline advances robotic task completion by integrating retrieval-augmented generation for contextual learning, chain-of-thought prompting to refine meta-action planning, and spatial reasoning via vision-language models alongside established obstacle avoidance techniques-a combination designed to ensure robust execution even when elegant theoretical plans encounter the unpredictable realities of production environments.

This paper introduces MaP-AVR, a meta-action planner leveraging vision-language models and retrieval-augmented generation for improved robotic task planning and embodied AI.

While embodied agents increasingly rely on large language and vision models for complex task planning, a critical gap remains in defining truly generalizable skill sets. This paper introduces MaP-AVR: A Meta-Action Planner for Agents Leveraging Vision Language Models and Retrieval-Augmented Generation, a novel approach that abstracts robotic actions into a set of fundamental “meta-actions” – encompassing movement, end-effector states, and environmental relationships – thereby decoupling planning from human-centric concepts. By integrating this meta-action framework with Retrieval-Augmented Generation for improved in-context learning, MaP-AVR enables more robust and adaptable task execution. Could this abstraction unlock a new era of truly versatile robotic agents capable of seamlessly navigating and interacting with complex daily environments?

The Illusion of Robotic Flexibility

Conventional robotic task planning frequently encounters limitations when confronted with the complexities of real-world environments. These systems, often reliant on pre-programmed sequences and precise environmental models, struggle with the ambiguity and unpredictability inherent in tasks requiring nuanced understanding. A robot designed to assemble a product on a factory line might falter when presented with a slightly misaligned component, or when encountering an unexpected obstacle. This difficulty stems from the challenge of translating human intuition – the ability to quickly assess situations, adapt to changes, and employ common sense – into algorithmic instructions. Consequently, robots often require meticulously detailed instructions for even seemingly simple tasks, hindering their ability to operate autonomously in dynamic and unstructured settings. The core issue isn’t a lack of processing power, but rather the difficulty in equipping machines with the capacity for flexible, context-aware reasoning.

Conventional robotic systems frequently encounter limitations when operating beyond carefully controlled settings, revealing a critical deficiency in adaptability. These methods, often reliant on pre-programmed sequences or static environmental models, struggle when confronted with the inherent unpredictability of real-world scenarios. A slight deviation – an unexpected obstacle, a change in lighting, or a novel object – can disrupt execution, leading to failure or the need for human intervention. This inflexibility stems from a reliance on precise, complete information at the planning stage, which is rarely available in dynamic environments. Consequently, research focuses on developing systems capable of robust planning – approaches that can not only account for uncertainty but also rapidly re-plan and adjust strategies in response to unforeseen changes, mirroring the intuitive adaptability observed in biological systems.

Successfully navigating complex tasks hinges on breaking them down into manageable sub-components, yet current robotic approaches to task decomposition often prove inflexible. These systems typically require significant pre-programming and meticulous manual adjustments to account for even slight variations in the environment or task parameters. This reliance on extensive manual engineering limits their adaptability and scalability, as each new scenario or modification demands substantial effort from human experts. The inherent brittleness of these methods means they struggle with unforeseen circumstances or dynamic changes, hindering their deployment in real-world applications where unpredictability is the norm. Consequently, a key challenge lies in developing decomposition strategies that are both robust and capable of autonomously adapting to the inherent complexities of real-world tasks, reducing the burden of manual intervention and enabling more versatile robotic systems.

Addressing the limitations of current robotic systems requires a fundamental shift in how complex tasks are approached. Conventional methods, reliant on pre-programmed sequences and rigid algorithms, falter when confronted with the inherent unpredictability of real-world environments. Researchers are actively exploring novel architectures, including hierarchical reinforcement learning and probabilistic reasoning, to imbue robots with the capacity for adaptive problem-solving. These innovative approaches aim to move beyond simple reaction to stimuli, enabling systems to anticipate potential challenges, dynamically re-plan strategies, and generalize learned behaviors to previously unseen scenarios. Ultimately, scaling reasoning capabilities isn’t merely about increasing computational power; it necessitates developing algorithms that can efficiently represent uncertainty, reason about abstract concepts, and learn from limited data – qualities essential for truly intelligent autonomous operation.

Meta-actions provide a fundamental abstraction for skills mirroring human experience and enable the composition of diverse daily tasks.

A Patch for the Symptom: Introducing MaP-AVR

The MaP-AVR framework employs a Meta-Action Planner to decompose high-level goals into a series of abstract actions, facilitating task completion in complex environments. This planner operates by first identifying the core sub-tasks required to achieve the overall objective, then representing each sub-task as an abstract action-a generalized instruction independent of specific environmental details. These abstract actions are sequenced to form a plan, and subsequently refined into concrete, executable steps based on the current state of the environment. This hierarchical approach allows MaP-AVR to handle tasks with significant complexity and long horizons by reducing the planning problem into a manageable sequence of simpler actions, improving efficiency and robustness compared to direct, monolithic planning strategies.

MaP-AVR incorporates Vision-Language Models (VLMs) like GPT-4o to process visual and textual inputs, enabling a comprehensive understanding of the environment and task context. These VLMs facilitate the interpretation of sensor data – including images and object detections – alongside natural language instructions. This combined processing allows the framework to generate a diverse set of potential actions, grounded in both perceptual understanding and linguistic goals. Specifically, the VLM’s ability to perform visual reasoning and natural language generation is leveraged to create action sequences that are contextually relevant and adaptable to varying environmental conditions.

Retrieval-Augmented Generation (RAG) within MaP-AVR functions by accessing a pre-populated database of previously successful task executions. This database contains examples of states, actions, and resulting outcomes, providing a knowledge base for the planner. When faced with a new task or ambiguous situation, the RAG module retrieves relevant demonstrations from this database based on similarity to the current state. These retrieved examples are then incorporated as context during action planning, effectively guiding the selection of robust and previously validated actions. This process mitigates the risks associated with novel or potentially flawed action sequences, improving the overall reliability and success rate of task completion, especially in dynamic or unpredictable environments.

Traditional planning methods often struggle with real-world complexity due to their reliance on precisely defined environments and actions, leading to brittle performance when encountering unforeseen circumstances. MaP-AVR overcomes these limitations by integrating a Meta-Action Planner with Vision-Language Models and Retrieval-Augmented Generation. This combination enables the framework to decompose tasks into abstract steps, interpret complex visual inputs, and leverage prior successful demonstrations. Consequently, MaP-AVR exhibits increased robustness and adaptability in dynamic and uncertain environments, allowing it to generalize beyond pre-programmed scenarios and effectively address novel situations.

The final meta-action generated by the VLM follows a specific linguistic structure, as illustrated.

From Abstract Thought to Mechanical Action

Action Execution Functions constitute the interface between task planning and robotic actuation. These functions receive high-level, abstract “meta-actions” – such as “grasp object” or “navigate to location” – and decompose them into a sequence of low-level commands executable by the robot’s hardware. This translation involves specifying motor velocities, joint angles, end-effector positions, and gripper states. The functions incorporate kinematic and dynamic models of the robot to ensure accurate and feasible motion planning, and also manage the timing and synchronization of these commands. Furthermore, they often include error handling and feedback mechanisms to address unexpected situations during execution and maintain task completion.

Robot action execution relies on a robust spatial understanding capability, which encompasses the perception and interpretation of the surrounding environment. This is achieved through the integration of sensor data – including lidar, cameras, and depth sensors – processed via algorithms for simultaneous localization and mapping (SLAM) and object detection. The resulting environmental model allows the robot to determine the position and orientation of itself and all relevant objects within its workspace. Accurate spatial understanding is critical for tasks such as path planning, grasping, and manipulation, as it provides the necessary information for the robot to interact safely and effectively with its surroundings. The fidelity of this understanding directly impacts the success rate and efficiency of subsequent actions.

Safe and efficient robot navigation is achieved through the integration of foundation models and obstacle avoidance algorithms. Foundation models, pre-trained on extensive datasets, provide the robot with generalized knowledge of the environment and potential trajectories. Complementing this, obstacle avoidance algorithms utilize sensor data – such as LiDAR and cameras – to detect and dynamically map surrounding obstacles in real-time. These algorithms then generate collision-free paths, adjusting the robot’s trajectory as needed. The combined system allows the robot to not only plan a route to a goal but also to react to unforeseen impediments, ensuring operational safety and maximizing navigational efficiency. Performance is often evaluated using metrics like path length, execution time, and the number of near-collision events.

The system architecture facilitates a direct translation of task objectives, formulated at a high level, into actionable motor commands for the robot. This is achieved through a structured pipeline where abstract goals are decomposed into a sequence of concrete actions, which are then mapped to specific robot joint velocities or end-effector positions. This integration eliminates the need for manual intervention in converting plans into executable code, enabling the robot to autonomously perform complex tasks involving multiple sequential steps and dynamic adjustments based on environmental feedback. The resulting efficiency is critical for applications requiring sustained operation and adaptability in unstructured environments.

Planned outcomes are achieved through meta-actions composed of three essential components that drive subsequent execution.

Validation and the Illusion of Progress

Evaluations of MaP-AVR across established benchmark datasets – including RoboVQA, RT-1, and Droid-100 – consistently demonstrate substantial performance improvements over existing methods, notably the Rekep framework. These datasets, designed to assess robotic vision and reasoning capabilities, provided a rigorous testing ground for MaP-AVR’s ability to understand complex instructions and execute appropriate actions in varied environments. The results indicate a clear advantage for MaP-AVR in tasks requiring visual perception, planning, and execution, highlighting its potential for advancing robotic automation and intelligent systems. This outperformance suggests a more robust and adaptable framework capable of handling the nuances of real-world robotic challenges.

To rigorously evaluate the MaP-AVR framework in a practical setting, experiments were conducted within the OmniGibson simulation environment-a physics-based platform designed to mirror the complexities of real-world human-robot interaction. This simulation allowed for comprehensive testing of the framework’s ability to perform manipulation tasks, and the results demonstrated a notably higher task success rate when compared to the Rekep method. The OmniGibson environment’s realistic physics engine and diverse object library provided a challenging, yet controlled, space to assess MaP-AVR’s robustness and adaptability, confirming its potential for deployment in more complex and unpredictable scenarios. This success within the simulation suggests a strong foundation for translating the framework’s capabilities into tangible real-world robotic applications.

The framework leverages the power of In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting to significantly improve the visual language model’s (VLM) capacity for complex reasoning. By providing the VLM with a few illustrative examples – the ‘in-context’ learning – it’s able to rapidly adapt to new tasks without requiring extensive retraining. Furthermore, the incorporation of Chain-of-Thought prompting encourages the model to break down complicated problems into a series of intermediate steps, mimicking human-like thought processes. This allows for more transparent and accurate decision-making, enabling the VLM to not only identify the correct action, but also to justify its reasoning – a crucial element for building trust and reliability in robotic applications.

A detailed analysis of failure modes within the MaP-AVR framework revealed two primary limitations impacting task success. Approximately 26% of failures stemmed from inaccuracies in identifying the target object or determining an appropriate grasp point, indicating a need for improved object localization and grasp planning algorithms. Furthermore, 25% of failures were attributed to difficulties in parsing the required action sequence – essentially, the system misinterpreting or improperly executing the steps needed to complete the task. These findings directly highlight specific areas for future development, suggesting that refining both visual perception for precise object handling and natural language processing for robust action understanding will be crucial for enhancing the framework’s overall reliability and performance.

We developed a suite of daily tasks within the OmniGibson simulation environment to rigorously test our method's performance. — We developed a suite of daily tasks within the OmniGibson simulation environment to rigorously test our method’s performance.

The Long Road to True Robotic Intelligence

A central challenge for robotic intelligence lies in creating systems that perform reliably not just in controlled settings, but also when faced with novel situations and environments. Future research concerning MaP-AVR will therefore prioritize enhancing its capacity for generalization, moving beyond performance within the training distribution. This involves exploring techniques to improve the system’s ability to abstract key principles from demonstrated tasks and apply them effectively to previously unseen scenarios, potentially through methods like domain randomization or meta-learning. Success in this area promises a significant step towards creating robots capable of autonomous operation in the dynamic and unpredictable real world, reducing the need for task-specific retraining and enabling broader applicability across diverse environments and challenges.

The current framework could benefit from the incorporation of a Policy Network, a component designed to learn and refine the robot’s control strategies through reinforcement learning. This network would operate in parallel with the existing retrieval and planning modules, allowing the robot to not simply execute pre-defined plans, but to adapt its actions based on observed outcomes and dynamically optimize for improved performance. By learning a policy – a mapping from states to actions – the robot can potentially overcome limitations in the RAG database and generalize more effectively to novel situations. This integration promises a more nuanced and responsive control system, enabling the robot to refine its movements, improve task completion rates, and ultimately exhibit a higher degree of intelligence in complex, real-world scenarios.

The robustness of the MaP-AVR framework is directly linked to the breadth and depth of its Retrieval-Augmented Generation (RAG) database. Expanding this database with a significantly larger and more varied collection of task demonstrations allows the robot to encounter a wider range of scenarios during the retrieval process. This increased exposure isn’t merely about quantity; it’s about diversity – encompassing variations in object types, environmental conditions, and task execution strategies. A more comprehensive RAG database equips the robot with the necessary contextual information to effectively generalize its learned skills to novel situations, improving its ability to adapt and perform reliably even when faced with unexpected challenges. Ultimately, scaling the database fosters a more resilient and adaptable robotic intelligence, capable of handling the inherent complexities of real-world applications.

The development of MaP-AVR represents a significant stride towards robotic systems exhibiting true intelligence and adaptability. By effectively bridging the gap between pre-programmed behaviors and novel situations, this framework enables robots to approach unfamiliar challenges with a degree of autonomy previously unattainable. This isn’t simply about automating tasks; it’s about fostering a capacity for learning and generalization, allowing robots to not just perform instructions, but to understand the underlying principles and apply them creatively. Consequently, MaP-AVR holds considerable promise for deployment in dynamic, unpredictable environments – from assisting in disaster relief and environmental monitoring to enhancing manufacturing processes and providing personalized support in healthcare – ultimately paving the way for robots that are genuinely capable of tackling the complexities of the real world.

The planner leverages a task database to retrieve relevant examples for in-context learning and then incorporates successfully completed tasks back into the database for future use.

The pursuit of increasingly sophisticated task planners, like the MaP-AVR framework detailed in this work, feels predictably iterative. It abstracts skills into meta-actions and employs retrieval-augmented generation-elegant concepts, certainly. However, one anticipates the inevitable accumulation of technical debt as production environments expose edge cases unforeseen in controlled testing. As Alan Turing observed, “There is no substitute for experience.” This holds true; no matter how robust the initial design, real-world deployment will reveal limitations and necessitate continuous refinement. The core idea of action decomposition is sound, yet the true test lies in how gracefully the system degrades when confronted with the messy realities of embodied AI.

What’s Next?

This ‘meta-action’ planning, predictably, simply shifts the burden of failure. The elegance of abstracting tasks into fundamental actions glosses over the inevitable messiness of real-world execution. Production, as always, will discover edge cases the simulations missed-the slightly askew object, the unexpected occlusion, the user who doesn’t quite mean what they say. The system will decompose beautifully, right up until it doesn’t.

Retrieval-Augmented Generation offers a temporary reprieve from the brittleness of purely generative models, but it’s a debt accruing interest. Each retrieved example is a constraint, a potential point of failure when faced with novel situations. The long-term viability hinges not on more data, but on a more robust understanding of when, and why, retrieval fails. Expect a resurgence of interest in symbolic reasoning – everything new is old again, just renamed and still broken.

The true test won’t be demonstrated in controlled environments, but in the slow, agonizing process of deployment. The system will need to adapt to the chaotic symphony of a home or office. One suspects the next iteration will involve a significantly larger error handling budget. And, inevitably, a team dedicated solely to explaining why the robot is stuck in the laundry room.

Original article: https://arxiv.org/pdf/2512.19453.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/