Robots That Understand You: Adapting Skills with Natural Language

Author: Denis Avetisyan

New research demonstrates how robots can dynamically adjust their movements based on spoken commands, bridging the gap between human intention and robotic action.

The system adapts robotic skills through natural language interaction, enabling a large language model to select and parameterize tools-including general response mechanisms and specialized controls like repulsion point adjustments, via-point insertion, and speed modulation-to refine kinematic motion primitives.

This work presents IROSA, a tool-based architecture leveraging large language models and kernelized movement primitives for safe, interpretable, and deployable robot skill adaptation in industrial settings.

While foundation models excel at diverse tasks and imitation learning offers efficient robot skill adaptation, their combined potential for practical robotics-particularly in industrial settings-remains largely unexplored. This paper introduces ‘IROSA: Interactive Robot Skill Adaptation using Natural Language’, a novel framework enabling robots to modify skills via natural language by interfacing large language models with kernelized movement primitives through a tool-based architecture. Our approach allows for safe, interpretable skill adaptation-demonstrated on a 7-DoF robot performing a bearing ring insertion-without requiring model fine-tuning or direct low-level control. Could this paradigm shift unlock truly flexible and intuitive human-robot collaboration in complex industrial environments?

The Challenge of Real-World Adaptability

Conventional robotic systems often falter when confronted with the inherent messiness of real-world scenarios. Unlike the controlled conditions of a factory floor, everyday environments are rarely static or predictable; lighting shifts, objects move unexpectedly, and surfaces vary in texture and grip. Traditional programming methods, which rely on precisely defined sequences of actions, struggle to accommodate these constant changes. A robot programmed to grasp a specific object in a fixed location will fail if that object is slightly moved, or if a new obstacle appears in its path. This inflexibility stems from the limitations of pre-programmed trajectories and the difficulty of anticipating every possible contingency, highlighting the need for robots capable of perceiving and responding to their surroundings in a more robust and adaptable manner.

Robots operating with rigidly defined movement sequences often falter when confronted with even slight deviations from their programmed parameters. This inflexibility stems from a reliance on pre-defined trajectories, where each action is meticulously planned and lacks the capacity for real-time adjustment. Consequently, a robot designed to perform a specific task-like assembling a product-may struggle if the location of parts shifts, an obstruction appears, or the task itself undergoes minor alteration. This limitation highlights a significant hurdle in robotics: the inability to generalize skills to new, unforeseen circumstances, hindering deployment in dynamic and unpredictable real-world environments where adaptability is paramount.

To truly operate in complex environments, robots require more than pre-programmed sequences; they need the capacity to interpret abstract goals and execute them with precision. Current research focuses on developing systems that bridge this gap, enabling robots to receive commands phrased in natural language or high-level task specifications – such as “carefully place the object on the shelf” – and autonomously generate the intricate series of movements needed for successful completion. This involves sophisticated algorithms that decompose complex instructions into manageable sub-tasks, predict the consequences of actions, and dynamically adjust strategies based on real-time sensory input. Ultimately, the ability to translate intention into nuanced physical action represents a critical step towards creating robots capable of genuine adaptability and seamless integration into human environments.

Natural language commands, such as “slow down between box and station,” enable precise temporal modification of robot trajectories while maintaining the original spatial path.

From Instructions to Action: A Tool-Based Approach

Current robotic systems are increasingly employing Large Language Models (LLMs) to directly process Natural Language Instructions, moving beyond traditional, explicitly programmed sequences. These LLMs function as an intermediary layer, translating human language into a series of discrete, actionable steps. This decomposition process involves identifying the intent within the instruction, extracting relevant parameters, and formulating a plan consisting of elementary actions the robot can execute. The core benefit of this paradigm is the ability to accept high-level commands – such as “move the box to the left” – and automatically generate the necessary low-level control signals without requiring task-specific model retraining for each new instruction.

A Tool-Based Architecture decouples the perception of natural language instructions from the execution of robotic actions by introducing an intermediate layer of predefined tools. This separation allows the Large Language Model (LLM) to focus solely on interpreting intent and selecting appropriate tools, rather than directly generating low-level motor commands. Each tool encapsulates a specific robotic capability – for example, adjusting velocity, navigating to a specific coordinate, or activating a gripper – and provides a standardized interface for interaction. This modular design enhances system flexibility, enabling the integration of new tools and capabilities without requiring retraining of the LLM, and facilitates adaptation to different robotic platforms or environments.

The system’s adaptability stems from its use of predefined tools that abstract low-level control. Instead of requiring complete model retraining to modify behavior, existing skills are augmented by composing them with these tools. For example, a robot possessing a ‘move to location’ skill can have its execution speed altered via a ‘speed modulation’ tool, or its trajectory refined using a ‘via-point insertion’ tool. This modular approach allows for rapid adaptation to new tasks or environmental constraints without modifying the core language understanding or movement planning components, significantly reducing computational cost and development time.

The demonstrated trajectory adaptation successfully modifies a [latex]KMP[/latex] trajectory with via-point insertion to reach a newly positioned object based on natural language commands, while maintaining task completion.

Safe and Efficient Movement: Navigating Dynamic Environments

Collision avoidance is a fundamental requirement for robotic navigation in dynamic environments. Systems employing Repulsion Point Generation utilize Signed Distance Fields (SDFs) to represent the surrounding space, effectively mapping the distance to the nearest obstacle at any given point. These SDFs are then used to calculate repulsive forces that steer the robot away from potential collisions; the magnitude of the force is inversely proportional to the distance, ensuring stronger avoidance behavior as the robot nears an obstacle. This approach provides a robust means of preventing impacts because it doesn’t rely on precise environmental modeling and can adapt to changes in the environment in real-time, unlike methods requiring complete knowledge of obstacle positions and shapes.

Kernelized Movement Primitives (KMPs) provide a method for representing robot motion skills as a weighted sum of kernel functions applied to a set of via-points, enabling generalization to novel situations. The underlying framework is probabilistic; KMPs, when combined with Gaussian Mixture Models (GMMs), allow for the representation of trajectory distributions rather than single, deterministic paths. This probabilistic approach accounts for inherent uncertainties in robot actuation and sensing, and permits adaptation of movements based on observed data. The GMM defines a probability distribution over possible trajectories, with each Gaussian component representing a likely movement variation, allowing the robot to select or blend between these variations based on contextual information and environmental feedback. [latex] \mathbb{P}(x) = \sum_{i=1}^{N} \pi_i \mathcal{N}(x | \mu_i, \Sigma_i) [/latex] represents the probability of a state x as the sum of N Gaussian components, each with weight [latex] \pi_i [/latex], mean [latex] \mu_i [/latex], and covariance [latex] \Sigma_i [/latex].

Task-Parameterized Kernelized Movement Primitives (KMPs) improve robotic skill execution by defining movements relative to specific task frames, rather than a global coordinate system. This is achieved by augmenting the KMP’s parameterization with task-relevant variables; the primitive’s basis functions are then conditioned on these variables, effectively transforming the movement to align with the current task context. This approach allows a single KMP to represent a skill applicable to various locations and orientations within the robot’s workspace, increasing adaptability and reducing the need for pre-defined trajectories for each specific scenario. The resulting parameterized movement is then computed using the standard KMP framework, leveraging Gaussian Mixture Models to represent probabilistic variations and ensure smooth, reliable execution.

Utilizing repulsion points to modify trajectories, the system successfully navigates obstacles and completes pick-and-insert operations while preserving the intended task structure.

Beyond Modification: Generating Novel Robotic Actions

Recent advancements in robotic manipulation have moved beyond simply modifying existing actions to generating entirely new trajectories from natural language instructions, building upon the foundations laid by systems like CLIPORT. These newer systems, such as LaTTe, enhance CLIPORT’s capabilities by directly translating linguistic commands into sequences of robotic actions, rather than relying on pre-defined skill libraries. This represents a significant leap towards more flexible and intuitive robot control, allowing users to specify what they want a robot to achieve, without needing to explicitly program how to do it. By learning a direct mapping from language to trajectories, these systems demonstrate an ability to generalize to novel tasks and environments, opening up possibilities for robots to assist in a wider range of unstructured, real-world scenarios.

While many robotic manipulation systems rely on pre-defined tools and modular action sequences, OVITA presents a distinct approach by directly generating executable code from natural language instructions. This method allows for greater flexibility in task planning and execution, potentially enabling robots to handle scenarios not explicitly programmed within a tool-based framework. However, this divergence from modularity introduces new challenges; ensuring the generated code is both functionally correct and safe for robotic execution requires robust verification and error-handling mechanisms, as the system must translate abstract language into precise motor commands without the benefit of pre-defined, tested modules.

KITE elevates language-conditioned robotic manipulation through a novel approach centered on keypoint control. Rather than directing an entire action at once, the system learns policies conditioned on specific keypoints – critical locations within an object or scene. This allows for significantly finer-grained control during task execution, enabling the robot to adjust its movements with precision based on visual feedback. By focusing on these keypoints, KITE achieves a more nuanced understanding of the required manipulation, leading to robust performance even with complex tasks and variations in object pose. The resulting system doesn’t simply replicate known actions; it dynamically generates trajectories tailored to the specific situation, improving adaptability and success rates in real-world scenarios.

Rigorous experimentation reveals a remarkably effective system capable of flawlessly translating natural language instructions into successful robotic actions. Across a diverse set of tasks, the framework consistently achieves a 100% Command Success Rate (CSR), indicating complete accuracy in understanding user requests. This is coupled with a 100% Interpretation Success Rate (ISR), confirming the system’s ability to correctly map language to actionable steps, and a 100% Task Completion Rate (TCR), demonstrating consistent and reliable execution of those steps. These results collectively highlight the system’s robust adaptability and its potential to reliably perform complex manipulations based solely on human language guidance, suggesting a significant advancement in human-robot interaction.

The Future of Adaptive Robotics: Seamless Human-Robot Collaboration

For real-time robotic control, deploying large language models (LLMs) locally is proving crucial to minimizing latency and maximizing responsiveness. Traditional cloud-based LLM access introduces significant delays due to network transmission, hindering a robot’s ability to react swiftly to dynamic environments. By embedding these models directly onto the robot’s hardware, processing occurs instantaneously, enabling quicker decision-making and more fluid movements. This localized approach avoids communication bottlenecks and allows the robot to interpret sensor data and execute commands with a speed essential for tasks requiring precision and immediate adaptation, such as collaborative manufacturing or complex surgical procedures. The result is a robotic system capable of not just responding to changes, but proactively anticipating and adapting with them.

The convergence of recent robotic advancements is poised to redefine human-robot interaction, fostering environments where robots move beyond pre-programmed tasks and exhibit genuine adaptability. These systems are engineered to perceive and react to dynamic changes in their surroundings – a shifted object, an unexpected obstacle, or a novel human gesture – not as disruptions, but as cues for immediate, intelligent response. This capability extends beyond simple reaction; robots are increasingly able to anticipate needs and collaborate with humans in a fluid, intuitive manner, mirroring natural teamwork. Such seamless interaction promises to unlock unprecedented levels of efficiency and safety across diverse fields, allowing robots to function not as isolated tools, but as integrated partners in complex tasks.

The convergence of advanced robotics and localized large language models promises a substantial broadening of automation’s reach, extending far beyond traditional industrial applications. Manufacturing facilities stand to gain through more flexible assembly lines capable of handling customized orders with greater efficiency, while the healthcare sector anticipates robotic assistance in complex surgeries, personalized patient care, and automated drug dispensing. Beyond these core areas, logistics operations will benefit from optimized warehouse management and autonomous delivery systems, and even agriculture could see increased yields through precision planting and harvesting. This technological synergy isn’t simply about replacing human labor; it’s about augmenting human capabilities and creating safer, more productive work environments across a diverse spectrum of industries, fostering innovation and economic growth.

Recent advancements in robotic control demonstrate a significant leap in responsiveness through a novel architectural approach. When benchmarked against OVITA, a comparable system utilizing the same locally deployed Large Language Model, this methodology achieves a remarkable 43% reduction in response time. This improvement isn’t merely incremental; it fundamentally alters the potential for real-time interaction, allowing robots to react to dynamic environments and human input with greater immediacy and precision. The accelerated processing enables more fluid and natural collaboration, bridging the gap between human intention and robotic action and paving the way for increasingly sophisticated automation solutions.

The demonstrated pick-and-insert trajectories (top) are accurately predicted by the KMP model (bottom), which also provides a visualization of prediction uncertainty.

The presented architecture prioritizes a streamlined approach to robot skill adaptation, echoing a fundamental principle of efficient information transfer. Andrey Kolmogorov observed, “The most important thing in science is not to be afraid of making mistakes.” This sentiment directly aligns with the iterative nature of interfacing large language models with kernelized movement primitives. The system isn’t striving for immediate perfection, but rather a continuous refinement of trajectories based on natural language input. This acceptance of incremental improvement, of learning through adjustment, is central to the paper’s contribution – a deployable solution for industrial robotics that embraces adaptability over rigid pre-programming. Unnecessary complexity is avoided; the focus remains on interpretable and safe trajectory modification.

Where To Next?

The presented architecture, while a demonstrable convergence of linguistic instruction and robotic action, ultimately highlights the enduring gap between representation and reality. The system functions, but to claim ‘adaptation’ implies a level of genuine generalization not yet achieved. Current limitations reside not in the mechanics of trajectory modification – kernelized movement primitives are, after all, a known quantity – but in the brittle nature of the language interface. A truly robust system will not respond to language, but understand intent, a distinction requiring, perhaps, a re-evaluation of the very notion of ‘understanding’ itself.

Future work will undoubtedly focus on refining the large language model component. However, the more pressing challenge is not simply increased scale, but increased precision. The current paradigm favors breadth of vocabulary over depth of semantic grounding. A system capable of distinguishing between ‘slightly faster’ and ‘immediately’ would represent a substantial advance. Equally important is the development of metrics beyond simple task completion; a successful adaptation should not merely achieve a goal, but do so efficiently and safely – qualities difficult to quantify, yet essential for deployment in complex industrial environments.

Ultimately, the pursuit of adaptive robotic systems is a search for a minimal sufficient model of the world. The temptation is to add layers of complexity, to account for every contingency. The more fruitful path, however, lies in relentless subtraction, in identifying the core principles that govern action and interaction. The goal is not to build a robot that knows everything, but one that needs to know very little.

Original article: https://arxiv.org/pdf/2603.03897.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/