Robots That Understand Instructions: A New Approach to Assembly

Author: Denis Avetisyan

A hierarchical vision-language model empowers robots to intelligently select and execute complex assembly tasks from natural language commands.

A robotic assembly framework leverages a two-stage visual language model to iteratively translate visual input into executable skills, acknowledging that each architectural decision inherently forecasts eventual system limitations.

This work introduces a framework leveraging vision-language models and imitation learning for robust skill selection and parameterization in long-horizon robotic assembly sequences.

Despite advances in robotic manipulation, enabling robots to autonomously execute complex assembly tasks remains challenging due to the need for robust skill selection and parameterization. This paper introduces a novel framework, ‘VLM-driven Skill Selection for Robotic Assembly Tasks’, which integrates vision-language models with imitation learning to address this critical gap. By grounding skill selection in natural language instructions and visual perception, our approach enables a gripper-equipped robot to perform long-horizon assembly sequences with improved success rates and interpretability. Could this hierarchical, multi-modal framework pave the way for more adaptable and intelligent robotic assembly systems capable of tackling increasingly complex manufacturing challenges?

The Inevitable Drift of Automation

Traditional robotic assembly systems, constrained by pre-programmed sequences, struggle with the variability of the real world. Human intervention remains frequent. True dexterity demands understanding, not just execution. Researchers are exploring how large language models (LLMs) can imbue robots with cognitive ability, allowing them to interpret sensory data and generate flexible action plans. By framing assembly as language prompts, robots can leverage LLM knowledge to adapt and overcome unforeseen circumstances. The system isn’t built – it’s cultivated, and every adjustment is a glimpse into the inevitability of change.

The prompt architecture effectively integrates task descriptions, state analysis, and action specifications to guide robotic behavior.

Seeing is Reasoning

Vision-Language Models (VLMs) offer a pathway to integrate visual perception with high-level reasoning in robotic control. Unlike rigid programming, VLMs enable robots to interpret visual inputs and dynamically select skills for complex operations. Models like GPT-4.1-2025-04-14 and GPT-5-mini-2025-08-07 translate vision into executable skill sequences, analyzing environments, identifying objects, and generating plans. Current research focuses on decomposing complex tasks into sequences of primitive actions, improving accuracy through extensive visual and textual data training.

In a simulation environment, the VLM-based primitive selection method demonstrates the ability to identify appropriate actions for robotic tasks.

Two Stages of Perception

A novel Two-Stage VLM Architecture addresses limitations in visual reasoning for skill selection. It decomposes the process into distinct visual analysis and skill reasoning stages. The first stage extracts relevant visual information, powered by Visual Annotation and Object Recognition. This capability accurately identifies key elements, improving skill selection precision. Mark-Based Visual Prompting further refines attentional focus, guiding the VLM towards critical regions and enhancing precision.

The VLM-based primitive selection method successfully identifies appropriate actions in a real-world environment, as demonstrated by performance in environment 2.

Building from the Ground Up

Complex robotic assembly tasks are decomposed into manageable skills built upon Primitive Skills—pick, place, and insert—serving as foundational building blocks. These primitives enable a hierarchical approach to task planning. Imitation Learning trains policies for executing these skills, allowing robots to learn from human demonstrations and generalize to novel situations. Action Chunking refines the learning process by enhancing the temporal consistency of learned policies, enabling smooth and efficient task execution.

Performance in real-world environment 1 indicates the VLM-based primitive selection method can effectively identify appropriate robotic actions.

The Illusion of Mastery

The proposed framework was evaluated on the Gear Assembly task, a challenging benchmark demanding precise coordination and adaptability. Performance was assessed in both simulated and real-world environments. Simulation results demonstrate a 0.93 success rate for Pick and 0.97 for Insert. Real-world deployments yielded 0.77-0.87 for Pick and 0.80-0.83 for Insert. This validates the system’s adaptability, despite discrepancies between simulation and reality. A system designed to solve a problem ultimately reveals the depth of its own unknowability.

The VLM-based primitive selection method achieves varying success rates across different environments, highlighting the impact of environmental complexity on task completion.

The pursuit of robust robotic assembly, as detailed in this work, echoes a fundamental truth about complex systems. This paper’s hierarchical Vision-Language Model, attempting to bridge visual understanding and imitation learning, isn’t about building a solution, but cultivating one. It acknowledges the inevitable decay of any rigid structure, seeking instead a framework capable of adapting to long-horizon sequences. As Brian Kernighan observed, “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” This sentiment applies directly to the challenges of robotic control; attempting perfect pre-programming is a denial of the entropy inherent in real-world interaction. The system’s focus on skill selection and parameterization isn’t about achieving flawless execution, but about gracefully handling inevitable deviations.

The Looming Horizon

This work, like all attempts to codify action, reveals less about controlling systems and more about the illusions of control. The hierarchical decomposition, the vision-language bridge—these are not solutions, but elaborately constructed interfaces with the inevitable entropy of long-horizon tasks. Each successful assembly, each learned parameter, merely postpones the moment when unforeseen circumstances will expose the fragility of the chosen representation. The system will not fail to adapt; it will adapt in ways unanticipated by its architects, becoming something subtly, irrevocably other.

The true challenge lies not in perfecting skill selection, but in accepting the inherent incompleteness of any model. Future efforts will inevitably turn toward embracing ambiguity, toward systems that anticipate their own limitations and negotiate failure gracefully. The focus will shift from imposing structure onto the world to cultivating resilience within the system, allowing it to reconfigure itself in response to the unpredictable currents of reality.

One suspects the ultimate metric of success will not be task completion, but the elegance with which the system acknowledges its own ignorance. Every refactor begins as a prayer and ends in repentance, and the most sophisticated architecture is, at its core, a beautifully rendered map of all the ways things can go wrong.

Original article: https://arxiv.org/pdf/2511.05680.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/