Robots Learn to Disassemble with a Little Help from Skills

Author: Denis Avetisyan

Researchers have developed a new framework that combines the power of large language models with a library of pre-defined skills to dramatically improve robotic disassembly success rates.

The SELF-VLA framework proposes a system where growth, rather than construction, defines architecture, acknowledging that every design choice inevitably forecasts future points of failure.

The SELF-VLA framework integrates structured skills and failure recovery within an agentic vision-language-action system for contact-rich manipulation tasks.

Despite advances in robotic manipulation, automating the disassembly of end-of-life electronics remains a significant challenge due to the inherent variability and complexity of the task. This paper introduces ‘SELF-VLA: A Skill Enhanced Agentic Vision-Language-Action Framework for Contact-Rich Disassembly’, a novel agentic framework that integrates explicit disassembly skills with a vision-language-action model to improve robustness and generalization. Experimental results demonstrate that SELF-VLA significantly outperforms existing end-to-end approaches on contact-rich disassembly tasks, achieving higher success rates through learned skill adaptation and failure recovery. Could this skill-augmented approach pave the way for truly autonomous robotic disassembly systems capable of handling the growing volume of electronic waste?

The Inevitable Tide: E-Waste and the Promise of Recovery

The escalating tide of electronic waste, or e-waste, represents a significant and growing threat to both environmental health and resource availability. Driven by rapid technological advancements and consumer demand, the volume of discarded electronics-from smartphones and laptops to refrigerators and televisions-is increasing at an unprecedented rate. This surge isn’t merely a landfill issue; e-waste contains a complex mixture of hazardous materials like lead, mercury, and cadmium, which can leach into soil and water sources, posing risks to ecosystems and human health. Simultaneously, the disposal of these devices represents a lost opportunity to recover valuable and increasingly scarce resources, including precious metals and rare earth elements essential for manufacturing new technologies, thereby exacerbating the demand for virgin materials and creating a cycle of resource depletion. The sheer scale of this waste stream demands innovative solutions to mitigate its harmful effects and unlock the potential for a circular economy.

The escalating volume of electronic waste presents a significant opportunity for resource recovery, particularly concerning rare earth elements. These elements – crucial for modern technologies ranging from smartphones to electric vehicles – are often embedded within complex devices, making extraction challenging. However, accessing these valuable materials isn’t primarily a chemical problem; it’s a logistical one. Efficient disassembly of end-of-life electronics is the critical first step, enabling targeted extraction processes and maximizing material yields. Without streamlined disassembly techniques, the economic viability of rare earth element recovery diminishes, hindering a circular economy and perpetuating reliance on environmentally damaging mining practices. Consequently, innovations in disassembly – encompassing automation, design for disassembly, and improved material tracking – are paramount to unlocking the hidden value within the growing e-waste stream.

Current electronic waste recycling largely depends on manual disassembly, a process that demands significant human labor to separate components and materials from discarded devices. This reliance on labor introduces substantial costs, particularly as devices become increasingly complex and miniaturized, making component separation more difficult and time-consuming. Consequently, the economic viability of recovering valuable materials – especially rare earth elements – is frequently undermined, as the cost of disassembly can exceed the value of the recovered resources. This creates a bottleneck in the circular economy for electronics, hindering widespread adoption of sustainable recycling practices and encouraging the disposal of valuable materials in landfills or through less environmentally responsible methods.

Extracting the CPU represents a significant complexity within end-of-life desktop computer disassembly processes.

From Sequential Pipelines to the Illusion of Control

Current robotic disassembly systems predominantly employ a sequential pipeline architecture, functionally separating the disassembly process into three core stages: perception, planning, and manipulation. The perception stage utilizes sensors – typically cameras and force sensors – to gather data about the target electronic waste (e-waste) assembly. This data is then processed to create a representation of the object’s geometry and identify key features. The planning stage leverages this representation to generate a sequence of actions required for disassembly, considering factors such as tool selection and trajectory optimization. Finally, the manipulation stage executes these planned actions using robotic arms and end-effectors, physically disassembling the e-waste. This sequential structure necessitates the successful completion of each stage before proceeding to the next, creating a rigid workflow susceptible to errors and limitations when encountering variations in real-world e-waste.

Effective robotic disassembly pipelines depend on accurate geometric data and reliable component identification. Specifically, precise Computer-Aided Design (CAD) models are required to define component shapes and relationships, enabling the robot to plan disassembly sequences. Robust object detection, frequently implemented using frameworks such as YOLOv4, is critical for locating features like screws within the e-waste stream. YOLOv4, a convolutional neural network, provides real-time object detection capabilities, identifying the position and orientation of screws to facilitate targeted manipulation. The performance of these systems is directly correlated to the quality of the CAD data and the accuracy of the object detection algorithms; inaccuracies in either area lead to failed disassembly attempts or potential damage to components.

Current robotic disassembly systems, while effective in controlled environments, exhibit limited adaptability when processing real-world electronic waste. E-waste streams are characterized by significant variability in object pose, lighting conditions, and the presence of unexpected components or damage. This variability frequently disrupts the performance of perception and planning algorithms reliant on precise CAD data and object detection. Specifically, inaccuracies in screw localization – a critical step in disassembly – increase due to these factors, leading to failed grasps or collisions. Consequently, systems designed for a narrow range of input conditions require frequent human intervention or struggle to maintain consistent performance when faced with the diversity inherent in unsorted e-waste.

Current robotic disassembly systems, dependent on pre-programmed sequences, exhibit limited performance when confronted with the inconsistencies of real-world electronic waste. These rigid systems struggle with variations in object pose, lighting conditions, and the presence of unexpected components or damage. A flexible approach necessitates integrating capabilities such as reinforcement learning and sensor fusion to allow robots to adapt to unforeseen circumstances and learn optimal disassembly strategies on-the-fly. This shift towards adaptable systems requires moving beyond reliance on precise, pre-existing CAD models and instead leveraging real-time perception to build an understanding of the object’s current state, enabling robust manipulation even in the absence of perfect information.

The proposed SELF-VLA framework consistently improves task success rates for OpenVLA-OFT, [latex]\pi_{0.5}[/latex], and [latex]\pi_{0.5}\pi_{0.5}-Droid[/latex] across a range of disassembly tasks at a control frequency of 10 Hz.

SELF-VLA: An Agentic Framework, a Temporary Stay of Entropy

Vision-Language-Action (VLA) models represent a significant advancement in robotic intelligence by enabling systems to interpret natural language instructions and translate them into concrete physical actions. These models are typically trained on large datasets of paired visual inputs, linguistic descriptions, and corresponding robotic behaviors, allowing them to generalize to new tasks and environments. Unlike traditional robotic systems reliant on pre-programmed sequences or manual control, VLA models facilitate a more intuitive human-robot interaction and allow robots to perform complex tasks based on high-level instructions. The core functionality involves processing visual information from sensors, understanding the semantics of language commands, and generating appropriate action sequences to achieve the desired outcome, effectively bridging the gap between human intention and robotic execution.

SELF-VLA is a robotic framework built upon Vision-Language-Action (VLA) models, designed to enhance disassembly tasks through the integration of pre-defined skill libraries and a robust error recovery system. This framework moves beyond simple sequential execution by enabling the robot to access and utilize a catalog of established skills – such as grasping, lifting, or applying force – rather than requiring task-specific programming for each unique scenario. Furthermore, SELF-VLA incorporates an error recovery mechanism that allows the robot to identify failures during skill execution and automatically restart the process, guided by a designated ‘Stop Token’ signal to halt current actions and initiate corrective measures, improving overall task completion rates and reliability.

The VLA-Planner component is responsible for translating natural language instructions and processing sensor observations into a sequence of robotic actions. Upon execution of these actions, the VLA-Corrector monitors for potential failures. If an error is detected – indicated by a Stop Token signal – the VLA-Corrector initiates a recovery process. This process involves halting the current skill execution and restarting it from the beginning, allowing for iterative attempts and improved robustness in task completion. The Stop Token serves as a critical interrupt signal, enabling the framework to dynamically adapt to unexpected events during disassembly.

The SELF-VLA framework relies on pre-defined skill libraries containing parameterized robotic actions, such as grasping, lifting, and inserting, to execute disassembly tasks. These skills are not monolithic but are broken down into sequences of movements defined by specific Waypoints – discrete spatial coordinates that the robot end-effector must pass through. Utilizing waypoints enables precise trajectory planning and control, allowing the robot to navigate complex geometries and avoid collisions during disassembly. The system selects appropriate skills from the library based on the current task and observed state, and then generates a trajectory through the defined waypoints to achieve the desired action. This approach facilitates both efficient execution and adaptability to variations in object pose and environment configuration.

Our SELF-VLA framework offers advantages over current variable local area (VLA) approaches by dynamically adapting to task requirements, as demonstrated in this comparative analysis.

Performance and Adaptability: Measuring the Inevitable Decay

Comparative analysis was conducted to evaluate the performance of SELF-VLA against established Vision-Language-Action (VLA) models, specifically OpenVLA, π0.5, and π0.5-Droid. These models served as baselines to assess the efficacy of the SELF-VLA approach in robotic disassembly tasks. The evaluation framework involved consistent task parameters and environmental conditions to ensure a fair comparison of performance metrics, including task completion rates, efficiency, and robustness across different component extractions. Results detailed in subsequent analysis demonstrate the quantitative improvements achieved by SELF-VLA relative to these baseline VLA models.

Low-Rank Adaptation (LoRA) was implemented as a parameter-efficient fine-tuning technique to improve the performance of Vision-Language-Action (VLA) models within the robotic disassembly domain. This approach freezes the pre-trained model weights and introduces trainable low-rank decomposition matrices, significantly reducing the number of trainable parameters compared to full fine-tuning. By focusing adaptation on these smaller matrices, LoRA facilitates faster training and reduces the risk of catastrophic forgetting, enabling the models to effectively transfer knowledge learned from general vision-language tasks to the specific requirements of robotic disassembly, such as component localization and manipulation planning.

Experimental evaluation of SELF-VLA on complex disassembly tasks indicates substantial performance gains compared to end-to-end Vision-Language-Action (VLA) baselines. Specifically, CPU extraction success rates increased by up to 80% when utilizing SELF-VLA. This improvement demonstrates enhanced robustness and efficiency in completing the disassembly process. The observed performance difference suggests SELF-VLA effectively addresses challenges present in these tasks that limit the performance of standard VLA models.

Implementation of SELF-VLA resulted in a 31% average improvement in CPU extraction success rate when compared across various Vision-Language-Action models. Specifically, testing demonstrated an increase in successful CPU extractions utilizing SELF-VLA. Furthermore, the average success rate for RAM module removal also improved, registering a 17% increase across the tested models when incorporating SELF-VLA. These figures represent the average performance gain observed after integrating SELF-VLA into the existing model architectures.

Quantitative analysis of task completion times demonstrates the efficiency gains achieved with SELF-VLA. Specifically, CPU extraction was completed in an average of 63 seconds using SELF-VLA, representing a 54% reduction compared to the 136 seconds required by the end-to-end baseline model. Similarly, SELF-VLA reduced the average time for RAM module removal to 45 seconds, a 10% improvement over the baseline’s 50 seconds. These results indicate a consistent acceleration of task completion when utilizing SELF-VLA across both CPU extraction and RAM removal procedures.

Following human interruption during a manipulation task, the VLA-corrector successfully resumes and completes the remaining steps, demonstrating its ability to recover from disturbances.

Towards a Circular Economy: Delaying the Inevitable, One Component at a Time

Robotic disassembly, particularly when driven by the Self-Learning Visual Action (SELF-VLA) framework, presents a compelling pathway towards substantial economic and environmental gains. This automated approach moves beyond traditional recycling by enabling the precise and efficient deconstruction of products, recovering high-value materials often lost in conventional processes. The economic benefits stem from reduced material costs and the creation of secondary resource streams, while the environmental advantages arise from diminished reliance on virgin resource extraction, lowered energy consumption in manufacturing, and a significant reduction in landfill waste. By automating the disassembly process, valuable components and materials can be reintegrated into the supply chain, fostering a truly circular economy and mitigating the detrimental impacts of the growing global e-waste problem. The potential for scalable implementation suggests a transformative shift in how products are designed, utilized, and ultimately, repurposed.

The escalating volume of electronic waste presents not only an environmental challenge but also a significant loss of valuable materials. Automated disassembly systems offer a pathway to recapture these resources, diminishing the need to continually extract virgin materials from the Earth. Through precise robotic deconstruction, components containing precious metals – like gold, silver, and palladium – as well as rare earth elements and plastics, can be efficiently separated and reintroduced into the manufacturing supply chain. This closed-loop approach minimizes landfill waste, reduces the energy expenditure associated with mining and refining, and lessens the environmental impact of resource depletion, ultimately fostering a more sustainable and circular economy for electronic goods.

The escalating volume of electronic waste presents a formidable environmental and economic challenge, but scalable and adaptable robotic disassembly systems offer a promising pathway towards a circular economy. Current e-waste recycling often relies on labor-intensive manual processes or inefficient shredding, resulting in material loss and hazardous waste streams. Robotic solutions, capable of intelligently deconstructing products, enable the high-value recovery of critical raw materials – from rare earth elements in smartphones to precious metals in circuit boards – lessening the dependence on environmentally damaging primary resource extraction. The key lies in developing systems that aren’t limited to a single product type; instead, adaptable robotic platforms, coupled with advanced sensing and artificial intelligence, can be rapidly reprogrammed to disassemble a wide range of devices, paving the way for a truly sustainable and resource-efficient future.

Continued development centers on broadening the repertoire of disassembly skills within the robotic system, moving beyond current capabilities to address a wider range of product designs and materials. This expansion necessitates enhanced perception and manipulation algorithms to navigate the complexities of real-world e-waste streams – often characterized by damage, variability, and unpredictable arrangements. Researchers are prioritizing improvements in the system’s robustness, aiming for reliable operation even in unstructured and dynamic environments where parts may be missing, orientations are unknown, and lighting conditions are suboptimal. Ultimately, this focus on adaptability and resilience is crucial for transitioning robotic disassembly from a laboratory demonstration to a scalable solution for closing the loop on material lifecycles and fostering a truly circular economy.

The pursuit of autonomous robotic systems, as demonstrated by SELF-VLA, reveals a fundamental truth about complex architectures. The framework’s reliance on a skill library and failure recovery isn’t simply about achieving higher disassembly success rates; it acknowledges the inevitability of systemic fragility. As Alan Turing observed, “There is no longer any boundary between the world outside and the machine.” This mirrors the agent’s interaction with the physical world, where unforeseen contact and environmental factors necessitate robust adaptation. The system isn’t built to avoid failure, but to anticipate and recover from it, accepting dependency as an inherent condition of operation within a dynamic, unpredictable environment. This mirrors the core idea that systems aren’t tools, they’re ecosystems, and ecosystems are defined by their capacity to absorb disruption.

The Inevitable Drift

SELF-VLA, as presented, represents a localized reduction in entropy-a temporary reprieve from the thermodynamic imperative. The integration of skill libraries within an agentic framework demonstrably improves performance, but this improvement isn’t a destination. It’s merely a shifting of the failure modes. The system’s present reliance on pre-defined skills constitutes a brittle core; unforeseen variations in object geometry, material properties, or even ambient lighting will inevitably expose the limitations of this structured approach. A guarantee of complete disassembly success remains, predictably, beyond reach; a guarantee is simply a contract with probability.

Future work will, of course, focus on expanding the skill library and refining the failure recovery mechanisms. However, a more fundamental challenge lies in moving beyond discrete skill acquisition. The true horizon isn’t about more skills, but about a system capable of generating them-of adapting its internal model of action in real-time. This demands a move away from treating skills as static primitives, and toward a representation that acknowledges their inherent fluidity. Stability, after all, is merely an illusion that caches well.

The pursuit of robotic autonomy isn’t a quest for control, but a negotiation with chaos. Contact-rich manipulation, by its very nature, amplifies the sensitivity to initial conditions. The system will not be built so much as grown-a complex, adaptive network responding to the constant influx of unpredictable events. Chaos isn’t failure-it’s nature’s syntax.

Original article: https://arxiv.org/pdf/2603.11080.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/