Building Blocks for Robot Dexterity: Composable Manipulation with LiLo-VLA

Author: Denis Avetisyan

A new framework empowers robots to tackle complex, multi-step tasks by breaking them down into reusable, object-focused actions.

LiLo-VLA achieves robust, long-horizon manipulation through the sequential application of object-centric skills, linked by resilient motion planning-a methodology that not only enables zero-shot task composition but also guards against the propagation of errors during complex procedures.

LiLo-VLA combines motion planning with object-centric vision-language-action policies for robust long-horizon manipulation and improved generalization.

Achieving robust, long-horizon manipulation remains a key challenge for general-purpose robotics due to the combinatorial complexity of sequencing skills and susceptibility to environmental disturbances. This work introduces [latex]LiLo-VLA[/latex], a novel framework for ‘Compositional Long-Horizon Manipulation via Linked Object-Centric Policies’ that decouples global motion from object-centric interaction via a modular design. By linking a reaching module with an object-centric Vision-Language-Action (VLA) policy, [latex]LiLo-VLA[/latex] demonstrates significant improvements in both compositional generalization and failure recovery across challenging simulation and real-world benchmarks. Could this modular approach unlock truly scalable and adaptable robotic systems capable of tackling increasingly complex tasks in unstructured environments?

The Fragility of Precision: Why Robots Struggle with Extended Tasks

Conventional robotic systems frequently encounter difficulties when executing tasks demanding a prolonged series of coordinated actions. This limitation stems from the cumulative effect of errors inherent in each step; even minor inaccuracies can propagate throughout the sequence, ultimately leading to failure. Consequently, these robots struggle with real-world applications – such as assembling complex products, preparing meals, or providing extended care – that necessitate not just precise movements, but also the ability to maintain that precision over a significant duration. The challenge isn’t simply performing each individual action correctly, but reliably chaining together dozens, or even hundreds, of actions without succumbing to compounding errors, effectively restricting their deployment in genuinely complex and dynamic environments.

Task and Motion Planning (TAMP) represents a significant step towards autonomous robotic manipulation, yet current implementations frequently falter when confronted with the unpredictable nature of real-world settings. These systems typically rely on pre-defined plans and models of the environment, proving brittle in the face of even minor deviations or unexpected obstacles. A robot executing a TAMP-generated plan might struggle, for example, if an object is slightly out of reach or if a previously unseen person enters its workspace. The rigidity stems from the difficulty in seamlessly integrating high-level task reasoning – understanding what needs to be done – with low-level motion planning – determining how to do it – while simultaneously accounting for environmental uncertainty and dynamically replanning in response to unforeseen events. Consequently, research is heavily focused on developing more robust and adaptive TAMP algorithms capable of handling the inherent complexities and dynamism of real-world manipulation tasks.

The system recovers from grasp failures by autonomously re-estimating object pose and retrying execution.

Deconstructing Complexity: A Modular Approach to Long-Horizon Interaction

LiLo-VLA facilitates long-horizon manipulation tasks by decoupling global transport from local interaction. This modular architecture addresses the challenges of complex manipulation by first planning a coarse trajectory to approach the target object-the transport phase-and subsequently executing precise, atomic actions to achieve the desired manipulation-the interaction phase. This separation allows for independent optimization of each phase; the transport module can prioritize efficient navigation, while the interaction module focuses on accurate and robust execution of specific actions. By dividing the problem into these distinct stages, LiLo-VLA improves scalability and adaptability to varied environments and manipulation goals, reducing the computational burden of planning for the entire long-horizon sequence simultaneously.

The Reaching Module employs motion planning algorithms to compute collision-free trajectories for the robot’s end-effector, facilitating efficient navigation to the proximity of desired target objects. This module operates by defining a search space based on the robot’s kinematic constraints and environmental map, utilizing techniques such as probabilistic roadmaps or rapidly-exploring random trees to identify feasible paths. The output of the motion planner is a sequence of robot joint configurations that, when executed, move the robot towards the target object’s location, positioning it for subsequent interaction. This stage prioritizes speed and coarse positioning, reducing the complexity required for the fine-grained manipulation handled by the Interaction Module.

The Interaction Module within LiLo-VLA employs Vision-Language-Action (VLA) models to execute discrete manipulation primitives. These models are trained to interpret visual input and natural language instructions, translating them into specific robotic actions. This approach allows for complex tasks to be broken down into a sequence of atomic manipulations – such as grasping, pushing, or rotating – which are then performed by the robot. By grounding actions in both visual perception and linguistic commands, the Interaction Module achieves robustness to variations in object pose, scene clutter, and task specifications, and offers adaptability to novel instructions without requiring explicit re-programming for each new scenario.

LiLo-VLA decouples manipulation into reaching and interaction modules, using state perturbation for robustness and visual masking to focus on objects, enabling closed-loop failure recovery by resetting state when skills fail.

The Attentive Gaze: Focusing Perception on What Matters

The Interaction Module’s Object-Centric Visual Language Attention (VLA) policy operates by prioritizing visual information derived from wrist-mounted camera observations. This approach centers the VLA model’s focus on the target object through the selective weighting of observed features. By emphasizing data originating from the wrist-view perspective, the system actively downweights background clutter and irrelevant visual elements. This prioritization strategy minimizes distractions and improves the model’s ability to accurately perceive and interact with the designated object, enhancing the robustness and precision of manipulation tasks.

Visual masking is implemented as a preprocessing step within the Perception Module to enhance the performance of the VLA model. This technique dynamically identifies and occludes regions of the image that do not pertain to the target object, effectively reducing background noise and computational load. The masking process utilizes segmentation data – derived from FoundationPose and YOLOE – to create a binary mask where pixels corresponding to the target object are retained and all other pixels are set to zero. This results in an input image where only the relevant object is visible to the VLA model, improving both the speed and accuracy of subsequent perception and manipulation tasks. The mask is updated with each frame to accommodate changes in object position and orientation.

Robust 6D pose estimation and object segmentation are achieved through the integration of FoundationPose and YOLOE. FoundationPose provides accurate and stable pose estimation, crucial for determining the object’s position and orientation in 3D space. Simultaneously, YOLOE, a state-of-the-art object detection model, performs pixel-level segmentation, precisely outlining the object’s boundaries. This combined perceptual input – 6D pose and segmented object mask – is then utilized by the manipulation system, enabling accurate grasping and interaction with the target object while minimizing errors caused by imperfect perception.

Randomly masking the wrist camera input demonstrates the policy's robust focus on the target object, effectively filtering out background distractions. — Randomly masking the wrist camera input demonstrates the policy’s robust focus on the target object, effectively filtering out background distractions.

Beyond Fragility: Towards Resilient and Generalizable Robotic Intelligence

LiLo-VLA distinguishes itself through an integrated Closed-Loop Recovery mechanism, designed to maintain operational continuity even when faced with failed manipulation attempts. This system leverages a dedicated Reaching Module, which proactively intervenes following unsuccessful skill execution to intelligently reset the workspace to a known state. Rather than halting upon failure, the robot autonomously repositions objects or tools, effectively creating a fresh starting point for subsequent actions. This capability is critical for complex, long-horizon tasks where a single error could otherwise derail the entire operation, and fundamentally contributes to the system’s overall resilience and ability to operate continuously without human intervention.

The architecture facilitates a capacity for zero-shot compositional generalization, meaning the robot can assemble and execute entirely new manipulation tasks without requiring any task-specific training examples. This is achieved through the flexible combination of pre-trained, modular skills-like reaching, grasping, and placing-allowing the system to ‘understand’ how these components can be sequenced to achieve previously unseen objectives. Rather than learning each task from scratch, the robot leverages its existing skillset to infer the necessary actions, demonstrating a significant step towards adaptable and versatile robotic systems capable of operating in dynamic and unpredictable environments. This ability to generalize reduces the need for extensive, time-consuming retraining whenever a new task is introduced, improving efficiency and broadening the scope of potential applications.

Extensive validation of the framework within the LIBERO environment demonstrates a substantial advancement in robotic manipulation capabilities. The system achieved a 69% average success rate across a suite of complex, long-horizon tasks – a figure that markedly surpasses the performance of current state-of-the-art baselines, including Pi0.5 at 28% and OpenVLA-OFT at a mere 2%. This proficiency extends beyond simulation; real-world validation, encompassing eight distinct long-horizon tasks, yielded an impressive 85% average success rate, confirming the system’s capacity to reliably execute intricate manipulation sequences in practical settings and highlighting its robustness in transferring learned skills to previously unseen scenarios.

LiLo-VLA demonstrates robust performance across a suite of real-world manipulation tasks-ranging from 4 to 8 steps-including variations in standard configurations, diverse layouts, and permuted skill sequences.

The pursuit of compositional generalization, as demonstrated by LiLo-VLA, echoes a fundamental principle of systems analysis: understanding how parts interact to create a whole. This framework doesn’t merely solve manipulation tasks; it dissects them into manageable, reusable components. As John McCarthy observed, “In fact, as far as I can see, all of the important ideas in computer science have been discovered before.” LiLo-VLA exemplifies this sentiment – it isn’t inventing new robotics principles, but rather, intelligently composing existing ones. The system’s ability to recover from failure isn’t a fortunate accident, but a consequence of modularity; when one component falters, the others continue, echoing a resilient system designed for exploit and comprehension.

Beyond the Horizon

LiLo-VLA establishes a functional, if predictably brittle, foothold in long-horizon manipulation. The framework’s modularity is a virtue-systems built on rigidly defined components invariably reveal their weaknesses first, allowing for targeted dismantling and reconstruction. However, the current reliance on pre-defined object categories, while pragmatic, implicitly accepts the limitations of human labeling. True generalization will demand policies capable of discovering relevant object affordances, not merely recognizing pre-assigned ones – a shift from supervised learning to genuine reverse-engineering of physical reality.

Failure recovery, touted as an advantage, feels less like resilience and more like a sophisticated patching of inevitable errors. Each recovered failure is, fundamentally, a lesson in what doesn’t work. A more ambitious direction lies in anticipating these failures – not through exhaustive simulations, but by building policies that actively seek instability, treating it as a source of information. After all, a system that cannot be broken is a system not fully understood.

The true test won’t be achieving task completion, but creating a system that gracefully degrades, revealing its internal logic through its errors. LiLo-VLA offers a promising toolkit for this endeavor, but the ultimate goal isn’t flawless manipulation – it’s the construction of a machine that teaches, even-and perhaps especially-through its spectacular failures.

Original article: https://arxiv.org/pdf/2602.21531.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Fragility of Precision: Why Robots Struggle with Extended Tasks

Deconstructing Complexity: A Modular Approach to Long-Horizon Interaction

The Attentive Gaze: Focusing Perception on What Matters

Beyond Fragility: Towards Resilient and Generalizable Robotic Intelligence

Beyond the Horizon

See also: