Robots That ‘See’ Your Preferences: Guiding Motion with Visual Language

Author: Denis Avetisyan

New research shows that robots can interpret visual cues and language to select movement paths that align with human preferences for style and object avoidance.

Vision-language models demonstrate varying accuracy in identifying appropriate paths within images during manipulation tasks, suggesting an inherent limitation in their capacity to reliably interpret spatial relationships crucial for embodied intelligence.

This study evaluates the ability of Visual Language Models to reason about robot motion and select trajectories based on user-defined spatial preferences.

While robots increasingly assist in complex tasks, translating nuanced human instructions-particularly those concerning spatial relationships and preferred motion-remains a challenge. This is addressed in ‘Evaluating VLMs’ Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences’, which investigates the capacity of state-of-the-art Vision-Language Models (VLMs) to interpret spatial reasoning for robot trajectory selection. The study demonstrates that VLMs, such as Qwen2.5-VL, can effectively identify robot motions aligning with user-defined preferences for both object proximity and path style, achieving up to 75% accuracy after fine-tuning. Could this represent a crucial step towards more intuitive and adaptable human-robot collaboration in dynamic environments?

The Illusion of Control: Traditional Motion Planning’s Limits

Conventional robot motion planning often depends on algorithms such as Probabilistic Roadmaps and BiRRT, which, while functional, demand significant computational resources. These methods typically involve extensive pre-processing to map and analyze the environment, creating a roadmap or tree of possible paths. This upfront cost limits a robot’s ability to react swiftly to dynamic changes or unexpected obstacles. The computational burden also hinders real-time adaptability, particularly in complex or large-scale environments where generating and searching these maps becomes increasingly slow and inefficient. Consequently, robots employing these traditional techniques can appear rigid and unresponsive, struggling to navigate unpredictable situations with the fluidity expected in human-robot collaboration or unstructured settings.

Current robotic motion planning algorithms often falter when confronted with the intricacies of real-world environments and the subtleties of human intention. Traditional approaches, while effective in simplified scenarios, struggle to navigate cluttered spaces or dynamically adjust to unforeseen obstacles, leading to rigid and unnatural movements. More significantly, these systems typically lack the capacity to interpret or integrate nuanced user preferences-a simple desire for a robot to approach an object ‘gently’ or follow a specific, non-direct path-resulting in interactions that feel clumsy and unresponsive. This inability to seamlessly blend robotic action with human expectation creates a barrier to truly collaborative work, limiting the potential for robots to function as intuitive and helpful partners in complex, everyday situations.

Qwen2.5-VL successfully plans robot trajectories (dotted lines) that adhere to specified language constraints in diverse motion planning scenarios.

The Echo of Intention: Visual Language Models Emerge

Visual Language Models (VLMs) represent a significant advancement in human-robot interaction by moving beyond traditional programming methods that require precise instructions. These models process both natural language input – such as commands like “bring me the red cup” – and visual information from the robot’s sensors. This combined input allows VLMs to infer user intent from imprecise or ambiguous requests and contextual visual cues. Unlike systems reliant on pre-defined actions, VLMs can interpret preferences expressed through demonstrations, pointing gestures, or even vague descriptions, effectively translating high-level user goals into actionable robotic behaviors without explicit, step-by-step programming.

Visual Language Models (VLMs) address the challenge of translating abstract, human-defined goals into concrete robotic actions by leveraging their ability to process both linguistic instructions and visual input. Traditional robot control often requires precise, low-level commands detailing specific movements; VLMs, however, can interpret high-level directives such as “avoid obstacles” or “stay close to the table” and map these to a range of possible robot trajectories. This is achieved through training on datasets that correlate natural language descriptions with corresponding visual scenes and robot behaviors, enabling the model to predict feasible and desirable robot paths given new, unseen instructions and environmental observations. The model effectively functions as an intermediary, converting qualitative goals into quantifiable movement plans suitable for robot execution.

Presenting candidate trajectories for evaluation by a Visual Language Model (VLM) necessitates methods beyond simple coordinate lists due to the VLM’s reliance on multimodal input. Direct presentation of raw trajectory data lacks the contextual information required for effective preference assessment. Therefore, our system employs a rendering process that generates a sequence of images depicting the robot executing each candidate trajectory in a simulated environment. These rendered views, combined with a natural language description of the intended goal, provide the VLM with the necessary visual and semantic data to determine which trajectory best aligns with user preferences. This approach facilitates a more intuitive and accurate evaluation process, enabling the VLM to effectively bridge the gap between high-level instructions and feasible robot actions.

Evaluations of the proposed system demonstrate successful trajectory selection with up to 71.4% accuracy in satisfying stated user preferences. This performance metric was determined through a series of experiments wherein the VLM-driven robot control system was tasked with selecting optimal trajectories based on natural language and visual input. The achieved accuracy indicates a statistically significant level of feasibility for implementing VLMs as a core component of intuitive robot control interfaces, suggesting the approach effectively translates user intent into actionable robotic movements.

Vision-language models demonstrate varying accuracy in identifying correct navigation paths within images.

The Fragile Signal: Optimizing Queries for Perception

Image query methods for Visual Language Models (VLMs) utilize different approaches to present potential object paths for evaluation. A Single-Image Trajectory presents candidate paths derived from a single input image, requiring the VLM to interpret the scene and assess path validity solely from that frame. Conversely, a Multi-Image Trajectory Trail utilizes a sequence of images, effectively providing a temporal context for path analysis. This allows the VLM to leverage changes across frames to better understand object movement and assess the plausibility of various trajectories, potentially increasing accuracy but also increasing computational demands.

The Single-Image Trajectory Trail method utilizes Visual Language Models (VLMs) to generate descriptive textual representations of the image scene, providing contextual information that improves the accuracy of trajectory selection. By prompting the VLM to articulate the visual elements and relationships within the image, the system gains a more nuanced understanding of potential paths. This descriptive output is then used to evaluate candidate trajectories, enabling the identification of paths that are not only geometrically feasible but also semantically aligned with the scene’s content, ultimately resulting in a more precise and reliable visual query process.

To address the computational demands of evaluating numerous candidate trajectories in visual query systems, K-means clustering was investigated as a method for efficient grouping and assessment. This approach involves partitioning the candidate trajectory space into k clusters, where each trajectory is assigned to the cluster with the nearest mean. By evaluating representative trajectories from each cluster, rather than every individual trajectory, the overall computational burden is significantly reduced. The efficacy of this method relies on the appropriate selection of k and the distance metric used to determine cluster membership, balancing computational savings with the preservation of trajectory diversity and accuracy.

Supervised Fine-Tuning (SFT) was applied to the LLaVa1.5 Visual Language Model to specifically enhance performance on object proximity tasks. This process involved training the model on a dataset of labeled examples detailing spatial relationships between objects within images. Quantitative results demonstrate that this fine-tuning approach yielded a greater than 60% improvement in accuracy when assessing object proximity compared to the base, pre-trained LLaVa1.5 model. The improvement was measured using a standardized evaluation set designed to test the model’s ability to correctly identify objects located near each other in visual scenes.

Finetuning the Qwen2.5-VL-7B large multimodal model demonstrated a performance increase exceeding 20% when evaluated on object proximity tasks. This improvement was achieved through supervised fine-tuning, adapting the pre-trained model to more accurately assess spatial relationships between objects identified within visual queries. The performance gain indicates that while Qwen2.5-VL-7B possesses inherent visual understanding capabilities, targeted training data focused on proximity reasoning significantly enhances its ability to solve these specific types of visual question answering problems.

Visual language models demonstrate varying accuracy in identifying correct paths within images, with performance dependent on the query method used and averaging across preferences for object proximity and path style.

The Ghost in the Machine: Accuracy and the Specter of Hallucination

Evaluating the precision with which Visual Language Models (VLMs) select appropriate trajectories is paramount to establishing their viability in robotic control systems. Recent research demonstrates that accuracy serves as the key indicator of successful VLM performance in this domain, showcasing the potential for these models to guide robots through complex tasks. By assessing how consistently a VLM chooses the correct path from a set of candidates, researchers can quantify its understanding of both visual input and natural language instructions. A high level of accuracy not only validates the model’s reasoning capabilities but also provides a foundation for developing safe and reliable autonomous systems, moving beyond theoretical possibilities toward practical robotic applications.

Recent evaluations demonstrate that the Qwen2.5-VL model achieves an overall accuracy of 71.4% in trajectory selection tasks, establishing a new benchmark in vision-language model-driven robot control. This performance notably surpasses that of the widely-regarded GPT-4o model, indicating Qwen2.5-VL’s superior capability in interpreting visual input and selecting appropriate robot actions. The achievement signifies substantial progress toward reliable and intelligent robotic systems, as accuracy is paramount for safe and effective operation in real-world environments. This high level of performance suggests the potential for broader deployment of VLMs in complex robotic applications, paving the way for more autonomous and adaptable machines.

Evaluations reveal a nuanced performance profile for the Qwen2.5-VL model in trajectory selection, demonstrating a strong capability in maintaining object proximity during robotic tasks. Specifically, the model achieved a 74.4% accuracy rate in ensuring the robot remained appropriately close to target objects. However, performance dipped to 63.9% when assessing adherence to desired path styles, suggesting a greater challenge in coordinating complex movements beyond simple proximity maintenance. These results indicate that while the model excels at basic spatial awareness, further refinement is needed to fully control the aesthetic and procedural qualities of robot motion.

Despite demonstrated capabilities, Visual Language Models (VLMs) are susceptible to ‘hallucination’ – a tendency to propose robot trajectories that do not correspond to any of the pre-defined, safe options available to the system. This represents a critical safety concern, as an unconstrained VLM could instruct a robot to perform actions outside of its operational limits or in potentially hazardous scenarios. Unlike simply misidentifying an object, hallucination involves the creation of novel, unimplemented actions, bypassing established safeguards and demanding robust mitigation strategies to ensure reliable and predictable robot behavior. Addressing this issue is paramount for deploying VLMs in real-world robotic applications where safety and dependability are non-negotiable.

A crucial element in refining Visual Language Model (VLM) control of robotic systems involves directly observing the potential trajectories considered by the model. To facilitate this, a Single-Image Screenshot Gallery was implemented, presenting a clear visual representation of each candidate path. This gallery allows researchers to pinpoint instances where the VLM selects improbable or unsafe trajectories, effectively debugging the decision-making process. By providing a readily interpretable overview of the available options, the gallery moves beyond simple accuracy metrics and fosters a deeper understanding of why a VLM chooses a particular course of action. This detailed visualization is particularly valuable in identifying and mitigating ‘hallucinations’ – instances where the model selects trajectories not present in the defined candidate set – and ultimately improves the safety and reliability of VLM-driven robotic control.

Path selection accuracy improves with increasing maximum allowed tokens, indicating that larger image sizes support more precise trajectory planning.

The study meticulously charts a course through the complexities of imparting preference to robotic systems. It’s a curious endeavor, this attempt to codify the nuances of ‘path style’ and ‘object proximity’ – essentially, to predict which trajectory a robot should take, not just which it can. This echoes a sentiment articulated by Alan Turing: “The question is not whether a machine can think, but whether a machine can be made to behave intelligently.” The paper doesn’t aim for sentience, of course, but a demonstrable capacity to interpret and act upon human-defined spatial reasoning-to grow a system capable of anticipating desired outcomes, rather than simply executing commands. Each deployed trajectory, then, is a small prediction, a prophecy of success or, inevitably, a minor failure in the ongoing evolution of human-robot interaction.

Where Do the Paths Lead?

This work, while demonstrating a capacity for aligning robot action with expressed desire, reveals less about planning and more about the art of choosing between pre-laid paths. A system isn’t a cartographer, it’s a curator – it doesn’t create routes, it selects from what already exists. The true challenge lies not in interpreting a preference for ‘nearby’ or ‘smooth,’ but in cultivating a landscape of possibilities from which those preferences can meaningfully emerge. A richer vocabulary of motion, and a means of composing it dynamically, remains elusive.

The current paradigm implicitly assumes a static world. Yet, the most compelling interactions unfold in environments that shift and change. Resilience lies not in isolating the model from uncertainty, but in forgiving the inevitable discrepancies between perception and reality. A system built on precise spatial reasoning will falter; one that anticipates and gracefully accommodates error will endure. The question, then, isn’t whether the model knows where things are, but how readily it adapts when they aren’t.

Ultimately, this research illuminates a familiar truth: a system isn’t a tool, it’s a garden. It requires constant tending, not to eliminate weeds, but to guide their growth. The future of human-robot interaction won’t be defined by flawless execution, but by the elegance with which these systems navigate imperfection, and the surprising paths they reveal along the way.

Original article: https://arxiv.org/pdf/2603.13100.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Control: Traditional Motion Planning’s Limits

The Echo of Intention: Visual Language Models Emerge

The Fragile Signal: Optimizing Queries for Perception

The Ghost in the Machine: Accuracy and the Specter of Hallucination

Where Do the Paths Lead?

See also: