Robots That Understand You: A New Benchmark for Vision-Language Action

Author: Denis Avetisyan

Researchers introduce a comprehensive evaluation of robotic systems that translate natural language into physical actions, revealing the challenges and opportunities for more intuitive robot control.

Real-world mobile manipulation tasks served as the proving ground for the system’s performance, demonstrating its capacity to function beyond simulated environments.

This paper presents CEBench, a practical benchmark for vision-language-action models, and LLaVA-VLA, a lightweight model achieving cross-embodiment and mobile manipulation without extensive pretraining.

Despite the promise of generalist robotic agents, current Vision-Language-Action (VLA) models are hampered by substantial computational demands and limited adaptability to diverse robotic embodiments. This work, ‘Rethinking the Practicality of Vision-language-action Model: A Comprehensive Benchmark and An Improved Baseline’, addresses these challenges by introducing CEBench, a comprehensive benchmark for evaluating VLA practicality, and LLaVA-VLA, a lightweight model designed for efficient deployment on consumer hardware. LLaVA-VLA achieves strong performance across embodiments-including the first end-to-end demonstration of mobile manipulation-through innovations in multi-view perception, action chunking, and a novel two-stage training paradigm. Could this approach unlock truly versatile and accessible robotic systems capable of operating in real-world environments?

The Illusion of Intelligence: Why Robots Still Can’t Adapt

Historically, robotic control has depended on meticulously designed, hand-coded policies for even simple tasks. This approach, while offering precision in constrained environments, severely restricts a robot’s ability to adapt to unforeseen circumstances or generalize learned behaviors to novel situations. Each new environment or slight variation in task demands a complete re-engineering of the control system, proving both time-consuming and brittle. The limitations of handcrafted policies become particularly evident when robots encounter the inherent ambiguity and complexity of real-world environments, hindering their deployment in dynamic settings where flexibility and robustness are paramount. Consequently, advancements in artificial intelligence are now focused on creating systems capable of learning and adapting, moving beyond the rigid constraints of pre-programmed instructions.

The burgeoning field of embodied artificial intelligence necessitates a paradigm shift beyond isolated capabilities, demanding systems that can perceive the world visually, comprehend natural language instructions, and translate those instructions into purposeful action. This convergence has fueled the development of Vision-Language-Action (VLA) models, designed to bridge the gap between sensing, thinking, and doing. These models aren’t simply about recognizing objects or understanding commands in isolation; they aim to create a unified framework where visual input grounds language, and language guides physical interaction with the environment. Effectively, a VLA model strives to replicate the human ability to see a task, hear a request, and then skillfully execute it, opening doors to more adaptable, intuitive, and generally intelligent robotic systems capable of navigating and manipulating the complexities of real-world scenarios.

Despite significant advancements in artificial intelligence, scaling Vision-Language-Action (VLA) models remains a substantial hurdle. Current methodologies often encounter performance bottlenecks when transitioning from controlled laboratory settings to the complexities of real-world environments. The primary limitation isn’t necessarily the individual components – perception, language, or action – but rather their synergistic integration at scale. As models grow in size and are tasked with increasingly intricate scenarios, computational demands escalate rapidly, leading to prohibitive costs and diminished efficiency. Furthermore, the accumulation of data required to train these large-scale models presents a logistical challenge, and the risk of overfitting to specific datasets hinders generalization to novel situations. This difficulty in scaling effectively thus restricts the deployment of VLA models in practical applications, impeding progress toward truly adaptable and intelligent robotic systems.

The Cobot-Magic system demonstrates mobile bimanual manipulation in a real-world setting, as shown in this top-down view.

LLaVA-VLA: Another Layer of Abstraction on a House of Cards

LLaVA-VLA presents a new methodology for Vision-Language-Action (VLA) modeling distinguished by its capacity to achieve competitive performance with significantly reduced reliance on large-scale pre-training. Traditional VLA models often require substantial pre-training on extensive datasets to establish foundational visual and linguistic understanding; however, LLaVA-VLA demonstrates strong capabilities using a comparatively smaller model size and a targeted training strategy. This is achieved through architectural innovations and training techniques that prioritize efficient learning and robust generalization from limited data, enabling deployment in resource-constrained environments without compromising performance on complex manipulation tasks.

LLaVA-VLA employs LLaVA-OneVision-0.5B as its foundational large language model, capitalizing on its existing vision and language capabilities. To facilitate robust state understanding for robotic control, the system integrates Proprioception Tokenization, which encodes robot joint angles and velocities as input tokens alongside visual and language data. Furthermore, LLaVA-VLA utilizes a Unified Action Space, representing all possible robot actions within a single token vocabulary; this allows the model to directly predict actions based on the combined visual, linguistic, and proprioceptive input, streamlining the control process and enabling more complex manipulation tasks.

The LLaVA-VLA architecture incorporates multi-view image inputs to provide a more comprehensive understanding of the robot’s surroundings, effectively increasing environmental awareness beyond what is achievable with single-view perception. Simultaneously, the system employs Action Chunking, a technique that decomposes complex actions into a sequence of smaller, discrete steps. This decomposition enhances planning robustness by reducing the cumulative error associated with long-horizon predictions and improves execution stability by allowing for more frequent feedback and correction during task completion. The combined effect of these features allows LLaVA-VLA to operate effectively in dynamic and partially observable environments.

The LLaVA-VLA training process utilizes a two-stage paradigm to optimize learning efficiency and adaptability. Initially, the model undergoes pre-training on a diverse collection of datasets comprising vision-language-action data, establishing a foundational understanding of multimodal relationships. Subsequently, a task-specific fine-tuning stage is employed, allowing the model to specialize in particular manipulation tasks using datasets tailored to those tasks. This staged approach mitigates the need for extensive end-to-end training for each new task, as the pre-trained backbone provides a strong starting point, thereby accelerating convergence and improving performance on downstream applications.

Our LLaVA-VLA model employs a vision-language architecture integrating a vision encoder and a large language model to process and understand both visual and textual information.

CEBench: A More Complex Way to Demonstrate Inevitable Failure

CEBench is designed as a holistic evaluation suite for Vision-Language-Action (VLA) models, moving beyond single-embodiment or single-environment testing. The benchmark incorporates a range of robotic platforms – including both simulated and real-world robots – and presents tasks within diverse environments, such as household spaces and cluttered scenes. This diversity is achieved through variations in robot arm length, gripper type, and simulated physics parameters. CEBench facilitates assessment of a VLA model’s ability to generalize across different robotic morphologies and environmental conditions, providing a more representative measure of its overall performance and robustness compared to benchmarks focused on a single setup. The suite incorporates metrics evaluating both task success rate and trajectory efficiency, offering a comprehensive analysis of model capabilities.

LLaVA-VLA exhibits robust performance within the CEBench benchmark, indicating a capacity for generalization to novel situations. Specifically, on the CALVIN benchmark, LLaVA-VLA achieves a 96.2% success rate, performing comparably to a larger model achieving 97.4%, with an average completed trajectory length of 3.68. Furthermore, on the RoboTwin benchmark, LLaVA-VLA attains a 40.3% success rate on previously seen tasks and 28.6% in Domain Randomized (DR) settings, exceeding the performance of diffusion and VAE baseline models. In real-world mobile manipulation, LLaVA-VLA achieves a success rate exceeding 10%, aligning with the performance of the ACT baseline, demonstrating effective transfer to physical environments.

Evaluations using the CALVIN benchmark demonstrate that the LLaVA-VLA model achieves a 96.2% task success rate. This performance is notably comparable to that of a larger model, which attained a 97.4% success rate on the same benchmark. Furthermore, LLaVA-VLA completed an average trajectory length of 3.68, indicating its ability to effectively plan and execute actions to fulfill task requirements within the simulated environment.

Domain Randomization (DR) is a key component of the CEBench evaluation framework, designed to assess and improve the generalization capabilities of Vision-Language-Action (VLA) models like LLaVA-VLA. This technique involves training and evaluating the model across a broad spectrum of simulated environments and conditions, systematically varying parameters such as lighting, textures, object poses, and camera viewpoints. By exposing LLaVA-VLA to this diversity during both training and testing, CEBench effectively increases the model’s robustness to variations encountered in novel, real-world scenarios, as demonstrated by its performance on the RoboTwin benchmark where it achieved a 28.6% success rate in DR settings, exceeding the performance of diffusion and VAE baseline models.

Evaluation on the RoboTwin benchmark demonstrates LLaVA-VLA’s performance capabilities in robotic task completion. Specifically, LLaVA-VLA achieves a 40.3% success rate when evaluated on tasks within the seen distribution. When tested with Domain Randomization (DR) applied – exposing the model to variations in simulated conditions – the success rate is 28.6%. These results indicate that LLaVA-VLA outperforms baseline models utilizing diffusion and Variational Autoencoder (VAE) architectures on the RoboTwin benchmark, both with and without the application of domain randomization techniques.

Evaluation of LLaVA-VLA in real-world mobile manipulation scenarios demonstrates a success rate exceeding 10%. This performance level is comparable to that achieved by the ACT baseline model under identical testing conditions. The metric assesses the model’s ability to successfully complete manipulation tasks in physical environments, providing a practical measure of its generalization capability beyond simulated data. While the success rate represents a relatively modest level of performance, its alignment with the ACT baseline suggests a functional level of competency in real-world application.

The system successfully performs both single-arm manipulation and bimanual collaboration tasks in both observed environments and randomized settings, demonstrating robust generalization across a variety of real-world scenarios.

Trimming the Fat: Or, How to Make Complexity Slightly Less Unmanageable

Vision-Language-Action (VLA) models, traditionally demanding in computational resources, are becoming increasingly accessible thanks to techniques like Low-Rank Adaptation (LoRA), as exemplified by models such as TinyVLA. LoRA operates by freezing the pre-trained weights of a large model and introducing a smaller set of trainable parameters – low-rank matrices – that are adapted during fine-tuning. This significantly reduces the number of parameters requiring gradient updates, leading to faster training times and reduced memory footprint without substantial performance degradation. TinyVLA demonstrates that this approach is particularly effective for VLA tasks, enabling efficient inference even on resource-constrained devices. By focusing adaptation on a limited number of parameters, LoRA not only accelerates the development cycle but also facilitates broader participation in VLA research and deployment, paving the way for more practical applications of multimodal AI.

NORA presents a significant advancement in how actions are represented within Vision-Language-Action models. Traditionally, action tokens require substantial computational resources. To address this, NORA leverages FAST – a technique for efficient quantization – to compress these action tokens into a more manageable form. This quantization process doesn’t simply reduce size; it effectively creates a “diffusion expert” within the model. By learning to reconstruct detailed action information from these compressed tokens, the model gains a robust understanding of temporal dynamics and nuanced movements, even with limited data. The resulting system demonstrates improved performance and efficiency, allowing for more complex action recognition and generation tasks without a corresponding increase in computational cost.

SmolVLA represents a significant step in streamlining Vision-Language-Action models without substantial performance loss. The architecture achieves enhanced efficiency through strategic modifications, notably by selectively skipping layers during processing – a technique that reduces computational load without compromising crucial feature extraction. Complementing this is a method of pruning visual tokens, effectively discarding less informative visual data to further decrease processing demands. Importantly, SmolVLA doesn’t abandon the benefits of Action Chunking – a process that organizes action sequences for improved understanding – but rather integrates it with these optimization strategies, demonstrating that substantial efficiency gains can be achieved while maintaining robust action recognition and understanding capabilities.

The release of OpenVLA, a 7 billion parameter open-source model, represents a significant step towards democratizing research in the rapidly evolving field of Vision-Language-Action understanding. By making a fully capable model freely available, OpenVLA lowers the barrier to entry for researchers and developers who may lack the resources to train such models from scratch. This accessibility fosters greater collaboration, accelerates innovation, and allows for wider scrutiny and improvement of Vision-Language-Action techniques. The open-source nature also promotes reproducibility, a cornerstone of scientific validity, as others can readily verify and build upon the model’s foundations, ultimately driving the field forward with shared knowledge and collective progress.

The pursuit of increasingly complex vision-language-action models feels predictably optimistic. This paper, with CEBench and LLaVA-VLA, attempts a pragmatic assessment-a rare exercise. It’s a useful contribution, not because it solves the general problem, but because it acknowledges the limitations of current approaches and focuses on practical, lightweight solutions. As Bertrand Russell observed, “The difficulty lies not so much in developing new ideas as in escaping from old ones.” The field consistently chases architectural novelty, ignoring the inherent difficulties of transferring models across robotic embodiments and real-world scenarios. This work, with its emphasis on action chunking and domain randomization, is a small step toward accepting that ‘better one monolith than a hundred lying microservices’ also applies to robotic control.

What’s Next?

The pursuit of generalized robotic action from language remains, predictably, a game of shifting tolerances. CEBench establishes a useful, if temporary, standard for evaluating vision-language-action models, but benchmarks are, at best, a snapshot of current failure modes. Every carefully curated domain randomization strategy will eventually encounter an edge case that exposes the brittleness hidden within. The architecture isn’t the diagram; it’s the compromise that survived deployment-and the accruing technical debt of all the scenarios not tested.

LLaVA-VLA’s pretraining-free approach is a pragmatic step, acknowledging the unsustainable resource demands of ever-larger models. But lightweight solutions invariably trade off expressiveness. The field will likely see a continued oscillation between capacity and efficiency – everything optimized will one day be optimized back. The real challenge isn’t just enabling robots to execute instructions, but to gracefully degrade when faced with ambiguity or unforeseen circumstances.

Cross-embodiment, while promising, highlights the uncomfortable truth that robotic intelligence isn’t about abstract reasoning; it’s about exquisitely calibrated physical interactions. The current focus on action chunking is a useful heuristic, but ultimately, the robot doesn’t see “open drawer”; it experiences a cascade of torques, forces, and sensor readings. The work doesn’t refactor code-it resuscitates hope.

Original article: https://arxiv.org/pdf/2602.22663.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Intelligence: Why Robots Still Can’t Adapt

LLaVA-VLA: Another Layer of Abstraction on a House of Cards

CEBench: A More Complex Way to Demonstrate Inevitable Failure

Trimming the Fat: Or, How to Make Complexity Slightly Less Unmanageable

What’s Next?

See also: