Seeing is Believing: How Models Learn to Connect Vision, Language, and Action

Author: Denis Avetisyan

New research dissects the internal workings of vision-language-action models, revealing a strong reliance on visual pathways and a surprising separation of motor programs from intended goals.

The study demonstrates a semantic organization within the feature space of an OpenVLA-OFT layer 16, as revealed by a UMAP projection of 4,096 SAE features, effectively mapping high-dimensional data into a lower-dimensional space while preserving semantic relationships between features.

A mechanistic analysis using sparse autoencoders demonstrates pathway specialization and highlights current limitations in representing complex action sequences.

Despite advances in embodied artificial intelligence, the mechanisms by which vision-language-action (VLA) models translate multimodal inputs into coherent behavior remain poorly understood. This work, ‘Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models’, dissects six VLA models-ranging from 80M to 7B parameters-using activation injection, sparse autoencoders, and causal analysis to reveal a strong reliance on visual pathways and a separation of motor program encoding from goal semantics. Specifically, we find that visual information often dominates action generation, while language sensitivity emerges only when visual context is ambiguous, suggesting current architectures may not fully leverage the potential of multimodal reasoning. How can we better align VLA model design with the nuanced interplay between perception, language, and action to unlock more robust and interpretable robotic intelligence?

The Erosion of Handcrafted Control: A Paradigm Shift

Historically, imparting robotic dexterity demanded painstaking engineering of specific visual features – identifying edges, corners, or objects – and then constructing elaborate pipelines to translate these perceptions into motor commands. This approach, while sometimes successful in constrained environments, proved brittle when confronted with novelty; a slight change in lighting, an unexpected object, or even a variation in the target’s pose could derail the entire system. These handcrafted features and complex pipelines lacked the inherent adaptability necessary for robots to function reliably in the real world, necessitating constant recalibration and limiting their ability to generalize beyond the specific scenarios for which they were programmed. Consequently, progress toward truly autonomous robots capable of navigating and interacting with unstructured environments was significantly hindered by these limitations in traditional control methodologies.

Emerging Vision-Language-Action (VLA) models represent a significant departure from traditional robotics by consolidating three core functionalities into a unified system. Rather than relying on separate pipelines for visual processing, natural language interpretation, and action planning, these models ingest visual data and linguistic commands simultaneously, learning to map them directly to desired robotic behaviors. This integrated approach allows for more intuitive human-robot interaction – a robot can respond to instructions like “pick up the red block” without needing pre-programmed routines for object recognition or grasp planning. The promise of VLA models lies in their potential for greater adaptability and generalization; by learning a common representation of vision, language, and action, robots can potentially perform novel tasks based on simple linguistic descriptions, reducing the need for extensive, task-specific engineering.

Despite the increasing sophistication of Vision-Language-Action (VLA) models in robotic control, a crucial area of inquiry centers on deciphering how these models actually translate perception and instruction into effective action. Researchers are actively investigating the interplay between visual and linguistic inputs, attempting to quantify the relative importance of each modality in different scenarios. Determining whether a model relies more heavily on visual cues for precise manipulation, or on linguistic guidance for high-level task planning, is paramount to improving robustness and generalization. This understanding isn’t merely about attribution; it’s about revealing the internal representations and reasoning processes within these models, paving the way for more interpretable and trustworthy robotic systems. Ultimately, pinpointing the contributions of vision and language will allow for targeted improvements in model architecture and training strategies, optimizing performance and adaptability in complex, real-world environments.

Removing color information via grayscale processing completely prevents the robot from successfully stacking the cube, demonstrating its reliance on color cues for object differentiation.

Visual Primacy: Evidence of a Hierarchical System

Observations indicate a significant influence of the visual pathway on agent behavior within Visual Language Agents (VLAs). This dominance is demonstrated by the capacity of visual inputs to independently drive action generation, even when conflicting with or absent of linguistic instruction. Studies utilizing activation injection techniques have confirmed that visual activations are frequently sufficient to initiate task completion, suggesting that linguistic prompts are not always essential for successful outcomes. This prioritization of visual information implies a hierarchical processing structure where the visual pathway can effectively override or bypass linguistic input in directing agent behavior.

Activation Injection techniques, involving the direct stimulation of the Visual Language Agent (VLA) with visual features, provide empirical evidence for the dominance of the visual pathway. These techniques demonstrate that providing only visual activations – bypassing linguistic input entirely – is frequently sufficient to elicit appropriate action generation. Specifically, researchers have been able to trigger desired behaviors by injecting activations corresponding to visual stimuli, confirming that the VLA can effectively interpret and respond to visual information independently of language. This capability highlights a fundamental aspect of the VLA’s architecture: a strong reliance on, and prioritization of, visual input for driving task completion.

Counterfactual prompting experiments reveal substantial variability in language sensitivity across different tasks performed by Visual Language Agents (VLAs). Cross-task injection tests, conducted on five separate models, consistently resulted in a 0% task success rate when language instructions were deliberately altered or removed. This outcome strongly indicates that, in these scenarios, the VLA relies primarily on visual input to determine and execute actions, demonstrating a clear dominance of the visual pathway over linguistic guidance for task completion. The models successfully performed the tasks despite the absence of accurate language instructions, highlighting that language is not always a necessary component for achieving desired outcomes.

Experiments utilizing the π_0.5π_0.5 model demonstrate a significant capacity for task completion based solely on visual input. Specifically, even when language-based prompts were replaced with null injections – effectively removing linguistic guidance – the model achieved a 77% task success rate. This performance was accompanied by a high cosine similarity score of 99.9%, indicating that the model’s actions closely aligned with the expected outputs despite the absence of language. These results strongly suggest the existence of a robust, underlying visual program within the model that can independently drive successful task execution.

Despite layer 17 classifiers accurately distinguishing between prompts ([latex]99.3%[/latex] accuracy) and no observed behavioral difference across over 3,396 episodes ([latex]p >> 0.24[/latex]), the agent ignores linguistic variations in prompts.

Deconstructing the Black Box: Interpretable Representations

Sparse Autoencoders (SAEs) function as a dimensionality reduction technique applied to the high-dimensional activations found within Vector-based Locomotion Adapters (VLAs). By forcing the network to reconstruct the input through a bottleneck, SAEs identify the most salient features necessary to represent the VLA’s internal state. This process yields sparse representations, where only a small subset of neurons are active for any given input, enhancing interpretability. Specifically, the reconstruction error serves as a metric for the importance of individual features, allowing researchers to isolate and analyze the components of the VLA that contribute most significantly to observed behaviors. Empirical results demonstrate SAEs can explain between 83-99% of the variance in GR00T layers, contingent on the pooling strategy employed, highlighting their effectiveness in capturing essential information within VLA representations.

The Action Atlas is a platform designed to facilitate the investigation of Vector-based Latent Action (VLA) representations through the use of Sparse Autoencoders (SAEs). By employing SAEs, the platform enables interactive exploration of the learned features within VLAs, allowing researchers to visualize and analyze how control policies are encoded. This interactive capability permits users to probe specific latent dimensions and correlate them with corresponding actions or behavioral traits. The platform’s architecture supports detailed inspection of the VLA’s internal structure, providing insights into the functional organization and underlying principles of the learned control signals. Data visualization tools within Action Atlas allow for both qualitative assessment and quantitative analysis of VLA representations.

Linear probes are employed alongside Sparse Autoencoders (SAEs) to assess the degree to which action-related information is linearly separable within the intermediate representations learned by a VLA. This technique involves training a linear classifier to predict actions directly from the output of a specific VLA layer, as reconstructed by the SAE. The performance of this linear classifier – typically measured by accuracy or other classification metrics – indicates the extent to which control signals are explicitly encoded in a linearly decodable manner within that layer’s representation. High performance suggests a relatively straightforward encoding, while low performance implies more complex, non-linear relationships between the representation and the corresponding action.

Analysis utilizing Sparse Autoencoders (SAEs) demonstrates a high degree of representational capacity within GR00T layers of VLAs, providing evidence for pathway specialization. Specifically, SAE reconstruction accounts for 83-99% of the variance observed in these layers; the precise percentage is influenced by the pooling strategy employed during the reconstruction process. This substantial variance explained indicates that SAEs effectively capture the core information present in GR00T layer activations, suggesting these layers encode functionally meaningful pathways related to the learned control policies.

Cross-task injection experiments demonstrate a high degree of trajectory dominance by the source task, with displacement rates reaching 99.8%. This methodology involves injecting activations from one trained VLA into another trained on a different task, and then measuring the extent to which the injected activations displace the target VLA’s behavior. The near-complete displacement observed indicates that the injected activations strongly influence the target VLA, effectively overriding the learned control policy for the second task and suggesting a limited degree of information mixing between the tasks during learning.

This methodology identifies causal relationships between neural network activations and behavior by recording activations during rollouts, replaying them under counterfactual conditions, decomposing them into sparse features using [latex]SAE[/latex], and then validating these features through ablation and steering experiments visualized in an Action Atlas.

OpenVLA-OFT: Democratizing Embodied Intelligence

OpenVLA-OFT is an open-source, trajectory-based Visual Language Action (VLA) model employing continuous L1 regression as its core action generation mechanism. This approach formulates action prediction as a regression problem, minimizing the L1 norm of the predicted action trajectory. By utilizing continuous regression, OpenVLA-OFT aims to produce smooth and physically plausible actions, unlike discrete action space methods. The model’s source code and pre-trained weights are publicly available, facilitating reproducibility and further development within the research community. This contrasts with closed-source or commercially restricted VLA implementations, enabling broader accessibility and collaborative innovation in the field of embodied AI.

The OpenVLA-OFT architecture employs Flow Matching and Conditional Variational Autoencoders (CVAEs) to facilitate the generation of accurate and temporally consistent action sequences. Flow Matching operates by defining a continuous normalizing flow that maps data distributions to a simple prior, enabling efficient sampling of plausible actions. CVAEs are utilized to model the conditional distribution of actions given the current state and desired goal, allowing for goal-directed behavior. By integrating these techniques, the model learns a latent space representation of actions that promotes smoothness and reduces abrupt transitions, resulting in more natural and effective control policies.

OpenVLA-OFT utilizes PaliGemma, a pre-trained language model developed by Google DeepMind, as its core linguistic component. This integration establishes a functional connection between language instructions and robotic action generation within the VLA framework. PaliGemma provides the model with pre-existing knowledge of language structure and semantics, enabling it to interpret natural language prompts and translate them into actionable commands for a robotic agent. By leveraging a pre-trained model, OpenVLA-OFT reduces the need for extensive language training specific to the robotic task, streamlining the development process and enhancing generalization capabilities to novel instructions.

The release of open-source VLA implementations, such as OpenVLA-OFT, coupled with standardized benchmark suites like LIBERO, MetaWorld, and SimplerEnv, is intended to significantly accelerate progress in the field. These benchmarks provide a common evaluation framework, enabling researchers to objectively compare different VLA approaches and track performance improvements. The availability of open-source code lowers the barrier to entry for new researchers and facilitates rapid iteration and collaborative development. Furthermore, consistent evaluation on these benchmarks aids in assessing the generalizability of VLAs and their potential for successful sim-to-real transfer, a critical step for deploying these agents in real-world applications.

OpenVLA-OFT exhibits a 92% zero-effect rate when subjected to concept ablation testing. This metric signifies the proportion of ablated concepts – individual features or components within the model – that result in no measurable change in the generated action. A high zero-effect rate indicates a robust control policy where the model is not overly reliant on any single concept, and that many of the learned features are redundant or contribute minimally to the final output. This suggests a level of interpretability, as the absence of contribution from specific concepts does not critically impact performance, and the model demonstrates resilience to the removal of potentially noisy or irrelevant features.

SmolVLA demonstrates a zero-effect rate of 28%, indicating a comparatively higher sensitivity to perturbations in its input data. This metric, calculated during concept ablation testing, reflects the percentage of instances where altering specific input concepts does not result in a corresponding change in the generated action. A lower zero-effect rate, as seen with SmolVLA, suggests that the model’s control policy is more reliant on all input features and less robust to minor variations or noise, potentially limiting its generalization capabilities and sim-to-real transfer performance when compared to models like OpenVLA-OFT which achieved a 92% zero-effect rate.

A comparative analysis of five vision-language models reveals that OFT prioritizes a single operational pathway, while SmolVLA and GR00T, along with [latex]\pi_{0.5}[/latex], demonstrate superior pathway specialization capabilities across various benchmarks including baseline success, visual override strength, language sensitivity, SAE fidelity, and cross-task transfer.

The study’s dissection of Vision-Language-Action models, revealing a dominance of visual pathways and separation of motor programs, echoes a fundamental tenet of computational elegance. As Ken Thompson once stated, “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” This sentiment aligns with the research’s emphasis on interpretability; the complexity inherent in these models necessitates a rigorous understanding of their internal mechanisms-a ‘provable’ architecture-rather than simply observing successful outputs. The sparse autoencoder analysis serves as a crucial debugging tool, illuminating how these systems encode action primitives and, importantly, where limitations reside in their ability to generalize beyond observed data.

What’s Next?

The demonstrated reliance of Vision-Language-Action models on visual pathways, while unsurprising given the inherent dimensionality of sensory input, highlights a fundamental brittleness. The separation of motor programs from high-level goals, revealed through sparse autoencoder analysis, is not necessarily an architectural flaw, but rather an acknowledgment of the inherent modularity of action. Yet, this modularity introduces vulnerabilities; a disruption in the visual pathway, however minor, risks cascading failure across the entire system. Future work must move beyond merely observing these pathways and focus on establishing provable guarantees regarding their robustness.

The current emphasis on scale, while yielding superficially impressive results, offers diminishing returns. Increasing parameters does not equate to increasing understanding. The field would be better served by a return to first principles – a rigorous mathematical treatment of action primitives and their composition. The goal should not be to build systems that mimic intelligence, but systems that embody provably correct reasoning about the physical world.

Ultimately, in the chaos of data, only mathematical discipline endures. The limitations exposed by this mechanistic study are not dead ends, but rather signposts pointing towards a more elegant, more reliable, and ultimately more correct foundation for artificial intelligence. The true challenge lies not in building models that work, but in building models that can be proven to work.

Original article: https://arxiv.org/pdf/2603.19233.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Erosion of Handcrafted Control: A Paradigm Shift

Visual Primacy: Evidence of a Hierarchical System

Deconstructing the Black Box: Interpretable Representations

OpenVLA-OFT: Democratizing Embodied Intelligence

What’s Next?

See also: