Robots That Reason: A New Approach to Vision and Action

Author: Denis Avetisyan


Researchers have developed a novel model that allows robots to plan and execute complex manipulation tasks by reasoning about the world around them in a more human-like way.

The LaST0 framework establishes a unified vehicle model leveraging a dual-system architecture-a slow reasoning expert and a fast-acting expert-that interact via shared self-attention, where the former constructs a spatio-temporal latent CoT through autoregressive prediction of latent tokens and the latter generates actions via flow matching conditioned on high-frequency observations and periodically updated latent representations, all within a training procedure utilizing staged parameter updates to ensure both latent reliability and robust action generation.
The LaST0 framework establishes a unified vehicle model leveraging a dual-system architecture-a slow reasoning expert and a fast-acting expert-that interact via shared self-attention, where the former constructs a spatio-temporal latent CoT through autoregressive prediction of latent tokens and the latter generates actions via flow matching conditioned on high-frequency observations and periodically updated latent representations, all within a training procedure utilizing staged parameter updates to ensure both latent reliability and robust action generation.

LaST0 leverages a latent spatio-temporal chain-of-thought within a dual-system architecture to enable efficient and coherent robotic vision-language-action control.

Explicit reasoning is often a bottleneck in robotic manipulation, hindering the temporal responsiveness needed for complex tasks. This limitation motivates ‘LaST$_{0}$: Latent Spatio-Temporal Chain-of-Thought for Robotic Vision-Language-Action Model’, which introduces a framework leveraging a compact latent space to model future visual dynamics, 3D structure, and robot proprioception. By enabling temporally consistent, implicit reasoning via a dual-system Mixture-of-Transformers architecture, LaST$_0$ achieves improved action accuracy and substantially faster inference compared to existing Vision-Language-Action models. Could this approach unlock more adaptable and efficient robotic systems capable of truly interactive manipulation?


The Inevitable Bottleneck: Reasoning Beyond Brute Force

Many language models currently function by breaking down problems into discrete, symbolic steps – a process akin to meticulously tracing a logical argument. While effective for certain tasks, this explicit reasoning can quickly become computationally prohibitive as problem complexity increases. Each step demands processing power and memory, leading to significant slowdowns and increased resource demands. Furthermore, this approach proves brittle; even minor variations in input phrasing or problem structure can disrupt the carefully constructed symbolic chain, leading to errors. This reliance on precise, step-by-step deduction limits the model’s ability to generalize and adapt to novel situations, hindering robust performance in real-world applications where ambiguity and nuance are commonplace.

Current artificial intelligence systems often struggle with reasoning tasks due to the limitations of their working memory. While increasing model size – and therefore explicit memory – can improve performance, this approach quickly becomes unsustainable and computationally prohibitive. A significant hurdle in advancing AI lies in developing methods that enable efficient reasoning without continually expanding the model’s memory footprint. Researchers are actively investigating techniques to compress information and perform inferences within a more constrained, lower-dimensional space, mirroring the efficiency of human cognition. This pursuit involves not just storing facts, but also learning to prioritize relevant information and generalize from limited data, effectively allowing the system to ‘think’ smarter, not just ‘remember’ more.

Current research increasingly focuses on enabling reasoning processes within compressed representational spaces, moving beyond the limitations of explicitly storing and manipulating symbolic information. This approach acknowledges that effective reasoning doesn’t necessarily require vast memory; instead, intelligence may reside in the ability to distill information into essential features and perform computations on these lower-dimensional embeddings. By projecting complex problems into a compact space, models can potentially achieve comparable, or even superior, reasoning performance with significantly reduced computational cost and improved generalization capabilities. This shift necessitates the development of novel techniques – such as information bottleneck methods and dimensionality reduction algorithms – capable of preserving the core relational structure of knowledge while discarding irrelevant details, ultimately paving the way for more efficient and scalable artificial intelligence systems.

The model leverages a fast-acting and slow-reasoning expert system-trained with varying reasoning-to-action frequencies-to flexibly capture long-horizon dependencies and generate appropriate responses based on both high-frequency observations and periodically updated latent knowledge.
The model leverages a fast-acting and slow-reasoning expert system-trained with varying reasoning-to-action frequencies-to flexibly capture long-horizon dependencies and generate appropriate responses based on both high-frequency observations and periodically updated latent knowledge.

Encoding Thought: Reasoning Within a Latent Space

Latent Spatio-Temporal Chain-of-Thought (LaST CoT) represents a departure from traditional reasoning methods by embedding the reasoning process within a low-dimensional, continuous latent space. This technique avoids discrete symbolic representations and instead models reasoning as a trajectory through this space, where each point represents a state in the reasoning process. By operating directly in the latent space, LaST CoT aims to improve computational efficiency and enable the capture of nuanced relationships inherent in complex reasoning tasks. The dimensionality reduction inherent in using a latent space allows for a more compact representation of the reasoning process, potentially facilitating generalization and reducing the computational burden associated with manipulating large symbolic structures.

LaST CoT represents reasoning processes as continuous trajectories within a latent space, enabling the efficient capture of complex spatio-temporal dynamics. This encoding allows the model to represent sequential reasoning steps as movements through the latent space, where the position at any given point reflects the accumulated reasoning state. By leveraging the properties of this space, LaST CoT avoids the computational bottlenecks associated with discrete symbolic manipulation, particularly when dealing with extended reasoning chains. The continuous representation facilitates the modeling of dependencies between reasoning steps and allows for generalization to unseen scenarios by interpolating within the learned latent trajectories. This approach is particularly effective in tasks requiring the integration of information over time, such as video understanding or robotic navigation, where the temporal evolution of states is crucial.

Traditional reasoning systems often rely on explicit symbolic manipulation, requiring discrete representations and rule-based operations. LaST CoT diverges from this approach by representing reasoning as continuous trajectories within a latent space, mirroring the analog computational properties observed in biological neural networks. This shift offers computational advantages due to the efficiency of operating on continuous representations, and avoids the combinatorial explosion associated with symbolic search. Furthermore, the latent space formulation allows for generalization to unseen scenarios by interpolating between learned trajectories, a capability not readily available in systems dependent on precise symbolic matching.

Attention heatmaps reveal that incorporating Chain-of-Thought (CoT) reasoning-either explicitly in CoT-VLA or via the LaST CoT method-enhances the model's focus compared to the baseline LaST0 model without reasoning.
Attention heatmaps reveal that incorporating Chain-of-Thought (CoT) reasoning-either explicitly in CoT-VLA or via the LaST CoT method-enhances the model’s focus compared to the baseline LaST0 model without reasoning.

LaST0: A Dual-System Architecture for Efficient Action

LaST0 is a Visual Language Agent (VLA) model distinguished by its dual-system architecture, built upon the Latent Space Trajectory (LaST) Chain-of-Thought (CoT) framework. This design enables efficient ‘reason-before-act’ behavior by separating the reasoning and acting processes within the model. Specifically, LaST0 leverages a latent space to represent environmental observations and planned actions, allowing for abstract reasoning prior to executing physical commands. This contrasts with single-system VLAs which often directly map observations to actions, potentially limiting complex planning and adaptability in dynamic environments. The dual-system approach facilitates more deliberate and informed decision-making, enhancing performance on tasks requiring multi-step reasoning and manipulation.

LaST0 employs a Mixture-of-Transformers (MoT) architecture, a sparsely-activated expert model, to enhance information encoding and processing within its latent space. This approach divides the model’s parameters into multiple “experts,” and a gating network selectively routes input tokens to a small subset of these experts for processing. By activating only relevant parameters based on the input, MoT reduces computational cost and allows the model to scale capacity without a proportional increase in inference time. This selective activation enables LaST0 to effectively represent and manipulate complex relationships inherent in reasoning and manipulation tasks, contributing to its improved performance and efficiency.

LaST0 utilizes DeepSeek-LLM 1B as its core language model, providing the foundational reasoning capacity for the entire system. This 1 billion parameter language model was selected to balance performance with computational efficiency, serving as the basis for the subsequent Mixture-of-Transformers (MoT) architecture. The MoT then builds upon DeepSeek-LLM 1B’s inherent capabilities, encoding and processing information within a latent space to facilitate the ‘reason-before-act’ behavior characteristic of LaST0. Effectively, DeepSeek-LLM 1B provides the pretrained weights and knowledge that enable LaST0 to approach and solve complex manipulation tasks.

Evaluations of LaST0 on real-world manipulation tasks demonstrate a mean success rate of 72%. This performance represents a substantial improvement over baseline models including SpatialVLA, which achieved a 39% success rate, π0.5 at 59%, and CoT-VLA with 53%. These results indicate LaST0’s enhanced ability to successfully complete manipulation tasks compared to existing methodologies, as quantified by the observed success rates across the tested task set.

Evaluation of LaST0 on the RLBench benchmark yielded an 82% success rate in completing assigned tasks. This performance represents a considerable advancement over existing visual language agent (VLA) models; comparative results show LaST0 significantly surpasses previous state-of-the-art approaches on the same benchmark. The achieved success rate demonstrates the model’s capacity for robust performance in a standardized robotic manipulation environment, highlighting its ability to effectively translate visual input and language instructions into successful physical actions.

LaST0 exhibits consistent performance across multiple execution steps, achieving success rates of 0.66 for tasks requiring one execution, 0.47 for two executions, and 0.33 for three executions. This indicates the model’s ability to maintain accurate reasoning and action planning over extended horizons, a capability where it surpasses the performance of the π0.5 model in similar multi-step task scenarios. These success rates were determined through empirical evaluation on benchmark tasks designed to assess long-term sequential decision-making.

LaST0 achieves an inference speed of 15.4 Hz, representing a significant performance increase over the CoT-VLA model, which operates at 1.1 Hz. This approximately 14-fold improvement in inference speed allows for faster response times and more efficient execution of reasoning tasks. The increased speed is a direct result of the model’s architecture and optimizations, enabling more real-time application potential compared to prior VLA approaches.

An ablation study of LaST0 reveals that performance, measured as average success rate across 10 RLBench tasks, is most sensitive to the inclusion of diverse latent modalities, sufficient tokens per modality, broad temporal coverage in reasoning, and frequent collaboration between reasoning and action experts.
An ablation study of LaST0 reveals that performance, measured as average success rate across 10 RLBench tasks, is most sensitive to the inclusion of diverse latent modalities, sufficient tokens per modality, broad temporal coverage in reasoning, and frequent collaboration between reasoning and action experts.

The presented LaST0 model embodies a pragmatic acknowledgement of systemic decay. Any improvement, as demonstrated by the model’s efficient spatio-temporal reasoning, ages faster than expected, necessitating continuous refinement. LaST0’s dual-system architecture, with its compact latent space, attempts to mitigate this decay by prioritizing temporally coherent control – a journey back along the arrow of time when faced with unforeseen circumstances. Vinton Cerf aptly observes, “If you don’t see a future for the Internet, you aren’t seeing very far.” This resonates with the core principle of anticipating, and gracefully managing, the inevitable entropy inherent in complex systems like robotic vision-language-action models.

The Long View

LaST0, with its compact latent representation of spatio-temporal reasoning, represents a necessary, if incremental, step toward robust robotic agency. The architecture acknowledges a fundamental truth: efficient action isn’t born of brute force computation, but of distilling experience into manageable form. Yet, the current iteration, like all initial architectures, is inherently brittle. The latent space, while promising, remains a black box, susceptible to the inevitable drift of real-world entropy. Every delay in fully understanding its dynamics is the price of building a system that doesn’t merely perform tasks, but endures them.

Future work must address the fragility inherent in any model reliant on pre-defined tasks. A truly adaptive system will not simply chain together learned actions, but actively reshape its internal representation of the world. The pursuit of a universally applicable latent space feels, perhaps, overly ambitious-a quest for static perfection in a dynamic universe. More fruitful may be the development of mechanisms for continual, self-supervised refinement, allowing the system to gracefully accommodate the unexpected.

Ultimately, architecture without history is fragile and ephemeral. LaST0 establishes a foundation, but the true test lies in its capacity for accumulated experience. The longevity of any VLA model will not be measured in benchmark scores, but in its ability to degrade slowly, learning not just how to act, but when to adapt, and, crucially, what to forget.


Original article: https://arxiv.org/pdf/2601.05248.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-01-09 23:46