Seeing is Controlling: Steering Soft Robots with Visual Learning

Author: Denis Avetisyan


Researchers have demonstrated a novel approach to controlling soft continuum robots by directly linking visual input to movement, enabling accurate, camera-free operation.

The system demonstrates precise trajectory tracking-verified using a VON model-across all tested movements except an upswing maneuver, where a discernible discrepancy between commanded and actual pressure inputs exposes a limitation in maintaining control during that specific dynamic phase.
The system demonstrates precise trajectory tracking-verified using a VON model-across all tested movements except an upswing maneuver, where a discernible discrepancy between commanded and actual pressure inputs exposes a limitation in maintaining control during that specific dynamic phase.

This work introduces a data-driven control method leveraging visually learned latent dynamics and a custom visual oscillator network to achieve stable tracking of visually defined trajectories in soft robots.

Achieving reliable, long-horizon control of soft continuum robots remains challenging due to their inherent complexities and lack of direct actuation feedback. This work, ‘Accurate Open-Loop Control of a Soft Continuum Robot Through Visually Learned Latent Representations’, introduces a novel approach leveraging video-learned latent dynamics and Visual Oscillator Networks (VONs) to enable accurate open-loop control without relying on camera feedback. By mapping visually specified trajectories to interpretable latent waypoints, we demonstrate stable tracking and extrapolated equilibria on a two-segment pneumatic robot, achieving the lowest mean squared errors with a VON and Koopman-based model. Could this data-driven, latent space control paradigm unlock more robust and adaptable soft robotic systems capable of navigating complex, unstructured environments?


Decoding Complexity: The Illusion of Detail

The challenge of understanding complex systems – be it weather patterns, financial markets, or biological networks – often stems from the sheer volume of data required to describe them. These systems are characterized by numerous interacting components, each generating data points that contribute to a high-dimensional space. While seemingly comprehensive, this abundance of information can actually obscure the fundamental principles governing the system’s behavior. The relationships between variables become diluted within this complexity, making it difficult to discern meaningful patterns or predict future states. Consequently, researchers find themselves grappling with a ‘curse of dimensionality’, where computational demands increase exponentially with each added variable, and the signal of true underlying dynamics is lost in noise. Effectively, a detailed description does not necessarily equate to genuine understanding; the core principles remain hidden within the vastness of the data.

A central challenge in systems modeling lies in distilling complex behaviors into a manageable, low-dimensional ‘state’ representation. This condensed view isn’t merely a simplification; it’s the identification of the essential variables that govern the system’s evolution. Imagine a chaotic pendulum-while its precise trajectory is sensitive to initial conditions and atmospheric disturbances, its state can be effectively described by just its angle and angular velocity. Discovering such latent representations allows models to focus on the core dynamics, ignoring irrelevant noise and facilitating accurate predictions and control strategies. This process isn’t always straightforward, as the relevant state variables may not be directly observable, requiring sophisticated techniques to infer them from available data and build a robust understanding of the underlying system.

Conventional approaches to system analysis frequently falter when confronted with the inherent complexities of temporal data. These methods often treat time as a static variable, failing to adequately capture the evolving relationships within a dynamic system – a critical limitation when attempting to forecast future states or exert meaningful control. Consequently, predictions generated from these analyses can be significantly inaccurate, particularly over extended periods, and interventions based on such forecasts may yield unintended or suboptimal results. The inability to discern the full scope of temporal dependencies hinders the development of robust models, limiting their practical utility in real-world applications ranging from climate modeling to financial forecasting and robotic control. This necessitates innovative techniques capable of unlocking the full richness of time-dependent data and revealing the underlying principles governing system behavior.

ABCD-based models demonstrate superior stability and produce more reasonable predictions across diverse stress tests-including static pressure, cosine ramp-up, and dynamic excitation-as evidenced by lower image-space MSEs and more realistic decoded observations compared to alternative approaches.
ABCD-based models demonstrate superior stability and produce more reasonable predictions across diverse stress tests-including static pressure, cosine ramp-up, and dynamic excitation-as evidenced by lower image-space MSEs and more realistic decoded observations compared to alternative approaches.

Unveiling the Hidden State: An Encoder-Dynamics-Decoder Framework

The Encoder-Dynamics-Decoder model functions by initially processing raw observation data – such as images or sensor readings – through an encoder network, which reduces the dimensionality of the input and generates a compressed, lower-dimensional representation known as the latent space. This latent space serves as a compact encoding of the essential information contained within the original observations. Subsequently, the dynamics model predicts the evolution of this latent representation over time, effectively capturing the system’s state transitions in a reduced dimensionality. Finally, a decoder network reconstructs the original observations from the predicted latent state, allowing for the recovery of information from the compressed representation and enabling analysis and prediction within the latent space.

The β-VAE architecture is employed to learn latent representations by optimizing a loss function consisting of a reconstruction loss and a Kullback-Leibler (KL) divergence term. The reconstruction loss, typically measured using mean squared error or binary cross-entropy, ensures the decoder can accurately reconstruct the input from the latent code. The KL divergence term, weighted by the parameter β, regularizes the latent space by forcing the learned latent distribution to remain close to a standard normal distribution [latex]N(0, I)[/latex]. Increasing β encourages a more disentangled and well-structured latent space, at the potential cost of reconstruction accuracy, while decreasing it prioritizes reconstruction fidelity. This balance enables the learning of informative latent representations suitable for downstream tasks and facilitates interpolation and manipulation within the latent space.

The training process utilizes loss functions specifically designed for static and dynamic image reconstruction to ensure accurate temporal modeling. Static image reconstruction loss minimizes the difference between the input observation and the image decoded from the latent representation at a single time step. Dynamic image reconstruction loss extends this by evaluating the reconstruction error over a sequence of time steps, thereby encouraging the model to learn the system’s temporal dependencies and predict future states. These combined loss functions effectively constrain the latent space to encode information crucial for representing and predicting the evolution of the observed system, improving the quality of learned dynamics.

Open-loop control experiments demonstrate model performance across trajectories, as measured by image mean squared error (MSE), and reveal the mean absolute error (MAE) of predicted input pressure for multi-step predictions, highlighting the impact of different ablation strategies.
Open-loop control experiments demonstrate model performance across trajectories, as measured by image mean squared error (MSE), and reveal the mean absolute error (MAE) of predicted input pressure for multi-step predictions, highlighting the impact of different ablation strategies.

Stabilizing the System: Latent Consistency and Dynamic Prediction

The Latent Consistency Loss is implemented as a regularization term within the training process, specifically designed to mitigate the effects of perceptual ambiguities and sensor noise. This loss function operates by enforcing proximity between the latent representations of similar input observations; the encoder is penalized for producing disparate latent states given nearly identical inputs. Mathematically, this is often achieved by minimizing a distance metric – such as mean squared error or cosine similarity – between the latent vectors generated from perturbed versions of the same observation. By encouraging a smoother and more stable latent space, the model becomes less sensitive to minor variations in input data and exhibits improved generalization performance to unseen conditions.

The Latent Consistency Loss functions by minimizing the distance between latent state representations generated by the encoder for closely related observations. This is achieved by defining a metric within the latent space and penalizing the encoder for producing disparate representations given similar input data. Specifically, data augmentation or slight perturbations of the input observations are encoded, and the resulting latent states are then compared; reducing the variance in these latent states for similar inputs directly improves the model’s ability to generalize to unseen data and handle noisy or ambiguous observations. This regularization technique enhances the robustness of the learned representations by creating a smoother and more consistent latent space.

Accurate modeling of dynamics within the learned latent space enables the prediction of future states based on current and past observations. This predictive capability is achieved by training a model to estimate the evolution of latent variables over time, effectively creating a forward model. The resulting predictions can then be used as input to a controller, allowing for interventions that influence the system’s behavior and enabling tasks requiring sequential decision-making. The precision of these predictions is directly correlated to the quality of the learned latent space and the effectiveness of the dynamic model, impacting the performance of any control strategies implemented within this framework.

This study validates an open-loop optimal control method for a physical SCR by utilizing a live SCR simulator to generate target states and demonstrate feasibility across various trajectories.
This study validates an open-loop optimal control method for a physical SCR by utilizing a live SCR simulator to generate target states and demonstrate feasibility across various trajectories.

From Prediction to Influence: Open-Loop System Manipulation

Open-loop control of the system is achieved through the utilization of learned latent dynamics, effectively allowing for pre-planned action sequences without relying on real-time feedback. This approach hinges on accurately modeling the system’s behavior – its response to various inputs – and embedding that understanding within a predictive framework. By learning these underlying dynamics, the system can forecast future states and, crucially, determine the sequence of actions needed to reach a desired outcome. This predictive capability allows for the design of control strategies that anticipate system evolution, enabling precise manipulation even in the absence of continuous sensory input or closed-loop correction – a significant advancement for applications where real-time feedback is limited or unreliable.

The system’s ability to manipulate dynamics relies on a control strategy centered around Single-Shooting Optimal Control, a technique designed to efficiently determine the most effective sequence of actions needed to achieve a desired outcome. Unlike iterative methods that refine a control policy over time, this approach formulates the control problem as a single, direct optimization – essentially, it ‘shoots’ a potential control trajectory and evaluates its overall performance. By directly optimizing this entire trajectory at once, the system swiftly identifies the action sequence that minimizes a defined cost function, effectively guiding the system’s evolution over a predicted horizon. This method proves particularly advantageous when dealing with complex, non-linear systems where traditional control techniques may struggle, offering a computationally efficient pathway to precise, open-loop manipulation of the observed dynamics.

Accurate system modeling is central to enabling control, and this work leverages techniques including Koopman Theory, Extended Dynamic Mode Decomposition (DMD), and Spectral Submanifolds (SSMs) to predict how the system will respond to external actions. These methods facilitate the creation of a predictive model capable of anticipating future states, which is crucial for designing effective control strategies. The resulting model demonstrated high fidelity, achieving a lowest overall image Mean Squared Error (MSE) of [latex]9.80 \times 10^{-3}[/latex] across diverse trajectories – a significant indicator of its predictive power and the efficacy of the chosen modeling approach. This level of accuracy allows for precise manipulation of the system, paving the way for targeted interventions and desired outcomes.

Investigations into open-loop control demonstrate a significant performance advantage for the Visual Oscillator Network (VON) when paired with an ABCD decoder; this combination yielded a remarkably low Image MSE of [latex]9.80 \times 10^{-3}[/latex]. This result surpasses the performance of the Koopman approach utilizing the identical ABCD decoder, which achieved an Image MSE of [latex]1.03 \times 10^{-2}[/latex]. The substantial reduction in error indicates the VON’s superior capacity to model and predict system dynamics, enabling more precise control actions and ultimately, a more accurate reconstruction of desired states. This difference highlights the potential of oscillatory neural networks in enhancing the fidelity of predictive control systems compared to traditional Koopman-based methodologies.

Rigorous ablation studies demonstrate the superior performance of the Visual Oscillator Network (VON) in controlling complex systems, as evidenced by its remarkably low multi-step Mean Squared Error (MSE) of [latex]1.71 \times 10^{-3}[/latex]. This indicates a highly accurate prediction of the system’s evolution over multiple steps, surpassing the performance of Koopman-based control. While both approaches exhibit strong control capabilities, the VON also achieved a competitive Mean Absolute Error (MAE) in pressure measurements of 15.44 kPa, closely following Koopman’s 13.79 kPa, suggesting a nuanced ability to maintain desired system states with precision. These results collectively highlight the VON’s efficacy in both predictive accuracy and stable control, establishing it as a promising architecture for advanced system manipulation.

The research presented dismantles conventional robotic control by eschewing closed-loop feedback, a seemingly rigid requirement. It instead proposes open-loop control guided by visually learned latent representations – a bold maneuver akin to navigating a complex system with only a predictive model. This approach mirrors a core tenet of understanding through deconstruction; the system isn’t reacted to, but anticipated. As Tim Bern-Lee stated, “The Web is more a social creation than a technical one,” and similarly, this work prioritizes learning the inherent dynamics of the robot, rather than imposing external corrective measures. By embracing the inherent properties of the soft continuum robot, researchers unlock a level of control previously constrained by traditional methodologies.

Beyond the Visible Horizon

The demonstrated decoupling of control from immediate visual feedback is, predictably, not an end but a provocation. This work successfully navigates a soft robot through space using predicted states, a feat achieved by distilling observation into a manageable, learned representation. Yet, the very act of distillation introduces a fundamental question: what is lost in translation? The system performs well on specified trajectories, but true autonomy demands resilience to the unforeseen – the unexpected perturbation, the novel environment. The latent space, however elegantly constructed, remains a model, and all models are, by definition, incomplete.

Future investigations will undoubtedly focus on expanding the repertoire of learned dynamics. But a more fruitful avenue might lie in deliberately introducing controlled error. A system that anticipates its own fallibility, that incorporates uncertainty into its predictive framework, is arguably more robust than one striving for unattainable precision. The challenge isn’t simply to map the world, but to understand the limits of that map – to engineer a robot that knows what it doesn’t know.

Ultimately, this research highlights a broader principle: control isn’t about imposing order, but about skillfully exploiting chaos. The soft robot, by its very nature, embodies that principle. The next step isn’t to tame the inherent unpredictability of a continuum body, but to leverage it – to build a system that dances with, rather than resists, the inevitable imperfections of reality.


Original article: https://arxiv.org/pdf/2603.19655.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-03-23 15:44