The Robot That Learns by Listening and Looking

Author: Denis Avetisyan

CAVER autonomously builds an understanding of the physical world through audiovisual exploration, opening doors for more adaptable and intelligent robotic systems.

CAVER constructs an audiovisual representation through intrinsically motivated interaction, incrementally building a knowledge network based on nearest neighbors and prioritizing exploration of the most uncertain points—identified by comparing visual features to prior experience—before collecting and integrating corresponding audio samples, effectively growing an understanding of the environment rather than simply perceiving it.

This work introduces CAVER, a curiosity-driven robot capable of learning audiovisual representations for material classification, sound prediction, and manipulation imitation using a KNN-based approach.

While robots excel at vision and manipulation, linking an object’s visual appearance to the sounds it makes during interaction remains a key challenge for truly versatile robotic learning. This work introduces CAVER: Curious Audiovisual Exploring Robot, a novel system designed to autonomously build rich audiovisual representations of objects through self-directed exploration. CAVER combines a specialized impact tool, a learned audiovisual representation, and a curiosity-driven algorithm to efficiently discover correlations between visual and auditory features. Could this approach unlock new capabilities in material classification, audio-based robotic imitation, and ultimately, more intuitive human-robot interaction?

The Fragility of Knowing

Traditional robotics relies on pre-programmed knowledge, limiting adaptability in novel environments. Robots struggle to generalize beyond controlled settings. Truly autonomous systems require active learning through physical interaction – efficiently gathering information and understanding object properties. The core challenge lies in balancing exploration and exploitation. Inefficient exploration wastes effort, while excessive exploitation limits understanding.

CAVER efficiently learns correlations between an object's visual appearance and acoustic properties through curious exploration, utilizing a KNN model with features from foundation vision models to predict impact sounds and employing distance in visual feature space to identify and sample the most uncertain interaction points for building an informative audio-visual representation. — CAVER efficiently learns correlations between an object’s visual appearance and acoustic properties through curious exploration, utilizing a KNN model with features from foundation vision models to predict impact sounds and employing distance in visual feature space to identify and sample the most uncertain interaction points for building an informative audio-visual representation.

A system that never breaks is, in its own way, already dead.

Echoes in the Material World

CAVER, a robotic material discovery system, uses an ‘Impact Tool’ to generate acoustic responses, perceiving materials through sound analysis. Exploration isn’t random; CAVER prioritizes visually unfamiliar areas using ‘Uncertainty-Guided Exploration’, maximizing information gain. This is driven by intrinsic ‘Curiosity’ and a ‘Farthest-First Selection’ method for broad surface coverage.

The unified audio-visual embeddings developed by CAVER demonstrate significantly improved material classification accuracy compared to relying solely on visual or audio data, indicating that incorporating both modalities is crucial for robust performance across diverse environments like the kitchen, garage, and playroom.

CAVER integrates audio and visual data through unified embeddings, improving material classification accuracy and demonstrating the importance of multi-sensory perception for robust performance.

Building a Resonance Map

CAVER constructs an ‘Audiovisual Representation’ linking interaction points, ‘Visual Embeddings’, and ‘Audio Descriptors’ – the foundation for correlating visual stimuli with auditory feedback. ‘Visual Embeddings’ are extracted using ‘Grounded SAM’ and ‘ResNet’ to capture multi-scale object features. ‘Audio Descriptors’ are derived from impact sounds using ‘Mel-Frequency Cepstral Coefficients’ (MFCCs). A ‘KNN Model’ enables retrieval in both directions – visual to audio and vice versa – creating a fully connected representation.

A 3D-printed, spring-loaded impact tool was designed and integrated with the robot's gripper to generate consistent impact sounds, leveraging a cam-follower mechanism to build spring tension and deliver a controlled impact, with a directional microphone recording the resulting acoustic signal. — A 3D-printed, spring-loaded impact tool was designed and integrated with the robot’s gripper to generate consistent impact sounds, leveraging a cam-follower mechanism to build spring tension and deliver a controlled impact, with a directional microphone recording the resulting acoustic signal.

The Ghosts in the Machine

CAVER demonstrates robust ‘Material Classification’ with 87% accuracy, learned through audiovisual correlations. This stems from training involving diverse material interactions and soundscapes. The system extends beyond identification, exhibiting ‘Audio Prediction’ capabilities and suggesting an internal model of physical interactions. CAVER also performs ‘Action Recognition’ with 42% accuracy, surpassing a human baseline.

CAVER achieves 66% accuracy in audio-based imitation of melodies, connecting sound and action in a coordinated manner. Every action, once performed, echoes in the system, waiting for a future resonance.

The development of CAVER highlights a crucial tenet of complex systems: predictability is often an illusion. As the robot explores and learns audiovisual representations, it doesn’t simply acquire knowledge, it cultivates an understanding through interaction. This resonates with Dijkstra’s observation, “It’s not enough to have a good idea; you have to make it work.” CAVER isn’t pre-programmed with material properties or sound associations; it discovers them through curiosity-driven exploration, iteratively refining its predictions. The system’s capacity for audio prediction and material classification emerges not from rigid design, but from a forgiving interplay between sensory input and learned associations, mirroring a garden where resilience stems from adaptation rather than control.

What Lies Ahead?

The creation of CAVER, a system capable of linking observation and consequence through audiovisual means, does not solve the problem of autonomy—it merely relocates it. The robot learns to predict sound, to classify material. But prediction is anticipation of failure, and classification is the first step toward constraint. Each successful categorization is a narrowing of possibility, a pre-commitment to a future state. The system’s curiosity, its drive to explore, is itself a dependency—a need for novel stimuli that will, inevitably, diminish.

Future iterations will undoubtedly refine the representation, increase the granularity of classification, and broaden the repertoire of manipulation. Yet, these are improvements to the surface of the problem. The core issue remains: a complex system, however elegantly constructed, accrues dependencies faster than it generates resilience. CAVER, like all such creations, will eventually encounter a stimulus for which its learned representation is inadequate – a sound it cannot predict, a material it cannot classify, a manipulation it cannot imitate.

The true challenge, then, isn’t building a robot that learns more, but designing systems that gracefully accommodate—even invite—their own eventual failures. The goal isn’t to eliminate uncertainty, but to distribute its impact. Every connection made is a potential point of systemic collapse; every learned association a prophecy of brittle dependence.

Original article: https://arxiv.org/pdf/2511.07619.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Fragility of Knowing

Echoes in the Material World

Building a Resonance Map

The Ghosts in the Machine

What Lies Ahead?

See also: