Robots Learning in the Wild: A New Era for AI Data

Author: Denis Avetisyan

Deploying robots to continuously collect real-world data offers a powerful pathway to improve and adapt foundation models for a wider range of tasks.

The study demonstrates that inconsistencies in the physical library environment-such as unexpectedly short shelving units-introduce noise into LiDAR-based shelf position sensing, necessitating increased human intervention, while damage, occlusion, and the degradation of multilingual labels highlight the need for a continuous data refinement loop to maintain robust visual language model performance.

This paper introduces the Robot-Powered Data Flywheel framework, demonstrating significant performance gains in vision-language models and multilingual OCR through in-the-wild learning.

Despite the impressive zero-shot capabilities of foundation models, their reliance on static internet data limits performance in messy, real-world settings. This work introduces the Robot-Powered Data Flywheel, a framework where robots autonomously collect data during deployment to continually refine these models. We demonstrate that deploying a mobile manipulator in a library setting not only aids task completion, but also generates a dataset that improves vision-language model performance on both book identification and multilingual optical character recognition. Could this approach unlock a virtuous cycle of robotic learning and adaptation, bridging the gap between simulated and real-world intelligence?

The Inherent Limitations of Empirically Derived Models

Foundation models, despite their impressive ability to perform tasks without explicit training examples – a feat known as zero-shot learning – are fundamentally shaped by the data upon which they are built. These models ingest enormous volumes of internet data, a resource that, while vast, presents inherent biases and limitations in representing the full spectrum of real-world scenarios. This reliance introduces a critical challenge: the data distribution encountered during deployment often diverges significantly from the model’s training data, leading to diminished performance and unreliable outcomes. Consequently, a model trained predominantly on digital content may struggle to accurately interpret information from physical environments or underrepresented demographics, highlighting the need for more diverse and representative datasets to ensure equitable and robust artificial intelligence systems.

The impressive capabilities of pre-trained foundation models are often contingent on a critical factor: the similarity between training data and real-world application. A performance gap emerges when these models encounter data distributions that deviate from those encountered during their initial training phase. Because these models learn patterns from the data they are fed, any significant shift in data characteristics-such as image quality, lighting conditions, or the prevalence of specific objects-can lead to a substantial drop in accuracy and reliability. This limitation underscores the importance of considering the potential for distributional shift when deploying these models in dynamic, unpredictable environments, and highlights the need for techniques that enhance their ability to generalize beyond the confines of their training data.

The capacity of current foundation models to function effectively diminishes considerably when transitioned from curated datasets to the complexities of real-world settings. These models, while proficient with neatly labeled digital information, encounter significant challenges in unstructured environments such as physical libraries or retail stores. Factors like inconsistent lighting, occluded objects, varying perspectives, and the sheer density of items create a data distribution vastly different from their training data. This discrepancy results in diminished accuracy and reliability, hindering the deployment of these models in practical applications requiring robust perception and adaptability. The inability to generalize beyond familiar data highlights a critical limitation, demanding innovative strategies to bridge the gap between artificial intelligence and the unpredictable nature of physical spaces.

The successful integration of artificial intelligence into real-world applications hinges on a system’s ability to adapt to unpredictable environments, a challenge currently limiting the utility of many pre-trained foundation models. Existing models, despite impressive capabilities, often falter when faced with data differing significantly from their training sets; however, recent work demonstrates substantial progress in bridging this performance gap. Specifically, a novel approach to book identification has yielded a dramatic improvement in accuracy, increasing from a baseline of 32.4% to an impressive 71.8%. This advancement highlights the potential for targeted interventions that enhance adaptability and pave the way for more robust and reliable AI deployments in practical settings, moving beyond controlled laboratory conditions and into dynamic, unstructured environments.

Fine-tuning the Qwen2.5 vision-language model on flywheel data substantially improves performance on both book identification (from 32.4% to 71.8%) and related tasks like multilingual OCR-increasing English OCR accuracy from 24.8% to 46.6% and Chinese OCR from 30.8% to 38.0%, suggesting pretraining data composition significantly impacts cross-lingual generalization.

A Self-Refining System for Continuous Learning

The Robot-Powered Data Flywheel utilizes robotic systems for the autonomous acquisition of real-world data. This involves deploying robots within target environments to gather data relevant to specific AI tasks, eliminating the reliance on pre-collected, static datasets. Robotic data collection allows for continuous and iterative data sourcing, adapting to the nuances of the operational environment and providing a stream of information for model training. The framework is designed to support various data modalities, including visual, tactile, and auditory data, gathered through onboard sensors and effectors, and focuses on minimizing human intervention in the data acquisition process.

Foundation models are adapted and improved through a process of iterative fine-tuning utilizing data collected from robotic deployments. This involves taking a pre-trained model and further training it on a dataset specific to the target environment and task. Domain-specific adaptation focuses the model’s learning on relevant features and nuances, increasing performance in real-world applications. This process is iterative; data collected after each fine-tuning cycle is used to further refine the model, leading to continuous improvement in accuracy and generalization capabilities. Demonstrated results include an increase in English OCR accuracy from 24.8% to 46.6% and Chinese OCR accuracy from 30.8% to 38.0% through this adaptation process.

Continuous data collection within the target operational environment is central to the Robot-Powered Data Flywheel’s learning process. Unlike traditional machine learning approaches reliant on fixed datasets, this system actively gathers data from real-world interactions. This iterative process allows models to refine their understanding of the specific nuances and variations present in the deployment environment. The resulting data informs ongoing model adaptation, enhancing generalization capabilities and reducing the effects of domain gap – the discrepancy between training data and real-world input. Observed improvements, such as the increase in English OCR accuracy from 24.8% to 46.6% and Chinese OCR accuracy from 30.8% to 38.0%, demonstrate the effectiveness of this approach in improving model performance through continuous learning.

Traditional machine learning relies on static datasets, limiting performance in changing environments. In contrast, the Robot-Powered Data Flywheel utilizes continuous data collection from real-world robotic deployments, creating a dynamic system capable of adaptation and improved generalization. Empirical results demonstrate the efficacy of this approach; specifically, English Optical Character Recognition (OCR) accuracy increased from 24.8% to 46.6%, and Chinese OCR accuracy improved from 30.8% to 38.0% through iterative model refinement using autonomously collected data.

Real-World Validation: The Scanford Deployment

The Scanford system employs a TidyBot++ mobile robot as its base, providing autonomous navigation capabilities within the library environment. A Franka FR3 robotic arm is mounted on the robot, facilitating precise positioning and manipulation of sensors. Data acquisition is achieved through a wrist-mounted Intel RealSense D435 camera, which captures visual information, and a Unitree L2 LiDAR sensor, used for generating 3D spatial maps and enabling obstacle avoidance. This sensor suite allows the system to systematically scan bookshelves and collect the necessary data for training foundation models.

The Scanford system employs a mobile robotic platform to methodically scan library bookshelves, capturing both visual imagery and three-dimensional spatial data of the books in situ. Data acquisition is achieved through a wrist-mounted Intel RealSense D435 camera, providing RGB and depth information, and a Unitree L2 LiDAR sensor, which generates point clouds for precise spatial mapping. This process systematically records the position, orientation, and visual appearance of each book across a representative library environment, creating a dataset suitable for training computer vision and robotic manipulation algorithms. The resulting data includes variations in lighting, book density, shelf height, and occlusion, reflecting the complexities of a real-world library setting.

Data acquired during the Scanford deployment is directly applied to the training and refinement of foundation models designed for library book identification. This involves utilizing the collected visual and spatial data – including images from the Intel RealSense D435 and point clouds from the Unitree L2 LiDAR – to enhance the models’ ability to accurately recognize books under varying conditions. Specifically, the data serves to improve model robustness against challenges such as occlusions, differing lighting, and variations in book cover appearance and condition. This iterative training cycle, leveraging real-world library data, aims to increase identification accuracy and generalization capabilities beyond synthetic datasets.

The Scanford deployment demonstrated the practical application of robotic data collection within the complex and unstructured environment of a library. Utilizing a mobile robotic platform, the system autonomously scanned 2103 shelves, achieving a reduction of approximately 18.7 hours in required librarian effort for the same task. This indicates a significant potential for automating data acquisition in similar real-world settings, offering both efficiency gains and the reallocation of human resources from repetitive physical tasks to more complex duties.

Deployed in a real-world library setting, Scanford autonomously scanned and labeled over 2,100 shelves, saving approximately 18.7 hours of librarian time and simultaneously enhancing VLM performance with minimal human intervention.

Expanding the Horizon: Implications and Future Trajectory

The architecture underpinning Scanford’s success extends far beyond the realm of book organization, offering a blueprint for advancements in fields demanding nuanced visual understanding and flexible response. The demonstrated ability to reliably perceive, interpret, and act upon complex, real-world scenes suggests powerful applications in retail environments – for automated inventory management and planogram compliance – and within logistics, optimizing warehouse operations and package handling. Perhaps most significantly, the framework holds promise for healthcare, potentially assisting with tasks like medical image analysis, robotic surgery, and even automated diagnostics, all areas where robust perception and adaptable intelligence are paramount to both efficiency and accuracy.

The efficiency of vision-language models hinges significantly on the quality and relevance of the data they are trained on, yet acquiring such data is often a laborious and expensive undertaking. Automated curation techniques address this challenge by employing algorithms to filter, validate, and refine datasets, minimizing the reliance on human annotation. These methods can identify and remove noisy or irrelevant data points, correct errors, and even augment existing data with synthetic examples. Consequently, models trained on automatically curated datasets demonstrate improved performance, faster training times, and increased robustness – a critical advantage as these systems scale to tackle increasingly complex real-world problems. By prioritizing data quality through automation, the development cycle is streamlined, and the potential for deploying adaptable and reliable vision-language applications is substantially broadened.

The Scanford framework doesn’t simply excel at replicating the precise task it was trained on-identifying books-but exhibits a remarkable capacity for domain-adjacent generalization. This means the underlying principles learned during book identification can be successfully applied to visually similar, yet distinct, challenges. The system demonstrates an ability to perceive and interpret visual information in a way that transcends specific object categories, allowing it to adapt to tasks like recognizing different types of media, locating objects within complex scenes, or even assisting in inventory management with minimal retraining. This adaptability stems from the framework’s focus on building a robust understanding of visual features and their relationships, rather than memorizing specific instances, paving the way for more flexible and broadly applicable vision-language models.

Ongoing development centers on refining the data flywheel-the self-improving cycle of data generation and model training-to incorporate multilingual Optical Character Recognition (OCR). This expansion isn’t merely about translating text; it aims to unlock knowledge embedded in books and documents across diverse languages, dramatically increasing the volume of accessible training data. Simultaneously, researchers are working to enhance the adaptability of vision-language models, allowing them to generalize more effectively to novel environments and tasks with minimal retraining. This focus on adaptability is crucial for deploying these models in real-world scenarios where perfect data alignment is rare, promising a future where AI can seamlessly interpret and interact with the visual world, regardless of language or context.

The pursuit of a robust Robot-Powered Data Flywheel, as detailed in the study, echoes a fundamental tenet of computational correctness. Grace Hopper famously stated, “It’s easier to ask forgiveness than it is to get permission.” This sentiment applies directly to the flywheel’s iterative data collection; rather than meticulously pre-labeling vast datasets, the system learns from its interactions with the real world, adapting foundation models through continuous feedback. This approach prioritizes a provable, albeit evolving, solution over a theoretically perfect but static one, circumventing the limitations of relying solely on curated data and embracing the inherent messiness of in-the-wild learning. The system’s gains in multilingual OCR demonstrate the power of this approach.

What’s Next?

The notion of a “data flywheel” propelled by robotic agents is superficially appealing, yet rests on a precarious foundation. The presented work demonstrates improvement – a necessary, but insufficient, condition for genuine progress. The gains observed in multilingual OCR, while encouraging, ultimately quantify correlation, not causation. A truly robust system demands formal verification of the data augmentation process itself; simply showing that more data leads to better performance evades the fundamental question of why that data is beneficial.

Future work must address the inherent biases introduced by the robotic data collection process. A robot, by its very nature, samples the world according to its capabilities and programmed objectives – a severely constrained view compared to the true distribution of real-world data. To claim a foundation model is ‘adapted’ is misleading if the adaptation relies on a systematically skewed dataset. Rigorous analysis of the collected data’s statistical properties, and comparison against independently verified ground truth, are paramount.

Ultimately, the field must move beyond empirical demonstration and embrace mathematical formalism. A provably correct algorithm for data augmentation, guaranteeing improvement under defined conditions, is the only path towards a truly elegant and trustworthy system. Until then, the “flywheel” remains a clever mechanism, but one built on shifting sands.

Original article: https://arxiv.org/pdf/2511.19647.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inherent Limitations of Empirically Derived Models

A Self-Refining System for Continuous Learning

Real-World Validation: The Scanford Deployment

Expanding the Horizon: Implications and Future Trajectory

What’s Next?

See also: