Beyond Real-World Data: Building Smarter Mobile Agents

Author: Denis Avetisyan


A new framework, OpenMobile, tackles the challenge of training robust mobile agents by generating synthetic data and incorporating error recovery mechanisms.

OpenMobile facilitates robust task completion by first constructing a comprehensive environmental memory, then synthesizing context-aware instructions from both short- and long-term recollections, and finally employing an error-intervention policy that leverages expert correction when the agent deviates from successful execution.
OpenMobile facilitates robust task completion by first constructing a comprehensive environmental memory, then synthesizing context-aware instructions from both short- and long-term recollections, and finally employing an error-intervention policy that leverages expert correction when the agent deviates from successful execution.

OpenMobile leverages data synthesis and vision-language models to create diverse training data for Android applications, achieving competitive performance with systems trained on closed datasets.

Despite recent advances in vision-language models for mobile robotics, a lack of open datasets and transparency in training methodologies hinders broader research progress. To address this, we introduce OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis, a framework for generating high-quality, diverse instructions and agent trajectories, incorporating a novel policy-switching strategy to capture crucial error-recovery data. Our synthetic data enables agents-including fine-tuned Qwen2.5-VL and Qwen3-VL-to achieve competitive performance on dynamic mobile agent benchmarks, surpassing existing open-data approaches on AndroidWorld. Will this open-source approach unlock a new era of accessible and robust mobile agent research, and how can synthetic data further bridge the gap between simulated and real-world performance?


Bridging the Gap Between Simulation and Reality in Mobile AI

Conventional methods of training mobile agents often depend on pre-collected, static datasets, a practice that fundamentally clashes with the ever-shifting nature of real-world mobile applications. These datasets represent only a snapshot in time, unable to account for frequent app updates, evolving user interfaces, and the unpredictable diversity of user interactions. Consequently, agents trained on such limited data struggle to adapt to even minor deviations from the training environment, exhibiting a lack of robustness and hindering their ability to perform reliably in dynamic, authentic scenarios. The inherent rigidity of static datasets creates a significant ‘reality gap’, limiting the practical applicability of these agents and necessitating more adaptive training methodologies.

The limitations of current mobile AI agents become strikingly apparent when confronted with the nuances of real-world applications, revealing a fragility stemming from poor generalization. These agents, meticulously trained on fixed datasets, often falter when encountering even minor deviations in app layouts or user interactions – a misplaced button, a slightly altered text prompt, or an unexpected user gesture can disrupt performance. This brittleness isn’t a matter of simple error; it indicates a failure to truly understand the underlying task, instead relying on memorized patterns. Consequently, agents struggle to adapt to the inherent variability of mobile environments, hindering their ability to function reliably outside the carefully controlled conditions of the training phase and ultimately limiting their practical utility.

Our models achieve state-of-the-art task success rates on dynamic mobile agent benchmarks, demonstrating improved performance with scaled data and enhanced error recovery in live environments.
Our models achieve state-of-the-art task success rates on dynamic mobile agent benchmarks, demonstrating improved performance with scaled data and enhanced error recovery in live environments.

OpenMobile: Synthesizing Data for Robust Mobile Agents

OpenMobile addresses the challenge of creating comprehensive training datasets for mobile agents by separating the processes of task synthesis and instruction generation. Traditional methods often link these, limiting the diversity of scenarios encountered during training. By decoupling them, OpenMobile enables the independent creation of tasks – defining what the agent should do – and instructions – detailing how to accomplish those tasks. This allows for a systematic exploration of the application’s functionality, generating a wider range of training examples and ensuring broad coverage of available actions and potential user interactions. The resulting data is not constrained by a fixed relationship between task and instruction, leading to more robust and adaptable agent behavior.

OpenMobile utilizes a ‘Global Environment Memory’ (GEM) to facilitate comprehensive data generation for mobile agent training. The GEM is a structured repository documenting all discoverable actions within an application, alongside the resulting state changes. This is achieved through systematic exploration – a process of initiating actions and recording the subsequent application states – allowing OpenMobile to map the entire action space. The GEM is not simply a static record; it is dynamically updated as new actions and states are discovered during exploration, ensuring a complete and accurate representation of the application’s functionality. This exhaustive mapping is critical for generating diverse training scenarios and identifying edge cases for robust agent development.

Policy-Switching Rollout in OpenMobile is a technique used to generate data specifically highlighting agent failure cases. The process involves intermittently switching between the agent’s current policy and a randomly selected, alternative policy during data collection. This intentional introduction of errors forces the agent to navigate unexpected states and allows the system to identify instances where the agent deviates from successful task completion. These identified failure points, representing scenarios requiring correction, are then flagged for expert annotation, creating a dataset focused on error recovery and improving the robustness of the training process. The resulting data prioritizes situations where the agent demonstrably struggles, offering a targeted source of learning signals.

Scaling synthesized instructions improves functionality coverage on AndroidWorld tasks, with OpenMobile consistently outperforming the coupled baseline, and simpler tasks with greater synthetic coverage demonstrating higher success rates.
Scaling synthesized instructions improves functionality coverage on AndroidWorld tasks, with OpenMobile consistently outperforming the coupled baseline, and simpler tasks with greater synthetic coverage demonstrating higher success rates.

Refining Agent Learning Through Error Intervention and Reinforcement

The Error Intervention strategy, implemented within the Policy-Switching Rollout framework, functions by actively identifying and correcting erroneous actions during agent training. This is achieved through a process of providing corrective signals when the agent deviates from successful task completion pathways. These signals, generated during rollout, serve as immediate feedback, allowing the agent to adjust its policy and avoid repeating the same errors. By focusing on specific instances of failure and providing targeted corrections, the Error Intervention strategy accelerates learning and improves the agent’s overall performance, leading to a more robust and reliable policy compared to training without such intervention.

OpenMobile employs a dual reinforcement learning approach, utilizing both step-level and trajectory-level techniques to refine agent behavior. Step-level reinforcement learning focuses on optimizing individual actions within a task, providing immediate feedback and adjusting the agent’s policy after each step. Complementing this, trajectory-level reinforcement learning evaluates and rewards the agent based on the completion of entire task sequences, encouraging the development of effective, long-term strategies. This combined approach allows for granular control over action selection while simultaneously promoting holistic task completion, resulting in improved agent performance and adaptability in complex mobile environments.

Agents trained using the OpenMobile synthesized dataset demonstrate significant performance gains on mobile automation benchmarks. Specifically, the agents achieve a 64.7% task success rate on the AndroidWorld environment, exceeding the performance of systems trained on publicly available datasets and reaching a level comparable to leading, proprietary data-driven systems. Furthermore, on the more complex MobileWorld benchmark, these agents attain a 17.4% task success rate, representing an 85.1% improvement over a baseline performance of 9.4% achieved using the Qwen2.5-VL-7B model.

Despite exhibiting moderate semantic similarity ([latex]3.5%[/latex] exceeding 0.7) to AndroidWorld instructions, removing highly similar training examples causes only a minor performance decrease, suggesting limited benchmark overfitting.
Despite exhibiting moderate semantic similarity ([latex]3.5%[/latex] exceeding 0.7) to AndroidWorld instructions, removing highly similar training examples causes only a minor performance decrease, suggesting limited benchmark overfitting.

Vision-Language Models and the Promise of Adaptive Mobile AI

Recent research highlights a significant performance boost in mobile AI agents through the strategic refinement of large Vision-Language Models (VLMs). Specifically, models like ‘Qwen2.5-VL-7B’ and ‘Qwen3-VL-8B’ undergo a fine-tuning process utilizing data generated by the OpenMobile platform. This targeted training allows the agents to more effectively interpret and respond to the complex interplay of visual and textual cues within mobile applications. The result is a substantial improvement in task completion and overall agent proficiency, demonstrating that a synergy between synthetic data and VLM optimization is key to advancing the capabilities of AI in mobile environments.

The convergence of synthetic data generation and large Vision-Language Model (VLM) fine-tuning creates a powerful pathway for mobile AI agents to perceive and respond to the complexities of application interfaces. By training VLMs on data specifically crafted to represent mobile environments – including visual screenshots paired with descriptive text – agents develop a nuanced understanding of both the appearance and functionality of apps. This allows them to interpret on-screen elements, decipher textual instructions, and ultimately, interact with mobile applications in a more human-like and effective manner. The resulting agents are not simply recognizing images; they are associating visual cues with semantic meaning, enabling robust performance even when faced with variations in app design or user interface elements.

The refined agents exhibit a marked ability to adapt to unseen mobile applications and navigate variations in user interface design, representing a significant advancement in mobile AI. This robustness stems from the synergistic effect of data synthesis and Vision-Language Model fine-tuning, allowing for more reliable performance across diverse app environments. Demonstrating this capability, the Qwen2.5-VL-7B model achieved a noteworthy 25% absolute point improvement on the challenging AndroidWorld benchmark, signifying a substantial leap in the ability of AI agents to effectively understand and interact with complex mobile interfaces and paving the way for more versatile and user-friendly mobile automation.

The pursuit of robust mobile agents, as demonstrated by OpenMobile, necessitates a parsimonious approach to data representation. The framework’s emphasis on synthesizing diverse instructions and error recovery signals aligns with a core principle of efficient information transfer. As Andrey Kolmogorov stated, “The most important thing in science is not to be right, but to be useful.” OpenMobile’s contribution isn’t merely achieving competitive performance; it’s providing a methodology – a synthetic dataset – that unlocks further research. The utility lies in the framework’s ability to circumvent the limitations of real-world data collection, offering a streamlined path toward increasingly adaptable and resilient agents. This prioritizes practical application over exhaustive, yet potentially redundant, data gathering.

Further Steps

The proliferation of synthetic data, as demonstrated, offers a palliative for the perennial scarcity of labelled examples. However, it does not address the fundamental question of representation. Current vision-language models, while proficient at mapping observations to actions, remain tethered to the biases inherent in their training corpora. The true test lies not in mimicking human performance on contrived tasks, but in achieving robust generalization to genuinely novel situations.

Error recovery, incorporated here as a signal, hints at a deeper need: agents must not simply avoid failure, but understand it. A system capable of diagnosing its own shortcomings, and adapting accordingly, would move beyond imitation towards something resembling intelligence. The framework’s dependence on pre-existing applications, while pragmatic, introduces an external constraint. A wholly synthetic environment, though computationally demanding, would offer greater control and potentially reveal more fundamental limitations.

The pursuit of ‘open’ systems-those capable of operating in unrestricted environments-will inevitably confront the problem of scale. The combinatorial explosion of possible states demands a parsimonious approach to learning, one that prioritizes essential features and discards the superfluous. The goal, perhaps, is not to replicate the complexity of the world, but to distill its essence.


Original article: https://arxiv.org/pdf/2604.15093.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-04-18 14:32