Unlocking Mobile App Coverage with AI

Author: Denis Avetisyan

A new framework leverages artificial intelligence to automatically explore and test previously unreachable code within Android applications.

CovAgent navigates interactive app elements within McDonald’s Canada-specifically those inaccessible to established automation frameworks like LLMDroid, Ape, and Fastbot-by leveraging dynamically instrumented edges, demonstrating an expanded capacity for complex app interaction.

CovAgent utilizes agentic AI and dynamic instrumentation to overcome the 30% coverage gap in mobile application testing.

Despite advances in automated testing, Android application GUI testing remains limited by consistently achieving only around 30% activity coverage. This paper introduces ‘CovAgent: Overcoming the 30% Curse of Mobile Application Coverage with Agentic AI and Dynamic Instrumentation’, a novel framework leveraging agentic AI and dynamic instrumentation to address this challenge. Our approach significantly improves test coverage by automatically identifying and satisfying previously unreachable activity launch conditions, achieving up to 179.7% higher coverage compared to state-of-the-art techniques. Could this agentic approach herald a new era of more comprehensive and effective automated mobile application testing?

The Inevitable Drift: From Randomness to Directed Exploration

Early approaches to Android application testing frequently relied on techniques like the ‘Monkey’ tool, which simulates random user events. While useful for basic stress testing, these methods often prove inefficient in achieving thorough coverage of an app’s functionality. The inherent randomness struggles to navigate complex application states, frequently missing edge cases or specific sequences of actions that trigger critical bugs. Consequently, developers may face a false sense of security, as a high volume of random interactions doesn’t necessarily translate to meaningful test coverage. The limitations of such brute-force methods underscore the need for more intelligent and targeted testing strategies capable of systematically exploring app features and uncovering hidden vulnerabilities.

Simply generating a large number of random interactions with an Android application, while seemingly thorough, often fails to adequately assess the underlying code. Achieving genuine code coverage necessitates smarter techniques that prioritize exploring relevant application states and functionalities. Researchers are increasingly focused on methods like symbolic execution and concolic testing, which analyze code paths and intelligently generate test cases designed to trigger specific code segments. These approaches move beyond the limitations of ‘brute-force’ testing by focusing on boundary conditions, exception handling, and complex logic, ultimately leading to more efficient bug detection and a higher degree of confidence in application reliability. The goal isn’t just to run many tests, but to execute tests that meaningfully exercise the application’s codebase and reveal potential vulnerabilities.

Efficient Android application testing necessitates a shift from exhaustive, yet often unproductive, methods towards strategies that intelligently navigate an app’s functional landscape. Current approaches are increasingly focused on techniques like state-aware testing and reinforcement learning, allowing automated systems to learn optimal paths through an application to uncover edge cases and critical bugs. These methods prioritize exploring unique app states and user flows, rather than simply generating random interactions, which drastically improves bug detection rates. By concentrating testing efforts on areas most likely to contain vulnerabilities, developers can enhance app quality, reduce time-to-market, and deliver a more robust user experience. This targeted exploration proves crucial, especially given the increasing complexity of modern Android applications and the diverse range of device configurations they must support.

CovAgent enhances GUI fuzzing by combining static analysis of app features with dynamic instrumentation, enabling targeted exploration of unreached activities through validated Frida scripts loaded directly into the application’s process.

The Illusion of Control: From Random Walks to Guided Searches

Automated testing tools, including Sapienz, APE, and Stoat, enhance application GUI testing through a combination of static analysis and evolutionary algorithms. Static analysis is employed to identify all possible GUI states and transitions without actually executing the application, creating a model of potential behaviors. This model then informs an evolutionary algorithm, which systematically explores the GUI by generating sequences of actions. These sequences are evaluated based on code coverage or other defined metrics, and the algorithm iteratively refines its search, prioritizing paths that maximize test efficiency and uncover previously untested GUI elements. This approach contrasts with purely random or manual testing by providing a structured, systematic exploration of the application’s user interface.

Directed search techniques, exemplified by A3E (Adaptive Automated Exploration), depart from purely random GUI testing by prioritizing exploration based on the likelihood of uncovering errors. These techniques utilize heuristics and models – often derived from program analysis or runtime observations – to identify UI elements or input sequences that are statistically more likely to trigger crashes or unexpected behavior. Rather than uniformly sampling the input space, A3E and similar methods focus computational resources on areas deemed ‘interesting’ based on code coverage, event handling, or historical failure rates. This targeted approach improves test efficiency by increasing the probability of discovering bugs within a limited number of interactions, as opposed to relying on chance encounters during random exploration.

Humanoid employs a reinforcement learning approach to GUI testing, specifically utilizing a Deep Q-Network (DQN) to model the testing process as a Markov Decision Process. The DQN is trained on a history of app interactions – actions taken within the GUI and the resulting app states – allowing it to predict which actions are most likely to lead to interesting or problematic states, such as crashes or exceptions. This learned policy then guides future exploration by prioritizing actions with higher predicted rewards, effectively shifting away from purely random input generation and focusing on test paths with a greater probability of uncovering bugs. The system continuously updates its model as it observes the outcomes of its actions, enabling it to adapt to the specific behavior of the application under test and improve its testing efficiency over time.

An injected button widget triggers a transition to a target activity, enabling validation of the edge instrumentation script.

The Glimmer of Intelligence: Reasoning About App Behavior

Recent advancements in mobile application testing utilize Large Language Models (LLMs) through tools like LLMDroid, GPTDroid, and DroidAgent. These systems function by analyzing an application’s User Interface (UI) elements and associated text to create concise summaries of available functionalities. This process enables the LLM to identify potentially new or less-utilized features within the app. Consequently, testing efforts can be focused on these specific areas, moving beyond broad, generalized test suites and achieving more targeted and efficient application coverage. The LLM’s UI summarization capability serves as a prerequisite for intelligent test case generation and prioritization.

Large Language Models (LLMs) are increasingly utilized for software testing by incorporating reasoning capabilities through techniques like Chain-of-Thought (CoT). CoT prompting enables the LLM to decompose complex testing problems into a series of intermediate reasoning steps, rather than directly predicting test inputs. This allows the model to analyze the application’s state, consider potential outcomes of actions, and generate test cases that target specific functionalities or edge cases. Consequently, LLMs can produce more intelligent test inputs – those that go beyond simple random inputs – by focusing on areas likely to reveal bugs or unexpected behavior, ultimately improving test coverage and effectiveness.

The ModelContextProtocol is a defined interface enabling communication between Large Language Model (LLM) agents and external tools necessary for app interaction and observation. This protocol standardizes the exchange of information, allowing the LLM to issue commands to tools – such as those automating UI interactions or accessing device sensors – and receive structured feedback regarding the app’s response. Specifically, it outlines the format for requests sent to tools, detailing the desired action and any necessary parameters, and the format for responses from tools, providing observable data about the app’s state or behavior following the action. This structured communication is critical for LLM-powered testing, as it allows the agent to iteratively explore the application, observe the results of its actions, and refine subsequent testing strategies based on observed behavior.

CovAgent injects a pop-up dialog with instrumented buttons into the Samsung Smart Switch app to trigger and monitor specific user interactions.

The Illusion of Completion: A Holistic Approach to Intelligent Testing

CovAgent introduces a new methodology for Android application testing by integrating three core techniques: dynamic instrumentation, static analysis, and large language models. Dynamic instrumentation, facilitated by tools like Frida, allows for runtime analysis of the application’s behavior. This is complemented by static analysis, which examines the application’s code without execution. The data derived from these two analyses is then fed into a large language model, enabling intelligent test case generation and improved application coverage. This combined approach aims to surpass the limitations of traditional testing methods by leveraging the strengths of each individual technique and providing a more holistic understanding of the application under test.

CovAgent employs the Frida dynamic instrumentation toolkit to observe and intercept runtime execution within Android applications. This allows for the extraction of data regarding method calls, variable values, and control flow without requiring modification of the application’s source code. The collected runtime information is then formatted and provided as input to the integrated Large Language Model (LLM), enabling it to understand the app’s behavior and inform more effective test case generation. Frida’s capabilities facilitate a detailed analysis of the application’s state during execution, supplementing static analysis and providing the LLM with contextual data crucial for identifying potential vulnerabilities and coverage gaps.

CovAgent demonstrates a substantial improvement in activity coverage when compared to the APE testing framework. Specifically, CovAgent achieves up to 49.5% activity coverage, representing a 2.8x increase over APE’s 17.7%. This metric indicates the proportion of distinct application activities explored during testing; a higher percentage suggests more thorough examination of the application’s functional components and user interface flows. The significant difference in coverage highlights CovAgent’s enhanced ability to systematically exercise and validate the application’s activity-based behavior.

The ComponentTransitionGraph (CTG) serves as a crucial input to CovAgent’s Large Language Model (LLM), providing a formalized depiction of the Android application’s structure and navigational pathways. This graph represents application components as nodes and transitions between them as edges, effectively mapping the app’s architectural blueprint. By consuming the CTG, the LLM gains contextual understanding beyond the raw code, enabling it to more accurately predict user interactions, identify potential test cases, and reason about application behavior. This structured representation significantly enhances the LLM’s ability to generate effective test sequences and improve overall testing efficiency compared to approaches lacking such architectural awareness.

CovAgent achieves a 54.8% Activity Launch Success Rate, representing a substantial improvement over Scenedroid, which demonstrates a 15.8% success rate. This metric quantifies the percentage of attempted activity launches within the target Android application that are successfully completed without errors or crashes. The significant difference in performance indicates CovAgent’s enhanced ability to navigate and interact with application components, suggesting a more robust and reliable testing process compared to Scenedroid.

Comparative analysis demonstrates CovAgent’s superior code coverage metrics when tested against the APE framework. Specifically, CovAgent achieved 56.6% class coverage, representing a 14.3 percentage point improvement over APE’s 42.3%. Method coverage with CovAgent reached 45.2%, exceeding APE’s 32.1% by 13.1 percentage points. Furthermore, CovAgent attained 39.8% line coverage, a substantial increase from APE’s 28.5%, indicating an 11.3 percentage point difference in the extent of executable code lines reached during testing.

The Inevitable Decay: Towards Self-Healing and Adaptive Testing

Ongoing research prioritizes the creation of self-healing testing frameworks designed to minimize the maintenance burden associated with evolving applications. These frameworks aim to automatically detect and address broken tests resulting from user interface changes or code refactoring, employing techniques like dynamic element location and machine learning to adapt to modifications. Rather than simply flagging failures, a self-healing system will attempt to autonomously repair tests by updating locators, adjusting assertions, or even regenerating test steps. This adaptive capability promises to significantly reduce the time and resources currently dedicated to test maintenance, allowing development teams to focus on innovation and faster release cycles. The envisioned systems will not only identify discrepancies but also learn from changes, improving their resilience and accuracy over time, ultimately fostering a more robust and efficient software development process.

Emerging research indicates a powerful synergy between Large Language Models (LLMs) and specialized tools like ‘ActivityLaunch’ and ‘InstrumentationScripts’ to revolutionize mobile application testing. By integrating LLMs, which excel at understanding and generating human-like text, with ‘ActivityLaunch’-a system for initiating specific app actions-and ‘InstrumentationScripts’-code that monitors app behavior-testing frameworks can move beyond simple, predefined test cases. This combination allows for the dynamic creation of tests tailored to specific app states and user interactions, and enables more intelligent analysis of test results. The LLM can interpret app code and UI elements, then generate targeted actions via ‘ActivityLaunch’ and interpret the resulting data collected through ‘InstrumentationScripts’, effectively creating a closed-loop system for sophisticated test case generation and execution, ultimately leading to more robust and adaptable testing procedures.

The envisioned future of software testing centers on a fully autonomous system, perpetually vigilant in its monitoring of application quality. This system transcends traditional reactive bug fixing, instead proactively identifying potential issues before they manifest as failures for end-users. Such a system would leverage advanced algorithms and machine learning models to analyze code changes, user behavior, and system logs, predicting where vulnerabilities might arise. Upon detection of a potential issue, the system wouldn’t simply report it, but would autonomously generate and execute targeted tests, and, crucially, implement corrections-effectively self-healing the application. This continuous cycle of monitoring, prediction, testing, and correction promises a paradigm shift, moving software quality assurance from a cost center to a self-optimizing, integral component of the development lifecycle, drastically reducing time-to-market and enhancing user experience.

The pursuit of complete code coverage, as CovAgent attempts with its agentic AI and dynamic instrumentation, is often framed as a technical challenge. However, this work reveals a deeper truth: systems don’t fail – they evolve. The framework doesn’t simply find unreachable activities; it grows pathways to them, revealing the inherent adaptability within the Android application itself. As Vinton Cerf once observed, “The Internet treats everyone the same.” Similarly, CovAgent doesn’t impose order; it responds to the latent potential already present, allowing the system to demonstrate its full, complex shape. The 30% coverage curse isn’t a barrier, but a symptom of a system revealing itself over time.

What Lies Ahead?

CovAgent addresses a symptom, not the disease. The persistent struggle for adequate Android application coverage reveals a deeper truth: testing isn’t about achieving a percentage, but about acknowledging the inherent unknowability of complex systems. Each activity successfully launched is merely a postponement of the inevitable – the undiscovered edge case, the unanticipated user flow. The framework itself will become a dependency, a brittle layer atop an already shifting foundation.

Future work will undoubtedly focus on refining the Large Language Model’s understanding of intent, perhaps attempting to predict unexplored states. This is a seductive path, but one built on the assumption that complete knowledge is attainable. A more fruitful, if less glamorous, direction lies in embracing the unknown. Tools that don’t seek to solve coverage, but to reveal the boundaries of understanding, might prove more resilient.

The architecture isn’t structure – it’s a compromise frozen in time. Technologies change, dependencies remain. The real challenge isn’t building a better agent, but cultivating a system that can gracefully degrade as the landscape shifts, and which acknowledges that the map will always be less detailed than the territory.

Original article: https://arxiv.org/pdf/2601.21253.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/