Fixing Software, Faster: The Rise of AI-Powered Issue Resolution

Author: Denis Avetisyan

New research explores how artificial intelligence can streamline the process of identifying, understanding, and resolving software bugs and maintenance issues.

This review analyzes the application of AI, including large language models, to improve issue report quality, developer workflows, and automated bug localization in software maintenance.

Despite the critical role of software maintenance, issue resolution remains hampered by challenges like ambiguous reports and a lack of automated support. This research, ‘Studying and Automating Issue Resolution for Software Quality’, addresses these limitations through a three-pronged approach: improving issue report clarity, characterizing developer workflows with and without AI assistance, and automating demanding tasks like bug localization. Our work demonstrates that leveraging large language models and machine learning can significantly enhance both the efficiency and effectiveness of resolving software issues. Will these advances pave the way for truly self-healing software systems?

The Inevitable Cascade of Ambiguous Reports

Software development frequently stalls due to poorly constructed issue reports, which consistently lack the crucial details needed for swift resolution. When reports fail to clearly articulate observed behavior – what actually happened – alongside a description of the expected behavior, developers are left to spend valuable time deciphering the problem rather than fixing it. This ambiguity is compounded when steps to reproduce the issue are missing; without a reliable path to consistently recreate the bug, diagnosis becomes significantly more difficult and time-consuming. Consequently, resolution times lengthen, impacting project timelines and increasing the overall cost of software maintenance, ultimately leading to user frustration and a diminished product experience.

The consequences of poorly written issue reports extend far beyond simple inconvenience; they actively erode the efficiency of software development and diminish the quality of the final product. When reports lack crucial details, developers are forced into cycles of clarification and guesswork, consuming valuable time and resources that could be dedicated to actual problem-solving. This miscommunication frequently results in wasted effort – investigating phantom bugs or implementing incorrect fixes – and contributes to delays in releasing updates and new features. Ultimately, these inefficiencies manifest as a degraded user experience, characterized by persistent bugs, frustrating interactions, and a loss of confidence in the software itself. The cumulative effect of these seemingly minor deficiencies can be substantial, impacting both developer productivity and user satisfaction.

Current methods for evaluating software issue reports frequently fall short in identifying and rectifying deficiencies in crucial details. While some tools offer basic validation – checking for required fields, for example – they rarely assess the clarity or reproducibility of the information provided. This means reports can still be ambiguous, lack sufficient context, or omit vital steps for developers to effectively diagnose and resolve the underlying problem. Consequently, organizations often rely on manual review – a time-consuming and inconsistent process – or accept a steady stream of incomplete reports that significantly impede development velocity and contribute to frustrating user experiences. The limitations of existing automated systems necessitate more sophisticated approaches capable of analyzing the semantic quality of issue descriptions and proactively guiding reporters towards providing comprehensive and actionable information.

AstroBR: Augmenting Reports, Not Replacing Reporters

AstroBR employs a combined methodology of Large Language Models (LLMs) and Dynamic Application Analysis to automatically refine ‘Steps to Reproduce’ (S2R) descriptions within issue reports. This technique moves beyond static analysis by actively observing application behavior during the execution of reported steps. The LLM component, specifically GPT-4, is utilized to interpret the S2R, identify potential ambiguities or inaccuracies, and propose improvements based on the runtime observations gathered through Dynamic Application Analysis. This process results in enhanced S2R descriptions that are more precise and reliable for issue reproduction and resolution.

AstroBR enhances issue report quality by integrating Large Language Model (LLM) reasoning with data derived from dynamic application analysis. This process allows the system to assess ‘Steps to Reproduce’ descriptions while the application is running, identifying discrepancies between the reported steps and the actual application behavior. Ambiguities or inaccuracies within the reported steps are flagged by comparing the expected application state-as defined by the steps-with the observed runtime state. This comparison enables AstroBR to automatically rectify issues like missing preconditions, incorrect actions, or incomplete sequences, leading to more accurate and reproducible reports.

AstroBR employs GPT-4 to process initial ‘Steps to Reproduce’ (S2R) reports, extracting critical information such as actions, objects, and preconditions. This extracted data is then used to construct a revised S2R description, aiming for increased clarity and completeness. The GPT-4 integration facilitates the identification of missing steps or ambiguous phrasing within the original report. The LLM refines the description by adding necessary details and ensuring logical flow, ultimately generating a more robust and reproducible account of the reported issue. This process leverages GPT-4’s natural language understanding and generation capabilities to improve the quality of issue reports without requiring manual intervention.

Evaluations demonstrate AstroBR’s superior performance in enhancing Steps to Reproduce (S2R) descriptions when compared to the current state-of-the-art technique, Euler. Specifically, AstroBR achieves a 25.2% improvement in S2R quality annotation, as measured by the $F_1$ score. Furthermore, the system exhibits a substantial 71.4% improvement in identifying missing steps within issue reports, also quantified using the $F_1$ score. These metrics indicate AstroBR’s ability to generate more accurate and complete S2R descriptions than existing methods, leading to improved issue reproduction rates.

Accelerating Resolution: A Symptom of Systemic Clarity

Effective issue resolution is directly dependent on the swift and accurate identification of previously implemented solutions; therefore, a robust solution identification process is paramount. AstroBR-enhanced reports facilitate this process by providing a standardized and detailed account of each issue, including relevant contextual data and diagnostic information. This improved reporting quality serves as a solid foundation for both machine learning models and large language models tasked with matching current issues to historical solutions, ultimately accelerating the resolution lifecycle. The completeness and clarity of AstroBR reports minimize ambiguity and increase the likelihood of a correct solution match.

To identify relevant past solutions, the system employs Large Language Models (LLMs), including the Llama-3ft model. This is achieved through two primary techniques: embeddings and prompting. Issue reports and previously resolved solutions are converted into vector embeddings, allowing for semantic similarity searches to find analogous cases. Subsequently, LLMs are prompted with the current issue description and the retrieved embeddings to generate a ranked list of potential solutions. The LLM assesses the relevance of each past solution based on the semantic similarity and the specific details of the current issue, providing a targeted set of recommendations for issue resolution.

While Machine Learning Models (MLMs) continue to be incorporated into solution identification workflows, comparative analysis demonstrates consistently lower performance metrics than those achieved with Large Language Models (LLMs). MLMs typically rely on feature engineering and require substantial labeled data for training, limiting their adaptability to novel issue descriptions. Evaluation has shown that MLMs struggle with semantic understanding and often fail to retrieve relevant solutions when faced with variations in phrasing or complex problem statements. This contrasts with LLMs, which leverage pre-training on massive datasets to achieve superior performance in understanding natural language and identifying conceptually similar solutions, even with limited task-specific training data.

The integration of enhanced issue reporting with advanced solution identification methods has demonstrably reduced time-to-resolution. Quantitative evaluation, using an ensemble of the highest-performing models, yielded a solution identification F1 score of 0.737. This metric represents a harmonic mean of precision and recall, indicating a balanced performance in correctly identifying relevant solutions from past issues. The improvement is attributed to both the increased quality of incoming issue reports, providing richer context, and the efficacy of the implemented solution identification algorithms in leveraging that data.

Proactive Defect Localization: A System Revealing its Faults

AstroBR’s capabilities extend beyond traditional text-based bug report analysis to pinpoint the precise user interface elements causing issues, a process known as Buggy UI Localization. This innovative approach leverages the textual descriptions within bug reports – often detailing where a problem occurs – to identify the corresponding faulty UI components. Rather than relying solely on developer interpretation, the system actively correlates the language of the bug with visual elements within the application. This allows for a more direct and efficient workflow, reducing the time spent on manual investigation and accelerating the path to resolution. By translating natural language into actionable UI component identification, AstroBR facilitates a proactive approach to defect localization, ultimately contributing to a more streamlined and effective software development process.

To pinpoint faulty user interface elements from textual bug reports, sophisticated techniques rooted in artificial intelligence are being utilized. Deep learning models, capable of discerning complex patterns, are central to this process, working alongside information retrieval methods to efficiently sift through vast amounts of data. Recent advancements include multi-modal models – such as Blip, Clip, and SBert – which process both text and visual information, enabling a more holistic understanding of the reported issue. These complex systems consistently demonstrate superior performance when contrasted with simpler approaches like Lucene, which relies on keyword matching and lacks the nuanced comprehension offered by modern machine learning techniques.

Recent advancements in automated defect localization demonstrate a significant capability in pinpointing faulty user interface elements directly from bug reports. The most effective approaches currently achieve 52% accuracy in recommending the correct screen containing the bug, placing the correct screen within the top three recommendations. Furthermore, these techniques exhibit even greater precision in identifying the specific UI component at fault, reaching 60% accuracy when the correct component is among the top three recommendations. These results indicate a substantial improvement over traditional methods and suggest a path toward more efficient and targeted bug fixing, ultimately reducing the time and resources required to resolve software defects.

The integration of advanced defect localization techniques into the issue resolution workflow promises a significant streamlining of the software development lifecycle. By automatically pinpointing the precise UI elements associated with reported bugs, organizations can drastically reduce the time developers spend on manual investigation and reproduction. This accelerated triage not only boosts developer productivity but also minimizes delays in addressing critical issues, ultimately contributing to the delivery of higher-quality software with fewer defects reaching end-users. The enhanced workflow allows for more focused debugging efforts, enabling quicker resolutions and a more efficient allocation of development resources, thereby fostering a continuous improvement cycle and a more responsive approach to software maintenance.

The principles of proactive defect localization and workflow enhancement found concrete validation through a case study centered on Mozilla Firefox. Researchers integrated the AstroBR system – leveraging deep learning, information retrieval, and multi-modal models – into Firefox’s existing issue resolution process. This application demonstrated a tangible improvement in identifying the specific UI elements related to reported bugs, achieving a 52% accuracy rate for screen localization and 60% for component localization within the top three recommendations. The results suggest that by automatically pinpointing faulty UI components from bug descriptions, development teams can significantly reduce the time spent on manual investigation, accelerate bug fixing, and ultimately deliver a more polished and reliable user experience for Firefox users. This practical implementation underscores the potential for broader adoption of similar techniques within other software development organizations.

The pursuit of automated issue resolution, as detailed in this research, echoes a fundamental truth about complex systems. One anticipates that even the most meticulously crafted workflows will eventually succumb to entropy. As Ken Thompson observed, “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” This sentiment applies equally to the design of AI-integrated systems for software maintenance; the quest for elegant automation must acknowledge the inherent unpredictability of real-world issues and the inevitable need for adaptation. The research’s focus on enhancing issue report quality, for instance, isn’t about achieving perfect input-it’s about building a more resilient system capable of gracefully handling imperfect data, recognizing that chaos is not a bug, but a feature.

What Lies Ahead?

The pursuit of automated issue resolution, as detailed within, inevitably reveals less about ‘fixing’ software and more about the inherent fragility of constructed systems. Each successfully localized bug, each automatically proposed solution, is merely a temporary reprieve. The ecosystem will, without fail, generate new, more subtle failures. A system that never breaks is, after all, a dead one – a monument to stagnation, incapable of adapting to the pressures of use and the inevitable decay of its constituent parts.

Future work will not, therefore, be measured by the percentage of issues ‘resolved’, but by the richness of the failure modes revealed. The true metric lies not in efficiency, but in the system’s capacity to become more interesting as it degrades. This requires a shift in focus – away from brittle, pre-defined ‘solutions’ and towards mechanisms for graceful degradation, for allowing human ingenuity to flourish within the cracks. Perfection, it bears remembering, leaves no room for people.

The current emphasis on LLMs and workflow analysis is but a scaffolding. The next stage demands an understanding of issue reports not as data points, but as symptoms – expressions of a complex interplay between user expectation, code implementation, and the unpredictable nature of human interaction. The goal is not to eliminate these reports, but to cultivate them – to treat them as vital signals within a constantly evolving system, and acknowledge that a healthy software ecosystem is, at its core, a beautifully messy one.

Original article: https://arxiv.org/pdf/2512.10238.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Cascade of Ambiguous Reports

AstroBR: Augmenting Reports, Not Replacing Reporters

Accelerating Resolution: A Symptom of Systemic Clarity

Proactive Defect Localization: A System Revealing its Faults

What Lies Ahead?

See also: