Evolving Code, Sharper Predictions: AI Agents Tackle Software Defects

Author: Denis Avetisyan

A new approach uses collaborative AI to predict software flaws by accounting for how code changes over time, overcoming limitations of traditional methods.

Despite a seemingly significant code alteration - the introduction of a race condition - the resulting variant’s embedding remained remarkably close to the benign version, demonstrating how models trained on code history can be easily misled, whereas even a straightforward, single-version classification approach utilizing large language models effectively distinguishes between the two instances. — Despite a seemingly significant code alteration – the introduction of a race condition – the resulting variant’s embedding remained remarkably close to the benign version, demonstrating how models trained on code history can be easily misled, whereas even a straightforward, single-version classification approach utilizing large language models effectively distinguishes between the two instances.

This work introduces a change-aware, file-level defect prediction framework leveraging multi-agent debate to mitigate label persistence bias and improve accuracy.

Despite reported advancements, much of the progress in file-level software defect prediction (SDP) may be illusory, stemming from biases introduced by persistent file labels across software versions. This paper, ‘From Illusion to Insight: Change-Aware File-Level Software Defect Prediction Using Agentic AI’, reframes SDP as a change-aware task, focusing on reasoning about code modifications rather than static snapshots, and introduces a multi-agent debate framework driven by large language models. Experiments demonstrate that this approach yields more balanced and sensitive defect predictions, particularly for critical defect transitions, exposing fundamental flaws in conventional evaluation practices. Could a shift toward change-aware reasoning unlock genuinely reliable and insightful defect prediction for evolving software systems?

The Illusion of Predictability: Why Static Analysis Falls Short

Software defect prediction stands as a cornerstone of efficient software engineering, directly impacting both financial resources and product dependability. By proactively identifying potential flaws within source code, development teams can strategically allocate testing and debugging efforts, dramatically reducing the substantial costs associated with post-release maintenance. This predictive capability isn’t simply about fixing bugs; it’s about preventing them from ever reaching end-users, bolstering user trust and safeguarding the reputation of the software itself. A robust defect prediction strategy allows for earlier intervention in the development lifecycle, shifting from reactive bug fixing to a proactive, preventative approach that ultimately delivers higher-quality, more reliable software.

Conventional software defect prediction frequently leverages static code metrics – lines of code, cyclomatic complexity, and similar quantifiable characteristics – to assess the likelihood of bugs. However, this reliance can be deceptively fragile. These models often struggle to differentiate between meaningful changes that genuinely impact defect risk and superficial alterations, such as code formatting or variable renaming. Consequently, a seemingly improved codebase, according to static metrics, may not actually exhibit fewer bugs, and conversely, minor refactoring can be misinterpreted as a significant risk factor. This sensitivity to cosmetic changes limits the practical applicability of these models in dynamic development environments where code evolves rapidly, necessitating more robust approaches that focus on semantic understanding rather than surface-level features.

A significant limitation of many software defect prediction models lies in their susceptibility to ‘label persistence bias’. Rather than genuinely learning the relationship between code characteristics and actual defects, these models frequently memorize previously assigned labels – meaning a component historically flagged as buggy may continue to be predicted as defective even if the underlying issues have been resolved. This creates a self-fulfilling prophecy, where past classifications unduly influence future predictions, hindering the model’s ability to accurately identify new defects stemming from current code. Consequently, superficial code changes – refactoring without bug fixes, for example – can be misinterpreted as indicators of continued problems, leading to wasted effort and a false sense of security regarding genuinely problematic areas of the software.

Embedding-based defect prediction reveals various challenges during file evolution, including persistent labels, label changes, the 'wrong in being right' phenomenon, benign-class bias, and misclassification by weak models, as illustrated by comparing true (y1, y2) and predicted (<span class="katex-eq" data-katex-display="false"> \hat{y}_{1}, \hat{y}_{2} </span>) labels for evolution pairs (<span class="katex-eq" data-katex-display="false"> x_{1}, x_{2} </span>). — Embedding-based defect prediction reveals various challenges during file evolution, including persistent labels, label changes, the ‘wrong in being right’ phenomenon, benign-class bias, and misclassification by weak models, as illustrated by comparing true (y1, y2) and predicted ( $\hat{y}_{1}, \hat{y}_{2}$ ) labels for evolution pairs ( $x_{1}, x_{2}$ ).

Deep Learning: Trading Simple Errors for Complex Ones

Deep learning approaches to Static Defect Prediction (SDP) utilize raw code and version control commit data as input, enabling the automated learning of complex code representations without reliance on hand-engineered features. This contrasts with traditional methods that require expert-defined code characteristics. By processing code directly-including source code, diffs, and associated commit messages-these models can identify patterns correlating code changes with the introduction or resolution of defects. The learned representations capture nuanced relationships within the codebase, potentially surpassing the accuracy of feature-based SDP techniques. These models do not require explicit feature extraction; instead, they learn relevant features directly from the data, allowing them to generalize to new codebases and programming languages.

Several deep learning models are currently employed for software defect prediction, each utilizing distinct architectural designs and pre-training methodologies. ASTNN employs an Abstract Syntax Tree Network to directly process code structure, while DeepJIT focuses on Just-In-Time compiled code analysis. Transformer-based models, including CodeBERT, CodeT5+, XLNet, and the more recent StarCoder2, utilize the self-attention mechanism and benefit from pre-training on large code corpora. These pre-training approaches typically involve masked language modeling or causal language modeling objectives, allowing the models to learn contextual code representations. Variations in model size, training data, and specific pre-training tasks contribute to performance differences across these architectures.

While increasing the size of deep learning models used for Software Defect Prediction (SDP) – such as ASTNN, CodeBERT, and StarCoder2 – consistently improves predictive accuracy on benchmark datasets, this scaling does not inherently address the problem of explainability. These models primarily identify correlations between code changes and defects; they do not establish causal relationships or provide insights into the underlying reasons a specific alteration introduces or resolves a vulnerability. Consequently, developers still require manual analysis to understand the root cause of defects, limiting the practical utility of scaled models for debugging and preventative code modification, and failing to deliver actionable intelligence beyond simple prediction.

Beyond Static Analysis: Modeling Debate for More Robust Prediction

The Multi-Agent Debate Framework addresses Static Defect Prediction (SDP) by modeling the reasoning process as a debate between specialized agents. This approach moves beyond traditional SDP methods that rely on static code analysis by simulating a dynamic evaluation of code changes. The framework defines distinct agent roles – Proposer, Analyzer, Skeptic, and Judge – each contributing to the assessment of potential defects and their resolution. This simulated debate allows for a more nuanced understanding of defect introduction and the impact of code modifications, ultimately aiming to improve the accuracy and reliability of defect prediction compared to methods focusing solely on static code characteristics.

The Multi-Agent Debate framework utilizes four distinct agents to simulate a reasoned evaluation of code changes. The Proposer initiates the debate by suggesting a potential defect. The Analyzer then performs a dual-sided assessment, considering both supporting and opposing evidence for the proposed defect. The Skeptic challenges the Analyzer’s findings, probing for weaknesses in the reasoning. Finally, the Judge reviews the arguments presented by all agents and renders a decision regarding the validity of the proposed defect, effectively modeling a comprehensive code review process through structured argumentation.

The Analyzer agent within the Multi-Agent Debate framework employs a ‘Dual-Sided Reasoning’ process to evaluate potential defects. This involves explicitly considering both evidence supporting the presence of a defect and evidence suggesting its absence. By analyzing arguments for and against a given code change as a potential defect, the Analyzer generates a more comprehensive assessment than methods focusing solely on identifying problematic code. This approach aims to reduce false positives and improve the overall accuracy of defect prediction by providing a balanced evaluation of each status-transition subset, contributing to the framework’s reported F1 score of 0.57.

The Multi-Agent Debate framework assesses file evolution not through analysis of static code, but by evaluating performance on ‘Status-Transition Subsets’. These subsets represent changes in file status, allowing the framework to reason about the impact of modifications. This approach focuses on the difference introduced by the change, rather than the inherent characteristics of the code itself. Quantitative results demonstrate a Harmonic Mean of Benign Prior Status (HMB) of 0.57 when using this status-transition methodology, indicating a balanced performance between identifying true defects and avoiding false positives when evaluating changes.

Performance evaluations of the Multi-Agent Debate framework indicate an overall F1 Score of 0.57. This represents a measurable improvement of 5-10% when compared to results obtained using traditional SDP baselines. Notably, the framework exhibits enhanced performance specifically on the D01 subset, which focuses on the challenging task of defect introduction, suggesting its efficacy in identifying and reasoning about newly introduced errors within a code base. These results demonstrate the framework’s ability to more accurately balance precision and recall in the context of software defect prediction.

The multi-agent debate architecture facilitates a structured exchange between agents to resolve disagreements and converge on a shared understanding.

Shifting the Focus: Predicting Change, Not Just Code

Traditional Static Defect Prediction (SDP) methods often treat software code as a static entity, overlooking the crucial impact of file evolution on the introduction of defects. The ‘Change-Aware Formulation’ addresses this limitation by explicitly modeling how files change over time and, critically, evaluating defect prediction models not simply on the code itself, but on those changes. This approach moves beyond identifying potentially problematic code regions to understanding how modifications – additions, deletions, or alterations – contribute to the likelihood of defects. By focusing on the dynamics of software development, the framework captures subtle yet significant relationships between code churn and quality, offering a more nuanced and accurate assessment of defect risk throughout the software lifecycle. The ability to reason about change allows for more timely and effective interventions, ultimately improving the robustness and reliability of software systems.

Traditional static analysis, while valuable, often struggles to accurately predict defects in actively evolving software projects due to its reliance on a single snapshot of the code. This change-aware formulation directly addresses this limitation by incorporating the history of file modifications into the defect prediction process. Rather than simply analyzing the current state of the codebase, the methodology examines how changes impact the likelihood of introducing defects, considering factors like the size and complexity of modifications, the developers involved, and the specific code areas affected. This dynamic approach significantly improves prediction accuracy, enabling developers to proactively identify and address potential issues before they manifest as bugs in live systems, and ultimately leading to more robust and reliable software.

Traditional software maintenance often reacts to defects after they appear, but this framework shifts the focus to anticipating issues before they manifest. It accomplishes this by moving beyond simply identifying where changes occur and instead concentrating on understanding why those changes introduce defects. The system analyzes the logical connections between code modifications and potential failure points, reasoning about the impact of each alteration on the system’s overall integrity. This capability enables a more proactive approach, allowing developers to address vulnerabilities during the modification process rather than after deployment. Consequently, software maintenance evolves from a reactive, corrective practice to a predictive and preventative strategy, promising substantial improvements in software reliability and a reduction in long-term maintenance costs.

Evaluations of the change-aware formulation reveal a Harmonic Mean of Defective Prior Status (HMD) reaching 0.53, a key indicator of its enhanced ability to pinpoint the introduction of defects during software evolution. This metric, which balances precision and recall in identifying defective code changes, signifies a notable improvement over traditional static analysis techniques. The achieved HMD score demonstrates that the framework isn’t simply flagging potential issues, but accurately predicting which modifications are most likely to introduce errors, offering a practical advantage for developers aiming to proactively address vulnerabilities and maintain code integrity. This performance suggests a move toward more reliable software maintenance, ultimately contributing to the development of more secure and robust applications.

The advent of change-aware formulation signals a departure from traditional static software quality assurance, moving towards a dynamic understanding of code evolution and its impact on system reliability. Rather than solely assessing a codebase at a single point in time, this methodology actively models the process of change, allowing for a more nuanced prediction of defect introduction. By focusing on the causal links between modifications and potential vulnerabilities, it facilitates proactive maintenance and allows developers to address issues before they manifest as failures. This shift isn’t merely about improving defect detection rates – evidenced by a Harmonic Mean of Defective Prior Status reaching 0.53 – but fundamentally reimagines quality assurance as an ongoing, adaptive process, ultimately contributing to the creation of demonstrably more robust and secure software systems.

The pursuit of defect prediction, as outlined in this paper, feels predictably circular. It attempts to refine accuracy by accounting for ‘change-aware’ factors and multi-agent debate, essentially layering complexity atop existing complexities. One anticipates production environments will invariably discover novel failure modes, rendering even the most sophisticated ‘status-transition subsets’ irrelevant. As Tim Berners-Lee observed, “The Web is more a social creation than a technical one.” This feels apt; the ‘social creation’ of software, with its constant churn and unpredictable user interactions, will always outpace the technical attempts to predict its flaws. Everything new is just the old thing with worse docs, and in this case, a larger model to train.

What’s Next?

This work, while presenting a nuanced approach to change-aware defect prediction, merely shifts the inevitable horizon of technical debt. The multi-agent debate, a clever mechanism for mitigating label persistence bias, will itself become a point of failure. Production systems, relentlessly evolving, will discover novel edge cases, new forms of code entropy, and interaction patterns the agentic framework did not anticipate. The elegance of the status-transition subsets, the carefully constructed debate protocols – all will eventually succumb to the sheer chaos of real-world software lifecycles.

Future effort will undoubtedly focus on automating the adaptation of these agentic systems, perhaps employing reinforcement learning to tune debate strategies based on observed prediction failures. However, this risks an escalating arms race between prediction models and the increasingly complex bugs they attempt to foresee. A more fruitful, though less glamorous, path may lie in accepting inherent unpredictability and focusing on faster, more robust rollback mechanisms – admitting that every abstraction dies in production, and preparing for the crash.

Ultimately, the true challenge isn’t achieving perfect prediction, but building systems resilient enough to survive imperfect ones. The pursuit of ever-more-sophisticated models will continue, but the field should also invest in the unglamorous work of failure analysis and operational robustness. Everything deployable will eventually crash; the art lies in minimizing the damage when it does.

Original article: https://arxiv.org/pdf/2512.23875.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Predictability: Why Static Analysis Falls Short

Deep Learning: Trading Simple Errors for Complex Ones

Beyond Static Analysis: Modeling Debate for More Robust Prediction

Shifting the Focus: Predicting Change, Not Just Code

What’s Next?

See also: