Can We Predict When AI-Generated Code Needs the Most Review?

Author: Denis Avetisyan

New research shows that early analysis of code changes created by AI agents can accurately forecast the effort required for effective human review.

The model effectively prioritizes a select subset of pull requests-identified as the “critical few”-to maximize utility, while simultaneously demonstrating a calibrated prediction of probabilities between expected and observed values.

Structural complexity within pull requests authored by AI agents serves as a strong predictor of review effort and potential abandonment, enabling improved triage and human-AI collaboration.

While AI agents increasingly contribute to software development through pull requests, efficiently managing the review process remains a significant challenge. This is addressed in ‘Early-Stage Prediction of Review Effort in AI-Generated Pull Requests’, which analyzes over 33,000 agent-authored PRs to reveal a distinct behavioral pattern characterized by either rapid merging or prolonged, often abandoned, refinement cycles-a phenomenon termed ‘agentic ghosting’. The study demonstrates that static structural features of these PRs can accurately predict review effort, enabling a triage model that intercepts 69% of total effort with zero-latency governance. Does this focus on structural complexity represent a fundamental shift towards prioritizing proactive governance in effective human-AI collaboration for software engineering?

The AI Floodgates: When Contributions Outpace Capacity

Open-source software development is experiencing a notable shift as artificial intelligence agents become increasingly prolific contributors. These agents are now automatically generating and submitting pull requests – proposed code changes – at a rapidly growing rate across numerous projects. This surge in contributions, while potentially accelerating innovation, presents a unique challenge to established workflows. The sheer volume of incoming changes necessitates adaptations in how these projects are managed and reviewed, as traditional, human-centered processes struggle to keep pace with the increased throughput. The phenomenon signals a fundamental change in the collaborative landscape of software creation, prompting exploration into methods for effectively integrating AI-driven contributions into the open-source ecosystem.

The escalating integration of artificial intelligence into open-source development is placing considerable strain on established code review procedures. Traditionally, human reviewers assess contributions for quality, security, and adherence to project standards; however, the sheer volume of AI-generated pull requests now threatens to overwhelm this system. This surge isn’t simply a matter of increased workload, but also introduces inefficiencies as reviewers grapple with code that may differ significantly in style or approach from human contributions. The existing review infrastructure, designed for a predictable flow of submissions, struggles to adapt to the rapid and often continuous influx from AI agents, potentially delaying critical updates and fostering a backlog of unreviewed code. Consequently, projects face a growing risk of bottlenecks, hindering innovation and impacting the overall velocity of development.

Recent analyses of open-source contributions reveal a concerning trend of ‘agentic ghosting’ – instances where AI-authored pull requests remain unaddressed for extended periods – occurring in 3.8% of cases. This phenomenon isn’t simply a matter of increased volume; it actively amplifies the challenges posed by the growing number of AI contributors. While automated agents can rapidly generate code suggestions, the lack of responsiveness following submission places a disproportionate burden on human reviewers. These stalled pull requests accumulate, creating a review bottleneck and potentially hindering the progress of vital software projects. The issue stems from the agents’ inability to respond to feedback or address reviewer concerns, effectively leaving human maintainers to resolve issues without assistance, thus negating some of the efficiency gains promised by AI-driven development.

Agent abandonment rates differ significantly, and while multi-component interactions correlate with increased abandonment, customer intelligence (CI) touches demonstrate a reduced risk of abandonment.

The Two Faces of AI Contribution: Effort and Abandonment

Analysis of the AIDev dataset demonstrates a bimodal distribution in the lifecycle of pull requests authored by agents. Approximately half of all agent-submitted pull requests are merged immediately upon creation, indicating a lack of required review or minimal changes. Conversely, the remaining pull requests enter iterative review loops, characterized by multiple rounds of feedback and modification before eventual merging. This suggests a distinct bifurcation in contribution patterns, with some changes accepted without scrutiny while others necessitate substantial refinement, potentially due to complexity or deviation from existing code standards.

Analysis of agent contributions demonstrates a strong correlation between pull request size and review process duration. Specifically, larger submissions, measured by the total number of lines changed, consistently undergo more extensive review cycles. This manifests as a higher number of review iterations, increased reviewer participation, and a longer time to merge. Conversely, smaller changes are frequently merged rapidly with minimal review. This observed relationship suggests that the magnitude of a contribution is a primary driver in determining the level of scrutiny applied during the code review process, effectively creating distinct outcomes based on submission size.

A Gaussian Mixture Model (GMM) was implemented to statistically delineate the observed two-regime outcome in agent contributions. The GMM operates by modeling the distribution of pull request sizes as a mixture of two Gaussian distributions, effectively clustering contributions based on the likelihood of either immediate merging or extended review. Parameter estimation via Expectation-Maximization identified distinct means and variances for each Gaussian component, quantitatively confirming the segregation of small, rapidly-merged contributions from larger, iteratively-reviewed submissions. This data-driven approach allows for objective characterization of agent contribution patterns and provides a basis for predicting merge outcomes based on the size of the submitted changes.

Instant merges, being narrow in scope (median 68 changes versus 104 for normal pull requests), modify critical configuration files less frequently (7.1% versus 18.4%) than standard pull requests.

Predicting the Pain Points: A Machine Learning Approach

A Light Gradient Boosting Machine (LightGBM) model was implemented to identify pull requests requiring disproportionately high levels of reviewer effort. This model was trained to predict ‘high-effort’ pull requests based on quantifiable characteristics of the code changes. The rationale for this approach is to proactively flag these requests, allowing for resource allocation and potentially triggering more thorough or dedicated review processes. Identifying these requests allows teams to better manage review workloads and maintain code quality by ensuring adequate scrutiny is applied where it is most needed.

The predictive model utilized features derived from both the time of pull request creation and the structural characteristics of the code changes. ‘Creation-time features’ encompass aspects such as the day of the week, time of day, and the time elapsed between commits, providing insights into development patterns. ‘Structural complexity’ is quantified through metrics like the number of files changed, lines of code added or removed, cyclomatic complexity, and the number of commits within the pull request, all of which indicate the scope and potential difficulty of reviewing the submitted code.

The LightGBM model utilized for predicting high-effort pull requests demonstrates a high degree of accuracy, as quantified by an Area Under the Receiver Operating Characteristic Curve (AUC) of 0.9571. This AUC score indicates the model’s ability to distinguish between high-effort and low-effort pull requests. Furthermore, the Precision-Recall Area Under the Curve (PR-AUC) value of 0.8812 confirms the model’s effectiveness in identifying high-effort pull requests while minimizing false positives, particularly relevant in scenarios where imbalanced datasets are present. Both metrics collectively validate the model’s strong predictive power and suitability for resource allocation in code review processes.

The empirical cumulative distribution function demonstrates the time elapsed between receiving feedback and issue closure during the label audit.

Validating the Signal: Calibration and Performance

Leave-One-Agent-Out (LOAO) cross-validation was employed to rigorously evaluate the LightGBM model’s generalizability. This method involved iteratively training the model on all agents except one, and then predicting outcomes for the excluded agent; this process was repeated for each agent in the dataset. The resulting performance metrics were then averaged to provide a robust estimate of the model’s ability to generalize to unseen data, mitigating potential biases from specific agent characteristics and ensuring reliable performance across the entire agent population.

Model calibration was evaluated using the Brier score, a metric quantifying the accuracy of probabilistic predictions. The initial LightGBM model demonstrated a Brier score indicating potential miscalibration. To address this, Platt Scaling, a logistic regression technique, was applied. Following Platt Scaling calibration, the Brier score was reduced to 0.1279. This score confirms the model’s ability to provide well-calibrated probability estimates, meaning predicted probabilities align closely with observed frequencies of events, which is crucial for reliable decision-making based on model outputs.

The LightGBM model achieved an Area Under the Curve (AUC) of 0.9571, representing a statistically significant improvement in predictive performance when contrasted with the AUC of 0.933 obtained using a size-only heuristic. This difference indicates that the LightGBM model more effectively discriminates between positive and negative instances, leading to more accurate predictions beyond those achievable through a simple size-based approach. The observed increase in AUC highlights the model’s ability to incorporate and leverage complex relationships within the data to enhance predictive power.

Beyond First-Come, First-Served: A Proactive Review Policy

Analysis reveals a strong rationale for adopting a ‘gated triage policy’ when evaluating pull requests submitted by agents. This proactive approach centers on strategically prioritizing reviews based on predicted effort – essentially, identifying submissions likely to demand significant reviewer time and attention. The study demonstrates that by pre-screening these requests, development teams can move beyond a first-come, first-served system to one that optimizes resource allocation. This isn’t simply about faster reviews; it’s about ensuring that the most complex and potentially impactful changes receive the detailed scrutiny they require, ultimately bolstering code quality and accelerating the overall development cycle. The findings advocate for a shift towards a more intelligent review process, recognizing that not all contributions are created equal and deserve commensurate levels of attention.

A predictive model enables the automated identification of pull requests likely to require substantial review effort. This system analyzes incoming code submissions, assessing complexity and potential for requiring significant reviewer time and expertise. By proactively flagging these ‘high-effort’ pull requests, the development process shifts from reactive assessment to prioritized review, ensuring that the most demanding changes receive immediate attention. This targeted approach optimizes the use of reviewer resources, preventing bottlenecks and accelerating the integration of complex, yet crucial, code contributions into the main development branch. The model’s predictions are designed to be a seamless component of the workflow, offering a preemptive signal for focused evaluation without disrupting the standard submission process.

A focused review strategy, allocating just 20% of available resources, demonstrably captures 86.2% of pull requests requiring substantial review effort. This targeted approach addresses a critical bottleneck in software development by prioritizing complex submissions, thereby preventing experienced reviewers from being overburdened with trivial changes. Consequently, the entire development workflow experiences a marked improvement in efficiency, as critical feedback is delivered promptly and high-quality code is integrated more rapidly. The methodology effectively balances review coverage with resource constraints, suggesting a practical and scalable solution for optimizing code review processes in large-scale projects.

The pursuit of predictive modeling for AI-generated pull requests feels… predictably fragile. This work, attempting to foresee review effort based on structural complexity, is a neat exercise, but one built on the assumption that patterns observed now will hold. As Carl Friedrich Gauss observed, “If I have seen as far as most men, it is because I have stood on the shoulders of giants.” Yet, even giants eventually crumble. The paper’s focus on identifying ‘agentic ghosting’ – abandoned PRs – highlights a fundamental truth: every abstraction dies in production. These models, however elegant, will inevitably encounter edge cases, unexpected interactions, and the sheer chaos of real-world software development. It’s structured panic with dashboards, really – a valiant attempt to anticipate failure, knowing full well that failure is guaranteed.

The Road Ahead

The predictive models detailed within will, predictably, require constant recalibration. The bug tracker is, after all, a book of pain, and AI agents are remarkably efficient at discovering novel ways to populate it. Current work focuses on structural complexity as a primary indicator of review burden, but this feels… incomplete. The assumption that ‘good’ code is inherently easier to review neglects the human element – the cognitive load of understanding why an AI made a particular decision, even if the code itself is syntactically sound.

Future iterations will undoubtedly attempt to incorporate agent ‘personality’ – a quantification of stylistic consistency, or perhaps the frequency of surprising choices. This feels suspiciously like anthropomorphism, a desperate attempt to project human failings onto a system that operates by entirely different rules. But the pressure to explain AI behavior to humans is relentless.

Ultimately, the true metric isn’t review effort, but review value. It is not enough to predict how long a human will spend looking at code; the question is whether that time yields a demonstrable reduction in risk. The models do not deploy – they let go. And the consequences, as always, remain to be seen.

Original article: https://arxiv.org/pdf/2601.00753.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/