AI Pair Programming: How Do Bots Stack Up in Code Review?

Author: Denis Avetisyan

A new study examines the impact of AI coding assistants on the quality and characteristics of pull requests, comparing them directly to human-authored contributions.

Empirical analysis reveals no significant difference in test code volume or quality between human and AI-assisted pull requests, though AI agents demonstrate a preference for adding new tests over maintaining existing ones.

Despite the increasing prevalence of AI coding agents in software development, the impact of human-agent collaboration on crucial practices like software testing remains poorly understood. This paper, ‘Human-Agent versus Human Pull Requests: A Testing-Focused Characterization and Comparison’, presents an empirical analysis of over 9,700 pull requests to characterize and compare testing behaviors in human-authored versus human-agent collaborative contributions. Our findings reveal that while the extent of testing is significantly higher in human-agent pull requests-nearly doubling the test-to-source line ratio-test quality is comparable, with agents demonstrating a preference for adding new tests rather than modifying existing ones. How might these evolving collaboration patterns influence long-term software maintainability and the effectiveness of testing strategies?

The Inevitable Burden of Test Maintenance

Contemporary software development is fundamentally reliant on comprehensive testing; however, the maintenance of these vital test suites presents a considerable and ongoing challenge. As software evolves, even well-crafted tests require constant attention, adaptation, and refactoring to remain relevant and effective. The sheer volume of tests needed to cover complex systems, coupled with the frequency of code changes, can quickly lead to test suite bloat and fragility. This necessitates significant investment in test automation frameworks, continuous integration pipelines, and dedicated resources focused on test maintenance – all to prevent regressions and ensure software quality isn’t eroded by the very mechanisms designed to protect it. A neglected test suite, while seemingly innocuous, can ultimately become a major bottleneck and a significant source of risk in the software lifecycle.

Software development frequently involves a dynamic interplay between modifying application code and simultaneously updating the tests designed to validate it – a process termed co-evolution. This isn’t simply adding tests after code is written; rather, as developers refine functionality, the corresponding tests must also be adapted to reflect these changes. This constant, interwoven modification dramatically increases the complexity of software projects; a change intended to improve one component can inadvertently break existing tests, requiring further adjustments to both code and tests. The result is a system where the codebase and its validation suite evolve in tandem, creating a challenge for developers to maintain stability and ensure ongoing reliability as the project scales and new features are introduced.

The simultaneous modification of software and its accompanying tests – a process known as co-evolution – presents inherent risks to software quality. As developers alter application code, the tests validating that code must also be updated, creating a feedback loop that, if not carefully managed, can introduce unintended consequences. Maintaining test effectiveness requires dedicated resources; failing to do so can lead to false positives, where the tests incorrectly signal a failure, or, more critically, false negatives, where defects slip through undetected. This necessitates ongoing investment in test maintenance, refactoring, and rigorous validation to ensure the test suite accurately reflects the evolving codebase and continues to provide a reliable safety net against regressions. Consequently, organizations are increasingly recognizing the importance of proactive test management strategies as a core component of sustainable software development.

AI Assistance: A Temporary Reprieve, Not a Revolution

AI-assisted software engineering leverages tools, prominently AI Coding Agents, to automate traditionally manual development tasks. These agents function by generating code, suggesting improvements, and completing repetitive coding patterns, thereby augmenting developer productivity. Capabilities include code completion, bug detection, and automated unit test creation. The integration of these agents into software development lifecycles allows developers to focus on higher-level design and problem-solving, while the AI handles implementation details. This automation extends beyond simple code generation to encompass refactoring, documentation, and even the creation of entire software components based on natural language prompts or specifications.

The integration of AI coding agents into software development workflows is manifesting as a new type of pull request, termed Human-Agent PRs (HAPRs). These HAPRs differ from traditional pull requests created solely by human developers in that they contain code contributions generated by both human engineers and the AI agent. This collaborative approach represents a shift in the development process, where AI is not simply a tool but an active participant in code creation and modification, directly contributing lines of code submitted for review. The increasing prevalence of HAPRs indicates a growing trend towards utilizing AI agents as integral components of the software engineering lifecycle.

Analysis of pull requests generated through human-AI collaboration – specifically, Human-Agent PRs (HAPRs) – reveals a significant increase in test code contribution compared to traditional Human PRs. The study quantifies this difference with a test code ratio of 0.327 for HAPRs, calculated as the number of lines of test code divided by the total lines of code in the pull request. This contrasts with a test code ratio of 0.185 observed in traditional Human PRs, indicating that collaborative efforts involving AI agents nearly double the proportion of test code included with new functionality or modifications.

The integration of AI coding agents into software development workflows demonstrably impacts testing practices and code quality. Analysis of Human-Agent Pull Requests (HAPRs) reveals a significantly increased ratio of test code relative to production code – 0.327 compared to 0.185 for traditional Human Pull Requests. This nearly twofold increase suggests that AI assistance actively promotes a more test-driven development approach, potentially leading to earlier detection of bugs, reduced technical debt, and an overall improvement in the reliability and maintainability of the code base. The automation capabilities of these agents allow developers to focus on more complex testing scenarios and edge cases, rather than being burdened by the creation of basic unit tests.

The Inevitable Decay of Test Code: Recognizing the Symptoms

Test code, while crucial for software verification, is equally prone to quality issues as production code. These issues, commonly referred to as ‘Test Smells’, are surface indications of deeper problems that may affect the maintainability, readability, or effectiveness of the test suite. Examples include duplicated test code, complex test logic, fragile tests dependent on implementation details, and tests that lack clear assertions. The presence of these smells doesn’t necessarily indicate failing tests, but rather suggests that the test suite may become difficult and costly to evolve alongside the system under test, potentially leading to reduced confidence in the software’s reliability over time. Identifying and addressing these smells is therefore a critical component of maintaining a healthy and sustainable testing practice.

AromaDr is a static analysis tool designed to identify ‘test smells’ within test codebases. These smells, such as AssertionCoupling, ConditionalTestLogic, and TestCodeDuplication, are indicative of potential maintainability or reliability issues. The tool operates by parsing test code and applying a predefined set of rules to detect these patterns. Upon detection, AromaDr provides developers with specific locations within the codebase where the smells occur, along with a description of the issue and, in some cases, suggested remediation strategies. This automated detection allows developers to proactively address potential problems before they manifest as bugs or hinder future development efforts, improving the overall quality and maintainability of the test suite.

Quantifying the impact of identified test smells is crucial for effective prioritization of remediation efforts, and statistical methods provide the necessary rigor. The Mann-Whitney U test, a non-parametric test, determines if there is a statistically significant difference between two groups – in this case, tests exhibiting a specific smell versus those that do not – regarding a defined metric, such as test execution time or code complexity. Cliff’s Delta then calculates the effect size, providing a standardized measure of the magnitude of the difference between these groups. A larger Cliff’s Delta value indicates a more substantial and practically significant impact from the identified test smell, allowing developers to focus on addressing the smells with the greatest negative consequences on test suite quality and maintainability. These methods move beyond simple smell counts to provide data-driven insights for targeted improvement.

Analysis of 6582 High-Priority Bug Reports (HAPRs) and 3122 High-Priority Regression reports (HPRs) demonstrates that the integration of Artificial Intelligence (AI) into the development process does not significantly impact the prevalence of test smells within the codebase. This finding indicates that any increase in the volume of test code resulting from AI-assisted development does not correlate with a reduction in test code quality or an increase in maintainability issues, as measured by the presence of established test smells. The observed effect size was negligible, suggesting that the benefits of AI integration are not offset by a corresponding degradation in test suite quality.

Automated Analysis: A Necessary Expedient, Not a Panacea

Accurate identification of test files is a foundational step in effective code analysis, enabling developers to focus resources on relevant portions of a project. Heuristic-based test file identification employs naming conventions, file extensions (e.g., .test, _test.go), and directory structures to automatically categorize files as tests. Complementing this approach, tools like Linguist leverage language detection to further refine identification, even in projects lacking strict naming conventions. Automation via these methods is essential for large codebases where manual inspection is impractical, and ensures that analysis tools operate on the correct files to provide meaningful results regarding test quality and coverage.

Automated test file identification and analysis techniques facilitate efficient codebase analysis by enabling developers to prioritize areas requiring attention. Rather than manual inspection of large codebases, these methods allow for focused examination of sections with critical functionality or known quality concerns. This targeted approach reduces the time and resources needed for quality assurance, allowing developers to concentrate on areas presenting the highest risk or greatest potential for improvement. The ability to quickly pinpoint areas needing further testing, or where existing tests are inadequate, directly contributes to improved software reliability and faster development cycles.

Test coverage analysis quantifies the extent to which a software’s source code is executed by its test suite. Common metrics include statement coverage, branch coverage, and path coverage, each measuring a different aspect of code execution during testing. Low coverage percentages indicate areas of the code base that are not adequately tested, potentially harboring undetected faults. Identifying these gaps allows developers to prioritize the creation of new tests specifically designed to exercise the uncovered code, thereby increasing the fault detection capability of the test suite and improving overall software quality. Tools utilizing these metrics provide reports detailing uncovered lines or branches, enabling targeted test development efforts.

Analysis of development workflows revealed a statistically significant difference in test creation timing between AI-assisted development (HAPRs) and traditional development (HPRs). Specifically, 68.4% of new tests were added concurrently with code changes during co-evolution in HAPRs, compared to 54.8% in traditional HPRs. This 13.6 percentage point difference indicates that the integration of AI tools encourages developers to write tests as part of the initial development process, rather than as an afterthought or during dedicated testing phases, potentially improving overall software quality and reducing technical debt.

The Inevitable Cycle: Managing Complexity, Not Eliminating It

The evolution of software development is increasingly marked by the synergistic relationship between human expertise and artificial intelligence. Recent advancements have seen the integration of AI coding agents capable of assisting with code generation, completion, and even bug detection, coupled with automated test quality analysis tools. This pairing represents a substantial leap forward from traditional methods, enabling developers to accelerate the creation of robust and reliable software. These AI agents don’t simply automate tasks; they proactively contribute to the development process, identifying potential issues before they manifest as costly errors, and bolstering the efficiency of testing procedures. Consequently, the industry is witnessing a shift towards more streamlined workflows, reduced development times, and ultimately, software that meets increasingly stringent quality standards.

Addressing “test smells”-patterns in test code that suggest deeper problems-offers a powerful strategy for controlling technical debt and fostering code longevity. These smells, such as brittle tests prone to failure with minor code changes, or tests that duplicate implementation logic, accumulate over time and hinder future development. Proactive identification and mitigation of these issues, through techniques like test refactoring and improved test design, reduces the cost of maintenance and enhances the code’s adaptability. By investing in test quality, developers create a more robust and understandable codebase, ultimately simplifying future modifications and reducing the risk of introducing new defects-a practice that directly translates to lower long-term costs and increased system reliability.

The convergence of automated testing and AI-driven development isn’t simply about streamlining processes; it’s fundamentally reshaping the software lifecycle to deliver tangible improvements in quality and dependability. By proactively addressing potential issues early in the development pipeline, teams experience accelerated cycles, reducing the time and resources needed to bring a product to market. This isn’t merely faster development, however, but better development; the resulting software exhibits fewer defects and greater stability, fostering increased user trust and minimizing costly post-release maintenance. Consequently, organizations benefit from enhanced operational efficiency and a stronger reputation for delivering reliable systems, ultimately boosting confidence in every deployment and paving the way for sustained innovation.

Recent investigations into software development workflows reveal a measurable increase in testing thoroughness when utilizing AI coding assistants. Analysis of pull requests demonstrates that contributions generated through human-agent collaboration – where developers work alongside AI tools – exhibit a test code ratio of 0.204 tests per file. This figure subtly, yet significantly, surpasses the 0.194 ratio observed in pull requests crafted solely by human developers. While seemingly minor, this difference suggests that AI assistance encourages – or facilitates – the inclusion of more tests alongside code changes, potentially bolstering software quality and reducing the risk of regressions. The data supports the notion that integrating AI into the development process isn’t merely about accelerating coding, but about fostering a more comprehensive and reliable software lifecycle.

The study’s findings regarding agent-created tests-a preference for novelty over maintenance-feel predictably human. It echoes a common pattern: shipping new features always seems to outweigh addressing technical debt. Alan Turing observed, “We can only see a short distance ahead, but we can see plenty there that needs to be done.” This resonates with the empirical results; agents, like developers under pressure, add tests to satisfy immediate requirements, but the long-term health of the test suite-its maintainability and resistance to test smells-doesn’t necessarily improve. It’s a reminder that even sophisticated AI tools will inherit the pragmatic compromises inherent in software evolution.

What’s Next?

The observation that agents maintain a preference for adding tests, rather than tending to the existing flock, feels less like a bug and more like a fundamental truth about optimization. Everything optimized will one day be optimized back. The long-term cost of this novelty bias remains to be seen; architectures aren’t diagrams, they’re compromises that survived deployment. It is tempting to frame this as a ‘technical debt’ problem, but that suggests a fix exists, and history rarely supports such optimism. The real question isn’t whether agents can write tests, but whether they can participate in the slow, tedious work of archaeological code maintenance.

Future work should move beyond simply measuring test volume and quality. The subtle shifts in developer workflow-the types of bugs caught earlier, the cognitive load of reviewing agent-generated tests, the eventual decay of those tests-demand closer scrutiny. Empirical studies focusing on the evolution of these contributions will be critical. It is not enough to ask if an agent’s code passes tests; the field must determine if it survives the inevitable refactoring.

Ultimately, this research highlights a familiar pattern: automation excels at creation, but struggles with care. The challenge isn’t building smarter agents, it’s building systems that can gracefully accommodate their limitations. It’s not about replacing developers, it’s about finding a sustainable equilibrium where humans and agents can co-exist in a shared codebase. And, as always, it’s about resuscitating hope.

Original article: https://arxiv.org/pdf/2601.21194.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/