AI Code Changes: A Performance Puzzle

Author: Denis Avetisyan


A new analysis of AI-generated code contributions reveals a surprising trend in how these systems tackle performance improvements.

Across all pull requests, the proportion authored by different artificial intelligence agents varies significantly by category, demonstrating a nuanced distribution of contributions based on specialization.
Across all pull requests, the proportion authored by different artificial intelligence agents varies significantly by category, demonstrating a nuanced distribution of contributions based on specialization.

This study uses BERTopic to analyze performance-related pull requests created by AI agents and finds higher rejection rates and a focus on feature-level optimizations over comprehensive software development lifecycle activities.

Despite the increasing prevalence of AI-driven software engineering, a clear understanding of how agentic AI systems proactively address performance considerations remains elusive. This paper, ‘How Do Agentic AI Systems Address Performance Optimizations? A BERTopic-Based Analysis of Pull Requests’, presents an empirical analysis of performance-related pull requests generated by these agents, revealing a surprisingly broad range of optimizations focused primarily on development-phase implementations. However, our findings indicate that performance-focused changes are subject to higher rejection rates and longer review times compared to standard pull requests. Will these observed patterns necessitate novel evaluation metrics and refinement strategies to fully leverage the potential of AI-driven performance optimization in software development?


Unmasking Bottlenecks: The Constraints of Modern Software Delivery

Modern software development workflows frequently center around manual code review processes, typically facilitated through Pull Requests. While intended to ensure quality and collaboration, this approach often introduces significant delays as developers await feedback and address identified issues. The inherent subjectivity in code reviews, coupled with the time required for thorough examination, can create bottlenecks, particularly as team sizes grow and the complexity of projects increases. Inconsistencies can also arise from differing interpretations of coding standards or varying levels of expertise among reviewers, leading to rework and potential bugs. This reliance on human inspection, while valuable, represents a scalability challenge for organizations striving for rapid and reliable software delivery, highlighting the need for more automated and efficient quality assurance mechanisms.

As software development teams grow, the complexities of coordinating code changes and ensuring consistent performance dramatically increase. The proliferation of individual contributions, while potentially accelerating feature development, often leads to integration challenges and a surge in the number of code reviews required. This scaling effect frequently creates a bottleneck where the rate of code merging and testing cannot keep pace with the rate of code creation. Consequently, identifying and resolving performance regressions becomes significantly more difficult, delaying releases and potentially impacting the quality of the delivered software. The increased cognitive load on reviewers, combined with the sheer volume of changes, can lead to overlooked issues and a decline in overall system performance, hindering the team’s ability to deliver high-quality, performant software efficiently.

The escalating demands for faster software releases and improved user experiences are driving a critical need for automated performance bottleneck identification and resolution. Manual performance analysis, while thorough, simply cannot keep pace with the velocity of modern development cycles and the complexity of contemporary applications. Automated tools now scan codebases, pinpoint inefficient algorithms, and even suggest optimized alternatives, dramatically reducing the time required to address performance issues. Organizations that proactively integrate such automation into their development pipelines gain a significant competitive advantage, delivering superior software more quickly and reliably, while simultaneously freeing up valuable developer time for innovation rather than tedious debugging.

Intelligent Automation: LLMs as Performance Catalysts

LLM-based software engineering facilitates the automation of performance-related tasks by analyzing code repositories for potential regressions and inefficiencies. These models can identify performance bottlenecks through static analysis, profiling data examination, and comparison of code changes against established performance baselines. Optimization suggestions generated by the LLM can range from algorithmic improvements and data structure modifications to code-level adjustments like loop unrolling or caching strategies. The automation of these processes reduces the manual effort required for performance monitoring and tuning, allowing engineering teams to proactively address performance issues and accelerate development cycles. Furthermore, LLMs can prioritize optimization opportunities based on projected impact and estimated implementation complexity.

Agentic AI systems utilize Large Language Models (LLMs) to automate software modification processes. These systems are capable of independently implementing feature enhancements and resolving identified bugs within a codebase. Functionality includes analyzing existing code, formulating appropriate changes, and generating the necessary code modifications. Crucially, these systems integrate with version control systems like Git, enabling autonomous submission of changes as Pull Requests. This automation reduces developer workload and accelerates the software development lifecycle by streamlining the implementation and testing phases of iterative improvements and bug fixes.

Zero-shot learning, as demonstrated by models such as GPT-OSS-20B, allows for the classification of Pull Requests (PRs) related to software performance without requiring pre-labeled training datasets. This functionality is achieved through the model’s inherent understanding of language and its ability to generalize from broad knowledge; the model can assess the content of a PR – including commit messages, code diffs, and associated documentation – and determine its relevance to performance optimization or degradation. Consequently, implementation time for automated performance analysis workflows is reduced, as the typical data labeling and model training phases are bypassed, enabling immediate deployment of performance-focused automation.

Revealing Patterns: Topic Modeling of Performance-Related Changes

Topic modeling, specifically employing the BERTopic framework, was applied to the textual content of performance-related Pull Requests (PRs) to discover prevalent themes and patterns within code changes. This approach treats each PR description as a document and utilizes algorithms to identify recurring topics. By analyzing the collective content of these PRs, we can move beyond individual change assessments and understand broader trends in performance optimization efforts. The identified topics represent clusters of PRs addressing similar areas of the codebase or employing comparable techniques, providing a high-level overview of performance-related activity.

The BERTopic pipeline employs a three-stage process for analyzing pull request descriptions. Initially, Qwen3-Embedding-8B generates vector representations of each PR description, capturing semantic meaning. These high-dimensional vectors are then subjected to dimensionality reduction using Uniform Manifold Approximation and Projection (UMAP), specifically configured with parameters to reduce the vector space while preserving key relationships. Finally, Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) is applied to these reduced vectors, grouping similar PR descriptions into coherent topics based on density and clusterability. This combination allows for the identification of prevalent themes within the performance-related changes.

Performance-related Pull Requests can be categorized by their primary focus, such as modifications to specific algorithms or data structures, through topic modeling. Implementation using the BERTopic pipeline, with UMAP parameters set to 20 components and 3 neighbors, yielded a Topic Coherence Score of 0.47 and a Silhouette Score of 0.57. These scores indicate a moderate level of topic distinctiveness and cluster cohesion, respectively, demonstrating the effectiveness of the approach in grouping related changes. The resulting categorization facilitates a more granular understanding of performance-related modifications within the codebase.

Performance-related pull requests are distributed across ten distinct categories, indicating the breadth of optimization efforts.
Performance-related pull requests are distributed across ten distinct categories, indicating the breadth of optimization efforts.

Validating Automation: Measuring Consistency and Efficiency

The consistency of automated topic assignments is paramount for reliable analysis, and evaluating this consistency was achieved through the application of Cohen’s Kappa and Gwet’s AC1 metrics. These statistical measures assessed the degree of agreement between the BERTopic model’s categorization and human evaluation, effectively quantifying the robustness of the automated process. Results indicated a high Cohen’s Kappa score of 0.92, coupled with a Gwet’s AC1 of 0.97, both demonstrating a strong level of inter-rater reliability-meaning the automated categorization consistently aligns with human judgement and offers a dependable foundation for subsequent insights.

Analyzing an AI-driven workflow extends beyond simply noting task completion; a comprehensive evaluation necessitates tracking metrics that reveal how efficiently the system operates. Combining automated topic categorization – which organizes outputs for clearer analysis – with quantifiable measures like Acceptance Rate and Merge Time provides a robust picture of workflow performance. Acceptance Rate indicates the proportion of AI-generated contributions successfully integrated, while Merge Time reflects the speed with which those contributions are adopted. These metrics, when considered together, offer valuable insights into bottlenecks, areas for improvement, and the overall effectiveness of the AI agent in streamlining complex processes, ultimately demonstrating the practical value of automated assistance.

An analysis of 1,221 pull requests generated by AI agents focused on performance-related code changes revealed a notable discrepancy in acceptance rates. The study found that 36.5% of these AI-generated performance pull requests were rejected, a figure significantly higher than the 22.7% rejection rate observed for pull requests not related to performance optimizations. This suggests that while agentic AI demonstrates promise in automating code contributions, it currently encounters challenges in consistently producing high-quality, accepted performance improvements, highlighting a key area for future development and refinement of these AI-driven workflows.

Categories exhibit varying merge times, ranging from several minutes to over <span class="katex-eq" data-katex-display="false">10^3</span> hours, as shown on a logarithmic scale.
Categories exhibit varying merge times, ranging from several minutes to over 10^3 hours, as shown on a logarithmic scale.

Toward Self-Optimizing Systems: The Future of Automated Performance

The convergence of automated topic modeling and Agentic AI represents a significant leap toward self-optimizing software systems. This integration allows for continuous monitoring of code changes and the identification of recurring performance-related themes within pull requests. Rather than simply flagging potential issues, the system proactively proposes and implements solutions – automatically refactoring code, adjusting configurations, or even suggesting architectural changes. By learning from each iteration and adapting to the specific codebase, the agent can anticipate bottlenecks before they impact users, effectively shifting performance optimization from a reactive process to a preventative one. This autonomous cycle promises to dramatically accelerate development timelines and deliver more robust, efficient software with minimal human intervention.

The system’s capacity for continuous learning represents a significant advancement in automated performance optimization. By perpetually analyzing performance-related pull requests, the underlying algorithms are exposed to a dynamic stream of code changes and their associated impacts. This iterative process allows the system to refine its understanding of performance bottlenecks and effective solutions, progressively reducing false positives and improving the precision of its recommendations. Each analyzed pull request serves as a training example, strengthening the system’s ability to accurately identify problematic code patterns and suggest targeted improvements. Consequently, the system doesn’t remain static; it evolves alongside the codebase, ensuring sustained accuracy and enhanced efficiency over time, ultimately automating a process traditionally reliant on extensive manual review and expert insight.

A recent evaluation of the automated performance optimization pipeline revealed a 7.5% false positive rate in the LLM-based classification of performance-related pull requests, assessed through manual inspection of 200 submissions. While indicating a need for continued refinement of the underlying AI models, this initial accuracy demonstrates the potential for a paradigm shift in software development practices. This technology promises to empower development teams to proactively address performance bottlenecks, ultimately leading to the delivery of higher-quality software with increased velocity, even when tackling intricate procedures such as large-scale refactoring. The ability to automate the identification and resolution of performance issues could significantly reduce technical debt and improve the overall efficiency of the software lifecycle.

The study reveals a fascinating dynamic within AI-driven development; agentic systems demonstrate an aptitude for addressing a spectrum of performance concerns, yet these contributions frequently encounter rejection during pull request reviews. This echoes John von Neumann’s observation: “There is no telling what ultimate shape this will take.” The agents’ focus on feature-level optimizations, while valuable, often overlooks broader software development lifecycle (SDLC) activities – a systemic issue. The analysis underscores that simply improving isolated components doesn’t guarantee overall system enhancement; a holistic understanding of the architecture and its interconnectedness is paramount, mirroring the principle that modifying one part of a system triggers a cascade of consequences.

Beyond Patchwork: Charting a Course for Agentic Optimization

The findings suggest a curious asymmetry. These agentic systems, capable of generating code to address performance concerns, frequently encounter resistance during integration. If the system survives on patchwork – a constant stream of rejected pull requests – it’s likely overengineered, addressing symptoms without diagnosing the underlying disease. The focus on feature implementation, rather than holistic Software Development Life Cycle (SDLC) activities, implies a fragmented understanding of performance-a localized fix to a systemic issue.

The illusion of control stems from modularity without context. Simply generating smaller, ‘optimized’ code blocks does not guarantee improved system behavior. Future work must shift toward agents capable of reasoning about architectural trade-offs and long-term maintainability, not just immediate gains. The crucial question is not whether an agent can optimize, but whether it understands what to optimize for-and at what cost.

Ultimately, the field requires a move beyond treating performance as a technical problem and recognizing it as an emergent property of system structure. A truly intelligent agent will not simply rewrite code; it will refactor understanding, tracing the causal links between design choices and runtime behavior-a subtle, yet critical, distinction.


Original article: https://arxiv.org/pdf/2512.24630.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-01-02 01:36