Can AI Really Refactor Code?

Author: Denis Avetisyan

A new study examines the practical impact of AI coding agents on improving software quality and maintainability.

Refactoring commits reveal a distinction in practice, with agentic refactoring instances exhibiting a unique distribution compared to other refactoring types, suggesting differing approaches to code modification.

This research presents the first large-scale empirical analysis of refactoring tasks performed by AI agents, revealing strengths in localized improvements but limitations in complex architectural changes.

Despite the increasing prevalence of AI coding agents, a clear empirical understanding of their refactoring practices remains surprisingly absent. This research, ‘Agentic Refactoring: An Empirical Study of AI Coding Agents’, presents a large-scale analysis of 15,451 refactoring instances, revealing that these agents prioritize localized code improvements focused on maintainability and consistency over high-level architectural changes. While agentic refactoring demonstrably yields small but statistically significant gains in structural code quality, the question remains: can these agents evolve to address more complex design flaws and truly emulate the nuanced refactoring skills of human developers?

The Escalating Challenge of Code Quality

Maintaining software quality is increasingly challenging within modern development. Growing codebase sizes and accelerated release cycles create pressure on teams, often leading to technical debt that impacts long-term maintainability and introduces defects. Traditional methods like manual review and refactoring are frequently insufficient and often reactive, expensive, and time-consuming. A proactive shift towards automated solutions is essential. The need for tools capable of analyzing code health, identifying issues, and suggesting improvements is critical; these solutions must understand code semantics and context to enable continuous improvement and foster sustainable development.

Decomposition into helper methods effectively improves code readability and reduces overall complexity.

Systems break along invisible boundaries – if you can’t see them, pain is coming.

Agentic Refactoring: Evolving Code with AI

Agentic Refactoring represents a paradigm shift, employing AI agents to automate code improvement beyond traditional static analysis. These agents leverage Large Language Models (LLMs) for a deeper understanding of code semantics, proposing and implementing meaningful refactorings to address maintainability and performance issues. Recent analysis demonstrates that 26.1% of commits generated by agentic systems explicitly target refactoring, signifying a substantial opportunity to reduce technical debt and accelerate development velocity.

A comparison of refactoring purposes between agents and humans, normalized for scale, reveals differences in their approaches (DBLP:journals/tse/KimZN14).

Tools and Techniques for Automated Code Improvement

Effective refactoring requires tools for identifying areas needing improvement. DesigniteJava and RefactoringMiner are essential resources for detecting design smells like long methods, large classes, and duplicated code, providing developers with actionable insights. Agentic Refactoring automates this process, leveraging both LowLevelRefactoring (renaming variables, extracting constants) and HighLevelRefactoring (moving methods, extracting classes). Furthermore, it identifies and addresses ImplicitRefactoring opportunities – improvements that arise as a byproduct of other modifications.

Validation, Future Directions, and a Shifting Paradigm

The AIDevDataset provides a dedicated resource for researchers investigating agentic software development methodologies, facilitating empirical study and standardized evaluation. AgenticCoding, encompassing AgenticRefactoring, supports complete automated development workflows. Analysis reveals a primary focus on maintainability (52.5% of cases), followed by readability (28.1%). Empirical results demonstrate a median reduction of -15.25 lines of code (LOC) and -2.07 weighted methods per class (WMC), suggesting a positive impact on code quality and hinting at a paradigm shift—one where continuous, AI-driven improvement supplants manual review, and altering one aspect of the codebase inevitably ripples through the entire architecture.

The study illuminates a critical point regarding agentic refactoring: while AI agents demonstrate proficiency in addressing superficial code issues, they struggle with holistic architectural improvements. This limitation echoes Bertrand Russell’s observation that “the whole is more than the sum of its parts.” Agentic systems, focused on localized cleanup, often fail to recognize the interconnectedness of a codebase. The research suggests that true code maintainability requires an understanding of the system’s overall structure, a capacity currently beyond the reach of these agents. Systems break along invisible boundaries—and this study highlights the difficulty of navigating those boundaries without a broader, systemic perspective.

What’s Next?

The observed proficiency of agentic refactoring in addressing superficial code debt is… predictable. It confirms the capacity of these systems to optimize within existing constraints, a talent akin to a skilled mechanic meticulously tuning an engine already designed. However, the limitations in tackling deeper architectural flaws expose a fundamental challenge: true refactoring isn’t simply about cleaning code, but about re-evaluating first principles. The study highlights that current agents excel at local optimization but lack the capacity for global restructuring—they polish the symptoms, not cure the disease.

Future work must move beyond evaluating agents on metrics of immediate maintainability and focus on their ability to identify and address systemic design issues. This necessitates developing evaluation frameworks that reward not just code cleanliness, but also modularity, separation of concerns, and adherence to established architectural patterns. Moreover, a critical area of investigation lies in equipping agents with the capacity for ‘intentional’ design – the ability to articulate the rationale behind architectural choices, and to weigh trade-offs between competing design principles.

The path forward isn’t simply about building ‘smarter’ agents, but about fostering a deeper understanding of the relationship between code structure and system behavior. The current findings suggest that the true measure of an agent’s intelligence lies not in its ability to rewrite code, but in its capacity to recognize when rewriting isn’t enough. Good architecture is invisible until it breaks, and only then is the true cost of decisions visible.

Original article: https://arxiv.org/pdf/2511.04824.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Escalating Challenge of Code Quality

Agentic Refactoring: Evolving Code with AI

Tools and Techniques for Automated Code Improvement

Validation, Future Directions, and a Shifting Paradigm

What’s Next?

See also: