Can AI Fix Your Code?

Author: Denis Avetisyan

A new wave of intelligent systems is emerging, promising to automate the tedious and critical task of software issue resolution.

The system outlines a framework for automated issue resolution, acknowledging that all processes inevitably encounter decay and prioritizing graceful adaptation as a key metric of sustained functionality.

This survey examines the latest advancements in large language model-based agents for automated software debugging, repair, and maintenance, outlining key challenges and future research directions.

While fully autonomous software maintenance remains a significant challenge, recent advances in artificial intelligence offer promising solutions. This paper, ‘Agentic Software Issue Resolution with Large Language Models: A Survey’, systematically examines the burgeoning field of LLM-based agentic systems designed for automated issue resolution in software engineering. Our analysis of 126 studies reveals a clear trend toward leveraging agentic capabilities-reasoning, planning, and iterative feedback-to address complex software defects and optimization tasks. As these systems mature, can we expect a paradigm shift towards truly self-healing and continuously improving software ecosystems?

The Inevitable Decay of Software: A New Frontier

Software development historically necessitates significant human effort in identifying and rectifying errors, a process frequently characterized by substantial financial and temporal costs. The traditional workflow often involves developers meticulously reviewing code, reproducing bugs, and crafting fixes – a cycle that can consume a disproportionate amount of project resources. This manual approach isn’t merely labor-intensive; it’s also prone to inconsistencies, as different developers may approach the same problem with varying strategies and levels of expertise. Consequently, the expense associated with issue resolution frequently eclipses the cost of initial development, creating a bottleneck in software delivery and hindering innovation. The sheer volume of potential errors in modern, complex software systems further exacerbates this challenge, demanding more efficient and scalable solutions.

The escalating complexities of modern software development are increasingly addressed by LLM-BasedAgenticSystems, a rapidly evolving field explored in a comprehensive survey of 126 research papers through late 2025. These systems leverage large language models to autonomously diagnose and resolve software issues, promising a significant reduction in the traditionally manual – and costly – debugging process. However, despite substantial progress, consistent and reliable performance remains a key hurdle; the survey highlights that achieving robust automation capable of handling the diverse range of software faults requires further innovation in areas like error generalization, contextual understanding, and validation techniques to ensure solutions don’t inadvertently introduce new problems.

Issue resolution benchmarks are typically constructed manually through a pipeline involving multiple steps.

The Foundations of Automated Repair: Beyond Superficial Correction

Supervised Fine-Tuning (SFT) establishes a foundational level of competence in Large Language Models (LLMs) for targeted issue resolution. This process involves training a pre-trained LLM on a labeled dataset consisting of input prompts and corresponding desired outputs. The model adjusts its internal parameters to minimize the difference between its generated responses and the provided labels, effectively learning to map specific inputs to correct outputs. Datasets for SFT typically include examples of user queries and expertly crafted solutions, enabling the LLM to initially mimic desired behavior. The quality and size of the labeled dataset directly influence the performance of the SFT model; larger, more diverse datasets generally yield more robust and accurate results. While essential for establishing a baseline, SFT alone often proves insufficient for complex reasoning tasks and requires further refinement through techniques like Reinforcement Learning.

While supervised fine-tuning establishes a foundational capability for LLM agents, exclusive reliance on labeled datasets introduces limitations in real-world applicability. Labeled data, by its nature, represents a finite set of known scenarios, hindering an agent’s performance when encountering novel or previously unseen situations. This constraint arises because the agent learns to map inputs to outputs based solely on the provided examples; extrapolation to unfamiliar contexts becomes problematic, resulting in decreased accuracy and reliability. Consequently, agents trained exclusively on labeled data may struggle with complex reasoning, ambiguous prompts, or tasks requiring adaptability beyond the scope of the training set.

Reinforcement Learning (RL) addresses the limitations of supervised learning in LLM agents by enabling learning through trial and error, optimizing for specific outcomes rather than relying solely on labeled examples. This process involves defining a reward function that incentivizes desired behaviors, such as successful issue resolution or efficient reasoning, and allowing the agent to explore different action sequences to maximize cumulative reward. Domain-specific fine-tuning of open-source LLMs using RL techniques has demonstrated performance improvements of up to 5% compared to solely supervised fine-tuning, indicating the potential of RL to enhance agent capabilities in complex and nuanced scenarios.

The Subtle Art of Agentic Refinement: Process and Representation

Effective reinforcement learning for code repair necessitates a nuanced reward system beyond simply rewarding correct outputs. A balanced approach, combining OutcomeBasedReward with ProcessOrientedReward, is crucial for training successful agents. OutcomeBasedReward assigns positive reinforcement solely for generating functionally correct code, while ProcessOrientedReward incentivizes desirable behaviors during the code repair process, such as minimizing the number of edits, reducing code complexity, or adhering to specific coding standards. Solely relying on OutcomeBasedReward can lead to agents discovering functionally correct but inefficient or poorly structured solutions. Incorporating ProcessOrientedReward guides the agent towards not only solving the problem but also developing a robust and maintainable solution, improving generalization and reasoning capabilities.

CodeRepresentation defines how source code is formatted and provided as input to the Language Model (LLM). Variations in this representation-including the presence or absence of comments, whitespace, variable naming conventions, and code structure-directly affect the LLM’s parsing accuracy and ability to identify semantic meaning. Specifically, LLMs demonstrate improved performance when code is presented with consistent formatting, clear variable names, and relevant comments, as these features facilitate pattern recognition and reduce ambiguity during the code analysis process. Conversely, obfuscated, minimally formatted, or inconsistently styled code can significantly hinder the LLM’s comprehension and ability to accurately manipulate the code, impacting the agent’s overall problem-solving efficacy.

The integration of carefully tuned reward signals with effective code representation facilitates a learning paradigm where agents develop procedural knowledge alongside solution identification. Specifically, balancing rewards for achieving correct outcomes with rewards for demonstrating efficient and logical reasoning steps allows the agent to internalize a problem-solving process. This contrasts with systems focused solely on outcome-based rewards, which may achieve solutions through suboptimal or brittle methods. By learning both what constitutes a correct fix and how to systematically arrive at that fix, agents demonstrate increased robustness, adaptability to novel challenges, and improved generalization capabilities when addressing code defects.

A tree-based structure visually represents the organization of a code repository and its constituent files.

The Rigor of Validation: Measuring True Progress Against Ephemeral Success

Rigorous evaluation of agentic systems is fundamental to their responsible development and deployment, demanding a comprehensive testing strategy that extends beyond initial functionality. Crucially, this involves not only verifying that a system resolves presented issues, but also ensuring the reliability of those fixes through ReproducibilityTests – confirming the same issue can be consistently addressed after modifications. Equally important are RegressionTests, designed to detect unintended consequences of updates, preventing the introduction of new problems while resolving existing ones. Without this dual focus on both consistent resolution and the avoidance of side effects, even seemingly successful agentic systems risk becoming unstable and unpredictable in real-world applications, undermining trust and hindering practical utility.

The development of robust and reliable agentic systems hinges on effective evaluation, and SWEBench emerges as a crucial resource in this pursuit. This meticulously curated dataset offers a standardized platform for benchmarking and comparing the performance of diverse issue resolution systems, enabling researchers and developers to objectively assess their capabilities. SWEBench isn’t simply a collection of bugs; it’s a carefully constructed environment that simulates real-world software maintenance scenarios, providing a consistent and repeatable method for measuring progress. By offering a common ground for evaluation, SWEBench fosters innovation and accelerates the development of more intelligent and dependable automated issue resolution tools, ultimately contributing to improved software quality and reduced maintenance costs.

A crucial aspect of validating agentic systems extends beyond simply fixing identified issues; it demands careful consideration of benchmark integrity. Evaluations must accurately mirror the complexities of real-world software development scenarios to avoid inadvertently rewarding systems that exploit benchmark quirks rather than demonstrating genuine problem-solving capabilities. Current evaluations reveal a surprisingly low rate of reproducibility – only 49% of issues can be consistently resolved across different runs, even with state-of-the-art systems. This suggests that many apparent successes may be fragile, dependent on specific conditions within the benchmark, and therefore unreliable when applied to actual software maintenance tasks. Ensuring benchmarks are robust and representative is therefore paramount to fostering the development of truly dependable agentic systems.

Software Engineering for Agents: A New Discipline for a Decaying System

A new discipline, Software Engineering for Agentic Systems (SE4AS), is rapidly gaining prominence as developers increasingly create autonomous agents capable of complex tasks. Unlike traditional software, these agents require a fundamentally different approach to engineering, focusing on emergent behavior, continuous learning, and robust adaptation to unpredictable environments. SE4AS tackles unique challenges, including ensuring agent safety, managing long-term reliability, and providing effective debugging tools for systems that operate with a degree of independence. This emerging field integrates principles from artificial intelligence, software engineering, and human-computer interaction to build and maintain these complex, self-directed systems, ultimately aiming to establish best practices and standardized methodologies for a future increasingly reliant on intelligent automation.

The widespread integration of agentic systems – autonomous entities capable of problem-solving and task execution – hinges significantly on overcoming inherent technical limitations. Current challenges include ensuring robust reasoning capabilities in complex, real-world scenarios, managing the unpredictable nature of agentic behavior, and guaranteeing the safety and reliability of their actions. Specifically, agents often struggle with tasks requiring common sense knowledge or adapting to unforeseen circumstances, leading to errors or suboptimal outcomes. Addressing these deficiencies requires advancements in areas such as verifiable AI, explainable decision-making, and robust error handling mechanisms. Without substantial progress in these areas, the potential benefits of agentic systems – including increased automation and enhanced productivity – will remain largely unrealized, hindering their practical application across diverse industries and limiting public trust in their capabilities.

The seamless integration of agentic systems into established software development workflows promises a substantial uplift in productivity and a faster pace of innovation. Rather than requiring a complete overhaul of current practices, these systems are designed to augment existing tools and processes – automating repetitive tasks like bug triage, code review, and even preliminary debugging. This allows engineers to concentrate on more complex problem-solving and creative endeavors, effectively multiplying their output. Furthermore, by rapidly identifying and addressing issues early in the development cycle, agentic systems can dramatically reduce technical debt and accelerate the delivery of high-quality software, fostering a more agile and responsive development environment. The potential extends beyond mere efficiency gains; it enables exploration of novel solutions and a quicker iteration on ideas, ultimately driving significant advancements across various technological domains.

The survey of agentic software issue resolution highlights a transient nature inherent in all complex systems. It observes that current LLM-based agents, while promising, are subject to the inevitable entropy of evolving software landscapes. This echoes Brian Kernighan’s observation: “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code first, debug it twice.” The pursuit of fully autonomous software maintenance, as detailed in the study, isn’t about achieving a static perfection, but rather developing systems capable of adapting to perpetual change – acknowledging that each ‘improvement’ introduces new avenues for decay and requiring continuous refinement. The lifecycle of software, much like any architecture, is one of constant evolution, not lasting stability.

What Lies Ahead?

The pursuit of agentic systems for issue resolution, as this survey details, is not a search for permanence, but a negotiation with inevitable decay. Each automated fix, each line of generated code, is merely a localized reduction in entropy – a temporary stay against the universe’s preference for disorder. Uptime is not a destination; it’s the transient state between failures. The current reliance on benchmark evaluations, while useful for charting progress, obscures a critical point: benchmarks measure performance within a defined context, failing to account for the infinite, unpredictable edge cases that constitute real-world software operation.

Future work must address this inherent limitation, shifting focus from achieving high scores on contrived datasets to building systems capable of graceful degradation. Reinforcement learning, despite its promise, remains vulnerable to reward hacking and brittle generalization. A more fruitful avenue lies in exploring approaches that prioritize detectability of failure over its prevention – systems designed to quickly identify and isolate issues, even if they cannot consistently avoid them. Stability is an illusion cached by time, and minimizing latency-the tax every request must pay-will become paramount.

Ultimately, the ambition of fully autonomous software maintenance is not about eliminating bugs, but about managing their lifecycle. The goal is not to create perfect software, but software that ages gracefully, adapting and self-repairing in the face of constant, relentless change. The true measure of success will not be the absence of failure, but the speed and efficiency with which systems recover from it.

Original article: https://arxiv.org/pdf/2512.22256.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/