Taming the Uncertainty: AI for Reliable Software

Author: Denis Avetisyan

A new formal framework aims to bridge the gap between the probabilistic nature of modern AI and the demand for dependable software systems.

This review proposes a system built around ‘Atomic Action Pairs’ and deterministic verification to manage the stochasticity of Large Language Models in software engineering contexts.

Current AI coding agents often conflate the strengths of Large Language Models with the need for deterministic control in software engineering, leading to unreliable outputs. This paper, ‘Managing the Stochastic: Foundations of Learning in Neuro-Symbolic Systems for Software Engineering’, proposes a formal neuro-symbolic architecture that separates stochastic generation from deterministic workflow control. By formalizing ‘Atomic Action Pairs’ and ‘Guard Functions’, the framework enables reliable code generation by coupling LLM outputs with immediate verification. Could architectural constraints, rather than simply increasing model scale, be the key to unlocking truly dependable AI-driven software development?

The Fragility of Legacy Systems

A significant obstacle to digital transformation lies within the very systems organizations rely upon most: their legacy infrastructure. Many critical applications, despite delivering ongoing business value, suffer from a profound lack of comprehensive test coverage. This absence isn’t simply a matter of technical debt; it actively hinders modernization efforts, transforming necessary updates into high-risk operations. Without robust validation, even minor refactoring – intended to improve performance or security – can introduce unforeseen regressions and system failures. Consequently, organizations find themselves trapped, hesitant to innovate or adapt due to the potential for disruption, and perpetually vulnerable to escalating maintenance costs and security breaches. The lack of testing doesn’t just impede progress; it fundamentally increases the risk associated with continuing to operate these vital, yet fragile, systems.

Legacy systems, often decades old and critical to ongoing operations, present a unique testing challenge due to their inherent complexity and lack of documentation-a condition known as ‘opacity’. Traditional testing approaches, designed for well-defined and transparent codebases, falter when confronted with intricate dependencies, undocumented business rules, and data flows that are poorly understood even by long-time maintainers. These systems frequently lack unit tests, and integration testing becomes exponentially more difficult as the number of interwoven components increases. The result is a situation where even minor changes carry a significant risk of unintended consequences, and comprehensive validation is often impractical, forcing organizations to choose between maintaining brittle infrastructure or undertaking costly and potentially disruptive rewrites.

Refactoring legacy systems-modifying existing code without altering its external behavior-becomes extraordinarily risky in the absence of robust validation infrastructure. Without automated tests and comprehensive monitoring, even seemingly minor code changes can introduce unintended consequences, potentially destabilizing critical functions and creating difficult-to-diagnose errors. This lack of confidence in the system’s integrity actively discourages innovation; developers become hesitant to improve or extend functionality, fearing the ripple effects of untested modifications. Consequently, organizations reliant on these systems find themselves trapped in a cycle of technical debt, where the cost of change outweighs the benefits, ultimately hindering their ability to adapt to new challenges and opportunities. The result is not merely a maintenance burden, but a significant impediment to growth and competitiveness.

Systematic Validation: A Pragmatic Approach

The ‘BootstrappingWorkflow’ is an iterative process designed to introduce automated validation to codebases lacking comprehensive testing. It moves beyond simply adding tests to existing functionality by prioritizing the systematic discovery of current system behavior as the foundation for validation suites. This workflow emphasizes building a safety net around the existing code before significant refactoring or feature additions, mitigating regression risks. The core principle involves incrementally expanding validation coverage, starting with observable behavior and progressively increasing confidence in the system’s stability through automated checks. This approach is particularly useful for ‘LegacySystem’ integration where documentation may be incomplete or outdated.

Characterization Testing involves executing existing code with its current inputs and asserting that the observed outputs match the existing behavior. This process doesn’t require prior knowledge of intended functionality; instead, it empirically determines what the system already does. The resulting tests, often referred to as ‘characterization tests’, serve as a baseline against which future changes can be measured and prevent regressions. These tests are built by capturing the actual outputs of the system for a given set of inputs, effectively creating a snapshot of the current functionality and establishing initial ‘guardrails’ to maintain that behavior during development.

Dependency mapping within a legacy system involves identifying and documenting all internal and external connections between code modules, data stores, and external services. This process typically utilizes static analysis tools, code review, and architectural documentation to create a visual representation of these relationships. Accurate dependency mapping is crucial for designing effective testing strategies because it highlights potential ripple effects of code changes, pinpoints critical integration points requiring specific test coverage, and allows for the prioritization of tests based on the impact of failing components. Without a clear understanding of these dependencies, testing efforts may be misdirected, resulting in incomplete coverage and increased risk of unforeseen issues in production.

Defining Workflow Integrity Through State and Guards

A ‘WorkflowState’ serves as the foundational element for validation by providing a fully deterministic snapshot of the system’s progression. This representation encapsulates all relevant parameters and data defining the current status of the workflow, enabling consistent and repeatable verification processes. Critically, the ‘WorkflowState’ is designed to be independent of external, potentially non-deterministic factors, ensuring that any validation failure can be directly attributed to the system’s internal logic. This deterministic nature is essential for debugging, testing, and the reliable automation of refinement procedures, as it allows for precise identification of the point at which a workflow deviates from its expected behavior.

Guard Functions are discrete validation checks implemented at each step of a workflow to confirm that generated outputs meet predefined behavioral expectations. These functions operate on the output of an action and return a boolean value indicating pass or fail status. Implementation involves defining specific criteria for acceptable output, such as data type, range, or format, and expressing these criteria as testable conditions within the function. Failure of a Guard Function halts further workflow progression, preventing the system from proceeding with invalid data or actions and triggering appropriate error handling or corrective measures. The use of Guard Functions is critical for maintaining system integrity and ensuring reliable operation, particularly in automated workflows where human oversight is limited.

An ‘AtomicActionPair’ integrates the generation of a system output with its subsequent verification, creating a mechanism for dense reward signaling. This pairing allows for immediate feedback on the validity of each action taken, facilitating automated refinement processes. Instead of sparse rewards received only upon completion of a larger task, the system receives a reward, or penalty, for each action-verification cycle. This granular feedback is crucial for algorithms employing reinforcement learning or similar iterative improvement techniques, as it allows for more efficient exploration of the solution space and faster convergence towards optimal behavior. The density of these signals directly correlates to the speed and stability of the automated refinement process.

The ‘DualStateSolutionSpace’ architecture distinguishes between workflow state and environment state, representing them as separate, independent data structures. This separation enhances system clarity by isolating the internal progression of the workflow logic from the external, potentially dynamic, environment. Implementation of this architecture has demonstrated reliability gains of up to 66 percentage points in testing, attributed to reduced state-space complexity and simplified debugging processes. Maintaining distinct state representations allows for targeted validation and refinement of each component without impacting the other, contributing to improved maintainability and reduced error propagation.

Model performance varies significantly by task under a guarded configuration (<span class="katex-eq" data-katex-display="false">R_{max}=3</span>), with DeepSeek-Coder (1.3B) failing across all tasks, Phi4-Mini demonstrating task-specific reliability (58% LRU, 0% password), and a data corruption issue causing Qwen2.5-Coder (14B) to fail on password tasks. — Model performance varies significantly by task under a guarded configuration ( $R_{max}=3$ ), with DeepSeek-Coder (1.3B) failing across all tasks, Phi4-Mini demonstrating task-specific reliability (58% LRU, 0% password), and a data corruption issue causing Qwen2.5-Coder (14B) to fail on password tasks.

Toward Robust Refactoring and Automated Validation

Refactoring safety, a cornerstone of robust software development, is demonstrably achieved through the systematic implementation of ‘GuardFunction’ checks. These functions act as preemptive validation points integrated throughout the codebase, meticulously verifying preconditions and invariants before, during, and after any modification. By establishing a network of these automated tests, the system proactively identifies potential regressions or unintended consequences arising from refactoring efforts. Each ‘GuardFunction’ essentially poses a question – “Is this change safe?” – and halts execution if the answer is negative, thereby preventing the introduction of errors. This rigorous approach ensures that refactoring doesn’t compromise the existing functionality, allowing developers to confidently restructure code while maintaining its integrity and reliability. The consistent application of these checks moves software maintenance from reactive debugging to proactive prevention.

The implementation of new features often introduces unforeseen errors, but a ‘TDDWorkflow’ – Test-Driven Development workflow – mitigates this risk through a cyclical process of writing tests before implementing the feature itself. This approach ensures that every addition is rigorously validated against predefined expectations, fostering a system built on demonstrable correctness. Rather than discovering issues post-implementation, potential problems are identified and addressed during the development phase, leading to a more stable and reliable system. Consequently, developers gain increased confidence in their changes, allowing for faster iteration and reduced debugging time, ultimately streamlining the development lifecycle and minimizing the potential for disruptive errors in production environments.

A central component of ensuring refactoring safety and automation lies in the implementation of ‘WorkflowSpecification’, a declarative format designed for comprehensive workflow definition and management. Rather than relying on imperative code to dictate process, this specification allows developers to describe the desired workflow state, including preconditions, postconditions, and allowable transitions. This approach offers several advantages; it promotes clarity and maintainability, enabling easier auditing and modification of complex processes. Furthermore, by separating the workflow logic from its execution, the specification facilitates automated verification and validation, ultimately bolstering confidence in the refactoring process and minimizing the potential for unforeseen errors. The declarative nature also allows for dynamic workflow adaptation, responding to changing conditions or requirements without necessitating extensive code rewrites.

The system effectively manages intricate scenarios by strategically incorporating ‘HumanGuard’ checkpoints when fully automated verification proves inadequate. This hybrid approach, blending automated testing with targeted human oversight, introduces a remarkably low computational overhead – only 1.2 to 2.1 times that of standard sampling methods. Critically, workflow execution maintains high reliability with a minimal need for retries, completing successfully on the first attempt in all observed cases. This balance between automation and human intervention allows for robust validation of complex processes without incurring prohibitive computational costs or compromising workflow stability, suggesting a scalable solution for demanding applications.

The pursuit of deterministic control, central to this framework of Atomic Action Pairs, echoes a fundamental tenet of rigorous systems. Alan Turing observed, “Sometimes people who are untidy enjoy efficiency.” This seeming paradox aptly describes the approach detailed within the study. The framework doesn’t seek to eliminate the inherent stochasticity of Large Language Models – an impossibility, and perhaps an inefficiency – but to manage it. By pairing atomic actions with formal verification, the system achieves predictability not through absolute certainty, but through constrained probability. This mirrors Turing’s insight: embracing a degree of ‘untidiness’-the probabilistic nature of LLMs-while imposing order through verification to achieve effective and reliable software engineering outcomes.

Where to Next?

The proposition-to impose deterministic control upon stochastic systems-is, predictably, more easily stated than achieved. This work establishes a foundation, yes, but highlights the enduring tension: can a fundamentally probabilistic agent ever truly be verified? Future efforts must address the inevitable discrepancies between formal specification and the lived experience of LLM-driven software. The current framework, focused on Atomic Action Pairs, offers a useful granularity, but scaling this approach to complex systems will demand not merely more pairs, but a more intelligent method of abstraction.

A critical limitation remains the reliance on complete specification. Real-world software is rarely, if ever, defined with such precision. Research should investigate methods for learning specifications from observed behavior, or, more radically, for building systems robust enough to tolerate incomplete or even contradictory instructions. The pursuit of formal guarantees should not become an excuse for ignoring the inherent messiness of the problems being solved.

Ultimately, the question is not whether neuro-symbolic systems can be made reliable, but whether the cost of that reliability-in terms of expressiveness, adaptability, and ultimately, usefulness-is acceptable. Simplicity, it bears repeating, is not limitation; it is intelligence. The next stage demands a ruthless pruning of complexity, a commitment to clarity, and a willingness to abandon guarantees that prove illusory.

Original article: https://arxiv.org/pdf/2512.20660.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Fragility of Legacy Systems

Systematic Validation: A Pragmatic Approach

Defining Workflow Integrity Through State and Guards

Toward Robust Refactoring and Automated Validation

Where to Next?

See also: