Simulating Software Quality with AI Agents

Author: Denis Avetisyan

A new approach uses artificial intelligence to proactively identify and address requirements defects within the software development lifecycle.

This review details the development and evaluation of an agentic AI-powered simulation lab for assessing and improving requirements quality in DevOps environments.

Despite growing recognition of requirements quality’s impact, evaluation remains largely reliant on anecdotal evidence and intuition. This paper, ‘The Software Engineering Simulations Lab: Agentic AI for RE Quality Simulations’, introduces agentic AI simulations as a novel approach to systematically assess requirements quality within modern DevOps pipelines. We demonstrate that replicating software engineering processes with standardized agents enables executable simulations, offering a fast and cost-effective means to quantify the impact of requirement defects. Could this simulation-driven methodology fundamentally reshape how we understand and optimize requirements engineering for increasingly AI-driven systems?

Static Analysis is a Fool’s Errand: The Rise of Agentic AI

Conventional requirements engineering frequently employs static analysis techniques – assessments of documentation without executing code or modeling system behavior – which proves increasingly inadequate for modern software development. This approach struggles to represent the inherent dynamism of complex systems, where interactions between components and evolving user needs create a constantly shifting landscape. Static methods often treat requirements as fixed entities, overlooking the emergent properties and unforeseen consequences that arise from their interplay during execution. Consequently, critical flaws and inconsistencies may remain undetected until late in the development cycle, leading to costly rework and potentially compromised system quality. The limitations of static analysis highlight the need for more proactive and dynamic approaches that can better capture the complexities of modern software and anticipate potential issues before they manifest as tangible problems.

Agentic AI simulations represent a significant departure from traditional methods of assessing requirements quality. Rather than static analysis of documents, these simulations construct a dynamic, virtual environment populated by autonomous agents – each embodying a role within the software development lifecycle, such as developer, tester, or product owner. This allows for the exploration of requirements not as isolated statements, but as they interact within a functioning, albeit simulated, system. By observing agent interactions and emergent behaviors, researchers and developers can proactively identify ambiguities, inconsistencies, or incompleteness in the requirements before code is even written. The evolving nature of the simulation, with agents adapting and responding to changes, mirrors the realities of modern software development, offering a more nuanced and predictive understanding of potential issues and fostering a shift towards more robust and adaptable systems.

Agentic AI simulations offer a powerful method for preemptive problem-solving in software development by modeling the collaborative dynamics of various development roles. These simulations don’t just test code; they recreate the process of building software, with individual agents representing developers, testers, and project managers, each pursuing their objectives and interacting with one another. By observing these interactions within a virtual environment, potential conflicts, communication breakdowns, or flawed assumptions can be identified long before they manifest in a real-world project. This allows for iterative refinement of requirements and processes, reducing the risk of costly errors and delays, and ultimately fostering a more robust and adaptable software development lifecycle. The system essentially functions as a digital ‘dress rehearsal’, enabling teams to proactively address issues and optimize their workflows before committing to implementation.

Orchestrating Chaos: The Simulation Engine at Work

The Simulation Process is the foundational element of our methodology, functioning as a replicable instantiation of a software project environment. This process enables autonomous agents to execute defined tasks within a controlled, virtualized setting. Each instantiation mirrors key aspects of a live project, including code repositories, build systems, and testing frameworks. By replicating these elements, the Simulation Process facilitates the observation and analysis of agent behavior and task completion rates, providing data for iterative refinement of the agents themselves and the underlying processes. This allows for experimentation and optimization without impacting active development pipelines.

The simulation engine supports multiple simulation types to comprehensively analyze system behavior. Dynamic simulations model systems evolving over time with continuous state changes. Stochastic simulations incorporate randomness and probability distributions to represent uncertainty. Event-driven simulations focus on discrete events and their impact on system state, optimizing for scenarios where activity is not constant. Finally, qualitative simulations utilize descriptive, non-numerical data to assess system characteristics and identify potential issues not readily apparent through quantitative analysis. The combined use of these methodologies facilitates a robust and multifaceted understanding of complex system interactions.

The Model Context Protocol facilitates integration between simulation agents and existing DevOps pipelines, establishing a closed-loop system for iterative refinement. This connection allows data generated from agent actions within the simulated environment to be fed back into the pipeline, triggering adjustments and optimizations to the processes being modeled. Each complete simulation run, or ‘clone’, currently requires approximately 5.5 hours for execution, encompassing data generation, analysis, and feedback loop completion. This timeframe is a key performance indicator for evaluating the efficiency of both the simulation engine and the underlying DevOps processes it mirrors.

LLMs and Digital Pawns: Powering the Agents

The agent framework utilizes Large Language Models (LLMs), specifically the DeepSeek-V3.2-Exp model, to automate the creation and validation of requirements artifacts. This LLM enables each agent to perform four core functions: planning the requirements process, generating code based on those requirements, executing tests to verify functionality, and reviewing the artifacts for defects. DeepSeek-V3.2-Exp was selected for its capabilities in code generation and reasoning, allowing agents to move beyond simple text completion and engage in more complex, iterative development tasks. The model’s outputs are directly integrated into the agent workflow, facilitating a closed-loop system for requirements engineering.

The agent system utilizes a defined role structure, assigning each agent one of four functions: planner, developer, tester, or reviewer. This specialization is intended to mimic realistic collaborative software development workflows. The planner agent is responsible for initial requirements definition and decomposition. The developer agent translates these requirements into executable code. The tester agent then validates the implemented code against the original requirements, and finally, the reviewer agent assesses both the code and test results to identify potential issues or inconsistencies. This division of labor allows for focused task execution and facilitates the observation of inter-agent communication patterns during the requirements engineering process.

Our agent-based system enables the observation of interactions between specialized agents – planner, developer, tester, and reviewer – and their impact on requirements quality. Prototype implementation results demonstrate a 34% success rate in merging proposed requirements changes, indicating effective collaborative refinement. Furthermore, 23% of requirements successfully passed all associated unit tests following agent-driven development and validation, providing a quantitative measure of defect identification and resolution capabilities. These metrics are derived from evaluating the output of agents as they fulfill their defined roles within the system.

The Cost of Prediction: Beyond Functionality and Into Reality

The intricacies of software development are increasingly illuminated through detailed simulation of realistic project scenarios. These computational experiments reveal a strong correlation between the initial quality of requirements and the efficiency of the entire development lifecycle. By modeling factors like ambiguity, completeness, and consistency in requirements, researchers can observe how these attributes propagate – or fail to propagate – through design, implementation, and testing phases. Simulations demonstrate that poor requirements quality often leads to increased rework, delayed timelines, and ultimately, higher project costs. This approach allows for proactive identification of potential bottlenecks and provides a quantifiable basis for investing in improved requirements engineering practices, offering a powerful tool for optimizing software development processes and bolstering project success rates.

Through rigorous simulation, the economic benefits of prioritizing requirement quality become demonstrably clear, a process facilitated by frameworks such as ABRE-QM. This methodology doesn’t simply advocate for better requirements; it establishes a quantifiable link between their precision and overall project success. Current simulations reveal an average unit test line coverage of 40%, a metric directly correlated with reduced defect rates and streamlined development cycles. By meticulously analyzing simulation outcomes, specific areas for improvement within the requirements engineering process are identified, allowing for targeted interventions and optimized resource allocation. The result is a data-driven approach to requirements management, moving beyond subjective assessments to provide concrete evidence of value and a pathway toward continuous improvement in software development.

The pursuit of comprehensive software development insights through large-scale simulations carries a significant environmental cost. Each instance, or ‘clone,’ of the simulation generates 0.6 kilograms of carbon dioxide emissions, a figure stemming from the intensive computational resources required. This process demands substantial token processing – approximately 94.2 million input tokens and 269,600 output tokens per clone – highlighting the energy consumption inherent in these analyses. Currently, the financial expenditure for each simulation clone is $3.27, a cost that, while seemingly modest, accumulates rapidly with extensive testing and underscores the need for resource-conscious methodologies in computational research.

The pursuit of automated requirements validation, as detailed in this paper, feels…familiar. It’s a beautifully complex system built to predict chaos, a digital attempt to tame the inevitable. It reminds one of David Hilbert’s assertion: “We must be able to answer, yes or no, to any mathematical question.” This research aims for a similar definitive answer regarding requirement quality – a ‘yes’ meaning a functional implementation, a ‘no’ signaling impending disaster. Of course, production always has a way of introducing novel failure modes, turning elegant simulations into notes for future digital archaeologists. The promise of agentic AI to model complex interactions within a CI/CD pipeline is compelling, but the system will ultimately be judged by its ability to withstand the relentless assault of real-world usage, not just simulated scenarios. If it crashes consistently, at least it’s predictable.

What’s Next?

The demonstrated coupling of agentic AI with existing CI/CD pipelines is, predictably, not a panacea. The simulations reveal defects, certainly, but translating those simulated impacts into actionable pre-implementation changes remains a significant hurdle. The agents dutifully expose fragility, but production will always find novel failure modes-it’s a feature, not a bug. The current work provides a fascinating, if idealized, view of requirements quality assessment, but scaling these simulations to reflect the chaotic reality of large-scale software development is a considerable challenge.

Future iterations will inevitably grapple with the cost of fidelity. More realistic simulations demand exponentially more computational resources and accurate agent behaviors. The question then becomes not can the system detect defects, but at what cost? Every abstraction dies in production, and this framework, while elegant, will eventually succumb to the same fate. The crucial next step isn’t simply improving the agents, but developing methods to gracefully degrade simulation fidelity without losing predictive power.

Ultimately, the true test lies in demonstrating long-term value. Will these simulations measurably reduce post-release defects, or will they simply become another layer of pre-deployment anxiety, a beautifully rendered dashboard displaying the inevitable? The research hints at potential, but it’s a safe bet that the most interesting bugs are still waiting to be discovered.

Original article: https://arxiv.org/pdf/2511.17762.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Static Analysis is a Fool’s Errand: The Rise of Agentic AI

Orchestrating Chaos: The Simulation Engine at Work

LLMs and Digital Pawns: Powering the Agents

The Cost of Prediction: Beyond Functionality and Into Reality

What’s Next?

See also: