Can AI Now Write Your Requirements?

Author: Denis Avetisyan

New research demonstrates that artificial intelligence can significantly aid in defining and evaluating system requirements, but human oversight remains critical.

The progression of artificial intelligence within requirements engineering demonstrates a clear historical trajectory, evolving from initial conceptual frameworks to increasingly sophisticated implementations capable of automating and optimizing the traditionally manual processes of elicitation, analysis, specification, and validation.

An empirical evaluation of AI assistance in requirements engineering finds that large language models perform effectively when integrated into a human-in-the-loop workflow for improved requirement quality and classification.

Despite decades of established practice, quality assessment within requirements engineering remains heavily reliant on subjective expert judgment, creating a bottleneck in systems development. This research, titled ‘AI-Assisted Requirements Engineering: An Empirical Evaluation Relative to Expert Judgment’, investigates the potential of artificial intelligence to augment-but not replace-human expertise in this critical process. Our findings demonstrate that AI tools can provide consistent and rapid preliminary evaluations of requirement quality, particularly for structural attributes, yet contextual interpretation and nuanced reasoning still necessitate human oversight. How can we best integrate these ‘copilot’ AI systems into existing systems engineering workflows to maximize their benefits while preserving accountability and engineering rigor?

Deconstructing Requirements: The Bottleneck of Complexity

Historically, the process of defining what a system should do – requirements engineering – has relied heavily on manual effort, creating a significant bottleneck in project timelines. This reliance on human documentation and review is not only remarkably time-consuming but also introduces a high risk of inconsistencies, ambiguities, and omissions. These flaws frequently cascade through the development lifecycle, leading to costly rework, delayed releases, and ultimately, projects that fail to meet stakeholder expectations. The inherent challenges of maintaining a cohesive and accurate set of requirements manually, especially in large and complex systems, demonstrate a clear need for more robust and efficient approaches to ensure project success and minimize the potential for preventable failures.

Contemporary systems, ranging from smart cities to autonomous vehicles and intricate financial models, present a dramatic escalation in complexity, necessitating a fundamental shift in how requirements are managed and analyzed. Traditional, document-centric approaches struggle to cope with the sheer volume, interconnectedness, and dynamic nature of these modern requirements. Scalable solutions are no longer simply desirable-they are essential for maintaining project coherence and preventing costly errors. These solutions must move beyond basic tracking to incorporate advanced analytical capabilities, enabling engineers to identify conflicts, prioritize needs, and ensure traceability across all system components. Without efficient methods for handling this complexity, projects risk scope creep, missed deadlines, and ultimately, failure to deliver functional, reliable systems that meet evolving user expectations.

Despite advancements in computational power, current automated requirements engineering tools frequently struggle with the subtleties inherent in natural language. These systems often rely on keyword matching or simplistic pattern recognition, leading to misclassification of requirements and inaccurate quality assessments. The lack of contextual understanding means that ambiguity, implicit assumptions, and domain-specific terminology are often overlooked, resulting in false positives or, more critically, false negatives. Consequently, while automation promises efficiency, its effectiveness is limited by an inability to discern the true intent and meaning embedded within requirements specifications, necessitating continued human oversight and validation to ensure project success and prevent costly errors.

The requirements engineering process follows a core set of stages, as illustrated by this flowchart.

Unlocking Automation: The Language of Systems

Large Language Models (LLMs) are increasingly utilized to automate tasks in requirements engineering, specifically in the areas of requirements classification and quality assessment. LLMs can process natural language requirements documents and automatically assign categories based on feature, priority, or other defined criteria, reducing manual effort and improving consistency. For quality assessment, LLMs can identify potential issues such as ambiguity, incompleteness, or inconsistency by analyzing the text and flagging problematic statements. This automated analysis facilitates early detection of defects, leading to improved requirements quality and reduced downstream development costs. The efficiency gains from LLM implementation are particularly notable in large-scale projects with extensive requirements documentation.

Large Language Models (LLMs), including GPT-4, Claude, and Llama 3, facilitate automated requirements analysis by processing natural language text to extract salient information. These models utilize techniques like Named Entity Recognition (NER) and relationship extraction to pinpoint key attributes within requirement statements, such as actors, actions, and objects. Subsequent categorization is achieved through text classification algorithms, enabling efficient assignment of requirements to predefined classes – for example, functional versus non-functional, or categorization based on system components. This automated process significantly reduces manual effort and improves the consistency and speed of requirements documentation and management, allowing for scalable analysis of large requirement sets.

Large Language Models (LLMs) facilitate the automated distinction between functional and non-functional requirements within requirements documentation. Functional requirements, detailing what a system should do, are identified through analysis of action-oriented verbs and desired outcomes described in natural language. Simultaneously, LLMs can recognize non-functional requirements – those defining how the system should perform – by detecting keywords and phrases related to quality attributes such as performance, security, usability, and reliability. This automated categorization streamlines the process of defining system behavior and quality characteristics, reducing manual effort and improving the consistency of requirements definition. The models achieve this by leveraging their training on extensive text datasets to recognize patterns and associations between language and requirement types.

AI research and development progresses from initial requirements gathering through implementation and ultimately culminates in the reporting of findings.

Testing the Boundaries: Validation and Refinement

The PROMISE Dataset was utilized as a standardized benchmark to assess the performance of Large Language Models (LLMs) in the context of software requirements analysis. This dataset comprises a collection of documented requirements, each categorized and labeled according to established criteria, allowing for quantitative evaluation of LLM outputs. Specifically, PROMISE provides a ground truth against which the LLM’s ability to correctly classify, interpret, and validate requirements can be measured, facilitating a comparative analysis of different model architectures and prompting strategies. The dataset’s established use within the requirements engineering community ensures the results are comparable to existing research and provides a robust foundation for evaluating the practical applicability of LLMs in this domain.

Evaluation using the PROMISE Dataset revealed that Claude Sonnet 3.5 achieved 85% accuracy in classifying software requirements based on the established criteria defined by the International Council on Systems Engineering (INCOSE) for ‘good requirements’. This performance level is statistically comparable to the accuracy demonstrated by experienced systems engineers when performing the same classification task, suggesting a high degree of reliability and potential for practical application in requirements analysis workflows. The INCOSE criteria assessed include characteristics such as clarity, completeness, consistency, and verifiability of each requirement statement.

Evaluation using the PROMISE dataset revealed significant performance differences in functional requirement identification between LLMs. Llama 3.0 achieved a recall rate of 86.3% in correctly identifying functional requirements within the dataset. Conversely, Claude 3.5 demonstrated a substantially lower recall of 44.6% for the same task, representing the lowest functional requirement recall rate among the models tested. This indicates that Llama 3.0 was considerably more effective at comprehensively identifying all instances of functional requirements present in the benchmark data compared to Claude 3.5.

Performance consistency was evaluated by measuring the standard deviation of results across multiple evaluators. Claude Sonnet 3.5 exhibited a standard deviation of ±12.3% in its performance metrics, indicating a relatively narrow range of variation in its outputs. This value is lower than that observed for GPT-4 and Llama 3, suggesting that Claude provides more consistent results across different assessments and is less susceptible to evaluator bias or variability in interpretation of the evaluation criteria.

Evaluation utilizing the ‘Dr. Tools’ case study, a complex software requirements specification, demonstrated the practical application of this LLM-assisted analysis approach. The system successfully identified and categorized over 700 individual requirements within the ‘Dr. Tools’ documentation, mirroring the findings from the PROMISE Dataset evaluation. Specifically, the LLMs facilitated a 30% reduction in the time required for initial requirements classification compared to manual analysis performed by experienced systems engineers, and highlighted previously undocumented inconsistencies within the specification. This real-world implementation validated the scalability and potential cost savings associated with integrating LLMs into the software development lifecycle for requirements engineering tasks.

A Human-in-the-Loop (HITL) system was implemented to leverage the strengths of both AI and human expertise in requirements analysis. This involved routing AI-generated insights – such as requirement classifications and identified functional requirements – to experienced systems engineers for review and refinement. This process served to validate the AI’s outputs, correct any inaccuracies, and ensure adherence to established standards like INCOSE guidelines. The integration of human oversight maximized the overall quality of the analysis, minimized the potential for errors propagating through the system, and facilitated a more robust and reliable outcome compared to relying solely on automated AI processing.

The PROMISE dataset exhibits a non-uniform distribution of requirement types, indicating a potential bias in the data.

Systems Evolved: The Future of AI-Assisted Engineering

The incorporation of Large Language Models (LLMs) into requirements engineering marks a pivotal advancement in AI-Assisted Engineering, fundamentally reshaping how complex systems are conceived and built. Traditionally a labor-intensive process, defining, analyzing, and validating requirements now benefits from the LLM’s capacity to process vast amounts of textual data with remarkable speed and accuracy. These models can automatically identify ambiguities, inconsistencies, and missing information within requirements documents, ensuring a more robust foundation for system design. Beyond simple error detection, LLMs facilitate the generation of alternative requirement formulations, aiding engineers in exploring diverse design options and optimizing system performance. This integration doesn’t replace the expertise of systems engineers, but rather augments their capabilities, allowing them to concentrate on creative problem-solving and innovative design choices, ultimately leading to more efficient and reliable systems.

Large Language Models (LLMs) are poised to reshape systems engineering by assuming responsibility for traditionally time-consuming and repetitive tasks. These models don’t simply expedite existing workflows; they actively analyze complex requirements documentation, identify potential inconsistencies, and even generate preliminary design options. This automation liberates engineers from the burden of manual processing, allowing them to concentrate on the creative aspects of system design – exploring novel architectures, optimizing performance characteristics, and tackling the truly challenging problems that demand human ingenuity. Consequently, the integration of LLMs isn’t about replacing engineers, but rather about augmenting their capabilities and accelerating the pace of innovation in the development of increasingly sophisticated systems.

The successful integration of large language models into systems engineering isn’t simply about adopting new technology; it demands a commitment to established frameworks. Adherence to standards, such as those developed by the International Council on Systems Engineering (INCOSE), is paramount for ensuring that AI-driven processes are not only innovative but also reliable and compliant. These standards provide a common language and methodology, facilitating validation, verification, and traceability throughout the systems development lifecycle. By aligning AI tools with proven best practices, engineers can mitigate risks associated with automation, maintain system integrity, and demonstrate adherence to industry regulations – ultimately fostering trust and enabling wider adoption of these powerful technologies within safety-critical applications.

The potential of large language models extends beyond mere automation, offering a pathway to fundamentally reshape systems engineering workflows and deliver substantial improvements across key performance indicators. By streamlining processes – from requirements elicitation to verification and validation – these technologies are poised to compress development timelines and unlock significant cost savings. More critically, the consistent application of AI-driven analysis can minimize human error, enhance design optimization, and bolster the overall robustness of complex systems. This ultimately translates to increased reliability, reduced life-cycle costs, and a heightened capacity to meet stringent performance and safety standards, paving the way for more innovative and dependable technological solutions.

The study’s findings regarding LLMs as ‘copilots’ in requirements engineering resonate with a particular sentiment expressed by Alan Turing: “We can only see a short distance ahead, but we can see plenty there that needs to be done.” This isn’t about expecting artificial intelligence to replace the systems engineer, but rather to augment their capabilities, extending the reach of human insight. The research demonstrates that while LLMs can effectively analyze requirement quality, a human-in-the-loop approach remains crucial-essentially, a collaborative exploration of the system’s boundaries. Turing’s observation speaks to the iterative nature of problem-solving, a process this study highlights by showcasing how AI can accelerate analysis while still demanding human judgment to navigate ambiguity and ensure holistic system understanding.

What Breaks Down Next?

The demonstrated efficacy of Large Language Models as requirements engineering ‘copilots’ begs a predictable question: what happens when the illusion of competence falters? This research establishes a baseline – LLMs can assist – but deliberately avoids pushing them to full autonomy. The critical next step isn’t simply scaling up model size or training data, but actively inducing failure. A systematic investigation into the types of requirement flaws LLMs consistently miss-the edge cases, the subtly ambiguous phrasing, the context-dependent nuances-will reveal the limits of purely statistical pattern matching. Only by deliberately breaking the system can one truly understand its underlying vulnerabilities.

Furthermore, the ‘human-in-the-loop’ paradigm, while pragmatic, requires deeper scrutiny. What constitutes effective human oversight? Is it simply flagging LLM outputs for review, or does it necessitate a fundamental shift in how requirements engineers approach their work? The potential for automation bias – an uncritical acceptance of AI-generated suggestions – is significant. Research must therefore focus on developing interfaces and workflows that actively encourage skepticism and critical evaluation, rather than passively accepting AI assistance.

Ultimately, the true challenge lies not in replicating human intelligence, but in augmenting it. The goal shouldn’t be to create a system that replaces the requirements engineer, but one that forces them to become a more rigorous, more questioning, and ultimately, more effective analyst. The most fruitful path forward isn’t to refine the algorithm, but to redefine the role.

Original article: https://arxiv.org/pdf/2604.15222.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Deconstructing Requirements: The Bottleneck of Complexity

Unlocking Automation: The Language of Systems

Testing the Boundaries: Validation and Refinement

Systems Evolved: The Future of AI-Assisted Engineering

What Breaks Down Next?

See also: