From Tests to Code: A New Era of AI-Powered Programming

Author: Denis Avetisyan

A novel approach uses generative AI to automatically build production code directly from test cases and natural language, potentially reshaping how software is created.

This review introduces Test-Oriented Programming, a paradigm leveraging generative AI and large language models for automated code generation and increased abstraction.

While current generative AI approaches for software development largely focus on assisting with or automating portions of code creation, a fundamentally higher level of abstraction remains unexplored. This paper, ‘Test-Oriented Programming: rethinking coding for the GenAI era’, proposes a novel paradigm, Test-Oriented Programming (TOP), wherein developers primarily validate test code derived from natural language specifications, delegating production code generation to large language models. This shifts the focus from how to solve a problem to what the solution should achieve, potentially streamlining development workflows. Could this approach unlock a new era of developer productivity and accessibility in software creation?

The Inevitable Decay of Assumptions: Reframing Software Origins

Historically, software projects frequently encounter difficulties stemming from poorly defined initial requirements and the inevitable shifts in those requirements as development progresses. This ambiguity often leads to a cycle of coding, testing, and subsequent rework, as implemented features fail to align with evolving user expectations or newly understood needs. The cost of this rework isn’t merely measured in developer time; it extends to delayed releases, frustrated stakeholders, and a potentially compromised final product. Consequently, projects can experience significant budget overruns and, in some instances, outright failure due to a fundamental mismatch between the delivered software and the actual problem it intended to solve. This pervasive challenge has spurred exploration into alternative development methodologies focused on proactive clarification and verifiable specifications.

Test-Oriented Programming represents a fundamental shift in software construction, prioritizing the articulation of desired system behavior through executable tests before any implementation begins. This approach doesn’t simply verify code; it defines the system itself, with tests serving as both specification and validation criteria. By formulating concrete, verifiable outcomes upfront, developers effectively create a living document of system requirements, dramatically reducing ambiguity and the potential for misinterpretation. This pre-implementation testing isn’t about catching bugs; it’s about establishing a clear contract between the system’s intent and its eventual realization, fostering a development process driven by demonstrable functionality rather than assumed interpretations.

Test-Oriented Programming distinguishes itself by centering development around provable results, rather than abstract interpretations of user requirements. This approach posits that a system’s functionality is best defined not by what it should do, but by what it demonstrably does – as verified by a comprehensive suite of executable tests. By establishing these verifiable outcomes upfront, the methodology aims to mitigate the risks associated with ambiguous specifications and evolving needs, common pitfalls in traditional software development. The result is a system built on a foundation of certainty, where each component’s behavior is explicitly defined and validated, ultimately delivering a more robust and reliable product that genuinely addresses user expectations.

Test-Oriented Programming represents a significant evolution of Test-Driven Development, moving beyond simply verifying individual units of code to achieving complete system specification through executable tests. While Test-Driven Development often focuses on implementing functionality to pass pre-written tests, TOP emphasizes that these tests should comprehensively define the entire expected behavior of the system. This means crafting tests not just for happy paths, but for edge cases, error handling, and all conceivable interactions – essentially, writing a formal, executable specification. By prioritizing this complete test suite as the primary source of truth, TOP aims to drastically reduce ambiguity and rework, fostering a development process where implementation confidently aligns with demonstrably verified requirements. The result is a system whose functionality is not merely tested, but explicitly defined by its passing tests.

Onion: A Seed for Test-Oriented Growth

Onion is a research tool created to demonstrate the feasibility of Test-Oriented Programming (TOP). Unlike tools focused on test-driven development which generate code after initial implementation, Onion aims to generate a complete system solely from a suite of executable acceptance tests. This approach positions tests not as verification steps, but as the primary source of system definition and the sole driver for code creation. The tool’s current implementation is a proof-of-concept, intended to validate the core principles of TOP and provide a platform for exploring related research questions rather than serving as a production-ready software development environment.

Onion employs a methodology where system requirements are formalized as executable tests written in a natural language format. These tests aren’t merely validation steps; they function as the primary input for an automated code generation process. The tool parses these natural language tests, interpreting them as specifications for desired system behavior, and subsequently generates Python code intended to satisfy those specified requirements. This approach shifts the development paradigm towards a test-driven methodology where the code is directly derived from, and validated against, the initial natural language specifications, reducing the potential for discrepancies between intended functionality and implemented code.

Onion’s implementation centers on Python as the target language for generated code, a deliberate choice to facilitate rapid prototyping and deployment. Utilizing Python allows for a streamlined development cycle, as the language’s dynamic typing and extensive libraries minimize boilerplate code and accelerate testing of systems defined by the acceptance test suite. This approach enables developers to quickly translate natural language specifications, embodied in the executable tests, into functional Python code, reducing time-to-market and simplifying iterative development processes. The selection of Python also allows for easy integration with existing Python-based infrastructure and tools.

Onion’s functionality is governed by configuration files, which serve as the primary input for the tool. These files, typically in a human-readable format such as YAML or JSON, detail project-specific information including the target system’s name, version control settings, and dependency declarations. Critically, they also contain the acceptance tests, written as executable specifications, that define the desired behavior of the system. System descriptions, outlining the components and their interactions, are also embedded within these configuration files, providing Onion with the necessary information to perform automated code generation based on the defined tests and specifications. The structure and content of these configuration files dictate the entire build process and resulting output.

Verifying the Machine: Code Generation Through the Lens of LLMs

Onion’s code generation capabilities were evaluated using two state-of-the-art Large Language Models (LLMs): GPT-4o-mini and Gemini 2.5-Flash. These models were selected for their advanced reasoning and code analysis abilities, allowing for an objective assessment of the generated code. GPT-4o-mini is a streamlined version of the GPT-4 architecture, known for its efficiency and performance, while Gemini 2.5-Flash represents Google’s latest advancements in LLM technology, optimized for speed and scalability. Utilizing both models provided a comprehensive benchmark for Onion’s performance across different LLM implementations.

Onion’s code generation was evaluated by leveraging Large Language Models (LLMs) – specifically GPT-4o-mini and Gemini 2.5-Flash – to compare generated code outputs against predefined test specifications. This process facilitated an objective assessment of functional correctness. Across all evaluation attempts, Onion consistently achieved 100% successful code generation as verified by these LLMs, indicating full adherence to the specified requirements within the test suite.

Assessment of generated code extended beyond simple pass/fail functionality tests to include evaluations of code quality attributes. These attributes encompassed metrics such as cyclomatic complexity, lines of code, and the presence of code smells. Efficiency was determined by benchmarking execution time and memory usage against established performance baselines. Adherence to coding standards was verified through automated linting tools configured with pre-defined style guides, ensuring consistency in formatting, naming conventions, and documentation practices. These multi-faceted evaluations provided a comprehensive understanding of the generated code’s overall suitability for production environments.

Comparative analysis of LLM-based code verification revealed differences in the degree of test code modification required prior to successful implementation validation. GPT-4o-mini necessitated test code adjustments in only 20% of attempts, indicating a higher degree of compatibility with the generated code. Conversely, Gemini 2.5-Flash required more frequent modifications across a larger proportion of attempts, with issues commonly stemming from the need to provide or correct supporting code elements such as necessary imports or execution dependencies. This suggests potential variances in the models’ ability to generate self-contained and readily verifiable code segments.

The Shifting Landscape: TOP in an Era of Generative AI

The landscape of software development is undergoing a swift and substantial shift, driven by the increasing prevalence of generative AI tools – most notably, coding assistants. These intelligent systems, powered by large language models, are no longer limited to simple auto-completion; they are capable of generating entire code blocks, suggesting optimal algorithms, and even translating between programming languages. This rapid advancement promises to dramatically increase developer productivity, allowing engineers to focus on higher-level design and problem-solving rather than repetitive coding tasks. However, this transformation also presents challenges, requiring a re-evaluation of traditional development methodologies to effectively integrate and validate AI-generated code, ensuring both functionality and reliability in the face of increasingly complex software systems.

Test-Oriented Programming (TOP) provides a robust framework for integrating generative AI into software development by prioritizing verification from the outset. Rather than simply assessing whether AI-generated code runs, TOP emphasizes confirming that the code demonstrably fulfills intended functionality as defined by pre-existing, rigorously tested specifications. This approach shifts the focus from syntactic correctness to semantic alignment, ensuring that generated code isn’t merely free of errors, but actively delivers the desired behavior. By building tests before code generation – whether human or AI-driven – TOP establishes a clear benchmark for success and enables automated validation of AI’s output. Consequently, developers can confidently leverage generative AI to accelerate development while maintaining a high degree of software quality and reliability, effectively transforming AI from a potential source of errors into a powerful tool for building trustworthy applications.

The integration of Vibe Coding and multi-agent systems with Test-Oriented Programming (TOP) promises a fundamentally altered software development process. Vibe Coding, focusing on capturing the intent behind code through expressive tests, provides a robust foundation for generative AI tools. These tools, acting as specialized agents, can then collaboratively work-one agent generating code, another verifying it against the ‘vibe’ tests, and yet another refining based on feedback-creating a continuous loop of improvement. This multi-agent approach isn’t simply about automation; it’s about distributing expertise and accelerating the development cycle. By combining the rigor of TOP with the dynamic interaction of multiple AI agents, software projects can achieve higher quality, faster iteration, and a more adaptable response to evolving requirements, ultimately shifting development from a largely manual process to a sophisticated, AI-orchestrated collaboration.

Test-Oriented Programming (TOP) fundamentally reshapes software engineering in the age of Large Language Models (LLMs) by embedding assurances of utility directly into the development lifecycle. Traditionally, evaluating LLM-generated code relied heavily on post-hoc testing, often revealing functional gaps after significant implementation effort. TOP reverses this approach, beginning with the explicit definition of desired software behavior through executable tests before any code generation. These tests then serve as both a specification for the LLM and a validation mechanism for its output; generated code is automatically assessed against the predefined tests, ensuring alignment with requirements. This process isn’t merely about confirming correctness, but about building verifiable claims regarding the software’s capabilities – a crucial step towards trustworthy and reliable LLM-based software engineering, where functionality isn’t assumed, but demonstrably proven.

The pursuit of abstraction, central to Test-Oriented Programming, echoes a timeless principle of system design. The article posits that generating production code from tests, rather than the other way around, allows for a more resilient and adaptable architecture. This resonates with the spirit of enduring systems. As Paul Erdős once stated, “A mathematician knows a lot of things, but a good mathematician knows where to find them.” Similarly, TOP doesn’t seek to eliminate coding, but rather to locate the essential logic within the tests, using generative AI as a tool for efficient expression. The elegance lies not in the absence of complexity, but in its careful encapsulation, ensuring the system ages gracefully rather than crumbling under its own weight.

What Lies Ahead?

The proposal of Test-Oriented Programming, while logically sound as a response to the current generative AI landscape, merely shifts the inevitable decay point of any software system. The core challenge isn’t code generation itself – that is a matter of diminishing returns – but the enduring difficulty of specifying correct intent. A perfect generator operating on flawed requirements produces only flawlessly incorrect software. The abstraction promised by TOP is not a simplification, but a deferral of complexity, a smoothing of the surface over deeper, unresolved ambiguities.

Future work will undoubtedly focus on refining the natural language interfaces, attempting to bridge the gap between human intention and machine understanding. Yet, this pursuit risks becoming an exercise in asymptotic approximation. Perhaps a more fruitful line of inquiry lies not in better specifications, but in systems designed to gracefully degrade in the face of inevitable specification errors-systems that prioritize resilience over absolute correctness. Stability, after all, is frequently a temporary reprieve, not a permanent state.

The Onion Tool, as presented, is a necessary component, but ultimately a scaffolding. The true test of TOP won’t be its ability to generate code, but its capacity to manage the entropy inherent in any complex system. Time, as always, will reveal whether this is a genuine evolution, or simply a more elegant arrangement of the same fundamental limitations.

Original article: https://arxiv.org/pdf/2604.08102.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Decay of Assumptions: Reframing Software Origins

Onion: A Seed for Test-Oriented Growth

Verifying the Machine: Code Generation Through the Lens of LLMs

The Shifting Landscape: TOP in an Era of Generative AI

What Lies Ahead?

See also: