Can AI Pass the Air Traffic Control Exam?

Author: Denis Avetisyan


A new assessment framework evaluates AI agents by measuring their performance against established human training standards for air traffic control.

The average error rates across three summative exercises-used to evaluate agent performance in varied assessment scenarios-demonstrate aircraft-specific inconsistencies, with each data point representing an average across multiple simulations conducted by trainee air traffic controllers, and a particularly notable example detailed further in Figure 10.
The average error rates across three summative exercises-used to evaluate agent performance in varied assessment scenarios-demonstrate aircraft-specific inconsistencies, with each data point representing an average across multiple simulations conducted by trainee air traffic controllers, and a particularly notable example detailed further in Figure 10.

Researchers propose a competency-based evaluation using digital twins and human-in-the-loop testing to validate AI readiness for ATC automation.

Despite increasing automation in aviation, robust evaluation of AI agents for high-stakes roles like Air Traffic Control remains a significant challenge due to a frequent disconnect between academic benchmarks and real-world operational complexities. This paper, ‘Human-in-the-Loop Testing of AI Agents for Air Traffic Control with a Regulated Assessment Framework’, introduces a novel evaluation framework leveraging a regulator-certified training curriculum and expert human instructors to assess AI performance authentically. Our approach aligns machine capabilities with established human competency targets, offering a more domain-accurate measurement than traditional methods. Will this framework pave the way for seamless and trustworthy human-machine teaming in increasingly complex airspace environments?


The Uncompromising Demands of Airspace Management

Air traffic control represents a uniquely demanding field where even minor errors can have catastrophic consequences, establishing an uncompromising need for both precision and rapid response. For decades, the safety of air travel has fundamentally depended on the skill and judgment of highly trained human controllers, who expertly manage complex airspace with a blend of procedural knowledge and real-time adaptation. This reliance on human expertise isn’t simply a matter of technological limitation; it’s a recognition that unpredictable events, ambiguous data, and the need for nuanced decision-making often exceed the capabilities of rigid, pre-programmed systems. The profession requires continuous vigilance, the ability to synthesize information from multiple sources, and a deep understanding of aircraft dynamics, weather patterns, and emergency procedures – qualities that have historically proven difficult to replicate in automated solutions.

The relentless growth of air travel is pushing the limits of current air traffic control systems, creating an urgent need to investigate automated solutions. However, transitioning to greater automation proves remarkably difficult due to the inherent complexity and unpredictability of airspace. Unlike static, predictable environments, air traffic is characterized by constantly shifting variables – weather patterns, aircraft performance, unexpected route changes, and the nuanced decision-making of pilots. Existing automation, often relying on pre-programmed rules, struggles to adapt to these dynamic conditions, frequently requiring human intervention to resolve conflicts or manage unforeseen circumstances. This limitation highlights a critical challenge: automation must not only handle routine operations efficiently, but also possess the flexibility and intelligence to navigate the countless anomalies that define real-world flight operations, a feat that demands a significant leap beyond current technological capabilities.

Current automation within air traffic control predominantly relies on systems programmed with explicit, pre-defined rules. While effective in narrowly defined circumstances, these rule-based approaches demonstrate limited capacity to handle the unpredictable and nuanced situations inherent in live air traffic management. The complexity arises from the sheer number of variables – weather patterns, aircraft types, unforeseen mechanical issues, and constantly shifting flight plans – that demand real-time assessment and flexible responses. Consequently, these systems often require frequent human intervention to address anomalies or operate safely, hindering the potential for genuine autonomous operation. The inability to learn from past experiences, adapt to changing conditions, or generalize solutions beyond programmed scenarios represents a significant obstacle to achieving the full benefits of automation in this critical safety domain.

The pursuit of automated air traffic control demands a fundamental shift beyond conventional methods to truly overcome existing limitations and ensure unwavering safety. Current systems, often reliant on pre-defined rules, struggle with the unpredictable nature of airspace and the need for nuanced decision-making. A novel paradigm necessitates the integration of advanced technologies, such as machine learning and artificial intelligence, capable of not merely reacting to situations, but proactively anticipating and adapting to them. This involves developing algorithms that can learn from vast datasets of flight information, weather patterns, and unforeseen events, allowing for a level of flexibility and responsiveness previously unattainable. Crucially, such a system requires rigorous validation and verification processes, alongside robust fail-safe mechanisms, to build trust and guarantee the highest standards of operational safety in a highly critical domain.

NATS trains aspiring Air Traffic Control Officers (ATCOs) through a structured progression of licensing stages.
NATS trains aspiring Air Traffic Control Officers (ATCOs) through a structured progression of licensing stages.

Standardized Evaluation Through Curriculum-Driven Assessment

Machine Basic Training (MBT) is an evaluation methodology designed to assess the performance of Air Traffic Control (ATC) automation agents using the curriculum established by the NATS Basic Course, which serves as the standard training program for human air traffic controllers. By adapting this existing, validated framework, MBT ensures a consistent and comparable assessment of both human and automated control systems. This approach focuses on evaluating agents against the core competencies required of human controllers, including skills in planning, controlling, and coordinating air traffic flow. The utilization of a standardized curriculum facilitates objective performance measurement and allows for a direct comparison of machine and human capabilities within the ATC system.

Machine Basic Training (MBT) utilizes the established National Airspace System (NAS) Basic Course curriculum to provide a standardized evaluation framework for Air Traffic Control (ATC) automation agents. This approach ensures assessments mirror the competencies required of human controllers, specifically targeting core skills categorized as Planning, Controlling, and Coordination. By aligning agent evaluation with existing human training standards, MBT facilitates a direct comparison of performance across both groups, allowing for objective measurement of automation system capabilities against established benchmarks for safe and efficient air traffic management. The framework evaluates agents on identical competencies, fostering consistent and meaningful performance metrics.

The Machine Basic Training (MBT) framework employs BluebirdDT, a probabilistic digital twin, to generate realistic Air Traffic Control (ATC) scenarios and facilitate agent testing. This simulation environment incorporates trajectory prediction algorithms to model aircraft flight paths with a demonstrated accuracy of less than 8% error. Trajectory error is quantified as the percentage of aircraft that deviate beyond a 5 flight level vertical or 2.5 nautical mile lateral threshold during simulated operations. This level of fidelity ensures that agent performance is evaluated against dynamically generated, yet highly accurate, representations of ATC airspace conditions.

Machine Basic Training (MBT) evaluation incorporates Inter-Rater Reliability (IRR) assessments to validate the consistency and objectivity of performance scoring. Across 19 simulated ATC scenarios, and with participation from a minimum of seven instructors, Spearman’s rho coefficient of 0.59 and Kendall’s W of 0.64 were achieved. These statistical measures indicate a moderate to substantial level of agreement among raters, demonstrating the robustness of the evaluation framework in minimizing subjective bias and ensuring reliable assessment of automation agent performance against established ATC competencies.

BluebirdDT’s trajectory predictor accurately emulates aircraft flight paths (red) as demonstrated by close alignment with real college data trajectories (blue) over 15 simulations.
BluebirdDT’s trajectory predictor accurately emulates aircraft flight paths (red) as demonstrated by close alignment with real college data trajectories (blue) over 15 simulations.

Agent Prototyping and Comparative Performance Analysis

Two agents, designated Hawk and Falcon, were developed as prototypes to specifically assess the capabilities of the Model-Based Testing (MBT) framework. Hawk operates utilizing a rules-based artificial intelligence approach, while Falcon employs optimization-based algorithms. This dual implementation allows for a comparative analysis of performance characteristics achievable with differing agent architectures within the MBT environment. The agents were designed to be fully integrated with, and evaluated by, the MBT curriculum to determine the framework’s efficacy in testing and refining automated air traffic control (ATC) systems.

Both the Hawk and Falcon agents were subjected to a standardized training regimen defined by the Model-Based Testing (MBT) curriculum, which included exposure to a diverse set of simulated air traffic control scenarios. Following training, agent performance was quantitatively assessed using metrics aligned with established human air traffic controller performance indicators, such as conflict resolution time, trajectory efficiency, and adherence to separation standards. This facilitated a direct comparison of agent capabilities against a baseline of experienced human controller performance, allowing for a statistically relevant evaluation of the automation’s potential to meet or exceed human performance levels in specific operational contexts.

Model-Based Testing (MBT) facilitated a granular assessment of both the Hawk and Falcon agents, moving beyond aggregate performance metrics to pinpoint specific operational strengths and weaknesses. This analysis revealed Hawk’s proficiency in predictable scenarios but identified limitations in handling unexpected deviations from established rules. Conversely, Falcon demonstrated adaptability but exhibited inconsistencies in achieving optimal solutions, requiring refinement of its optimization algorithms. The detailed data generated through MBT allowed for targeted improvements to each agent’s core logic and decision-making processes, informing subsequent iterations of development and addressing previously unidentified failure modes.

Evaluations conducted within the Model-Based Testing (MBT) framework yielded performance data supporting its utility for assessing and improving Air Traffic Control (ATC) automation agents. Specifically, the MBT environment facilitated the identification of agent strengths and weaknesses, allowing for targeted development efforts focused on safety and reliability. The framework’s capacity to provide quantifiable metrics on agent behavior under various simulated conditions demonstrates its potential as a standardized testing and refinement tool for ATC automation, moving beyond traditional, less rigorous evaluation methods. This data-driven approach is crucial for validating agent performance before deployment in live air traffic environments and ensuring adherence to critical safety standards.

Round 1 summative results demonstrate performance differences between Falcon and Hawk.
Round 1 summative results demonstrate performance differences between Falcon and Hawk.

Towards Uncompromising Safety and Future Automation Capabilities

The implementation of a Model-Based Testing (MBT) framework represents a proactive strategy for mitigating the inherent safety risks associated with automating Air Traffic Control (ATC). This approach systematically verifies the behavior of automated ATC agents against a comprehensive suite of formally defined scenarios, ensuring adherence to stringent Civil Aviation Authority (CAA) regulations. Crucially, the MBT framework isn’t solely reliant on simulated testing; it incorporates Human-in-the-Loop Verification, where qualified air traffic controllers actively evaluate agent responses to complex and unanticipated situations. This collaborative process provides an essential layer of oversight, validating that the automated system not only functions as designed but also demonstrates the necessary judgment and adaptability expected of human controllers, thereby building confidence in the system’s safety and reliability before deployment in live airspace.

The successful implementation of automated air traffic control relies heavily on effectively incorporating the extensive knowledge possessed by human controllers. This is achieved through meticulous reward engineering and comprehensive data labeling processes. Reward engineering defines the parameters that incentivize the AI agent to prioritize actions mirroring established, safe ATC protocols, while detailed data labeling provides the agent with examples of correct responses to a vast array of scenarios. By carefully crafting these inputs, developers ensure the AI doesn’t simply optimize for task completion, but learns to operate as an experienced controller would, adhering to crucial safety margins and established procedures. This integration of expert knowledge is not merely about achieving functional automation; it’s about building trust and ensuring the system’s behavior aligns with decades of proven ATC practices, ultimately paving the way for wider acceptance and deployment.

The culmination of model-based testing and human-in-the-loop verification signifies a pivotal advancement in the pursuit of fully autonomous Air Traffic Control. This integrated system isn’t merely about replacing human controllers with algorithms; it’s about establishing a foundation for air traffic management that demonstrably improves upon current capabilities. By prioritizing safety through rigorous testing and expert knowledge integration, the framework promises to not only maintain, but elevate, the reliability of air travel. Consequently, increased efficiency becomes achievable through optimized flight paths and reduced congestion, while simultaneously expanding the capacity of airspace to accommodate growing demand – ultimately reshaping the future of aviation through intelligent automation.

Ongoing research aims to significantly enhance the capabilities of automated air traffic control by embedding sophisticated artificial intelligence techniques within the established Model-Based Testing (MBT) framework. Specifically, the integration of Deep Reinforcement Learning promises to refine agent decision-making through complex simulations and reward systems, enabling more nuanced and adaptive responses to dynamic airspace scenarios. Complementing this, the application of Multi-Agent Methods will allow for the development of collaborative agent networks, mirroring the interactions between human controllers and fostering more robust and efficient traffic management. This synergistic approach not only aims to improve the precision and reliability of automated systems but also to bolster their resilience against unforeseen events and increasingly complex operational demands, ultimately paving the way for truly autonomous and safe air traffic control solutions.

Requirements cascade from high-level Collision Avoidance Architecture (CAA) regulations down to fundamental machine-level specifications.
Requirements cascade from high-level Collision Avoidance Architecture (CAA) regulations down to fundamental machine-level specifications.

The presented framework establishes a rigorous evaluation process for AI agents in air traffic control, mirroring the competency-based assessment of human controllers. This approach acknowledges that demonstrable correctness, not merely functional operation, is paramount. As G.H. Hardy stated, “Mathematics may not prepare you for the harsh realities of life, but it will give you the tools to understand them.” The study’s focus on a regulated assessment-a formal, provable method for gauging AI performance-directly embodies this sentiment. By grounding the evaluation in existing training curricula and focusing on demonstrable competencies, the research strives for a solution that isn’t simply ‘working’ on tests, but is mathematically sound and demonstrably correct within the complex domain of ATC automation.

Beyond Simulation: Charting a Course for Verifiable Autonomy

The presented framework, while a necessary refinement of existing evaluation methodologies, merely addresses the symptoms of a deeper challenge. Current approaches, even those incorporating human-in-the-loop testing within a regulated curriculum, remain fundamentally empirical. Demonstrating that an AI agent performs adequately during trials is not equivalent to proving its correctness. A statistically significant result, however impressive, offers no guarantee against unforeseen failures in novel or edge-case scenarios. The field must shift from accumulating evidence of performance to constructing formal proofs of safety and optimality – a pursuit often dismissed as ‘academic,’ yet essential for truly autonomous systems.

Future work should prioritize the integration of formal verification techniques. Digital twins, as employed here, are valuable for scenario generation, but insufficient as a substitute for mathematically rigorous analysis. The challenge lies not in creating more realistic simulations, but in developing algorithms whose behavior can be proven to conform to specified safety constraints. Consider, for instance, the application of temporal logic to define and verify the agent’s response to dynamic airspace changes – a far more demanding, but ultimately more reliable, approach than relying on extensive testing.

The current emphasis on competency-based assessment, while laudable, risks perpetuating a pragmatic, rather than principled, approach. It is tempting to optimize for ‘acceptable’ performance, but such compromises introduce subtle errors that accumulate over time. The pursuit of perfect automation, though perhaps unattainable, remains the only intellectually honest goal. Anything less is simply applied approximation masquerading as intelligence.


Original article: https://arxiv.org/pdf/2601.04288.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-01-11 07:44