Can AI Truly Discover New Worlds?

Author: Denis Avetisyan

A new benchmark environment, Stargazer, challenges artificial intelligence to move beyond statistical analysis and demonstrate genuine physical reasoning in the search for exoplanets.

The Stargazer framework facilitates iterative agent development by generating tasks from both synthetic physics and real-world data, then evaluating submissions based on [latex]\Delta\Delta BIC[/latex], RMS error, matching criteria, and count statistics to provide per-criterion feedback.

Stargazer provides a high-fidelity, scalable testbed for evaluating AI agents performing complex radial velocity analysis under realistic astrophysical constraints.

Despite advances in artificial intelligence, reliably translating statistical optimization into scientifically grounded reasoning remains a significant challenge. To address this, we introduce Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints, a novel platform designed to evaluate AI agents on the iterative, physics-informed task of exoplanet discovery via radial velocity analysis. Our results reveal that while current agents can achieve good statistical fits to data, they frequently fail to accurately recover underlying physical parameters, even with access to basic astrophysical skills. This raises a critical question: can we develop AI agents capable of not just fitting models to data, but truly understanding the underlying scientific principles?

Unveiling Hidden Worlds: The Challenge of Exoplanet Detection

The search for planets beyond our solar system – exoplanets – presents a formidable observational challenge. Unlike direct imaging, which is often obscured by a star’s brightness, astronomers frequently rely on detecting subtle “wobbles” in a star’s motion caused by the gravitational pull of orbiting planets. This technique, known as Radial Velocity (RV) Spectroscopy, measures variations in the star’s light spectrum as it moves slightly towards and away from Earth. These shifts are incredibly small – often equivalent to measuring the speed of a human walking towards or away from an observer located several kilometers away – necessitating extraordinarily precise instruments and years of observation to confirm a planetary signal. The amplitude of this wobble is further diminished by the mass of the planet and its orbital distance, making the detection of smaller, more distant exoplanets particularly difficult, and highlighting the ingenuity required to unveil these hidden worlds.

The search for planets orbiting distant stars, while increasingly successful, is hampered by the demanding nature of radial velocity (RV) analysis. This technique, which detects planets by measuring the wobble they induce in their host star, traditionally relies on computationally expensive methods. Keplerian fitting, used to model the orbital parameters of a potential planet, and periodogram analysis, which searches for repeating patterns in the stellar wobble, both require significant processing power and time. As the volume of data from exoplanet surveys continues to grow, these traditional methods create a substantial bottleneck, limiting the speed at which new planetary candidates can be identified and confirmed. Researchers are actively exploring innovative algorithms and computational techniques to streamline these analyses, aiming to accelerate the discovery of worlds beyond our solar system and unlock the secrets they hold.

Stargazer evaluates agents on 120 robotic vision tasks-varying in physical difficulty-by assessing their ability to recover correct orbital parameters from periodogram analysis, revealing a tendency for strong statistical fits that do not guarantee physically accurate solutions.

Stargazer: An Automated System for Exoplanet Discovery

Stargazer is a novel testbed designed to facilitate end-to-end reasoning for the discovery of exoplanets. The system integrates Large Language Model (LLM) Agents to automate the complex pipeline typically performed by astronomers. This includes data acquisition, processing of radial velocity (RV) measurements, signal detection, and the subsequent characterization of planetary orbits. By leveraging LLM Agents, Stargazer aims to replicate and potentially surpass the analytical capabilities of human researchers in identifying exoplanetary candidates from astronomical data, offering a platform for autonomous exoplanet discovery and analysis.

Stargazer employs simulated environments and synthetic data to facilitate agent training and validation prior to deployment on actual observational datasets. This approach addresses the limitations of real-world astronomical data, which can be sparse, noisy, and expensive to acquire. Synthetic datasets allow for controlled experimentation and the generation of labeled examples necessary for supervised learning techniques. The simulation environment replicates the characteristics of radial velocity (RV) data, including instrument noise and stellar activity, enabling agents to learn robust feature extraction and signal processing capabilities. Performance is evaluated within the simulated environment using quantifiable metrics, ensuring agents meet pre-defined criteria before being applied to real-world RV observations, thereby minimizing the risk of erroneous detections and maximizing the efficiency of exoplanet searches.

Stargazer’s workflow execution is structured to autonomously analyze radial velocity (RV) data from astronomical observations. This process begins with data ingestion and preprocessing, followed by the application of algorithms to identify potential planetary signals – specifically, periodic variations in stellar velocity. Once candidate signals are detected, the system automatically performs Keplerian fitting to determine orbital parameters such as period, eccentricity, and semi-major axis. The system is designed to iteratively refine these parameters through statistical analysis and validation, ultimately producing a characterized set of planetary candidates with associated uncertainties. This autonomous execution minimizes manual intervention and enables efficient processing of large RV datasets.

The RV model demonstrates that GPT-5.2 accurately recovers planetary signals even in medium-difficulty cases, while GPT-5-mini fails in hard cases by converging to aliased periods and producing a significantly divergent radial velocity curve.

Deciphering Complex Systems: Multi-Planet Dynamics and Resonance

Stargazer utilizes advanced algorithms to process astrometric and radial velocity data from multi-planet systems, enabling the detection of minute gravitational perturbations. These perturbations, often below the threshold of traditional analytical methods, are caused by planet-planet interactions and can reveal information about planetary masses, eccentricities, and orbital inclinations. By modeling the combined gravitational forces, Stargazer can reconstruct the dynamical history of the system and identify potential instabilities or resonant configurations. The system’s ability to discern these subtle effects is crucial for understanding the long-term evolution and stability of planetary architectures, particularly in systems with closely packed planets or significant eccentricity.

Orbital resonance occurs in multi-planet systems when planets exhibit orbital periods with simple integer ratios, such as 2:1 or 3:2. This means that for every two orbits of one planet, another completes one or 1.5 orbits, respectively. These ratios are not coincidental; they indicate a stable configuration maintained by gravitational interactions, preventing close encounters and long-term orbital instability. The presence of resonance provides strong constraints on models of planetary formation, suggesting that planets often migrate and settle into these configurations. Identifying these resonant chains-where multiple planets are locked in such relationships-allows astronomers to infer the system’s evolutionary history and assess its long-term dynamical stability. [latex]T_{1}/T_{2} = n/m[/latex], where [latex]T[/latex] represents the orbital period, and [latex]n[/latex] and [latex]m[/latex] are integers, defines the relationship.

Performance evaluations of the multi-planet system analysis agents demonstrate a substantial discrepancy between capability on simplified tasks and complex scenarios. Agents successfully completed approximately 70% of easier tasks, indicating proficiency in basic statistical fitting of orbital data. However, when presented with more challenging datasets requiring nuanced understanding of gravitational interactions, the pass rate decreased dramatically to under 6%. This significant performance drop suggests a limitation in the agents’ ability to move beyond purely statistical analysis and engage in the type of physical reasoning necessary to accurately model and predict the behavior of complex multi-planet systems.

Statistical metrics [latex]\Delta BIC[/latex] and RMS (blue) maintain high pass rates across difficulty tiers, while physical criteria-Match Score and Planet Count (red)-exhibit a significant performance decline as difficulty increases.

Amplifying Agent Performance: The Power of Foundational Skills

The incorporation of ‘bootstrapped skills’ into large language model (LLM) agents represents a substantial advancement in artificial intelligence capabilities. This technique involves equipping the agent with a foundational set of pre-trained abilities – essentially, a toolkit of problem-solving strategies – before tackling more complex challenges. Rather than learning entirely from scratch, the agent leverages these existing skills to accelerate the learning process and achieve markedly improved performance. This approach has been shown to not only increase the speed at which the agent masters new tasks, but also to enhance the overall quality of its solutions, paving the way for more efficient and robust automated systems capable of handling increasingly intricate problems.

Efficient token usage proves paramount when deploying large language model agents for complex data analysis. Studies reveal a substantial disparity in computational cost between successful and unsuccessful analytical runs; effective processing requires approximately 68,000 tokens, a figure dramatically lower than the 730,000 tokens consumed during failed attempts. This optimization isn’t merely about cost reduction-it directly enables the scalable analysis of expansive datasets that would otherwise be computationally prohibitive. By minimizing token expenditure, researchers can unlock the potential for broader, more comprehensive investigations across diverse scientific domains, fostering insights previously inaccessible due to resource limitations.

Even with statistically sound calculations, current agent performance on complex tasks – specifically, identifying potential exoplanets – remains remarkably low, achieving a mere 5% ‘Match Score’. This suggests that simply increasing computational power or data analysis isn’t sufficient for breakthroughs in fields demanding robust physical reasoning. The limitation isn’t processing speed, but a fundamental gap in the agent’s ability to understand and apply physical principles to the data. Overcoming this hurdle promises to unlock genuinely automated exoplanet discovery, shifting the paradigm from human-guided analysis to scalable, self-directed exploration of astronomical datasets and potentially revolutionizing the search for life beyond Earth.

The pursuit of scalable model-fitting, as demonstrated by Stargazer, demands a relentless simplification of process. The benchmark environment intentionally presents a complex astrophysical workflow, yet the core challenge lies in distilling that complexity into manageable, logical steps for AI agents. As Robert Tarjan once stated, “The most effective programs are always the shortest.” This resonates with the paper’s implicit argument: current AI models often excel at statistical fitting but falter when confronted with the need for physical reasoning – a deficiency stemming from unnecessarily convoluted approaches. A system that requires extensive parameters and intricate calculations has already conceded a portion of its potential elegance. The true test isn’t the quantity of data processed, but the purity of the resulting insight.

Further Horizons

The Stargazer environment, by design, does not offer solutions. It merely clarifies the nature of the difficulty. Current large language models demonstrate proficiency in the application of statistical methods, but exhibit a notable failure to integrate those methods with underlying physical principles. This is not a failure of scale, but a failure of representation. The models manipulate symbols; they do not, as yet, meaningfully understand the systems those symbols describe.

Future work must therefore prioritize the development of agentic systems capable of constructing and validating internal models of physical phenomena. Simply increasing the volume of training data will not suffice. The emphasis should shift towards architectures that facilitate causal reasoning, and towards benchmarks that actively penalize solutions reliant solely on correlative patterns. The true test lies not in discovering that a signal exists, but in articulating why it exists.

Ultimately, the limitations revealed by Stargazer are not unique to exoplanet discovery. They reflect a broader challenge in artificial intelligence: the transition from pattern recognition to genuine comprehension. The pursuit of artificial intelligence is, at its core, a pursuit of understanding. And understanding, it appears, requires more than just data.

Original article: https://arxiv.org/pdf/2604.15664.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Unveiling Hidden Worlds: The Challenge of Exoplanet Detection

Stargazer: An Automated System for Exoplanet Discovery

Deciphering Complex Systems: Multi-Planet Dynamics and Resonance

Amplifying Agent Performance: The Power of Foundational Skills

Further Horizons

See also: