From Physics Papers to Executable Code: An AI-Powered Leap for HEP Analysis

Author: Denis Avetisyan

Researchers are exploring how large language models can automatically translate the methods described in high-energy physics publications into functional code.

A two-stage workflow iteratively distills selection criteria from a target paper and its references, then leverages those criteria to sequentially generate, execute, and validate analysis code, ultimately achieving successful reproduction through a process of automated refinement.

This work details a proof-of-concept system for generating code from HEP papers, aiming to improve reproducibility and workflow automation with a human-in-the-loop approach using open-weight models.

Ensuring the reproducibility of results remains a significant challenge in high-energy physics, despite increasing emphasis on open data and analysis preservation. This paper details the development of an LLM-based system for automatic code generation from HEP publications, presenting a proof-of-concept workflow that extracts analysis procedures from published papers and translates them into executable code. Initial evaluations, benchmarked against the ATLAS [latex]H \to ZZ^{*} \to 4\ell[/latex] analysis using Open Data, demonstrate that recent open-weight large language models can recover documented selection criteria and, in some instances, generate event selections matching a baseline implementation. While stochasticity and execution failures currently limit full autonomy, these findings suggest that LLMs hold promise as valuable human-in-the-loop tools – but can we fully automate the complex task of HEP data analysis with these models?

Bridging the Analytical Gap: Automating High-Energy Physics

High-energy physics (HEP) traditionally progresses through a painstaking process of data analysis, where physicists manually translate experimental data into meaningful insights. This workflow demands substantial code development – crafting algorithms to filter, reconstruct, and interpret particle interactions – followed by rigorous validation to ensure accuracy. However, this manual approach increasingly creates bottlenecks as experiments generate ever-larger datasets and more complex analyses are required. The time-consuming nature of this process limits the speed at which new discoveries can be made, hindering the exploration of fundamental physics. Consequently, researchers are actively seeking automated solutions to streamline data analysis, reduce human error, and accelerate the pace of scientific advancement in the field.

Modern high-energy physics experiments are generating datasets of unprecedented size and intricacy, pushing the limits of traditional data analysis techniques. Consequently, researchers are actively developing automated workflows capable of translating the complex procedures detailed in scientific publications directly into executable code. This shift isn’t merely about speed; it addresses a fundamental challenge posed by the sheer volume of information and the nuanced statistical methods employed. These automated systems aim to parse published analyses – including selections, kinematic constraints, and systematic uncertainty evaluations – and reconstruct them as functional analysis pipelines. Successfully implementing such automation promises to significantly accelerate the pace of discovery by reducing the reliance on manual coding and validation, minimizing errors, and enabling rapid reprocessing of data as new theoretical understandings emerge.

From Publication to Pipeline: An LLM-Driven Workflow

The analysis workflow is structured as a sequential two-stage pipeline. The initial stage focuses on the automated extraction of structured selection criteria directly from High Energy Physics (HEP) publications. These criteria, which define specific parameters for data analysis, are parsed and converted into a machine-readable format. Subsequently, the second stage utilizes these extracted criteria as input to generate executable code, typically in Python, designed to perform the defined analysis on relevant datasets. This modular approach allows for automation of the analysis process, reducing manual intervention and enabling efficient exploration of large volumes of HEP data.

The workflow automation is achieved through the integration of Open-Weight Large Language Models (LLMs) with orchestration frameworks. Specifically, LangChain and LangGraph are utilized to construct and manage LLM-based pipelines. LangChain facilitates the modular construction of chains, enabling the connection of various LLM components and data sources. LangGraph builds upon this by adding graph-based execution capabilities, allowing for more complex and dynamic analysis workflows. This combination permits the automated extraction of information and the subsequent processing of that data according to defined criteria, reducing the need for manual intervention in the analysis process.

The initial stage of the workflow, extracting selection criteria from High Energy Physics (HEP) publications, relies on converting Portable Document Format (PDF) files into machine-readable text. This conversion process represents a significant computational bottleneck due to the varied formatting and complex layouts commonly found in scientific literature. Factors contributing to this limitation include the need for Optical Character Recognition (OCR) to process scanned documents, handling of equations and special characters, and the computational cost of maintaining document structure during the conversion. Performance is further impacted by document size and image content, necessitating optimized PDF parsing techniques and potentially pre-processing steps to reduce computational load.

The implementation of the LLM workflow pipeline utilizes vLLM, a fast and easy-to-use library for LLM inference and serving. vLLM employs PagedAttention, an optimization technique that significantly reduces memory requirements during attention calculations, thereby increasing throughput and enabling the processing of longer sequences. This is achieved by dividing the input into pages and only computing attention within each page, reducing the memory footprint from quadratic to linear with respect to sequence length. Furthermore, vLLM supports continuous batching of incoming requests to maximize GPU utilization and minimize latency, making it well-suited for production environments and high-volume data analysis tasks within the HEP context.

Validating Analytical Integrity: Reproducibility and Mitigation

The LLM workflow’s performance was benchmarked against the ATLAS H→ZZ∗→4ℓ analysis, a well-defined particle physics analysis with documented selection criteria and expected outcomes. This analysis served as a rigorous test case to assess the workflow’s ability to accurately interpret and reproduce a complex scientific process. Successful reproduction of the ATLAS analysis results demonstrates the workflow’s foundational reliability and provides a quantifiable metric for evaluating subsequent improvements and modifications. The selection of this specific analysis was motivated by its established documentation and the availability of ground truth data for comparison, enabling objective validation of the LLM’s output.

Large Language Models (LLMs) exhibit inherent stochasticity, meaning their outputs are not deterministic and vary even with identical inputs. This characteristic introduces a challenge for scientific applications requiring consistent and verifiable results. Furthermore, LLMs are susceptible to generating “hallucinations”-outputs that appear plausible but are factually incorrect or lack grounding in the provided data. These hallucinations stem from the models’ training process, where they learn to predict the most probable sequence of tokens rather than strictly adhering to truthfulness. Mitigating both stochasticity and hallucinations requires strategies such as employing techniques to control output randomness, implementing robust validation procedures, and integrating human oversight to verify the factual correctness of generated information before its use in critical analyses.

The Large Language Model (LLM) workflow operates within a Human-in-the-Loop (HITL) framework, prioritizing human oversight and collaboration throughout the analysis process. This design choice deliberately avoids full automation, recognizing the current limitations of LLMs in maintaining factual accuracy and avoiding potentially erroneous outputs. Human experts are integrated to verify the LLM’s extracted criteria, validate the logical consistency of the analysis chain, and resolve ambiguities. This collaborative approach ensures the reliability of the results and allows for iterative refinement of the LLM’s performance, leveraging human expertise to correct errors and improve the overall accuracy of the H→ZZ∗→4ℓ analysis reproduction.

The LLM-based workflow accurately extracted all 27 documented selection criteria, often referred to as “cuts”, from the ATLAS H→ZZ∗→4ℓ analysis documentation. These cuts define the parameters used to isolate signal events from background noise, and their successful extraction confirms the workflow’s ability to process and interpret complex, multi-layered analysis requirements. The identified criteria encompass a range of variables, including lepton momenta, invariant masses, and detector-specific conditions, demonstrating the workflow’s capability to handle both kinematic and detector-related constraints within a high-energy physics analysis.

Structured Selection Criteria (SSC) represent the extracted analysis requirements in a standardized, machine-readable format, specifically utilizing a defined schema to detail each selection cut’s logical components and associated parameters. This structured approach moves beyond simple text-based descriptions, allowing for automated validation against the original analysis documentation and facilitating precise reconstruction of the event selection process. The SSC format explicitly defines the variables, operators, and thresholds used in each cut, thereby increasing interpretability for both human experts and automated systems. Furthermore, this standardized representation enables systematic testing and verification of the LLM-extracted criteria, improving confidence in the workflow’s accuracy and reproducibility; [latex] \text{Cut} = \{ \text{Variable}, \text{Operator}, \text{Threshold} \} [/latex].

Comparing Bulk and Chunk settings, the results demonstrate successful cut extraction-with medians indicated by horizontal bars-and reveal a comparable rate of hallucinated cuts, excluding failed runs.

Expanding the Analytical Horizon: Future Directions

Integrating Retrieval-Augmented Generation (RAG) represents a significant advancement for this automated workflow, promising enhanced accuracy and reliability by directly connecting the large language model (LLM) to a curated and comprehensive knowledge base. Currently, the LLM relies on its pre-existing training data, which may be incomplete or outdated regarding specific experimental details or theoretical nuances within high-energy physics. RAG addresses this limitation by enabling the LLM to retrieve relevant information from a dedicated database – encompassing experimental data, simulation results, and published literature – before formulating a response. This grounding in verified knowledge mitigates the risk of hallucination or the generation of factually incorrect code, ultimately bolstering the trustworthiness of the analysis and allowing for more robust scientific conclusions. By dynamically accessing and incorporating external knowledge, the system transcends the limitations of static pre-training, offering a pathway towards continually improved performance and adaptability.

The automation of high-energy physics (HEP) data analysis, as demonstrated by this framework, promises a significant acceleration in the rate of scientific discovery. Traditionally, physicists dedicate substantial time and effort to crafting bespoke code for each new analysis, a process demanding specialized expertise and often becoming a bottleneck in research. By automating this process, the workflow substantially reduces the manual coding burden, allowing researchers to explore larger datasets and test hypotheses more rapidly. This shift from code development to scientific investigation frees up valuable time for physicists to focus on interpreting results, refining theoretical models, and formulating new research questions – ultimately fostering a more dynamic and productive research environment and potentially unveiling new physics beyond current understanding.

Recent evaluations indicate a promising capacity for large language models to generate code that precisely replicates established data analysis procedures. Specifically, the Qwen3-Coder:80B and GPT-OSS:120B models successfully produced code matching baseline event selections in 3 and 2 out of 10 independent test runs, respectively. This achievement suggests these models aren’t simply generating syntactically correct code, but are learning to embody the logic of established analytical techniques. The ability to consistently reproduce known results is a crucial step towards automating complex scientific workflows and builds confidence in the models’ potential for independent discovery, as it demonstrates an understanding of underlying scientific principles rather than just pattern matching.

The automated analytical framework, initially developed for High Energy Physics (HEP), possesses a remarkable capacity for adaptation across diverse scientific disciplines. Its core principles – automating data processing, code generation, and result validation – are not unique to particle physics; rather, they address common challenges inherent in any data-intensive scientific pursuit. Fields such as genomics, materials science, astrophysics, and climate modeling routinely grapple with massive datasets and complex analytical pipelines, making them ideally suited for this technology. The framework’s ability to translate scientific requirements into executable code, combined with its validation procedures, offers a pathway to significantly reduce the time and resources required for data analysis in these areas, ultimately accelerating discovery and innovation beyond the realm of HEP.

The automation of data analysis, as demonstrated by this framework, fundamentally shifts the role of the high-energy physics researcher. By handling the traditionally time-consuming and meticulous tasks of data reduction, event selection, and initial pattern recognition, the technology liberates scientists to concentrate on formulating novel hypotheses, designing more insightful experiments, and interpreting results with greater nuance. This transition isn’t merely about increased efficiency; it represents a qualitative leap towards more ambitious scientific inquiry, allowing exploration of previously inaccessible areas of research and fostering a more rapid cycle of discovery and innovation within the field and potentially beyond.

The pursuit of automated code generation from complex publications, as demonstrated in this work, echoes a fundamental principle of systemic design. Every attempt to optimize one facet-in this case, streamlining data analysis through LLMs-introduces new complexities and potential tension points within the broader scientific workflow. As Max Planck observed, “An appeal to the authority of science is useless, unless it is backed by the authority of experiment.” This highlights the critical need for a human-in-the-loop approach; the generated code, while representing a significant advancement in workflow automation, must be rigorously validated against experimental results to ensure accuracy and maintain the integrity of the scientific process. The system’s behavior over time, therefore, is defined not solely by the LLM’s capabilities, but by the iterative refinement driven by empirical verification.

The Road Ahead

This work demonstrates a tantalizing, if provisional, step towards automated knowledge extraction from the high-energy physics literature. The system, while promising, reveals a fundamental truth: every new dependency-here, on the performance of a large language model-is the hidden cost of freedom from manual coding. The apparent ease of code generation obscures the intricate feedback loop of model training, data curation, and validation required to maintain reliability. A truly robust system will necessitate not merely larger models, but a deeper understanding of how knowledge is represented within them, and how that representation maps to the nuanced demands of scientific analysis.

Current limitations highlight a critical structural issue. The workflow, predicated on extracting procedures from publications, inherits the inherent ambiguities and omissions of any human-authored text. Reproducibility, the stated goal, demands more than just executable code; it requires a complete, unambiguous specification of the intent behind that code. Future work must therefore explore methods for capturing not just how an analysis was performed, but why specific choices were made-a challenge that extends beyond the capabilities of current language models.

The long-term trajectory suggests a shift from simply automating existing workflows to fundamentally re-thinking how scientific knowledge is structured and disseminated. A successful system will not merely translate publications into code, but will act as a dynamic, evolving knowledge base, capable of adapting to new discoveries and facilitating collaborative analysis. The ultimate measure of progress will not be the lines of code generated, but the reduction in cognitive load for the physicist striving to understand the universe.

Original article: https://arxiv.org/pdf/2604.14696.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Bridging the Analytical Gap: Automating High-Energy Physics

From Publication to Pipeline: An LLM-Driven Workflow

Validating Analytical Integrity: Reproducibility and Mitigation

Expanding the Analytical Horizon: Future Directions

The Road Ahead

See also: