From Data to Insight: Coding Scientific Analysis with AI

Author: Denis Avetisyan

New research demonstrates how artificial intelligence can automate complex scientific data analysis and visualization, reducing reliance on manual coding.

The performance of several large language models-Devstral-24B, Magicoder-7B, Llama3-70B, Gemma3-27B, and DeepSeek-R1-70B-was evaluated through the generation of data analysis and visualization code, demonstrating that even with simple prompts, these models exhibit varying capabilities in translating instructions into functional outputs, a disparity further accentuated when subjected to more detailed requests.

Prompt engineering techniques significantly improve the reliability of AI-generated code for analyzing HDF5 data and producing accurate visualizations.

Despite increasing data volumes in modern science, programming expertise remains a significant barrier to timely insight for many researchers. This challenge motivates the work ‘Toward Automated and Trustworthy Scientific Analysis and Visualization with LLM-Generated Code’, which systematically evaluates the capacity of large language models to autonomously generate Python scripts for scientific data analysis and visualization. Our findings reveal that while LLMs demonstrate promise, unassisted code generation suffers from limited reliability due to ambiguous prompts and insufficient domain understanding-issues mitigated by techniques like data-aware prompt engineering and iterative error repair. Can these advancements pave the way for truly inclusive, AI-assisted research tools that democratize access to complex data analysis?

Unraveling the Data Deluge: A Modern Predicament

Contemporary scientific research is characterized by an exponential increase in data volume and intricacy, a phenomenon that frequently overwhelms conventional analytical techniques. This isn’t merely a matter of ‘more data’; the datasets now routinely generated are often multi-dimensional, encompassing diverse data types and requiring substantial computational resources. Investigations across fields like genomics, astrophysics, and climate science are producing datasets measured in terabytes and petabytes, demanding new approaches to storage, processing, and interpretation. The limitations of traditional methods – often designed for smaller, simpler datasets – manifest as bottlenecks in research workflows, hindering the ability to identify patterns, validate hypotheses, and ultimately, accelerate scientific discovery. Consequently, researchers are actively developing and adopting novel computational tools and algorithms capable of effectively managing and extracting knowledge from these increasingly complex data landscapes.

The proliferation of large-scale scientific datasets is increasingly outpacing the capacity of conventional analytical tools. While data acquisition methods have advanced rapidly, the software and techniques for effectively processing and interpreting this information lag behind, creating a substantial bottleneck in the research process. This inefficiency doesn’t merely slow down progress; it actively hinders discovery, as potentially valuable patterns and correlations remain hidden within unanalyzed data. The struggle to extract meaningful insights stems not only from the sheer volume of information, but also from its inherent complexity-high dimensionality, noise, and intricate relationships between variables all contribute to the challenge. Consequently, researchers are spending disproportionate amounts of time on data wrangling and preliminary analysis, diverting resources from hypothesis generation and the pursuit of novel scientific understanding.

Modern scientific endeavors routinely produce data far exceeding the capacity of conventional file formats, necessitating specialized solutions for storage and access. Formats like HDF5, NetCDF, and FITS are designed to address this challenge by efficiently managing the sheer volume and intricate structure inherent in scientific datasets. HDF5, for instance, allows for the storage of large, heterogeneous data with metadata, enabling complex analyses; NetCDF excels at representing multidimensional data commonly found in climate and oceanographic studies; and FITS is the standard for astronomical data, accommodating vast arrays of information alongside calibration data. These formats aren’t merely storage containers; they facilitate data sharing, interoperability, and the preservation of scientific knowledge, ultimately enabling researchers to unlock insights hidden within increasingly complex information landscapes.

The proliferation of large-scale scientific datasets, originating from endeavors like NASA’s Earth Observing System (EOS) and fast Magnetic Resonance Imaging (fastMRI), is driving a critical need for automated analytical techniques. These datasets, often exceeding terabytes in size and characterized by intricate structures, routinely overwhelm traditional analytical pipelines, leading to diminished script executability and reduced rates of correct output. The study demonstrates that manual intervention becomes impractical, and even automated scripts face difficulties in reliably processing the data without errors. Consequently, researchers are increasingly focused on developing algorithms and software capable of autonomously handling data ingestion, quality control, analysis, and visualization, ensuring that the potential insights within these massive datasets are not lost due to computational bottlenecks or inaccuracies.

Data-aware prompt disambiguation consistently improves the execution of data analysis and visualization code across multiple large language models (Devstral-24B, Magicoder-7B, Llama3-70B, Gemma3-27B, and DeepSeek-R1-70B).

The Algorithm as Alchemist: A New Paradigm for Scientific Workflows

Large Language Models (LLMs), including GPT-4 and Claude 3.5, are increasingly investigated for their potential to automate processes within scientific workflows. These models demonstrate capability in handling tasks such as data cleaning, statistical analysis, and the creation of data visualizations, traditionally requiring significant manual effort from researchers. By leveraging natural language processing, LLMs can interpret user requests expressed in plain language and translate them into executable code or commands for data manipulation and graphical representation. Initial studies suggest LLMs can accelerate research by reducing the time required for routine data processing, allowing scientists to focus on interpretation and hypothesis generation. However, the reliability and accuracy of LLM-driven automation remain areas of ongoing research and validation.

Large Language Models (LLMs) demonstrate significant capability in automated code generation, specifically producing Python scripts tailored for scientific data handling and visualization. These models accept instructions provided in natural language – for example, a request to “generate a script to plot a histogram of column ‘X’ from the dataset ‘data.csv’” – and translate these into functional code. The generated scripts commonly utilize libraries such as NumPy, Pandas, Matplotlib, and SciPy for data manipulation, analysis, and graphical representation. This allows researchers to automate repetitive coding tasks, rapidly prototype data analysis pipelines, and explore datasets without requiring extensive programming expertise. The efficiency of this process is contingent on the LLM’s training data and its ability to correctly interpret the user’s intent from the natural language prompt.

The efficacy of Large Language Models (LLMs) in generating functional code for scientific workflows is directly correlated with the quality of the provided prompt. LLMs require detailed and unambiguous instructions to produce syntactically correct and logically sound code; vague or incomplete prompts frequently result in errors or unintended functionality. Specifically, prompts must clearly define the desired input data format, the required data transformations, the specific visualization techniques to employ, and any relevant parameters or constraints. The study indicates that simply increasing prompt length is insufficient; prompts must be comprehensively detailed to achieve high rates of script executability and accurate output, with successful prompts often incorporating example inputs and expected outputs to guide the LLM’s code generation process.

Large Language Models (LLMs) facilitate scientific visualization by generating code for libraries such as Matplotlib and VTK, effectively converting raw data into graphical representations. Research indicates a strong correlation between the quality of LLM-generated visualization scripts and their successful execution. Specifically, script executability – the ability to run without errors – and the accuracy of the resulting output are significantly impacted by prompt engineering and the implementation of techniques designed to improve LLM performance. These techniques include, but are not limited to, providing detailed data descriptions, specifying desired plot types, and incorporating example outputs to guide the LLM’s code generation process. Variations in prompt construction directly influence the rate of successful script execution and the correctness of the visualized data.

Data analysis and visualization code generated by several large language models (Devstral-24B, Magicoder-7B, Llama3-70B, Gemma3-27B, and DeepSeek-R1-70B) benefited from retrieval-augmented prompting, improving performance across all models.

Refining the Machine: Prompt Engineering and Error Correction

Retrieval-Augmented Prompt Enhancement and Data-Aware Prompt Disambiguation are techniques used to improve the quality of prompts supplied to Large Language Models (LLMs). Retrieval-Augmentation involves supplementing the initial prompt with relevant information retrieved from an external knowledge source, providing the LLM with necessary context it may not inherently possess. Data-Aware Prompt Disambiguation focuses on clarifying the prompt’s intent by incorporating metadata about the data the LLM will process; this includes data types, units of measurement, and relevant attributes. Both methods aim to reduce ambiguity and provide the LLM with sufficient information to generate more accurate and contextually appropriate responses, thereby increasing the overall performance and reliability of the model’s output.

Efficient information retrieval is crucial for augmenting prompts with relevant context. The All-MiniLM-L6-v2 model provides high-quality sentence embeddings, allowing for semantic similarity searches within large datasets. Faiss, a library for efficient similarity search and clustering of dense vectors, is utilized to quickly identify the most pertinent information based on these embeddings. This combination enables the system to retrieve relevant data points from a knowledge base and incorporate them into prompts, thereby improving the accuracy and relevance of the LLM’s responses. The use of Faiss is particularly beneficial for scaling to datasets containing millions of vectors, providing low-latency retrieval times necessary for real-time prompt augmentation.

Iterative Error Repair leverages execution feedback to refine LLM-generated code, improving its functionality and accuracy. This process involves executing the generated script and analyzing any resulting errors; the LLM then utilizes this feedback to modify and correct the code iteratively. Research demonstrated that combining Iterative Error Repair with other prompt engineering techniques resulted in significant gains in script executability and correct output rates. Specifically, testing on NASA EOS datasets indicated a marked improvement in performance when compared to approaches lacking this error correction loop, highlighting the method’s efficacy in complex scientific computing tasks.

Large Language Models (LLMs) generating scripts for numerical and scientific tasks heavily rely on foundational libraries such as NumPy and SciPy. NumPy provides support for multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these structures, enabling efficient numerical computation. SciPy builds upon NumPy, offering more advanced scientific computing tools including optimization, integration, interpolation, signal processing, and statistics. The integration of these libraries within LLM-generated code is crucial for performing complex calculations, data analysis, and simulations, particularly when dealing with datasets commonly encountered in fields like engineering and data science. The effective utilization of these libraries directly impacts the accuracy, performance, and reliability of the generated scripts and their outputs, often involving operations represented by mathematical formulas like $E=mc^2$ or statistical analyses.

Devstral-24B consistently generates functional data analysis and visualization code, with performance improving predictably with increased iterations, even using simple prompts and default prompt enhancement features.

The Validation Imperative: Benchmarking LLM-Driven Scientific Workflows

The emergence of Large Language Models (LLMs) in scientific workflows necessitates robust evaluation metrics, and benchmarks like MatPlotBench are designed to address this need by providing a standardized assessment of LLM performance in generating scientific visualizations. These benchmarks operate by presenting LLMs with specific data analysis tasks – often requiring the creation of plots and charts – and then rigorously comparing the generated code and resulting visualizations against known correct solutions. This process isn’t simply about verifying functional code; it examines adherence to best practices in data visualization, ensuring clarity, accuracy, and effective communication of scientific findings. By offering a common yardstick, MatPlotBench facilitates meaningful comparisons between different LLMs and tracks improvements in their ability to automate complex data exploration, ultimately accelerating the pace of scientific discovery across disciplines.

The validity of code generated by large language models for scientific applications hinges on meticulous testing against well-established datasets. Researchers are employing benchmarks – collections of known inputs and expected outputs – to systematically evaluate the accuracy and robustness of LLM-driven workflows. This process isn’t simply about confirming whether the code runs, but whether it produces scientifically correct results, mirroring the output of traditional, vetted methods. By subjecting LLM-generated code to these rigorous tests, potential errors or biases can be identified and addressed, ensuring the reliability of automated analyses and fostering confidence in the insights derived from these increasingly powerful tools. Such validation is paramount before deploying LLMs to tackle complex scientific challenges, particularly when dealing with sensitive or critical data where inaccuracies could have significant consequences.

Recent advancements in large language models (LLMs) offer a compelling pathway to accelerate scientific discovery, as evidenced by a study evaluating LLM-driven workflows across multiple datasets. The research demonstrates measurable improvements in automated data analysis and visualization pipelines when employing iterative error repair techniques. While all tested datasets benefited from this approach, the impact proved particularly pronounced when processing NASA Earth Observing System (EOS) data – suggesting LLMs excel at tasks involving complex geospatial information. Conversely, the study revealed limited gains when applied to fastMRI datasets, indicating the effectiveness of these techniques is dataset-dependent and requires careful consideration of data characteristics and potential biases. This variability highlights the need for ongoing refinement and benchmarking to optimize LLM performance across the diverse landscape of scientific disciplines.

The automation of intricate data analysis and visualization pipelines represents a paradigm shift in scientific methodology, allowing researchers to transcend the burdens of computational tasks and dedicate their expertise to the core of discovery. By handling the complexities of data processing, cleaning, and graphical representation, these automated systems effectively function as force multipliers for scientific inquiry. This transition isn’t merely about efficiency; it allows investigators to concentrate on formulating hypotheses, interpreting nuanced results, and identifying previously unseen patterns within datasets. The capacity to rapidly generate and explore visualizations facilitates a more intuitive understanding of complex phenomena, ultimately accelerating the pace of innovation and enabling breakthroughs across diverse scientific disciplines. Consequently, researchers are empowered to move beyond data manipulation and towards higher-level cognitive tasks that drive genuine scientific advancement.

The pursuit of automated scientific analysis, as detailed in the paper, fundamentally relies on challenging the boundaries of existing systems. The study demonstrates this by systematically probing large language models’ capacity to translate intent into functional code. This process of iterative refinement, of deliberately introducing queries to expose weaknesses, echoes a core tenet of understanding any complex system. As Blaise Pascal observed, “The eloquence of angels is no more than the silence of the wise.” In this context, the ‘silence’ represents the initial limitations of the LLM, and the work actively elicits responses-‘eloquence’-through carefully crafted prompts, revealing and correcting flaws in the model’s ‘design’ until a trustworthy analytical pipeline emerges. The paper isn’t merely about using LLMs; it’s about testing them, reverse-engineering their limitations, and ultimately, building a more robust system through deliberate provocation.

What Breaks Down Next?

The demonstrated improvements in LLM-driven scientific code generation are, predictably, not the end. The system still functions within the boundaries of solvable problems – those neatly packaged in HDF5, readily addressed by existing libraries, and expressible in natural language prompts. A more aggressive approach requires deliberately crafting ambiguous requests, introducing incomplete datasets, or demanding analyses that necessitate novel algorithmic combinations. What happens when the model encounters data structures it hasn’t ‘seen’ before, or when a seemingly straightforward scientific question requires a leap in computational thinking? The inevitable failures will reveal the true limits of pattern recognition versus genuine understanding.

Current methods prioritize correctness – code that runs and produces a numerical result. But scientific inquiry demands more. It requires a system capable of challenging assumptions embedded within the data itself. Can an LLM be prompted to identify potential biases in a dataset, or to suggest alternative analytical approaches that might reveal hidden relationships? The focus must shift from simply automating existing workflows to augmenting the scientist’s capacity for critical thinking, even if that means occasionally generating incorrect – but insightful – code.

Ultimately, the utility of these models hinges not on their ability to flawlessly execute instructions, but on their capacity to systematically break them. Only by pushing the boundaries of what’s possible – and meticulously documenting the resulting failures – can one truly reverse-engineer the complexities of scientific discovery.

Original article: https://arxiv.org/pdf/2511.21920.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Unraveling the Data Deluge: A Modern Predicament

The Algorithm as Alchemist: A New Paradigm for Scientific Workflows

Refining the Machine: Prompt Engineering and Error Correction

The Validation Imperative: Benchmarking LLM-Driven Scientific Workflows

What Breaks Down Next?

See also: