AI Scientists: Automating Discovery with Intelligent Agents

Author: Denis Avetisyan

New research showcases how AI agents are being developed to autonomously curate data, analyze complex information, and synthesize knowledge for scientific advancement.

This review details the development and application of agentic AI frameworks, DeepCollector and DeepScribe, for automated scientific data curation, multimodal analysis, and knowledge synthesis using knowledge graphs and retrieval-augmented generation.

Current large language models often struggle with complex reasoning and maintaining context within extended scientific workflows. This limitation motivates the work presented in ‘Experiments in Agentic AI for Science’, which details two novel agentic AI frameworks-DeepCollector and DeepScribe-designed to automate data curation, multimodal analysis, and knowledge synthesis using a hybrid local-remote architecture. Through practical implementations-including granular attribute extraction and distributed concurrency controls-we demonstrate how these systems overcome existing limitations to rigorously support scientific tasks, from time-series data management to converting complex physics lectures into structured reports. Could this approach unlock new avenues for autonomous scientific discovery and accelerate progress across diverse fields like high-energy physics, where knowledge graph construction is paramount?

The Inevitable Deluge: Beyond Data Storage

The proliferation of sensors and interconnected devices has ignited an unprecedented surge in time-series data – measurements recorded sequentially over time. This exponential growth, fueled by fields like environmental monitoring, financial markets, and industrial automation, is rapidly eclipsing the capacity of conventional analytical techniques. Traditional methods, often reliant on manual inspection and static algorithms, struggle to process the sheer volume, velocity, and variety of incoming data streams. Consequently, critical insights are often delayed, obscured, or entirely missed, hindering scientific progress and informed decision-making. The challenge isn’t simply storing this data, but effectively extracting meaningful patterns and predictions from a deluge that overwhelms existing infrastructure and analytical workflows.

The accelerating generation of time-series data across diverse scientific disciplines has created a critical impasse: manual data curation is no longer a viable path to knowledge. While historically essential for ensuring data quality and reliability, the sheer volume now overwhelms the capacity of human experts to efficiently ingest, validate, and structure information. This bottleneck doesn’t merely slow down research; it actively impedes scientific discovery and introduces significant risks to the integrity of derived insights. Errors or inconsistencies, often subtle, can propagate through analyses, leading to flawed conclusions and potentially misdirected efforts. The limitations of manual processes now represent a substantial impediment to fully leveraging the potential held within rapidly expanding datasets, demanding innovative automated solutions to maintain the pace of scientific progress.

The escalating volume of time-series data necessitates a shift towards automated curation systems. Traditional methods, reliant on manual inspection and correction, are proving insufficient to handle the sheer scale and velocity of modern datasets. An autonomous system addresses this challenge by integrating data ingestion, cleaning, and structuring into a cohesive, self-operating pipeline. Such a system utilizes algorithms to identify and rectify inconsistencies, outliers, and missing values, while simultaneously organizing the data into a format suitable for advanced analytical techniques like machine learning and statistical modeling. This not only accelerates the pace of scientific discovery but also enhances the reliability and reproducibility of insights derived from complex time-series data, ultimately enabling more informed decision-making across diverse fields.

DeepTS: A Necessary Automation

DeepTS addresses the challenges of time-series data management by providing a complete pipeline for data curation, extraction, and deduplication. This end-to-end functionality minimizes the need for manual intervention, automating processes traditionally requiring significant human effort. The system is designed to ingest raw time-series data, identify and resolve inconsistencies or redundancies, and output a clean, curated dataset ready for analysis. By automating these steps, DeepTS aims to improve the efficiency and scalability of time-series data projects, reducing both time and resource expenditure associated with data preparation.

The DeepTS architecture is structured around a ‘Local Body Remote Brain’ paradigm, designed to optimize resource allocation and processing efficiency. This approach segregates operations into two distinct categories: local data handling and remote cognitive reasoning. Local processing, executed directly on the data source, focuses on tasks such as data ingestion, cleaning, and basic transformation, minimizing data transfer overhead. Conversely, more complex analytical and deductive reasoning, including anomaly detection, data deduplication, and schema inference, is offloaded to a remote server leveraging large language models (LLMs). This distribution allows DeepTS to scale effectively by utilizing the computational resources of the remote server for demanding tasks while preserving the responsiveness of local data operations.

DeepTS achieves integration with Large Language Models (LLMs) through adherence to the OpenAI API Specification and the implementation of a Model Context Protocol (MCP). This allows for efficient communication and data exchange between the system and external LLMs. Performance profiling, conducted across six separate runs, demonstrates substantial LLM utilization, totaling 2268 API calls. The MCP facilitates the transfer of relevant contextual information to the LLM, enabling informed decision-making during time-series data curation, extraction, and deduplication processes.

Retrieval-Augmented Reality: Filling the Knowledge Gaps

DeepTS utilizes Retrieval-Augmented Generation (RAG) to improve the performance of Large Language Models (LLMs) by integrating external data sources into the generation process. This is achieved through LlamaIndex, a data framework enabling LLMs to access and reason about private or domain-specific data. Rather than relying solely on the LLM’s pre-trained knowledge, RAG dynamically retrieves relevant information from these external sources and provides it as context to the LLM, resulting in more accurate, informed, and contextually relevant responses. This approach mitigates issues such as hallucination and knowledge cut-off inherent in standard LLM deployments.

DeepTS extends Retrieval-Augmented Generation (RAG) capabilities through the implementation of GraphRAG, a technique that utilizes graph structures to represent and query knowledge. Unlike traditional RAG which relies on sequential document retrieval, GraphRAG represents information as nodes and edges, enabling the LLM to traverse relationships between concepts and improve reasoning capabilities. This approach allows for more complex queries and inferences by considering the interconnectedness of data, potentially overcoming limitations of purely text-based retrieval methods and enhancing the accuracy and contextual relevance of generated responses.

DeepCollector, a component within the DeepTS framework, employs Cellular RAG to improve the dependability of its inferences. This implementation achieves average latencies ranging from 6.9 to 33.8 seconds when integrated with the gemini-3-flash-preview model, and 20.2 seconds with gemini-3.1-pro-preview. These latency figures represent the time taken for DeepCollector to retrieve relevant data, augment the LLM prompt, and generate a response utilizing the specified Gemini models.

DeepScribe: The Inevitable Automation of Expertise

DeepScribe showcases a significant advancement in automated knowledge distillation, moving beyond simple text summarization to construct fully-formed scientific reports directly from lecture content. This system doesn’t merely transcribe spoken words; it actively interprets the presented physics concepts, identifies key arguments, and structures them into a coherent, publication-ready document. By autonomously converting spoken lectures into formalized reports – complete with sections, equations rendered via [latex]\frac{d}{dx}[/latex], and logical flow – DeepScribe suggests a pathway toward scaling scientific knowledge synthesis. The framework’s success with physics lectures demonstrates its potential applicability to diverse fields, promising a future where complex information can be efficiently transformed into accessible and rigorously structured scientific literature.

The system efficiently transforms raw lecture footage into polished scientific reports by utilizing a powerful combination of open-source tools. Specifically, FFmpeg handles the complex task of video processing – including segmentation, frame extraction, and audio analysis – preparing the content for textual conversion. This processed data is then seamlessly integrated into LaTeX, a sophisticated typesetting system renowned for its ability to produce professional, publication-quality documents. LaTeX ensures accurate rendering of complex equations, such as [latex]E=mc^2[/latex], and facilitates the creation of visually appealing and structurally sound reports, ready for dissemination within the scientific community. This automated pipeline minimizes manual effort and maximizes the fidelity of the final document, offering a robust solution for converting spoken knowledge into formal scientific literature.

The DeepScribe system capitalizes on the computational resources of Google Colab, enabling efficient and scalable autonomous scientific report generation. This cloud-based execution not only circumvents the need for substantial local computing power but also facilitates rapid processing of complex data derived from sources like physics lectures. Crucially, the system is engineered to maintain a consistent average latency of 600 seconds for its Deep Research agent, ensuring timely report completion without sacrificing analytical rigor. This constraint encourages optimization of the entire workflow, from video processing with FFmpeg to the final LaTeX formatting of publication-ready documents, and demonstrates a practical balance between computational demand and turnaround time for automated scientific summarization.

Towards a Generalized Architecture: The Seeds of True Automation

DeepKG functions as a foundational, generalized agentic framework designed to facilitate both the construction and application of knowledge graphs, serving as the core technology behind more specialized systems like DeepTS and DeepScribe. Rather than being a single, monolithic application, DeepKG provides the underlying architecture and tools necessary to create dynamic knowledge representations and empower agents to reason and act upon them. This modular design allows for adaptability across diverse domains, enabling the creation of agents capable of complex tasks such as time-series analysis – handled by DeepTS – and automated scientific writing, as demonstrated by DeepScribe. Essentially, DeepKG abstracts the complexities of knowledge graph management, offering a unified platform for building intelligent systems that leverage structured knowledge for enhanced performance and decision-making.

The versatility of DeepKG extends beyond general knowledge domains, finding powerful application in specialized fields such as high-energy physics through the DeepQCD project. This implementation showcases DeepKG’s ability to ingest, reason with, and generate insights from the complex datasets characteristic of particle physics research. By representing physical principles and experimental data as a knowledge graph, DeepQCD enables automated hypothesis generation and validation, effectively acting as a collaborative research assistant. The system can, for example, predict the outcomes of particle collisions or identify potentially overlooked relationships within experimental results, accelerating the pace of discovery in a field traditionally reliant on human intuition and painstaking analysis. This success demonstrates DeepKG’s potential to become a foundational tool for scientific exploration across diverse and data-rich disciplines.

The creation of these autonomous agentic frameworks represents a substantial collaborative undertaking, drawing upon the expertise of thirty researchers affiliated with twenty-six distinct organizations globally. This broad participation underscores the growing interest and concerted effort within the artificial intelligence community to move beyond narrowly defined tasks and towards systems capable of independent reasoning and action. The diverse backgrounds and perspectives contributed to the development process, fostering innovation and ensuring the robustness of the resulting technologies. Such widespread collaboration not only accelerates progress in agentic AI but also highlights the inherently interdisciplinary nature of building truly intelligent systems, demanding contributions from computer science, physics, and beyond.

The pursuit of autonomous scientific discovery, as detailed in this work, inherently embraces a level of controlled chaos. The systems DeepCollector and DeepScribe aren’t engineered for rigid predictability, but rather designed to navigate the inherent messiness of scientific data. As Marvin Minsky observed, “The more we learn about intelligence, the more we realize how much of it is just good guessing.” This ‘good guessing’ manifests in the agentic AI’s ability to synthesize knowledge from multimodal sources and curate data – a process far removed from deterministic programming. Stability, in this context, is merely an illusion that caches well, as the system adapts and evolves with each iteration of data analysis and knowledge refinement. A guarantee of perfect curation is impossible; the value lies in the probabilistic improvement of understanding.

What Lies Ahead?

The architectures presented here-DeepCollector and DeepScribe-are, predictably, solutions in search of problems they will inevitably exacerbate. The automation of scientific data curation, while appearing efficient, merely shifts the bottleneck. It isn’t a removal of cognitive load, but a re-distribution, concentrating it on the design and maintenance of these very systems. Scalability is just the word used to justify complexity, and the promise of autonomous discovery feels less like a liberation of intellect and more like an abdication of responsibility.

The current focus on knowledge graphs and retrieval-augmented generation offers a compelling illusion of understanding. But meaning isn’t simply the sum of connected data points. True synthesis demands nuance, context, and a healthy dose of skepticism-qualities not easily encoded in algorithms. Everything optimized will someday lose flexibility, and the pursuit of perfectly curated datasets risks blinding researchers to serendipitous anomalies-the very seeds of genuine innovation.

The perfect architecture is a myth to keep people sane. Future work shouldn’t focus on building ‘smarter’ agents, but on fostering symbiotic relationships between human intuition and machine precision. The goal isn’t to replace the scientist, but to augment them – to create tools that enhance, not diminish, the inherently messy, unpredictable process of scientific inquiry. The real challenge lies not in automating discovery, but in preserving the capacity for wonder.

Original article: https://arxiv.org/pdf/2605.26305.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

2026-05-27 17:03