Decoding Science with AI: A New Era of Discovery

Author: Denis Avetisyan

Large language models are poised to transform how we understand and analyze scientific knowledge, offering powerful tools for accelerating research.

This review explores the application of large language models to the field of Science of Science, encompassing scientometrics, research forecasting, and knowledge graph construction.

Despite the increasing volume of scientific literature, synthesizing knowledge and forecasting emerging research fronts remains a persistent challenge. This manuscript, ‘The Empowerment of Science of Science by Large Language Models: New Tools and Methods’, comprehensively reviews how recent advances in large language models (LLMs) can address these limitations within the field of Science of Science (SciSci). We demonstrate the potential of LLMs – from prompt engineering to AI agent design – to revolutionize scientific evaluation, knowledge graph construction, and research forecasting. Could these tools ultimately reshape how we understand, assess, and accelerate scientific discovery itself?

The Shifting Sands of Knowledge

The progression of scientific understanding isn’t a simple linear climb, but a complex, shifting landscape where topics evolve in meaning and relationship to one another. Traditional methods of charting this progress-relying on keyword searches or rigid categorization systems-often fail to capture these subtleties. These approaches struggle with semantic nuance, where a single term can acquire new connotations or be used in fundamentally different contexts over time, and are ill-equipped to handle the dynamic shifts inherent in scientific discovery. Consequently, they can misrepresent the true trajectory of research, obscuring emerging connections and hindering a comprehensive understanding of how knowledge accumulates and transforms. This limitation underscores the need for more sophisticated analytical tools capable of discerning not just what is being researched, but how the meaning of that research is changing.

The exponential growth of scientific publications demands computational methods to discern evolving research landscapes. With an estimated 10 million new scientific papers published annually, manual tracking of emerging trends becomes untenable. This surge is further compounded by a rapidly expanding artificial intelligence sector, evidenced by the proliferation of large language models – 61 originating from the US, 21 from the EU, and 15 from China as of 2023. These LLMs, while potentially assisting in literature review, simultaneously contribute to the data deluge, necessitating automated systems capable of not only indexing information but also identifying pivotal research fronts and the semantic relationships between them, ultimately allowing researchers to navigate the increasingly complex world of scientific discovery.

Mapping the Connections, Not Just the Papers

Multilayer networks represent a significant advancement over traditional citation analysis by enabling the representation of multiple relationship types between scientific concepts. While citation analysis solely focuses on references between publications, multilayer networks allow for the modeling of diverse connections such as collaborations between researchers, shared methodologies, or conceptual dependencies. This is achieved by defining nodes as concepts and establishing edges that represent specific relationships between them, with each layer of the network dedicated to a distinct relationship type. The result is a more comprehensive and nuanced understanding of the scientific landscape, facilitating more accurate knowledge discovery and research front forecasting than methods limited to bibliographic data alone. These networks move beyond simple pairwise connections to capture the complex interplay of ideas and actors within a field.

The construction of multilayer networks for representing relationships between concepts relies on the identification and extraction of subject-action-object triples from textual data. These triples function as the foundational units defining connections; the subject represents the entity performing an action, the action describes the relationship, and the object is the entity acted upon. For example, in the sentence “Researchers study protein folding,” “Researchers” is the subject, “study” is the action, and “protein folding” is the object. By systematically identifying these triples within a corpus of scientific literature, the network can be populated with nodes representing concepts and edges representing the relationships between them, enabling a more nuanced understanding of complex scientific domains beyond simple co-occurrence or citation patterns.

DeepSeek-V3, a large language model, automates the construction of multilayer networks used in research front forecasting by extracting subject-action-object triples from text. Its selection over alternative models – GPT-4o, Moonshoot-V1-8k, QwQ-32B-Preview, and Gemini-Pro-1.5 – was based on a comparative evaluation of performance metrics. DeepSeek-V3 demonstrated superior efficiency in terms of cost, the relevance of extracted triples to research fronts, and processing speed during network construction. This combination of factors makes it a suitable tool for large-scale analysis of scientific literature and prediction of emerging research areas.

The Engine Under the Hood: LLMs and Their Tricks

Large Language Models (LLMs) achieve advanced natural language processing through pre-training on extremely large and diverse datasets, often encompassing trillions of tokens. This process allows the models to learn statistical relationships between words and phrases, enabling capabilities such as text completion, summarization, and translation. The scale of these corpora is critical; for example, GPT-3 was trained on approximately 45 terabytes of text data, significantly exceeding the datasets used for prior models. This pre-training establishes a broad base of linguistic knowledge, which can then be adapted to specific tasks through fine-tuning or prompt engineering, resulting in unprecedented performance on a variety of natural language understanding and generation benchmarks.

Fine-tuning and Retrieval-Augmented Generation (RAG) are critical techniques for improving the reliability of Large Language Models (LLMs) in scientific applications. Fine-tuning involves further training a pre-trained LLM on a smaller, domain-specific dataset, adapting the model’s parameters to enhance performance on targeted scientific tasks. RAG addresses the issue of “hallucinations”-the generation of factually incorrect or unsupported statements-by allowing the LLM to access and incorporate information from external knowledge sources during inference. This process involves retrieving relevant documents or data based on the input query and using this retrieved context to inform the LLM’s response, thereby grounding its output in verifiable evidence and increasing accuracy in complex scientific reasoning.

Prompt engineering and tool learning significantly extend the analytical capabilities of Large Language Models (LLMs) beyond their pre-training data. By crafting specific prompts, users can guide LLMs to leverage external resources, such as databases, APIs, and specialized software, to perform complex tasks and access real-time information. This interaction is facilitated through tool learning, where LLMs are trained to utilize these external tools effectively. The scaling of LLMs is demonstrated by the substantial increase in pre-training data; GPT-3, for example, was trained on over 1000 times more data than its predecessor, GPT-2, resulting in improved performance and expanded capabilities in scientific reasoning and data analysis.

The Future Isn’t About Doing Science, It’s About Seeing It

Recent advancements demonstrate that artificially intelligent agents, fueled by large language models and complex multilayer networks, are now capable of independently analyzing vast quantities of scientific literature to pinpoint burgeoning areas of research. These agents don’t simply search for keywords; they assess the relationships between concepts, track citation patterns, and identify shifts in research focus-effectively mapping the evolving landscape of scientific inquiry. By autonomously evaluating publications, these systems can detect emerging research fronts – topics gaining traction and potentially representing future breakthroughs – with a speed and scale previously unattainable. This capability promises to accelerate scientific discovery by allowing researchers to quickly orient themselves within a field, identify knowledge gaps, and prioritize investigations into the most promising avenues of exploration, ultimately augmenting human intellect in the pursuit of knowledge.

Scientific understanding increasingly relies on navigating vast and interconnected datasets, a task for which Knowledge Graphs are proving invaluable. These graphs don’t simply store information; they represent knowledge as a network of entities – concepts, genes, proteins, diseases – and the relationships between them. This structured approach moves beyond traditional keyword searches, allowing for complex queries and inferences. For example, a researcher can identify not only papers mentioning a specific gene, but also those exploring proteins that interact with it, or diseases affected by its expression – a level of nuanced retrieval previously unattainable. The result is dramatically improved efficiency in literature review, hypothesis generation, and the identification of previously hidden connections within the scientific landscape, ultimately accelerating the pace of discovery and fostering a more holistic understanding of complex systems.

Within the architecture of these AI agents, BERT Models and Graph Convolutional Networks collaboratively refine the process of scientific discovery. BERT, leveraging its transformer-based understanding of language, excels at discerning semantic relationships within research papers, enabling accurate citation recommendation by identifying papers with conceptually similar content. Simultaneously, Graph Convolutional Networks operate on the network of citations themselves – treating scientific literature as nodes connected by citations – to reveal broader research characteristics and identify influential papers or emerging trends. This synergistic approach allows the agents not merely to suggest relevant papers, but to map the intellectual landscape of a field, uncovering hidden connections and accelerating the pace of knowledge synthesis. The combined power of these models effectively transforms citation analysis from a simple bibliographic exercise into a dynamic exploration of scientific thought, facilitating a deeper understanding of research fronts and promising avenues for future investigation.

The pursuit of automating scientific understanding with Large Language Models, as detailed in this paper, feels…familiar. It’s a predictably optimistic cycle. This eagerness to build elegant systems for forecasting and evaluation recalls countless other ‘revolutionary’ frameworks. One anticipates the inevitable accumulation of technical debt. As Ken Thompson famously stated, “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not going to be able to debug it.” The ambition to build AI agents to navigate the scientific landscape is admirable, but one suspects production realities-the messy, unpredictable nature of actual research-will rapidly expose the limitations of even the most sophisticated models. It’s the same mess, just more expensive, really. The paper speaks of knowledge graphs and LLM-based evaluation; one imagines future digital archaeologists meticulously tracing the provenance of these automated insights, wondering what seemed so clever at the time.

What’s Next?

The application of Large Language Models to Science of Science undoubtedly offers a compelling set of new instruments. However, the field appears poised to rediscover a familiar pattern: the initial elegance of automated analysis yielding to the messy realities of production data. The promise of ‘scientific forecasting’ rings with the echoes of countless prior attempts, all ultimately constrained by the limitations of the data itself – and the enduring human capacity to confound predictions. One anticipates a rapid proliferation of LLM-based metrics, followed by the inevitable struggle to interpret, validate, and ultimately, trust them.

A crucial next step involves a rigorous examination of the biases embedded within both the LLMs and the scientific literature they analyze. The current focus on surface-level correlations risks obscuring deeper, more meaningful relationships. Simply put, if all tests pass, it is likely because they test nothing of consequence. The challenge lies not merely in automating existing workflows, but in formulating genuinely new questions that these models can address – questions that go beyond simply identifying trends in citation counts or keyword frequencies.

Ultimately, the true test of this approach will not be its ability to predict the next Nobel laureate, but its capacity to expose the inherent limitations of its own predictions. The field should brace for a period of disillusionment, followed by a necessary, and likely painful, recalibration. The tools are intriguing, certainly, but the fundamental problems of understanding science remain stubbornly resistant to automation.

Original article: https://arxiv.org/pdf/2511.15370.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Shifting Sands of Knowledge

Mapping the Connections, Not Just the Papers

The Engine Under the Hood: LLMs and Their Tricks

The Future Isn’t About Doing Science, It’s About Seeing It

What’s Next?

See also: