Mining Materials Knowledge with Artificial Intelligence

Author: Denis Avetisyan

The rapid development of powerful language models is opening new avenues for accelerating materials discovery and design.

This review assesses the growing role of large language models – and the critical need for open-source approaches – in automating data extraction, predictive modeling, and complex workflows within materials science.

Despite the accelerating pace of materials discovery, efficiently extracting knowledge from the rapidly expanding scientific literature remains a significant bottleneck. This review, ‘Large language models in materials science and the need for open-source approaches’, examines the burgeoning role of large language models (LLMs) in automating tasks across the materials science pipeline, from data mining to predictive modeling and experimental system coordination. Our analysis demonstrates that open-source LLMs can achieve comparable performance to closed-source alternatives, offering crucial benefits in transparency, reproducibility, and cost. As these models mature, will open-source platforms become the foundation for a more accessible and collaborative future in materials innovation?

The Inevitable Cascade: From Experiment to Data

Historically, the development of new materials has been a remarkably iterative and often serendipitous process. Researchers typically navigate an immense chemical space – a virtually infinite number of possible material compositions and structures – relying heavily on physical experimentation and painstaking trial-and-error. This approach is intrinsically limited by the sheer scale of possibilities; synthesizing and characterizing even a tiny fraction of these potential materials is exceptionally time-consuming and resource-intensive. Furthermore, the synthesis of many advanced materials involves complex procedures, sensitive to subtle variations in conditions, adding layers of difficulty to the already challenging task of discovering compounds with desired properties. This traditional methodology, while foundational, struggles to keep pace with the accelerating demand for novel materials tailored to increasingly specific technological applications.

The advent of Large Language Models represents a fundamental change in materials science, moving beyond traditional, painstakingly slow experimental approaches. These models don’t simply calculate; they learn from the vast and often unstructured data contained within the scientific literature – research papers, patents, and technical reports. By processing this information, LLMs can identify relationships between material composition, synthesis procedures, and resulting properties with remarkable speed and accuracy. This predictive capability allows researchers to virtually screen countless material combinations, drastically reducing the time and resources needed to discover novel substances with desired characteristics. Rather than relying on intuition or exhaustive experimentation, scientists can now leverage LLMs to guide their research, focusing on the most promising candidates and accelerating the pace of materials innovation.

Recent advances in large language models have yielded a remarkable capacity for automated data processing within materials science, specifically in the extraction of synthesis conditions from scientific literature. Current models now demonstrate up to 90% accuracy in identifying the precise parameters—such as temperature, pressure, reaction time, and precursor materials—necessary to create specific materials. This level of precision signifies a substantial leap forward, effectively automating a previously laborious and time-consuming process for researchers. By accurately decoding complex experimental details from published papers, these models drastically reduce the need for manual data curation, enabling faster iteration and accelerating the discovery of novel materials with tailored properties. The ability to reliably extract and organize this crucial information unlocks new possibilities for computational materials design and predictive modeling.

The landscape of materials discovery is rapidly evolving with the advent of openly available Large Language Models. Previously, access to such sophisticated predictive tools was largely confined to well-funded institutions, creating a bottleneck in scientific progress. However, initiatives like Meta’s Llama 3 and Alibaba’s Qwen have broken down these barriers, offering powerful LLMs under permissive licenses. This democratization empowers researchers worldwide – from university labs to smaller companies – to leverage the ability of these models to analyze vast scientific literature and predict material properties. Consequently, the pace of materials innovation is accelerating, as a broader community gains the capacity to design, synthesize, and optimize materials with unprecedented efficiency, fostering collaboration and potentially unlocking solutions to critical challenges in energy, medicine, and beyond.

Automating the Inevitable: Systems Mimic the Scientist

MOF-ChemUnity and ReactionSeek exemplify the integration of Large Language Models (LLMs) with domain-specific tools to automate materials science data handling. MOF-ChemUnity focuses on extracting and structuring data related to Metal-Organic Frameworks (MOFs) from diverse sources, while ReactionSeek specializes in chemical reaction information. These systems utilize LLMs not as standalone solutions, but in conjunction with tools designed for tasks like optical character recognition (OCR) for image-based data, chemical structure recognition, and data normalization. This combined approach overcomes the limitations of LLMs in accurately interpreting complex scientific data and enables automated extraction of key parameters, reaction conditions, and material properties from publications, patents, and experimental datasets. The result is a significant increase in efficiency for materials discovery and analysis.

Current scientific workflows are increasingly utilizing multimodal Large Language Models (LLMs), notably GLM-4V, to automate the interpretation of complex visual data. Specifically, GLM-4V has demonstrated a 91.5% accuracy rate in correctly interpreting reaction scheme images, a critical task in chemical research. This capability allows for the automated extraction of information regarding reactants, products, catalysts, and reaction conditions directly from visual representations, eliminating the need for manual data entry and significantly accelerating the analysis of scientific literature and experimental data. The model’s performance is based on its ability to process both visual and textual information simultaneously, enabling a more nuanced and accurate understanding of the depicted chemical processes.

Sequence-Aware Extraction methods improve data processing by recognizing the chronological order of experimental steps, crucial for accurately interpreting synthetic procedures. This is coupled with the implementation of Material String formats, which are standardized, machine-readable representations of chemical compounds and materials. These formats utilize specific delimiters and identifiers to define reactants, products, catalysts, and reaction conditions, allowing automated systems to parse and interpret experimental details with greater precision. The combination enables the capture of not just what materials were used, but how they were used in a specific sequence, facilitating the reconstruction of experimental protocols and enabling more effective data analysis and knowledge discovery.

Automated scientific workflows, beyond simple data handling, establish relationships between disparate pieces of information to enable knowledge discovery. These pipelines utilize extracted data – including chemical structures, reaction conditions, and material properties – to identify patterns and correlations that might not be apparent through manual analysis. By computationally connecting concepts, researchers can formulate new hypotheses, predict material behaviors, and accelerate the iterative process of scientific investigation. This active connection of information facilitates a shift from data accumulation to knowledge generation, ultimately reducing the time required for breakthroughs in materials science and chemistry.

The Ghost in the Machine: Autonomous Experimentation Emerges

Recent advancements in artificial intelligence have led to the development of autonomous research agents, exemplified by systems like MOFGen and ChemAgents. These agents leverage Large Language Models (LLMs) to perform tasks traditionally requiring human scientists, including hypothesis generation, experimental design, and data interpretation. Unlike prior automated systems focused on specific tasks, these LLM-driven agents exhibit a degree of generalizability, allowing them to address a broader range of scientific problems with minimal manual intervention. The core functionality relies on the LLM’s ability to process and synthesize information from scientific literature, databases, and experimental data to guide the research process, effectively functioning as a virtual scientist capable of iterative experimentation and analysis.

Autonomous research agents, powered by large language models (LLMs), operate by iteratively formulating testable hypotheses based on existing scientific literature and data. These agents then design experiments – specifying materials, procedures, and measurement parameters – without requiring human direction. Following data acquisition, the agents interpret results, assess the validity of the initial hypothesis, and refine subsequent experimental designs. This closed-loop system minimizes the need for manual intervention at each stage of the scientific process, accelerating discovery and reducing human effort in traditionally labor-intensive research areas. The ability to independently cycle through these phases represents a fundamental shift toward automated scientific investigation.

L2M3 represents a system that integrates Large Language Models (LLMs) with Bayesian Optimization to enhance the prediction of optimal synthesis conditions. This combined approach allows for more efficient exploration of chemical spaces than relying on LLMs alone. Performance metrics indicate L2M3 achieves similarity scores of 82% when its predictions are compared against those generated by GPT-3.5-turbo and GPT-4o, demonstrating a comparable level of predictive capability while leveraging the benefits of Bayesian Optimization for targeted experimentation.

Coscientist demonstrates the feasibility of fully autonomous scientific experimentation through its capacity to independently design, plan, and execute complex investigations. Utilizing finetuned models, Coscientist achieves 94.8% accuracy in predicting hydrogen storage performance. This represents a substantial improvement of 46.7% over traditional methods that rely solely on precursor names for prediction, indicating a significant advancement in the automation of materials discovery and optimization processes.

The Inevitable Diffusion: Democratizing Access and Collaboration

Low-Rank Adaptation (LoRA) presents a transformative approach to applying large language models (LLMs) to materials science, circumventing the prohibitive costs associated with full model retraining. Rather than adjusting all of a LLM’s parameters – a computationally intensive undertaking – LoRA introduces a smaller set of trainable parameters, known as low-rank matrices, that are added to the existing model. This technique significantly reduces the number of parameters needing optimization, accelerating the fine-tuning process and diminishing the required computational resources. Consequently, researchers can efficiently adapt powerful LLMs to specialized materials science tasks, such as predicting material properties or suggesting synthesis conditions, without necessitating access to extensive computing infrastructure. The resulting models maintain strong performance while offering a pathway to democratize AI-driven materials discovery, enabling broader participation and faster innovation cycles within the scientific community.

The advancement of materials science increasingly relies on a foundation of open collaboration, powerfully enabled by platforms like Zenodo. This repository serves as a central hub for researchers to not only archive and share crucial datasets and code, but also to actively build upon existing work, circumventing redundant efforts and accelerating discovery. By openly disseminating the tools and information generated through computational studies, Zenodo fosters a cycle of iterative improvement; new models and algorithms can be readily tested, validated, and refined by the broader scientific community. This collaborative ecosystem ensures transparency, reproducibility, and ultimately, a more efficient pathway to developing novel materials with tailored properties, moving the field beyond isolated research endeavors towards a collective pursuit of innovation.

The landscape of materials science is undergoing a significant shift as increasingly accessible, open-source large language models (LLMs) like Zhipu AI GLM and Meta Llama 3 democratize access to advanced computational capabilities. These models are not merely theoretical tools; finetuned iterations are demonstrating remarkable proficiency in practical applications, achieving up to 98.6% accuracy in predicting the synthesisability of novel materials. Importantly, this high level of performance is maintained even when analyzing complex molecular structures—averaging 97.8% accuracy for compounds containing up to 275 atoms—indicating a robust capability for tackling real-world challenges. This expansion of readily available, high-performing LLMs allows a broader spectrum of researchers to engage in AI-driven materials discovery, fostering innovation beyond the limitations of specialized computational resources and expertise.

The rapid advancement of materials science is increasingly fueled by a collaborative, open-source approach to artificial intelligence. Recent studies demonstrate that freely available large language models, such as finetuned versions of GLM-4.5-Air and Qwen3, are achieving performance levels comparable to proprietary models like GPT-4o in crucial tasks such as recommending optimal synthesis conditions. This parity in capability, achieved through shared code and data, dramatically broadens participation in materials discovery, allowing researchers with limited computational resources to contribute meaningfully to scientific progress. Such open access not only accelerates the pace of innovation but also fosters a more inclusive research environment, essential for tackling complex global challenges that demand diverse perspectives and collective expertise.

The pursuit of automating materials discovery, as detailed in the review, echoes a fundamental truth about complex systems: order is merely a transient state. The article highlights how Large Language Models offer a path toward predicting material properties and extracting knowledge from vast datasets, but this isn’t about building understanding, it’s about cultivating it. As Ada Lovelace observed, “The Analytical Engine has no pretensions whatever to originate anything.” These models don’t create knowledge; they reveal patterns already inherent in the data, much like tending a garden reveals the potential within the seeds. The reliance on open-source LLMs acknowledges that robust ecosystems, not monolithic structures, are the true survivors in the long run. There are no best practices—only survivors.

The Garden Takes Root

The promise of these large language models in materials science isn’t about building a better engine for prediction, but cultivating a more responsive ecosystem for discovery. The work suggests a shift – not from experiment to simulation, but from directed research to guided exploration. Yet, a dependence on closed, proprietary models feels less like progress and more like transplanting a delicate shoot into concrete. The true leverage won’t arrive until the garden is open, allowing for cross-pollination and adaptation by those who tend it.

Current limitations aren’t technical inconveniences to be solved, but inherent characteristics of any complex system. The models reflect the biases of their training data, mirroring our own incomplete understanding of material relationships. Resilience lies not in isolation – in building models that ‘know’ everything – but in forgiveness between components, in systems that gracefully degrade and signal their uncertainties. A perfect prediction is a brittle thing; a helpful suggestion, robust.

The future isn’t about automating the scientist, but augmenting their intuition. It isn’t about eliminating the need for careful experimentation, but about creating tools that whisper possibilities, highlighting the most promising paths through the vast landscape of material combinations. A system isn’t a machine, it’s a garden – neglect it, and you’ll grow technical debt. The task now is to nurture this emerging ecosystem, ensuring it flourishes with openness and adaptability.

Original article: https://arxiv.org/pdf/2511.10673.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Cascade: From Experiment to Data

Automating the Inevitable: Systems Mimic the Scientist

The Ghost in the Machine: Autonomous Experimentation Emerges

The Inevitable Diffusion: Democratizing Access and Collaboration

The Garden Takes Root

See also: