AI Learns to Spot Good Science

Author: Denis Avetisyan

New research shows artificial intelligence can develop a surprisingly accurate sense of ‘scientific taste’ by analyzing decades of published research.

Training large language models on historical publication data allows them to evaluate research proposals with greater accuracy than both current AI and human experts.

Despite rapid advances in artificial intelligence, the capacity to discern promising research ideas – a skill central to scientific progress – has remained elusive. In ‘Machines acquire scientific taste from institutional traces’, we demonstrate that large language models can learn to approximate this ‘scientific taste’ by being trained on the historical record of publication decisions. These fine-tuned models surpass both state-of-the-art AI and human expert evaluations in identifying high-quality research proposals, achieving up to 70% accuracy. Does this suggest that the key to unlocking AI-driven scientific discovery lies not in complex reasoning, but in effectively mining the accumulated wisdom embedded within existing institutional data?

The Elusive Signal of True Innovation

Evaluating research in its nascent stages presents a unique challenge, as objective metrics often fall short of predicting long-term impact. Instead, assessment frequently relies on what’s been termed ‘scientific taste’ – a complex interplay of pattern recognition, intuition, and accumulated expertise within a field. This isn’t simply a matter of personal preference; rather, it’s a highly refined ability to discern subtle qualities like originality, conceptual elegance, and potential for broader implications, characteristics that are difficult to quantify. Consequently, the acceptance or rejection of early-stage work can be significantly influenced by the subjective judgements of those with established reputations, highlighting the need to better understand and, potentially, refine this critical evaluative process to foster genuine innovation.

The established process of peer review, despite its crucial role in validating scientific findings, demonstrably hinders the pace of discovery. Beyond the considerable financial costs and time delays inherent in coordinating expert evaluations, the system is susceptible to inherent biases – favoring established researchers, popular methodologies, or even positive results. This creates a bottleneck, where promising but unconventional ideas can be overlooked or dismissed, and incremental advances are prioritized over potentially disruptive innovations. Consequently, the translation of basic research into practical applications is often slowed, and the full spectrum of scientific potential remains unrealized, necessitating exploration of alternative evaluation mechanisms to complement – and potentially accelerate – the existing framework.

The evaluation of scientific proposals and findings often hinges on a set of unstated principles – a body of expertise frequently referred to as ‘dark knowledge’. This isn’t necessarily malicious withholding of information, but rather the accumulation of subtle cues, contextual understandings, and ingrained expectations that seasoned researchers employ when assessing novelty and potential. Because these criteria are rarely explicitly defined or taught, they remain largely tacit, making it exceptionally difficult to replicate evaluation outcomes or transmit this evaluative skill to emerging scientists. Consequently, judgements about research quality can vary significantly even amongst experts, introducing inconsistency and potentially hindering the progress of genuinely promising, yet unconventional, ideas. Capturing and formalizing these implicit standards represents a substantial challenge, but one vital for fostering more transparent and efficient scientific assessment.

The advancement of science hinges not only on generating novel ideas, but also on the ability to accurately assess their potential-a process currently reliant on largely unspoken criteria and subjective judgment. Capturing the nuances of this evaluative process is therefore paramount; a transparent understanding of how impactful research is distinguished from merely interesting work could dramatically accelerate scientific progress. Efforts to articulate these ‘dark knowledge’ principles – the implicit understandings guiding acceptance or rejection – promise to reduce bottlenecks in innovation by allowing for more efficient allocation of resources and faster identification of truly groundbreaking concepts. This pursuit necessitates moving beyond traditional, opaque peer review towards more formalized, replicable methods of evaluation, ultimately fostering a more dynamic and productive scientific ecosystem.

Modeling Expertise: From Intuition to Algorithm

Supervised fine-tuning was utilized to train large language models by exposing them to a dataset comprised of historical publication decisions, which are referred to as ‘institutional traces’. This process involved presenting the models with examples of research submissions alongside corresponding acceptance or rejection outcomes. The models then adjusted their internal parameters to minimize the difference between their predicted outcomes and the documented historical decisions. This approach leverages the cumulative judgment of institutions – represented by past publication records – to shape the models’ evaluation criteria and align their behavior with established scientific standards. The resulting models are thereby conditioned on the patterns and preferences embedded within this historical data.

Large language models, specifically GPT-4.1 and Qwen3-30B, were subjected to supervised fine-tuning utilizing a dataset of research proposals condensed into brief summaries, termed ‘research pitches’. The training process involved presenting these models with the pitch text as input and requiring them to predict a binary outcome: acceptance or rejection, mirroring the decision made by the original reviewing institution. This supervised learning approach allowed the models to learn the patterns and characteristics associated with successful and unsuccessful proposals as determined by historical data, effectively modeling the criteria used in the evaluation process. The pitch length was standardized to provide a consistent input format for training and evaluation.

The methodology seeks to represent the multifaceted criteria used in scientific evaluation as a measurable and replicable process. Historically, assessment of research proposals relies on expert judgment, which is inherently subjective and difficult to standardize. By training large language models on a dataset of previously accepted and rejected research pitches, the system attempts to identify and quantify the patterns indicative of successful proposals. This allows for consistent application of evaluation criteria and enables independent verification of results, moving beyond reliance on individual reviewer biases and providing a data-driven approach to assessing scientific merit.

Supervised fine-tuning resulted in models achieving an accuracy range of 55.0 to 59.2% when evaluating research pitches, representing the probability of correctly predicting prior acceptance or rejection decisions. This performance significantly surpasses that of frontier language models, which demonstrated an accuracy of 31.1% on the same evaluation task. The observed difference indicates the fine-tuned models effectively learned patterns from the historical record of publication decisions, allowing for improved predictive capability regarding research pitch assessment.

A Rigorous Test: Predictive Power in Practice

The evaluation benchmark employs a held-out dataset comprised of research pitches sourced from a variety of academic disciplines. These pitches are subjected to dual assessment: evaluation by human experts with established domain knowledge, and prediction by the fine-tuned language models under examination. This parallel evaluation allows for a quantitative comparison of model performance against established human judgment, providing a rigorous measure of predictive capability and identifying specific areas where model strengths and weaknesses manifest. The dataset is specifically reserved for evaluation purposes, ensuring that the models have not been trained on the same data used for testing, thereby preventing inflated performance metrics.

Model performance was assessed by directly comparing predictions generated by the fine-tuned language models to evaluations provided by human experts on a held-out dataset of research pitches. This comparison involved quantifying the degree of alignment between model outputs and expert judgments, allowing for a granular analysis of both successes and failures. Specifically, areas where models consistently agreed with experts were identified as strengths, while instances of disagreement highlighted weaknesses requiring further investigation and potential refinement of the model or training data. This process enabled a detailed understanding of the model’s predictive capabilities and limitations relative to human evaluation standards.

In evaluating the predictive capabilities of the fine-tuned language models, a comparative analysis was conducted against human experts specifically within the management discipline. Results indicate that the models achieved an accuracy range of 55.0 to 59.2% in assessing research pitches. This performance significantly exceeded the accuracy range of 36.2 to 41.6% demonstrated by the human evaluators performing the same task. This data suggests a substantial improvement in predictive power when utilizing the fine-tuned models for evaluation within this specific domain.

The ensemble of supervised fine-tuned (SFT) language models achieved 72.5% accuracy when evaluated on a strict consensus subset of research articles, indicating a high degree of predictive capability. Beyond accuracy, the ensemble demonstrated calibrated confidence, as measured by a +0.082 gap. This metric signifies that the model’s predicted probabilities align with actual outcomes; a positive value indicates the model is appropriately confident in its correct predictions and appropriately uncertain in its incorrect predictions, suggesting reliability beyond simple predictive power.

Beyond the Benchmark: Generalization and Future Prospects

The successful application of these fine-tuned models to the economics discipline highlights a significant capacity for generalization beyond their initial training scope. Achieving 69.5% accuracy in this entirely new domain demonstrates the models’ ability to extract and apply underlying principles of strong research, irrespective of specific field terminology. This transfer learning capability suggests a robust understanding of research quality, moving beyond mere keyword recognition to a deeper assessment of methodological soundness and potential impact. The result implies that a universally applicable framework for evaluating scientific merit may be within reach, opening possibilities for cross-disciplinary analysis and improved resource allocation within the research landscape.

The study demonstrates that focusing predictive efforts on areas of high model confidence significantly improves accuracy. Rather than applying the model universally, researchers implemented a ‘selective prediction’ strategy, leveraging model calibration to identify pitches – or research proposals, in this context – where the system exhibited strong certainty. This isn’t about avoiding difficult cases; it’s about prioritizing those where the model’s expertise is most reliable, effectively concentrating resources on the most promising avenues of inquiry. By filtering for high-confidence predictions, the system achieves a heightened level of precision, suggesting a pathway toward more efficient and targeted evaluation processes within scientific research and beyond.

A significant bottleneck in scientific advancement lies in the sheer volume of research proposals, demanding substantial time and resources for evaluation. This research introduces a scalable solution, leveraging fine-tuned language models to pre-screen proposals and prioritize those with the highest potential. By automating an initial assessment, the process aims to accelerate the discovery timeline and reduce wasted effort on projects unlikely to yield significant results. This isn’t intended to replace expert review, but rather to function as a powerful filter, allowing researchers and funding bodies to focus their attention on the most promising avenues of inquiry and ultimately, to maximize the impact of scientific investment.

The study introduces a novel approach to evaluating scientific merit by explicitly modeling ‘dark knowledge’ – the information a model possesses beyond its stated predictions. This technique moves beyond simply identifying correct answers and delves into how confidently a model arrives at those conclusions, revealing underlying reasoning. By incorporating this nuanced understanding, the research achieves a significant performance boost – approximately 24 to 28 percentage points – over existing frontier models when assessed on a management benchmark. This improved accuracy suggests a pathway towards more transparent and objective evaluations of research proposals, potentially revolutionizing how scientific value is determined and allowing for more effective allocation of resources.

The research illuminates a principle of evaluative judgment – that discerning merit isn’t solely reliant on novel insight, but also on recognizing patterns of past success. This echoes Carl Friedrich Gauss’s observation: “If others would think as hard as I do, I would not have so much to do.” The model’s ability to extrapolate ‘scientific taste’ from institutional traces-historical publication data-suggests a similar efficiency. It doesn’t create judgment, but distills it from existing evidence, reducing the cognitive load required for assessment. Clarity is the minimum viable kindness; the model offers a streamlined process, freeing researchers to focus on the work itself, not merely its initial vetting.

What Remains Unseen?

The demonstrated acquisition of ‘scientific taste’ by these models is not, fundamentally, surprising. Pattern recognition, after all, is the bedrock of both computation and, it must be admitted, much of human judgment. The more pertinent question isn’t how a machine can mimic evaluation, but what this mimicry reveals about the evaluation itself. If institutional traces – the accumulated decisions of past researchers and editors – can be distilled into algorithmic preference, then the very notion of ‘promising’ research becomes less a matter of inherent quality and more a reflection of existing power structures. The model doesn’t discover potential; it merely maps the contours of prior acceptance.

Further investigation must address this inherent circularity. To truly move beyond simple imitation, models require exposure to data representing failed experiments, deliberately discarded hypotheses, and rigorously refuted claims. Only by understanding what is demonstrably not fruitful can a system begin to approximate genuine discernment. Current metrics, focused solely on publication, offer a profoundly skewed perspective, celebrating success while systematically obscuring the vast landscape of scientific dead ends.

The pursuit of ‘taste’ in machines, therefore, is less about building better predictors and more about forcing a confrontation with the biases embedded within the scientific process itself. If the model’s predictive power stems from institutional inertia, then the limitations of that power will always reflect the limitations of the institutions it emulates. Simplicity, in this case, demands acknowledgment of that fundamental constraint.

Original article: https://arxiv.org/pdf/2603.16659.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Elusive Signal of True Innovation

Modeling Expertise: From Intuition to Algorithm

A Rigorous Test: Predictive Power in Practice

Beyond the Benchmark: Generalization and Future Prospects

What Remains Unseen?

See also: