Doctor and AI: A Powerful Partnership for Smarter Diagnoses

Author: Denis Avetisyan

A new system combines the reasoning power of artificial intelligence with physician expertise to improve diagnostic accuracy, particularly for challenging rare diseases.

The system distills complex clinical data-including patient history, examination findings, and lab results-into a spectrum of potential diagnoses, iteratively refining these hypotheses through targeted evidence retrieval from resources like PubMed, ultimately synthesizing clinical reasoning with scientific literature to arrive at a definitive primary diagnosis-such as malignant lymphoma within the context of Hashimoto’s thyroiditis-alongside a ranked differential assessment.

This review details PULSE, an evidence-integrated language agent demonstrating expert-level performance in clinical diagnosis when used in collaborative human-AI workflows.

Diagnostic reasoning remains a challenge even for experienced clinicians, particularly when confronting rare or complex presentations. This limitation motivates the development of artificial intelligence tools, as explored in ‘Human-AI Co-reasoning for Clinical Diagnosis with Evidence-Integrated Language Agent’, which introduces PULSE-a medical reasoning agent integrating large language models with scientific literature retrieval. Demonstrating expert-competitive accuracy and stable performance across disease incidence levels, PULSE not only matched senior specialists but also enhanced diagnostic capabilities when used collaboratively with physicians. How can we best leverage such agents to mitigate risks like automation bias and realize their full potential in augmenting clinical decision-making?

The Erosion of Diagnostic Certainty

The evaluation of intricate endocrine disorders presents a significant cognitive challenge, frequently demanding a breadth of knowledge and speed of analysis that surpasses the capabilities of any single physician. These cases often involve subtle symptom presentations, overlapping pathologies, and the need to integrate information from diverse laboratory tests and imaging studies. The human brain, while powerful, has limitations in its capacity to simultaneously process and correlate such extensive datasets, increasing the risk of diagnostic delays or inaccuracies. This inherent limitation underscores the need for innovative diagnostic support tools capable of augmenting physician expertise and ensuring timely, effective patient care, particularly as the field of endocrinology continues to advance and present increasingly complex clinical scenarios.

Traditional endocrine diagnostics, reliant on sequential testing and clinical assessment, frequently encounter limitations when faced with complex presentations. The process of systematically excluding possibilities – while foundational – becomes protracted in cases of rare disorders or those manifesting with unusual symptoms, delaying crucial interventions. Diagnostic error rates are demonstrably higher in these scenarios, stemming from the cognitive challenges of interpreting nuanced data and the potential for anchoring bias – prematurely fixating on a single diagnosis. Furthermore, atypical presentations can mimic more common conditions, leading to misdiagnosis and inappropriate treatment, particularly as clinicians may unconsciously prioritize frequently encountered pathologies over less familiar ones. This inherent vulnerability underscores the need for innovative diagnostic tools capable of accelerating the process and mitigating the risk of human error in complex endocrine cases.

The relentless expansion of medical knowledge presents a significant hurdle for endocrine specialists and general practitioners alike. Each year, thousands of peer-reviewed articles, clinical guidelines, and research findings flood the scientific landscape, making it virtually impossible for any single physician to remain comprehensively current. This information overload isn’t merely a matter of volume; the nuanced interpretations of complex data, coupled with the rapid emergence of new diagnostic techniques and treatment protocols, demand continuous learning. Maintaining expertise requires not only diligent study but also sophisticated methods for filtering, synthesizing, and applying this ever-growing body of knowledge – a challenge that underscores the need for innovative support systems in modern endocrine diagnosis.

The PULSE agent demonstrated statistically significant diagnostic accuracy comparable to physicians across varying experience levels when evaluated on 82 endocrinology cases, while also exhibiting a broad search space for diagnostic hypotheses.

The Architecture of Augmentation: Introducing PULSE

PULSE functions as a diagnostic assistant by combining a Large Language Model (LLM) with a dedicated scientific literature retrieval system. The LLM processes clinical input and formulates queries to the retrieval system, which then accesses and filters relevant medical literature. This integration allows PULSE to synthesize information from both structured clinical data and unstructured text found in scientific publications, providing a comprehensive knowledge base for diagnostic support. The LLM then interprets the retrieved literature to generate potential diagnoses or supporting evidence, effectively bridging the gap between raw data and clinically relevant insights.

PULSE utilizes the National Center for Biotechnology Information’s (NCBI) E-utilities API to programmatically query and retrieve data from the PubMed database. This API access enables the system to perform complex literature searches based on user inputs or internal diagnostic reasoning. The E-utilities suite provides a standardized interface for accessing PubMed’s extensive collection of biomedical literature, including abstracts, citations, and full-text links where available. By directly interfacing with PubMed via its API, PULSE ensures that its diagnostic support is informed by the most current and comprehensive medical research, with data retrieval occurring in real-time based on query parameters and filtering criteria.

PULSE is architected to facilitate integration into existing clinical workflows, accommodating multiple collaboration patterns with physicians. The system supports serial collaboration, where a physician initiates a diagnostic query, PULSE provides an initial assessment, and the physician reviews and potentially refines the query iteratively. Concurrently, PULSE enables concurrent collaboration, allowing multiple physicians to simultaneously access and contribute to the same diagnostic case, with the system managing version control and highlighting conflicting interpretations. This is achieved through a role-based access control system and a shared data repository, ensuring data integrity and auditability within team-based diagnostic processes.

PULSE-assisted diagnostic performance, evaluated through Top@1 and Top@4 accuracies and diagnostic concordance, reveals that concurrent collaboration consistently outperforms serial collaboration across physicians and residents, resulting in a greater proportion of consensus-correct diagnoses and fewer instances where either the AI or physician alone arrives at the correct answer.

Performance Under Scrutiny: Validating PULSE Across Expertise Levels

PULSE was subjected to evaluation utilizing a dataset comprised of 82 authentic clinical cases sourced from the field of endocrinology. This evaluation methodology involved a comparative analysis of PULSE’s diagnostic performance against that of physicians representing distinct levels of clinical expertise – specifically, senior specialists, junior specialists, and residents. The purpose of this comparative assessment was to establish PULSE’s relative performance capabilities within a realistic medical context, measured against established clinical benchmarks at various stages of training and professional experience.

Evaluation of PULSE on a dataset of 82 authentic endocrinology cases revealed a Top@1 diagnostic accuracy of 57.32%. This performance is statistically comparable to that of senior specialist physicians. In contrast, PULSE significantly outperformed both junior specialists and residents on the same dataset, indicating a high level of diagnostic capability approaching that of experienced clinicians. The Top@1 metric assesses whether the correct diagnosis appears within the model’s single highest-probability prediction.

Evaluation of the PULSE model on a dataset of 82 authentic endocrinology cases yielded a Top@4 diagnostic accuracy of 79.27%. This performance level is statistically indistinguishable from that of senior specialist physicians, as determined by comparative analysis. Top@4 accuracy signifies that the correct diagnosis appeared within the model’s top four ranked possibilities. The lack of statistically significant difference between PULSE and senior specialists indicates a comparable ability to generate a relevant differential diagnosis within the dataset, demonstrating a high degree of diagnostic reasoning capability.

Analysis of the length of PULSE’s generated reasoning chains demonstrates a positive correlation with the inherent complexity of each clinical case. Specifically, cases requiring more extensive diagnostic consideration, as determined by expert review, consistently elicited longer output lengths from the model. This indicates that PULSE does not apply a uniform reasoning process, but rather dynamically adjusts its output – and presumably its internal reasoning effort – in response to the challenges presented by each case. This adaptive behavior suggests the model is capable of allocating computational resources proportionally to the diagnostic demands of the clinical problem.

Diagnostic accuracy of PULSE, senior specialists, junior specialists, and residents varied with disease incidence, with increasing clinical experience among junior specialists correlating with improved performance as measured by Top@1 and Top@4 accuracy and granular analysis across incidence tiers.

Beyond Automation: Augmenting Expertise and Mitigating Bias

PULSE demonstrates a sophisticated level of reasoning that dynamically adjusts to the complexity of each medical case, much like an experienced physician. Instead of applying a uniform analytical approach, the system effectively prioritizes effort; it dedicates more computational resources to challenging diagnoses and streamlines analysis for straightforward cases. This adaptive capability isn’t simply about speed, but about intelligent allocation of resources, mirroring the cognitive strategies honed by clinicians over years of practice. By focusing deeper analysis where it’s most needed, PULSE not only enhances diagnostic accuracy but also optimizes efficiency, representing a significant step towards truly collaborative intelligence in healthcare.

The potential for automation bias – an overreliance on artificial intelligence, even when evidence suggests errors – is a critical concern in medical diagnostics. PULSE addresses this by functioning not as an autonomous decision-maker, but as a sophisticated second opinion generator. Beyond simply offering an alternative diagnosis, the system actively synthesizes relevant medical literature, presenting physicians with the supporting evidence behind its conclusions. This dual approach – a contrasting perspective coupled with readily accessible justification – compels clinicians to critically evaluate the AI’s reasoning, fostering a more nuanced and informed decision-making process. By demanding cognitive engagement with the underlying evidence, PULSE effectively safeguards against the uncritical acceptance of potentially flawed AI outputs and promotes a more robust application of artificial intelligence in healthcare.

Recent studies demonstrate that PULSE significantly narrows the performance disparity between medical residents and experienced specialists through real-time collaborative diagnosis. When residents utilized PULSE as a diagnostic partner, their Top@1 accuracy – representing the frequency with which the correct diagnosis appeared among their top suggestions – rose dramatically to between 48.8% and 62.2%. This suggests that PULSE doesn’t simply offer answers, but actively enhances a resident’s reasoning process, effectively bridging the knowledge and experience gap. The observed improvement indicates that artificial intelligence, when implemented as a collaborative tool, can substantially elevate the diagnostic capabilities of those still in training, bringing their performance closer to that of seasoned professionals.

The integration of artificial intelligence into medical diagnostics isn’t about replacing physician expertise, but rather augmenting it. Recent studies demonstrate a collaborative model where AI serves as a powerful partner, synthesizing complex medical literature and offering a secondary assessment to complement human intuition. This approach allows physicians to focus on the nuanced aspects of patient care – considering the patient’s individual circumstances and emotional well-being – while benefiting from the AI’s capacity for data analysis and recall. By effectively bridging the gap between experience levels, this synergy empowers clinicians to make more thoroughly informed decisions, ultimately leading to improved patient outcomes and a more robust healthcare system that harnesses the strengths of both human and artificial intelligence.

Sankey diagrams and quantitative analysis reveal that reviewing the agent’s output consistently improved diagnostic accuracy across five junior specialists, shifting diagnoses from incorrect to correct, while trade-off analysis demonstrates a net positive benefit through increased correction rates and diagnostic stability, as indicated by the cohort means and favorable regimes highlighted in the plots.

The pursuit of diagnostic accuracy, as demonstrated by PULSE, echoes a fundamental principle of system design: all solutions are temporary. This agent, while exhibiting expert-level performance in clinical reasoning and rare disease identification, represents a snapshot in time. The continual refinement of large language models and literature retrieval techniques necessitates ongoing adaptation. As John McCarthy observed, “Every abstraction carries the weight of the past,” and PULSE, built upon existing datasets and algorithms, inherently reflects prior limitations. The true measure of its value lies not in static achievement, but in its capacity to evolve and integrate new knowledge, preserving resilience against the inevitable decay of information and shifting medical understanding.

What’s Next?

The presentation of PULSE, a system capable of approximating expert diagnostic reasoning, does not resolve the fundamental challenges inherent in complex medical decision-making-it merely shifts them. The system’s success in identifying rare diseases, while encouraging, suggests not a triumph over diagnostic uncertainty, but an increased capacity to discover the edges of what remains unknown. Time will inevitably reveal cases that challenge even this augmented reasoning, exposing the limitations of any knowledge base, however vast.

Future work will likely focus on refining the integration of literature retrieval, but this is akin to polishing the instruments of a sinking ship. The true limitation isn’t access to information, but the inherent ambiguity of biological systems. A system can become proficient at pattern recognition, but it cannot truly understand pathophysiology. The goal shouldn’t be to replicate human intelligence, but to accept that stability in diagnostic accuracy is often a temporary reprieve before inevitable entropy.

Ultimately, the trajectory of this research points toward a continual cycle of refinement and recalibration. Systems age not because of errors, but because time is inevitable. The next generation of medical reasoning agents will likely not be defined by their ability to solve diagnostic problems, but by their capacity to gracefully navigate the ever-expanding frontier of medical uncertainty.

Original article: https://arxiv.org/pdf/2603.10492.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Erosion of Diagnostic Certainty

The Architecture of Augmentation: Introducing PULSE

Performance Under Scrutiny: Validating PULSE Across Expertise Levels

Beyond Automation: Augmenting Expertise and Mitigating Bias

What’s Next?

See also: