The Evolving Doctor: AI That Learns Like a Clinician

Author: Denis Avetisyan

New research details an AI agent designed to mirror the diagnostic process of experienced physicians, continuously refining its accuracy through real-world case analysis.

DxEvolve frames clinical diagnosis as an iterative, evidence-centered process-mimicking deep clinical research-where an agent dynamically plans evaluations, consults external resources, and consolidates successful diagnostic reasoning into reusable “diagnostic cognition primitives” stored in a repository, ultimately demonstrating the potential to learn from past encounters and improve future diagnostic accuracy across both in-distribution and out-of-distribution patient cohorts sourced from MIMIC-CDM and external hospital data.

This paper introduces DxEvolve, a self-evolving AI system that combines structured diagnostic workflows with experience-driven learning to improve clinical decision support and auditability.

Current artificial intelligence approaches to clinical diagnosis often treat it as a single-step prediction, failing to capture the dynamic, experience-driven process of a clinician. This limitation motivates the work presented in ‘Emulating Clinician Cognition via Self-Evolving Deep Clinical Research’, which introduces DxEvolve-a novel agent that bridges this gap by autonomously requesting examinations and continuously learning from patient encounters. Demonstrating an average 11.2% improvement on the MIMIC-CDM benchmark and achieving parity with clinician performance in a reader study, DxEvolve offers both improved diagnostic accuracy and a pathway for auditable, longitudinal learning. Could this framework represent a crucial step towards truly intelligent and accountable clinical decision support systems?

The Illusion of Prediction in Clinical Work

Conventional clinical diagnosis frequently operates under the assumption that identifying a disease is akin to making a prediction – assessing the likelihood of a specific condition given observed symptoms. However, this framing obscures the fundamentally iterative nature of how experienced clinicians actually arrive at a diagnosis. Rather than a single predictive leap, diagnostic reasoning involves a continuous cycle of gathering evidence, forming hypotheses, testing those hypotheses with further investigation, and then refining or rejecting them based on new data. This process isn’t about achieving certainty from the outset, but about progressively narrowing the range of possibilities through careful, cyclical inquiry-a dynamic approach often lost when diagnosis is treated as a simple predictive exercise. Consequently, focusing solely on predictive accuracy risks overlooking the crucial role of ongoing evidence assessment and adaptation that characterizes skillful clinical practice.

The conventional approach to clinical diagnosis, often framed as a predictive exercise, frequently overlooks the dynamic interplay between accumulating evidence and evolving hypotheses that characterizes skillful medical practice. Treating diagnosis as simply ‘matching a pattern’ disregards the iterative process of gathering nuanced data, critically evaluating its relevance, and refining initial assessments in light of new findings. This simplification can lead to premature closure – accepting a diagnosis before sufficient evidence is obtained – or conversely, pursuing exhaustive testing without a focused line of inquiry. Consequently, patient care may be compromised, either through misdiagnosis or unnecessary interventions, highlighting the critical need for diagnostic strategies that more accurately reflect the complex, investigative nature of experienced clinical reasoning.

Clinical diagnosis should evolve beyond a simple predictive exercise and embrace an evidence-centered inquiry model, mirroring how seasoned clinicians actually approach complex cases. Rather than attempting to immediately categorize a patient based on initial impressions, this paradigm prioritizes the systematic gathering and evaluation of evidence – lab results, imaging, patient history, and observed symptoms – as an iterative process. Experienced physicians don’t merely predict a diagnosis; they formulate initial hypotheses, actively seek data to support or refute those hypotheses, and continually refine their understanding as new information emerges. This investigative approach acknowledges the inherent uncertainty in medicine and emphasizes the importance of ongoing assessment, promoting safer and more accurate patient care through a dynamic, evidence-driven process of reasoning.

Clinician assessments reveal that diagnostic cognition primitives (DCPs) mature with exposure, demonstrating improved clinical correctness, actionability, and generalizability over time (encounters 1-2000, [latex]ICC = 0.81[/latex]), as evidenced by increased retrieval of late-stage DCPs.

Beyond Prediction: DxEvolve and the Active Pursuit of Diagnosis

DxEvolve represents a departure from traditional diagnostic agents which primarily function through predictive analysis of static datasets. This system is engineered for continuous improvement through an iterative process mirroring clinical investigation. Rather than delivering a single diagnostic output, DxEvolve operates by formulating a hypothesis, actively seeking supporting or contradictory evidence, and then refining that hypothesis based on the results. This self-evolving capability allows the agent to adapt its diagnostic approach over time, increasing its accuracy and potentially identifying previously overlooked factors in complex medical cases. The design prioritizes an evidence-based methodology, treating diagnosis as an ongoing investigation rather than a one-time prediction.

DxEvolve differentiates itself from traditional diagnostic agents by incorporating a Deep Clinical Research Workflow, allowing for active data acquisition. Rather than functioning solely on pre-existing datasets, DxEvolve can autonomously request specific diagnostic tests relevant to the patient’s presentation. This capability extends beyond simple data input; the agent actively monitors and interprets the results of these requests, integrating the new evidence into its ongoing diagnostic assessment. This iterative process of test ordering and result observation forms a closed-loop system, enabling DxEvolve to refine its hypothesis and potentially identify more accurate diagnoses than systems limited to passively received data.

The DxEvolve agent utilizes an Action-Based Loop to iteratively improve diagnostic accuracy. This loop functions by actively requesting relevant clinical tests and incorporating the resulting data to refine the agent’s current diagnostic hypothesis. Unlike passive diagnostic systems, DxEvolve’s continuous refinement process, driven by incoming evidence, has demonstrated an 11.2% mean accuracy gain when benchmarked against a competitive baseline system. This performance increase indicates the efficacy of the active learning approach implemented within the Action-Based Loop, enabling the agent to move beyond initial predictions and converge on more accurate diagnoses through empirical evidence.

DxEvolve consistently generates investigations more aligned with documented clinical workflows and established guidelines, as demonstrated by higher workup consistency and improved guideline-compliance scores [latex]p < 0.05[/latex] across key dimensions like physical examinations, laboratory tests, and imaging.

Building a Cognitive Repository: Experience as the Ultimate Teacher

DxEvolve utilizes a process of Experience Consolidation whereby each resolved clinical case is not merely a solved problem, but a source of codified knowledge. This is achieved by encapsulating the diagnostic reasoning and actionable steps – the workup guidance – into reusable units termed Diagnostic Cognition Primitives. These primitives represent discrete, standardized components of diagnostic expertise, allowing the system to store, retrieve, and apply previously learned insights to new patient presentations. This differs from traditional case-based reasoning by focusing on the process of diagnosis, rather than simply storing complete cases, enabling generalization and adaptation to novel situations beyond the initially encountered data.

Experience Consolidation within the DxEvolve agent functions by transforming resolved clinical cases into reusable Diagnostic Cognition Primitives. These primitives represent codified workup guidance derived from past encounters, enabling the agent to apply previously successful diagnostic strategies to novel, unseen cases. This process effectively accelerates the agent’s learning curve by reducing the need for repeated exploration of similar clinical presentations; instead, the agent can directly leverage established diagnostic pathways, improving efficiency and potentially reducing diagnostic error rates in subsequent evaluations.

Evaluations utilizing the MIMIC-CDM benchmark dataset demonstrate the diagnostic capabilities of the DxEvolve agent. Testing revealed an overall diagnostic accuracy of 90.4% across a diverse range of clinical scenarios presented within the benchmark. This performance level represents a statistically significant improvement over the 88.8% accuracy achieved by human expert clinicians when evaluated on the same dataset. The MIMIC-CDM benchmark provides a standardized and comprehensive assessment of diagnostic performance, facilitating objective comparison between the agent and established clinical expertise.

DxEvolve consistently improves diagnostic accuracy on the MIMIC-CDM dataset-demonstrated across varying pathologies, diagnostic burdens, and a reader-study subset-outperforming the CDM baseline and approaching the performance of clinicians with full information, as indicated by results stratified by base LLM and diagnostic complexity.

Mirroring the Clinician: Real-World Alignment and Adaptability

DxEvolve’s diagnostic approach closely resembles the methodical evidence gathering of seasoned clinicians, a characteristic termed strong Workflow Alignment. The agent doesn’t simply arrive at a conclusion; instead, it sequentially examines relevant data points, prioritizing information much like a human expert would. This isn’t a random search, but a deliberate progression through potential findings, mirroring the way doctors build a case based on initial symptoms, then refine their thinking with test results and patient history. By emulating this natural clinical reasoning process, DxEvolve demonstrates not just diagnostic capability, but a fundamentally human-like approach to problem-solving within the complexities of medical assessment, suggesting a smoother integration into real-world clinical settings.

Diagnostic precision is fundamentally linked to the consistent application of established medical guidelines, and DxEvolve demonstrably prioritizes this crucial aspect of clinical reasoning. The agent’s architecture is specifically designed to ensure its diagnostic pathways align with current best practices, effectively mirroring the decision-making processes of experienced clinicians who rigorously adhere to protocol. This commitment to guideline adherence isn’t merely a structural feature; it’s reflected in the agent’s performance, fostering trust in its outputs and suggesting a capacity for responsible, evidence-based diagnosis. Consequently, the agent doesn’t simply identify potential conditions; it arrives at conclusions grounded in recognized standards, increasing the potential for seamless integration into real-world clinical workflows and supporting informed medical decision-making.

The DxEvolve agent showcases a remarkable capacity for clinical adaptability, extending beyond its initial training data. Evaluations reveal a substantial performance increase – exceeding 10% – when processing both translated medical records and documentation originally composed in Chinese, suggesting effective cross-lingual reasoning. Critically, the agent also achieves a 17.1% accuracy improvement when diagnosing conditions not represented in its original knowledge base, demonstrating an ability to generalize learned patterns to novel clinical scenarios. This capacity to effectively integrate new information and navigate diverse data modalities positions DxEvolve as a potentially valuable asset in increasingly globalized and heterogeneous healthcare environments.

Diagnostic accuracy increases with the addition of retrieved experiences, eventually plateauing, and analysis reveals that improved diagnoses are disproportionately associated with experiences sourced from previously misdiagnosed cases [latex]p < 0.05[/latex].

The pursuit of mimicking clinician cognition, as detailed in this work with DxEvolve, feels… familiar. It’s another layer of complexity built atop layers of existing complexity. They’ll call it AI and raise funding, naturally. This agent, learning from past cases and refining its diagnostic approach, simply formalizes what happens anyway – clinicians subtly shifting their reasoning based on experience. It’s all just pattern matching, really. The claim of ‘longitudinal learning’ is just a fancy way of saying the system remembers its mistakes. One can’t help but suspect that in a few years, someone will be debugging a mess of emergent behavior, muttering about how it all started with a simple bash script – or, in this case, a clean, elegant deep learning architecture. As Linus Torvalds once said, ‘Talk is cheap. Show me the code.’ And, more importantly, show the debugging logs after production inevitably breaks it.

What’s Next?

The pursuit of emulating clinician cognition, as exemplified by DxEvolve, inevitably highlights the gulf between demonstrated capability and sustained deployment. This work offers a compelling architectural approach – a structured workflow coupled with experience-driven learning – yet the true test lies not in benchmark datasets, but in the relentless churn of real-world clinical practice. Every abstraction dies in production, and diagnostic reasoning is riddled with them. The elegance of a self-evolving agent will be judged by its graceful failures, not its initial successes.

A critical unresolved problem centers on the longitudinal aspect of learning. DxEvolve, like all such systems, faces the challenge of concept drift – the subtle shifts in patient presentation and disease prevalence that erode model accuracy over time. Simply accumulating cases is insufficient; the system must actively identify and adapt to changing clinical landscapes, a task currently requiring significant human oversight. The long-term cost of maintaining this oversight-of perpetually auditing the auditability-remains an open question.

Future work will likely focus on the integration of such agents into existing clinical workflows, a prospect that necessitates not only technical refinement but also careful consideration of human factors. Ultimately, the value of DxEvolve-or any similar system-will be determined not by its diagnostic prowess, but by its ability to augment, rather than replace, the nuanced judgment of experienced clinicians. Everything deployable will eventually crash; the goal is to ensure the fall is predictable, and the recovery, swift.

Original article: https://arxiv.org/pdf/2603.10677.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Prediction in Clinical Work

Beyond Prediction: DxEvolve and the Active Pursuit of Diagnosis

Building a Cognitive Repository: Experience as the Ultimate Teacher

Mirroring the Clinician: Real-World Alignment and Adaptability

What’s Next?

See also: