AI Passes the Medical Exam: A New Model Outperforms GPT-4

Author: Denis Avetisyan


Researchers have developed a medical AI assistant, Erkang-Diagnosis-1.1, demonstrating advanced diagnostic capabilities and exceeding the performance of established models like GPT-4.

Erkang-Diagnosis-1.1 leverages the Qwen-3 large language model and a comprehensive medical knowledge graph to achieve state-of-the-art results in medical examinations.

Despite the increasing potential of large language models in healthcare, reliable and comprehensive medical AI assistants remain a significant challenge. This paper details the development of Erkang-Diagnosis-1.1, a novel AI healthcare consulting assistant built upon the Alibaba Qwen-3 model and fortified with approximately 500GB of structured medical knowledge. Through a hybrid pre-training and retrieval-augmented generation approach, Erkang-Diagnosis-1.1 surpasses GPT-4 in comprehensive medical exam performance, offering accurate preliminary analysis and guidance within a few conversational turns. Could this represent a scalable solution for empowering primary healthcare and personalized health management?


The Illusion of Diagnostic Precision

Traditional artificial intelligence systems in medicine frequently struggle with the subtleties of patient assessment, often failing to integrate complex symptoms, medical history, and individual patient factors into a cohesive diagnostic picture. Current AI models can excel at pattern recognition, but often lack the capacity for the higher-level reasoning – the ‘clinical judgment’ – that experienced physicians employ. This limitation stems from a reliance on statistical correlations rather than a true understanding of underlying physiological processes, leading to diagnoses that, while technically correct, may not fully address the patient’s unique circumstances or anticipate potential complications. Consequently, these systems can generate generic recommendations, overlook crucial details, and ultimately fall short of delivering the personalized, patient-centric care necessary for optimal health outcomes.

Erkang-Diagnosis-1.1 distinguishes itself through the implementation of QWen-3, a large language model specifically refined for the complexities of medical consultation. Unlike general-purpose AI, QWen-3 underwent intensive training utilizing a vast dataset of medical literature, clinical guidelines, and real-world case studies. This specialization allows the system to move beyond simple symptom matching and engage in more nuanced reasoning, interpreting patient information with a deeper understanding of medical context. The result is an AI assistant capable of formulating more accurate differential diagnoses, suggesting relevant follow-up questions, and ultimately providing a more comprehensive and reliable initial assessment – a marked improvement over models not purpose-built for healthcare applications.

Erkang-Diagnosis-1.1 represents a significant step forward in the application of artificial intelligence to preliminary medical assessments. Designed to improve both the speed and precision of initial consultations, this AI assistant utilizes the QWen-3 large language model, meticulously adapted for the complexities of medical diagnosis. Rigorous evaluations demonstrate that Erkang-Diagnosis-1.1 consistently surpasses the performance of GPT-4 in comprehensive medical examinations, suggesting a heightened capacity for nuanced understanding and accurate assessment of patient information. This enhanced diagnostic capability promises to alleviate the burden on healthcare professionals and potentially lead to earlier, more effective interventions, ultimately improving patient outcomes through a more efficient and reliable initial consultation process.

The Weight of the Dataset

The Erkang-Diagnosis-1.1 model utilizes a substantial 500GB knowledge base constructed from a variety of medical sources. This dataset incorporates complete medical textbooks, current clinical practice guidelines, and a large volume of anonymized patient data. The inclusion of diverse data types aims to provide a comprehensive foundation for the model’s understanding of medical concepts and conditions. Data anonymization procedures were implemented to ensure patient privacy and compliance with relevant regulations during the dataset’s creation and utilization.

Continued pre-training of the QWen-3 model utilizes a large corpus of medical data to iteratively refine its understanding of complex medical concepts. This process goes beyond initial training by exposing the model to a wider range of clinical information and relationships, allowing it to better discern nuances in medical terminology, pathophysiology, and treatment protocols. The refinement focuses on enhancing the model’s ability to not only recognize medical entities but also to interpret their interdependencies and apply this knowledge to novel clinical scenarios, thereby improving its capacity for accurate reasoning and inference within the medical domain.

Continued pre-training of the QWen-3 model specifically enhances its capabilities in two key areas: Medical Entity Recognition and Diagnostic Logic Modeling. Medical Entity Recognition involves the accurate identification of clinical terms, such as diseases, symptoms, medications, and anatomical structures, within unstructured medical text. Diagnostic Logic Modeling focuses on the model’s ability to process and interpret relationships between these entities – for example, understanding that certain symptoms are indicative of specific conditions. The refinement process enables the model to not only recognize these elements but also to synthesize them into a coherent understanding of a patient’s medical presentation, which is crucial for effective diagnosis.

The Erkang-Diagnosis-1.1 model achieves a 90% consistency rate with expert diagnostic consensus, as validated on an internal test set comprising over 200 common diseases. This performance metric indicates the model’s ability to accurately analyze provided patient symptom data and generate a list of potential diagnoses aligned with those of experienced medical professionals. The evaluation methodology involved comparing the model’s diagnostic suggestions to established expert opinions for a standardized dataset, quantifying the degree of agreement and establishing a baseline for diagnostic accuracy.

Retrieval as a Necessary Illusion

Erkang-Diagnosis-1.1 utilizes Retrieval-Augmented Generation (RAG) as a core mechanism for improving response quality and accuracy. RAG functions by first retrieving relevant information from an external knowledge source – in this case, a vector database containing medical data – based on the user’s query. This retrieved information is then combined with the model’s pre-existing parametric knowledge before generating a final response. This process allows the model to ground its answers in factual, up-to-date information, mitigating the risk of hallucination and improving the reliability of its diagnostic and informational outputs. By integrating external knowledge, RAG enables Erkang-Diagnosis-1.1 to provide more comprehensive and contextually appropriate responses than would be possible using its internal knowledge alone.

A Vector Database facilitates efficient semantic retrieval of medical knowledge by representing data points – such as medical concepts, symptoms, and treatments – as high-dimensional vectors. This process, known as embedding, allows the database to understand the meaning of the data, rather than simply matching keywords. When a query is submitted, it is also converted into a vector, and the database identifies the vectors most similar to the query vector using algorithms like cosine similarity. This similarity search enables the retrieval of relevant information even if the query doesn’t contain the exact terms present in the stored data, offering a significant improvement over traditional keyword-based search methods. The resulting vectors and associated medical knowledge are then utilized by the Retrieval-Augmented Generation process to formulate responses.

The medical knowledge utilized by Erkang-Diagnosis-1.1 is structured as a Medical Knowledge Graph, which represents medical concepts and their relationships as nodes and edges. This graph-based organization facilitates efficient semantic retrieval during the RAG process by enabling the model to traverse related concepts and identify relevant information beyond simple keyword matching. Specifically, the Knowledge Graph allows for the identification of indirect relationships – for example, recognizing that a symptom is associated with a disease through an intermediary condition – thereby providing a more comprehensive and nuanced knowledge base for response generation and streamlining the information retrieval stage of RAG.

The Erkang-Diagnosis-1.1 model utilizes a Retrieval-Augmented Generation (RAG) process where information retrieved from a vector database and medical knowledge graph is integrated with the model’s pre-existing parametric knowledge. This combination allows the model to move beyond generating responses solely based on its training data; instead, it can leverage external, up-to-date medical information to formulate answers. The result is enhanced response accuracy, increased contextual relevance, and the ability to provide more informed outputs addressing specific user queries, effectively mitigating the risks associated with potential knowledge gaps or outdated information within the model’s core parameters.

The Architecture of Trust

Erkang-Diagnosis-1.1 integrates a robust Medical Risk Control system as a foundational element, proactively addressing the potential for patient harm stemming from diagnostic inaccuracies. This system doesn’t simply flag problematic responses; it employs a multi-layered approach to scrutinize both user inputs and the AI’s generated advice. Critical medical claims are subjected to validation against established clinical guidelines and curated knowledge bases, while the system also assesses the overall risk level associated with any suggested course of action. By identifying and mitigating potentially harmful advice – such as incorrect diagnoses or inappropriate treatment suggestions – the platform prioritizes patient safety and establishes a critical safeguard against the inherent uncertainties of AI-driven healthcare solutions. This proactive approach aims to ensure the technology functions as a supportive tool for medical professionals, rather than a replacement for qualified judgment.

Erkang-Diagnosis-1.1 incorporates a robust content security filtering system as a crucial safeguard against inappropriate or dangerous interactions. This filter actively blocks prompts and requests concerning harmful activities, unethical advice, or topics outside the scope of responsible healthcare consultation. By preemptively identifying and rejecting such content, the system ensures that the AI remains focused on providing safe, constructive, and medically sound guidance. This proactive measure not only protects patients from potentially damaging information but also reinforces the AI’s commitment to ethical practice and responsible innovation within the healthcare landscape, establishing a firm boundary against misuse and promoting trustworthy dialogue.

Erkang-Diagnosis-1.1 employs a sophisticated instruction fine-tuning process, meticulously guided by a State Machine, to ensure consistently effective and prudent consultations. This isn’t simply about providing answers; the State Machine actively directs the conversational flow, preventing potentially harmful or irrelevant tangents. By predefining permissible conversational states and transitions, the system proactively steers interactions towards clinically sound reasoning and appropriate advice. This methodology allows the AI to navigate complex medical inquiries with a focused approach, prioritizing patient safety and responsible information delivery – a significant advancement over less structured models where conversations can easily drift into unhelpful or even dangerous territory. The result is a healthcare assistant designed not just to respond, but to consult with a carefully controlled and beneficial exchange.

Erkang-Diagnosis-1.1 isn’t simply an advancement in diagnostic AI; it represents a deliberate effort to establish a new standard for trustworthiness in the field. Beyond achieving comparable or superior performance to models like GPT-4, the system prioritizes patient safety through integrated risk controls and content security filters. This focus on responsible AI extends to the very core of its conversational design, utilizing a State Machine-guided instruction fine-tuning process to ensure consultations remain effective and ethically sound. The culmination of these measures isn’t merely about building a more accurate assistant, but about fostering a reliable healthcare companion capable of earning and maintaining patient confidence – a crucial distinction as AI increasingly integrates into sensitive medical contexts.

The pursuit of Erkang-Diagnosis-1.1, as detailed in this report, reveals a fundamental truth about complex systems: they are not constructed, but cultivated. The model’s reliance on a vast medical knowledge graph, augmenting the Qwen-3 base, isn’t about achieving perfect recall, but fostering an environment where relevant information emerges. This echoes Blaise Pascal’s observation: “The dignity of man lies in thought.” The system doesn’t know medicine; it provides the infrastructure for thoughtful diagnosis. Monitoring its performance, therefore, isn’t simply about error detection-it’s the art of fearing consciously, anticipating the inevitable revelations that expose the limits of its current understanding and guide its continued evolution. true resilience isn’t in avoiding failure, but in embracing the lessons revealed within it.

What Lies Ahead?

The construction of Erkang-Diagnosis-1.1, like all such endeavors, merely clarifies the shape of the inevitable compromises. A system boasting superiority today-even against a benchmark as shifting as GPT-4-has already begun its decay. The medical knowledge graph, however meticulously curated, will invariably lag behind the relentless churn of new findings, becoming a monument to past certainties. Performance gains are, after all, temporary reprieves from the chaos of incomplete information.

The true challenge isn’t building a more exhaustive database, but accepting the inherent fragility of knowledge itself. Future work will likely focus not on scaling these systems, but on building in mechanisms for graceful degradation-for admitting error not as a bug, but as a fundamental property of the domain. The aim shouldn’t be to solve medical diagnosis, but to create systems that are predictably, and perhaps even elegantly, wrong.

One anticipates a proliferation of these specialized models, each a carefully constructed illusion of competence. The interesting questions won’t be about accuracy scores, but about the ethical implications of deploying systems designed to fail in subtly different ways. Every deploy is, inevitably, a small apocalypse.


Original article: https://arxiv.org/pdf/2512.20632.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-26 23:04