From Clinic to Code: AI Built by Doctors, For Doctors

Author: Denis Avetisyan

A new framework empowers clinicians to directly translate their medical expertise into functional AI models for image analysis.

The study proposes a shift from conventional, expert-mediated AI development-where translation between clinical needs and technical implementation introduces costly misalignment-to a clinician-driven framework leveraging an autonomous coding agent capable of bridging medical and artificial intelligence domains and enabling direct AI prototyping.

Autonomous coding agents enable a clinician-driven approach to developing deep learning pipelines for medical imaging tasks, accelerating AI refinement and deployment.

Traditional clinical AI development relies on iterative communication between clinicians and specialized AI teams, often creating a bottleneck in translating clinical intent into executable models. This study, ‘From Clinical Intent to Clinical Model: An Autonomous Coding-Agent Framework for Clinician-driven AI Development’, introduces an autonomous coding-agent framework designed to empower clinicians to directly build AI models using natural language. Across diverse medical imaging tasks-including lesion classification, fracture detection, and debiased pneumothorax classification-the system successfully generated functional deep learning pipelines from clinician requests, even mitigating shortcut learning in complex scenarios. Could this approach fundamentally shift the landscape of clinical AI, making development more accessible and responsive to real-world clinical needs?

Whispers of Chaos: Bridging the Clinical Divide

Historically, the development of artificial intelligence for medical applications has frequently stumbled due to a disconnect between technical creation and clinical reality. Many AI models, trained on datasets lacking the subtle complexities of patient care, demonstrate diminished performance when implemented in actual healthcare settings. This discrepancy arises because traditional AI pipelines often prioritize statistical accuracy over the nuanced judgment inherent in medical decision-making; critical factors such as patient history, contextual factors, and the art of differential diagnosis are frequently underrepresented or entirely absent in training data. Consequently, algorithms may excel in controlled environments but falter when confronted with the unpredictable variability of real-world cases, leading to decreased clinician trust and limited practical utility. The result is a need for AI systems that are not merely accurate, but also clinically relevant and capable of integrating seamlessly into existing workflows.

Clinician-Driven AI represents a significant shift in artificial intelligence development, allowing medical professionals to directly shape AI models using everyday language. Rather than relying on data scientists to interpret complex clinical needs, this paradigm empowers doctors and specialists to articulate their requirements in natural language, effectively becoming the architects of the AI solutions they employ. This direct involvement ensures the resulting models are not only technically sound but, crucially, clinically relevant and immediately usable within existing workflows. By bypassing the traditional translation layer, the approach minimizes the risk of misinterpretation and accelerates the deployment of AI tools tailored to specific medical challenges, ultimately promising more effective and impactful healthcare applications.

The realization of Clinician-Driven AI hinges on the development of an autonomous coding agent – a system capable of interpreting complex clinical requests expressed in natural language and converting them into executable artificial intelligence solutions. This agent doesn’t simply execute pre-programmed tasks; it actively writes code, iteratively refining algorithms and models based on the clinician’s specifications. Such a system demands advanced capabilities in natural language understanding, code generation, automated testing, and debugging – effectively functioning as a tireless, AI-powered software engineer dedicated to medical applications. The agent must also manage data integration, ensuring compatibility with diverse healthcare datasets and maintaining patient privacy, all while providing clinicians with transparent, understandable outputs that validate the model’s logic and performance. Ultimately, this autonomous coder bridges the gap between medical expertise and technical implementation, democratizing AI development within healthcare.

This framework enables clinicians to develop AI tools by translating natural language requests-including task-specific concerns like mitigating chest-drain artifacts in chest radiographs-into iteratively refined, executable code with transparent optimization choices and negotiable trade-offs, ultimately producing a model aligned with clinical intent, as demonstrated by improved pneumothorax classification.

From Intention to Algorithm: The Automated Pipeline

The initial step in automating the development pipeline involves a semantic parser which transforms clinician-provided requests, expressed in natural language, into a formalized, structured task representation. This representation explicitly defines both the clinical problem to be addressed and the desired outcomes of the resulting model. The parser identifies key entities, relationships, and constraints within the natural language input, converting them into a machine-readable format suitable for downstream processing. This structured format typically includes specifications for input data types, expected output formats, relevant clinical guidelines, and quantifiable performance metrics used to evaluate success. The output of the semantic parser serves as the foundational blueprint for subsequent automated stages, ensuring alignment between clinical need and technical implementation.

Upon receiving a structured task representation derived from clinical input, the task initializer automatically configures the core components for model development. This process involves selecting an appropriate model architecture – encompassing choices regarding layer types, network depth, and connectivity – based on the task’s characteristics. Simultaneously, a training recipe is generated, defining parameters such as batch size, learning rate, optimization algorithm, and data augmentation strategies. Finally, an evaluation protocol is established, specifying the metrics used to assess model performance – including accuracy, precision, recall, and F1-score – and the methodology for validating the model’s generalizability to unseen data. This automated configuration streamlines the development process and ensures consistency across tasks.

The autonomous development process utilizes iterative experimentation to optimize model performance. This involves automatically generating and evaluating numerous model variations, with each iteration informed by pre-defined clinical priorities – such as diagnostic accuracy or reduced false positive rates – and quantitative performance metrics including area under the ROC curve (AUC), precision, and recall. The system doesn’t rely on manual intervention for hyperparameter tuning or architectural changes; instead, it employs algorithms to intelligently explore the solution space and identify improvements across a range of clinical tasks. This cycle of experimentation and refinement continues until pre-defined performance thresholds are met or a maximum iteration limit is reached, resulting in a model optimized for clinical utility.

An autonomously tuned AI framework consistently improved performance across three clinical tasks-dermoscopy classification, melanoma/nevus differentiation, and wrist fracture detection-and aligned those gains with the original clinical objectives, as demonstrated by statistically significant improvements and 95% confidence intervals.

Evidence in Action: Clinical Applications

The proposed methodology demonstrates applicability across diverse medical imaging tasks, including pneumothorax classification from chest radiographs, wrist fracture detection in skeletal radiographs, and dermoscopic lesion classification from skin lesion images. These use cases were selected to represent a range of imaging modalities, anatomical locations, and clinical challenges. Successful implementation across these varied applications highlights the generalizability of the approach and its potential for broader deployment in diagnostic radiology and dermatology. Datasets specific to each task – SIIM-ACR Pneumothorax, GRAZPEDWRI-DX, and ISIC 2019 – were utilized for both training and rigorous validation of model performance.

The SIIM-ACR Pneumothorax, GRAZPEDWRI-DX, and ISIC 2019 datasets are publicly available resources crucial for developing and validating the proposed approach. The SIIM-ACR Pneumothorax dataset contains chest radiographs with and without pneumothorax, facilitating the training of algorithms for its detection. GRAZPEDWRI-DX provides pediatric wrist X-rays labeled for fracture presence, enabling the development of wrist fracture detection models. ISIC 2019 consists of dermoscopic images of skin lesions, categorized into multiple classes, and is used for training and evaluating algorithms for lesion classification, including melanoma detection. These datasets offer standardized, labeled data, allowing for quantitative performance assessment and comparison of different methodologies in medical image analysis.

Refinement of the dermoscopic lesion classification model resulted in significant performance gains across multiple metrics. The area under the receiver operating characteristic curve (AUC) for 8-class lesion categorization improved from 0.8786 to 0.9153. Specifically focusing on melanoma detection, the AUC increased from 0.8422 to 0.9155. A substantial improvement was observed in melanoma sensitivity at a fixed 80% specificity, increasing dramatically from 0.6021 to 0.9089, indicating a considerable reduction in false negative diagnoses.

Evaluation of the wrist fracture detection model demonstrated a performance increase following refinement, as measured by the mean Average Precision at an Intersection over Union threshold of 50% (mAP@50). Initially, the model achieved a mAP@50 of 0.7943. Following the refinement process, this metric improved to 0.8517, indicating enhanced accuracy in identifying wrist fractures within the validation dataset. This represents a quantifiable improvement in the model’s ability to correctly localize and classify fracture instances.

Model performance can be negatively impacted by the presence of spurious correlations within training datasets. Specifically, in pneumothorax imaging, the presence of chest drains – a common treatment for pneumothorax – can be misinterpreted by the model as a defining characteristic of the condition itself, rather than an artifact of treatment. This leads to false positives and reduced generalization ability. Addressing this requires the implementation of robust debiasing techniques, such as data augmentation strategies that include and exclude images with chest drains, or the application of attention mechanisms that focus on relevant anatomical features and minimize the influence of these confounding factors.

Debiasing the model substantially reduces its reliance on chest drain status as a predictor of pneumothorax, decreasing false positives from 60% to 31% and diminishing residual shortcut dependence by 47% as demonstrated by shifts in predicted pneumothorax probability distributions and a reduced partial correlation between predicted pneumothorax and chest drain probability.

Beyond Accuracy: Strengthening Robustness and Generalization

Gradient reversal represents a compelling strategy for enhancing the robustness of machine learning models when faced with confounding variables. This technique effectively flips the sign of the gradient during backpropagation for specific features suspected of introducing bias. By penalizing the model for relying on these confounding factors, gradient reversal encourages it to focus on more relevant and informative signals. This process doesn’t eliminate the confounding features entirely – which might be important for other tasks – but rather diminishes their undue influence on the primary prediction. Consequently, the model becomes less susceptible to spurious correlations and generalizes more effectively to unseen data, offering improved reliability in real-world applications where confounding factors are pervasive.

The integration of labeled and unlabeled data, known as mixed supervision, presents a powerful strategy for improving model performance when fully annotated datasets are scarce. This approach leverages the abundance of readily available, yet unannotated, data to complement the information contained within a smaller labeled set. By employing techniques like self-training or consistency regularization, models can learn robust representations from both data sources, effectively generalizing to unseen examples. The benefit extends beyond simply increasing data volume; unlabeled data can provide crucial information about the underlying data distribution, helping the model to disentangle relevant features and reduce overfitting, ultimately leading to enhanced accuracy and reliability even with limited annotations.

Group-balanced sampling represents a critical strategy for mitigating biases inherent in imbalanced datasets, a common challenge in medical image analysis where certain conditions or demographics may be significantly underrepresented. This technique doesn’t simply oversample minority groups; instead, it strategically constructs training batches to ensure each predefined subgroup-defined by characteristics like age, sex, or disease severity-contributes a proportionate representation. By preventing the model from being disproportionately influenced by the majority class, group-balanced sampling fosters equitable performance across all subgroups, leading to more reliable and fair diagnostic outcomes. The approach actively combats the tendency of algorithms to prioritize accuracy on the dominant groups while potentially misclassifying or overlooking those with limited representation, ultimately enhancing the clinical utility and trustworthiness of the AI system.

Recent advancements in medical image analysis demonstrate a significant reduction in diagnostic bias through targeted debiasing techniques. Specifically, a study focusing on pneumothorax detection revealed a substantial decrease in false positive rates linked to the presence of chest drains – dropping from 60% to 31%. This improvement wasn’t simply a matter of overall accuracy; the methodology also successfully diminished the statistical dependence between predicted pneumothorax probability and the mere presence of a chest drain by 47%. This suggests the model learned to focus on actual radiographic evidence of pneumothorax, rather than being unduly influenced by the procedural context, paving the way for more reliable and equitable diagnostic tools in clinical settings.

Autonomous refinement significantly improved wrist-fracture detection performance, increasing metrics like mAP@50, mAP@50:95, recall, and F1, as demonstrated by validation runs and a final model comparison, with a minor trade-off in precision.

The pursuit of clinician-driven AI, as detailed in this framework, isn’t about achieving perfect accuracy – it’s about coaxing order from the inherent ambiguity of medical data. The agent functions as a scribe, translating intention into a language the machine understands, but the true magic lies in the iterative refinement. As Geoffrey Hinton once observed, “The problem with deep learning is that it’s a black box, and we don’t really understand how it works.” This holds true; the agent doesn’t solve the clinical problem, it provides a vessel for clinicians to shape the model, to guide the darkness and, through feedback, to refine the spell until it momentarily holds against the chaos of real-world data. It’s not about creating a perfect model, but a responsive one.

The Algorithm’s Due

This exercise in automated translation from clinical intent to executable code merely highlights how little the question was ever about intelligence. The agent doesn’t understand a diagnostic task; it correlates syntax. Any success suggests only that clinicians, perhaps unconsciously, already speak the language of shortcuts – a troubling thought. The true measure won’t be accuracy on curated datasets, but the elegance with which the resulting pipelines fail in production – a testament to the inevitable discord between the model’s promise and the messy reality it attempts to tame.

The pursuit of ‘clinician-driven AI’ risks becoming an echo chamber. If the agent consistently validates pre-existing clinical biases, it doesn’t augment expertise-it embalms it. The interesting problems aren’t those solved by the framework, but those it consistently misinterprets, revealing the unspoken assumptions embedded within the clinical request. Those errors, diligently ignored, will be the seeds of future, more spectacular, failures.

The next iteration won’t be about refinement, but about controlled demolition. A system designed to actively seek out its own limitations, to generate adversarial examples not of the data, but of the request itself. Only then might the whispers of chaos be heard above the din of calculated confidence.

Original article: https://arxiv.org/pdf/2604.17110.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Whispers of Chaos: Bridging the Clinical Divide

From Intention to Algorithm: The Automated Pipeline

Evidence in Action: Clinical Applications

Beyond Accuracy: Strengthening Robustness and Generalization

The Algorithm’s Due

See also: