Author: Denis Avetisyan
Researchers have developed an artificial intelligence system that autonomously discovers and refines its own image analysis techniques, promising more adaptable and accurate clinical decision support.

This work introduces MACRO, a self-evolving agent that learns reusable, multi-step procedures for complex medical image interpretation via reinforcement learning and tool discovery.
Current medical AI agents struggle with the dynamic nature of clinical image interpretation, often relying on pre-defined toolsets that degrade across tasks and evolving diagnostic needs. This work, ‘Evolving Medical Imaging Agents via Experience-driven Self-skill Discovery’, introduces MACRO, a self-evolving agent capable of autonomously discovering and integrating reusable, multi-step procedures-or composite tools-from its own successful experiences. By shifting from static tool composition to experience-driven skill discovery, MACRO demonstrates improved orchestration accuracy and cross-domain generalization on diverse medical imaging datasets. Could this approach unlock truly adaptive, context-aware clinical AI assistance that continuously learns and improves alongside practitioners?
The Inevitable Limits of Static Diagnosis
Conventional medical diagnosis frequently depends on a finite collection of diagnostic tools, coupled with the subjective assessment of experienced clinicians. This reliance creates inherent bottlenecks in both speed and precision; the process is limited by the availability of pre-defined tests and the time required for manual interpretation. As medical imaging generates increasingly complex datasets, and patient presentations become more nuanced, these static approaches struggle to keep pace. The cognitive load on clinicians rises as they sift through vast amounts of information, increasing the potential for oversight or misdiagnosis. Consequently, the efficiency of healthcare systems is hampered, and the timely delivery of accurate diagnoses – crucial for effective treatment – is often delayed.
Conventional diagnostic methods often falter when confronted with the inherent messiness of real-world medical imaging. Biological systems are rarely textbook perfect; images frequently exhibit noise, artifacts, and subtle variations that challenge even experienced clinicians. Beyond image quality, a complete understanding necessitates integrating complex patient context – encompassing medical history, genetic predispositions, lifestyle factors, and even environmental exposures. Static diagnostic tools, designed to identify pre-defined patterns, struggle to reconcile these nuances, frequently overlooking critical details or generating false positives. This limitation isn’t simply a matter of technological inadequacy; it reflects the fundamental difficulty of reducing a uniquely individualâs health status to a limited set of quantifiable features, highlighting the need for more adaptable and holistic approaches to diagnosis.
Contemporary medical diagnosis is frequently hampered by workflows demanding substantial clinician effort and mental processing. A typical case often necessitates a series of discrete steps – image acquisition, initial review, potential consultation with specialists, further imaging, and finally, a conclusive report – each contributing to significant time delays and placing a heavy cognitive burden on healthcare professionals. This multi-stage process isn’t simply additive in terms of time; each step requires focused attention and the integration of potentially conflicting data, increasing the risk of oversight or misinterpretation. The cumulative effect is a system susceptible to human error and prone to inefficiencies, ultimately impacting both the quality of patient care and the overall sustainability of healthcare systems.
The escalating complexity of modern medicine demands a shift towards diagnostic systems capable of more than static analysis; increasingly critical are adaptive, intelligent agents. These agents arenât intended to replace clinicians, but to augment their abilities by synthesizing information from diverse sources – medical imaging, patient history, genomic data, and real-time monitoring. Such systems move beyond simple pattern recognition, employing complex reasoning to weigh probabilities, identify subtle anomalies often missed by the human eye, and contextualize findings within the individual patientâs unique presentation. This ability to integrate visual and contextual data promises to not only accelerate diagnostic timelines, but also to improve accuracy and personalize treatment strategies, ultimately leading to more effective patient care and a reduction in diagnostic errors.

Orchestrating Intelligence: The Medical Agent Framework
The Medical Agent is a system designed to facilitate complex diagnostic and treatment processes by integrating a Large Language Model (LLM) as its central orchestrator. This LLM doesnât function in isolation; instead, it directs and manages a collection of specialized tools, effectively acting as an intelligent intermediary. The LLM receives input relating to a medical case, determines the appropriate tools needed for analysis or action, and then sequences their execution. This allows the system to move beyond simple question answering and engage in multi-step reasoning and problem-solving, leveraging the strengths of both the LLMâs natural language processing capabilities and the precision of dedicated tools. The agentâs architecture prioritizes flexibility and scalability through this modular design, enabling the incorporation of new tools and capabilities as they become available.
The Medical Agent employs a dynamic âTool Integrationâ mechanism, eschewing pre-defined, static toolsets in favor of accessing and combining capabilities from external resources as needed. This involves identifying relevant tools – encompassing APIs, databases, and specialized algorithms – based on the specific requirements of a diagnostic task. Rather than being limited to a fixed inventory, the agent can query and utilize a broad spectrum of existing functionalities, enabling adaptability to novel challenges and access to the most appropriate resources for each situation. This integration is performed programmatically, allowing for automated orchestration of tools and the seamless flow of data between them to achieve a desired outcome.
Composite Tool Synthesis enables the Medical Agent to automatically generate and verify complex procedures by chaining together primitive tools. This process involves iteratively discovering sequences of tool applications, evaluating their performance against defined criteria, and refining the resulting composite tool for optimal execution. Validation incorporates both simulated testing environments and, where feasible, retrospective analysis of clinical data to ensure reliability and safety. Successfully synthesized composite tools are then stored and made available for reuse across multiple diagnostic tasks, reducing redundant reasoning and improving overall efficiency. The system prioritizes tools with established provenance and documented performance characteristics during synthesis to minimize risk and maximize the quality of the generated procedures.
The Medical Agentâs diagnostic reasoning is dynamically adjusted based on the requirements of each individual case. Rather than applying a pre-defined analytical pathway, the system evaluates the presented clinical data and selects, combines, and sequences tools – including both single-step utilities and previously synthesized Composite Tools – to address the specific information gaps and complexities of the problem. This involves assessing the relevance of different data types, prioritizing lines of inquiry, and iteratively refining the analytical strategy as new evidence emerges, ultimately allowing for a more focused and efficient diagnostic process compared to static, rule-based systems.
![MACRO operates by retrieving relevant experiences based on image similarity, generating tool sequences with rewards for utilizing registered composite tools, and dynamically updating a memory store [latex]\mathcal{M}[/latex] and composite tool registry [latex]\mathcal{C}[/latex] with successful trajectories to facilitate the online discovery of new tools.](https://arxiv.org/html/2603.05860v1/x2.png)
Reinforcement and Refinement: Learning to Diagnose
The Medical Agent is trained using Reinforcement Learning (RL), a technique enabling the agent to learn optimal diagnostic strategies through trial and error. Specifically, the Group Relative Policy Optimization (GRPO) algorithm is implemented to facilitate the discovery and utilization of effective composite tools – combinations of diagnostic tests and procedures. GRPO operates by comparing the performance of the agent to a group of peers, encouraging the adoption of strategies that outperform the group average and leading to improved diagnostic accuracy through the identification of beneficial tool combinations. This approach differs from single-test analysis by focusing on synergistic effects achievable through integrated diagnostic pathways.
Performance validation of the Medical Agent utilized three distinct datasets representing different medical domains: the MITEA Dataset, focused on heart disease diagnosis; the REFUGE2 Dataset, specializing in glaucoma detection through optic disc imaging; and the RAM-W600 Dataset, which provides data for bone erosion analysis in radiographic images. These datasets were chosen to ensure the agent’s diagnostic capabilities were assessed across a range of pathologies and imaging modalities, providing a comprehensive evaluation of its generalization ability and robustness to diverse clinical presentations. Data from these sources facilitated quantitative assessment of the agentâs performance metrics, including F1 score and Balanced Accuracy, enabling objective comparison and refinement of its diagnostic reasoning.
Supervised learning serves as a critical initialization phase for the Medical Agent, leveraging labeled datasets to establish a foundational understanding of diagnostic patterns and best practices. This involves training the agent on examples of known medical conditions and corresponding diagnoses, allowing it to learn the relationships between patient data and established clinical guidelines. The resulting model provides a starting point for reinforcement learning, guiding the agent towards behaviors consistent with expert medical reasoning and reducing the exploration time required to achieve optimal performance. This pre-training also mitigates the risk of the agent developing potentially harmful or inaccurate diagnostic strategies during the reinforcement learning phase.
The Medical Agent utilizes an âExperience Memoryâ to facilitate ongoing learning and adaptation. This memory stores records of interactions deemed successful based on performance metrics during diagnostic tasks. Stored data includes input medical data, the agentâs actions taken, and the resulting outcome, enabling the agent to revisit and analyze effective strategies. By repeatedly accessing and learning from these stored experiences, the agent refines its diagnostic reasoning capabilities and improves its ability to generalize to novel cases and varying data distributions without explicit retraining, contributing to a continuously improving diagnostic performance.
Evaluation of the diagnostic framework on established datasets yielded an F1 score of 80.3% when tested against the REFUGE2 dataset, which focuses on glaucoma detection. Performance on the MITEA dataset, assessing heart disease diagnosis, resulted in a Balanced Accuracy (BACC) of 77.2%. These scores demonstrate a quantifiable improvement in diagnostic capability achieved through the frameworkâs ability to discover and utilize composite tools – combinations of diagnostic methods – over single-method approaches. The reported metrics represent the agentâs performance after reinforcement learning training and validation against these datasets.
Towards Predictive Healthcare: Impact and Future Directions
The increasing complexity of modern medicine often presents clinicians with intricate diagnostic challenges, demanding significant time and expertise to synthesize information from various sources. This research addresses this burden through an automated framework capable of constructing and executing complex diagnostic workflows. By intelligently orchestrating the analysis of patient data – including medical history, imaging results, and laboratory tests – the system aims to not only accelerate the diagnostic process but also enhance its precision. The automation minimizes the potential for human error and cognitive biases, leading to more reliable diagnoses, particularly in cases requiring the integration of multifaceted clinical information. This ultimately frees up valuable clinician time, allowing them to focus on patient care and complex cases requiring nuanced judgment, while simultaneously improving overall diagnostic accuracy and patient outcomes.
This diagnostic framework distinguishes itself through a remarkable capacity to generalize across diverse medical challenges. Beyond a single ailment, the systemâs architecture facilitates effective analysis for conditions ranging from the complexities of heart disease – identifying subtle indicators of cardiac dysfunction – to the nuanced detection of glaucoma, a leading cause of blindness, and even the early stages of bone erosion indicative of osteoporosis or other skeletal disorders. This broad applicability stems from the frameworkâs design, which prioritizes the identification of core diagnostic reasoning patterns rather than memorizing condition-specific rules, allowing it to be readily adapted to new medical domains with minimal retraining and offering a versatile tool for comprehensive healthcare assessment.
The frameworkâs versatility stems from the implementation of parameter-efficient fine-tuning techniques, notably Low-Rank Adaptation (LoRA), coupled with the AdamW optimizer. This methodology allows the system to rapidly adapt to new medical datasets and varying clinical contexts without requiring extensive retraining of all model parameters. By focusing on adjusting only a small subset of parameters, LoRA significantly reduces computational costs and training time, facilitating quick deployment in diverse healthcare settings. The AdamW optimizer further enhances this process by incorporating weight decay regularization, preventing overfitting and promoting generalization to unseen data, ultimately improving diagnostic performance across a spectrum of medical conditions and ensuring the system remains responsive to evolving clinical needs.
The progression of this diagnostic framework extends beyond the laboratory, with ongoing research dedicated to seamless integration into existing clinical workflows. This involves not simply presenting diagnostic suggestions, but actively collaborating with electronic health record systems and decision support tools to provide clinicians with readily accessible, contextually relevant insights. Simultaneously, exploration into the realm of personalized medicine is underway, leveraging the frameworkâs adaptable architecture to tailor diagnostic approaches based on individual patient characteristics, genetic predispositions, and lifestyle factors. The ultimate aim is to move beyond generalized diagnoses towards predictive and preventative healthcare, where early detection and individualized treatment plans become the standard of care, optimizing patient outcomes and reducing the burden on healthcare systems.
The diagnostic framework demonstrates a substantial leap forward in heart disease detection, achieving a remarkable 74.9% improvement in the F1 score when contrasted with currently available medical agentic systems. This metric, a harmonic mean of precision and recall, signifies a heightened ability to both correctly identify patients with heart disease and avoid false positives – a critical balance in clinical application. This performance establishes a new state-of-the-art benchmark, suggesting the frameworkâs potential to significantly enhance diagnostic accuracy and reduce the burden on healthcare professionals. The considerable gain over existing methods underscores the efficacy of the approach and its readiness for further evaluation and eventual integration into clinical practice, promising more reliable and timely diagnoses for patients at risk.

The pursuit of adaptable intelligence, as demonstrated by MACROâs self-skill discovery, echoes a fundamental truth about complex systems. It isnât about pre-programmed perfection, but about fostering an environment where components can learn and forgive each otherâs imperfections. As Claude Shannon observed, âThe most important thing in a complex system is the way in which its parts interact.â This interaction, allowing for the emergence of âcomposite toolsâ from successful procedures, isnât simply about efficiency; itâs about building resilience through redundancy and adaptation. The system doesn’t strive for a fixed solution, but rather cultivates a garden of possibilities, pruning failures and nurturing promising new growth. This echoes the core idea of the article: that adaptability stems not from rigid design, but from a systemâs capacity to learn from its own experience.
What’s Next?
The pursuit of self-evolving agents, as demonstrated by this work, is less about constructing intelligence and more about cultivating a capacity for graceful failure. A system that never discovers a better tool is, functionally, already deceased. The current iteration, while demonstrating the potential of experience-driven skill discovery, remains tethered to the specifics of imaging interpretation. The true test will lie in its capacity to extrapolate – to apply the principle of composite tool creation to domains where the initial scaffolding is absent.
Limitations are, of course, inherent. The reliance on successful interactions as the sole driver of evolution risks entrenching existing biases and overlooking genuinely novel, yet initially unsuccessful, approaches. A system optimized solely for positive reinforcement will inevitably prize predictability over exploration. The next stage must therefore grapple with the uncomfortable truth that error is not a bug, but a vital component of any adaptive process.
Perfection, in this context, is a theoretical dead end. A perfectly optimized agent, capable of flawlessly executing a predefined set of tasks, leaves no room for people – for the serendipitous discoveries and contextual adaptations that define true clinical expertise. The goal, then, is not to replace interpretation, but to augment it – to create a symbiotic system where agent and clinician evolve together, each learning from the otherâs failures.
Original article: https://arxiv.org/pdf/2603.05860.pdf
Contact the author: https://www.linkedin.com/in/avetisyan/
See also:
- Star Wars Fans Should Have âTotal Faithâ In Tradition-Breaking 2027 Movie, Says Star
- Call the Midwife season 16 is confirmed â but what happens next, after that end-of-an-era finale?
- eFootball 2026 is bringing the v5.3.1 update: What to expect and whatâs coming
- Jessie Buckley unveils new blonde bombshell look for latest shoot with W Magazine as she reveals Hamnet role has made her âbraverâ
- Country star Thomas Rhett welcomes FIFTH child with wife Lauren and reveals newbornâs VERY unique name
- Taimanin Squad coupon codes and how to use them (March 2026)
- Decoding Lifeâs Patterns: How AI Learns Protein Sequences
- Denis Villeneuveâs Dune Trilogy Is Skipping Children of Dune
- Mobile Legends: Bang Bang 2026 Legend Skins: Complete list and how to get them
- Are Halstead & Upton Back Together After The 2026 One Chicago Corssover? Jay & Haileyâs Future Explained
2026-03-10 04:42