Beyond Automation: Reclaiming Human Insight in the Age of AI

Author: Denis Avetisyan

New research argues that data science workflows should prioritize human reasoning and knowledge contribution, not just automated results.

Designing AI data science processes around transparent intermediate artifacts empowers users to refine analytical questions and improve model outcomes.

While generative AI promises to democratize data science, its end-to-end automation often obscures the critical reasoning processes needed for complex analytical tasks. This paper, ‘More Than “Means to an End”: Supporting Reasoning with Transparently Designed AI Data Science Processes’, investigates how intentionally-designed intermediate artifacts-readable outputs generated within AI workflows-can empower users to evaluate choices and refine questions. Our analysis of AI systems in the medical domain reveals that these artifacts facilitated effective reasoning and knowledge contribution, even amidst opaque underlying processes. How can the HCI community best leverage intermediate artifacts to cultivate more thoughtful and effective human-in-the-loop data science?

Decoding the Clinical Signal: Beyond the Bottleneck

The sheer volume of patient data locked within unstructured clinical notes presents a significant bottleneck for medical advancement. Historically, researchers and clinicians have relied on manual chart reviews or rule-based systems to glean insights from this text, a process that is both time-consuming and prone to error. These traditional methods struggle to scale with the exponential growth of healthcare data, hindering efforts to identify patterns, improve patient care, and accelerate medical discovery. The nuanced and often ambiguous nature of clinical language – including abbreviations, jargon, and variations in documentation styles – further complicates the task, making it difficult to reliably extract key information using conventional approaches. Consequently, valuable insights remain hidden within these notes, awaiting more efficient and sophisticated methods of analysis.

The promise of artificial intelligence to revolutionize medical research faces significant hurdles due to the inherent ambiguity of human language. Clinical notes, while rich with potentially vital information, are rarely structured in a way that machines can easily interpret; nuanced phrasing, abbreviations, and the sheer variety of ways a single medical concept can be expressed all contribute to this challenge. Simply identifying keywords isn’t enough; accurate concept identification requires AI systems to understand the context of each phrase and differentiate between similar terms, a task demanding sophisticated natural language processing capabilities. Consequently, extracting meaningful insights from these textual datasets requires more than just computational power; it necessitates algorithms capable of discerning subtle linguistic cues and reliably mapping them to standardized medical knowledge, a process that remains a central focus of ongoing research.

Data Integrity: The Ghost in the Machine

Data leakage in healthcare AI model development occurs when information from the future, unavailable at the time of prediction, is used to train the model, leading to unrealistically optimistic performance estimates and flawed real-world application. This can manifest through several mechanisms, including the inclusion of post-diagnosis information as a predictor, utilizing data collected after the event being predicted, or improper cross-validation techniques that allow future data to influence past predictions. The result is a model that appears accurate during testing but fails to generalize effectively to new, unseen data, potentially leading to incorrect diagnoses or treatment recommendations. Rigorous data handling protocols and validation strategies are therefore essential to identify and mitigate data leakage before model deployment.

HACHI employs a multi-faceted approach to mitigate data leakage, beginning with prospective model development where data is strictly partitioned into training, validation, and testing sets based on temporal order. Rigorous validation procedures include hold-out testing on unseen future data, as well as cross-validation techniques adapted to time-series data to prevent information from future time points influencing past predictions. Feature engineering is carefully controlled, excluding any variables derived from future knowledge. Finally, all model outputs are subject to post-hoc analysis for potential leakage indicators, such as unusually high predictive accuracy on the test set or unexpected correlations between predicted and actual values, ensuring the integrity and reliability of findings.

HACHI prioritizes algorithmic fairness through the implementation of weighted performance metrics designed to mitigate biased outcomes in healthcare predictions. Initial model performance, measured by Area Under the Curve (AUC), was 0.71 across two hospital campuses. Following iterative refinement incorporating human feedback and adjustments to weighting mechanisms, the AUC improved to 0.93, demonstrating a substantial reduction in bias and improved predictive accuracy. This process ensures equitable performance across patient demographics and minimizes the potential for disparities in healthcare delivery based on algorithmic predictions.

Tempo: Extracting Order from Chaos

Tempo is an artificial intelligence system engineered to identify and extract temporal event data – information relating to specific occurrences and their timing – from electronic health records (EHRs). Unlike traditional methods that rely on keyword searches, which can be imprecise and miss nuanced data, Tempo employs AI algorithms to understand the context of clinical notes and accurately pinpoint events and their associated timestamps. This capability extends beyond simple event detection to include the extraction of relationships between events, durations, and sequences, providing a more comprehensive and granular understanding of patient histories than is achievable through conventional data retrieval techniques. The system is designed to process unstructured text within EHRs, converting it into structured, analyzable data points for improved clinical research, decision support, and patient care.

Tempo employs TempoQL, a purpose-built query language designed for the specific challenges of extracting and aggregating temporal event data from electronic health records. Unlike general-purpose query languages, TempoQL facilitates precise identification of events based on time and sequence, addressing the complexity of clinical narratives. This functionality extends the capabilities of the existing EHR-Agent system, providing a more robust and nuanced approach to data retrieval beyond simple keyword-based searches. The language is structured to handle the inherent ambiguity and variability found in clinical documentation, enabling researchers and clinicians to define complex temporal relationships between events with greater accuracy.

The Tempo system incorporates an AI Assistant to facilitate data retrieval through Natural Language Query (NLQ). This allows users to pose questions in standard English, which are then translated into the TempoQL query language. Performance evaluations demonstrate the AI Assistant generates correct TempoQL queries 2.5 times more frequently than equivalent queries written directly in SQL when addressing the same temporal data extraction tasks. This improved accuracy reduces the need for manual query refinement and enhances the efficiency of data access within electronic health records.

Beyond the Black Box: Exposing the Algorithmic Mind

Both HACHI and Tempo are designed with a crucial component for human collaboration: ‘Intermediate Artifacts’. These aren’t simply outputs, but rather visible representations of the AI’s thought process – the specific data transformations, feature selections, and modeling choices made during analysis. By externalizing these analytical steps, the systems allow researchers to scrutinize how a conclusion was reached, not just the conclusion itself. This approach facilitates a collaborative workflow where human expertise can guide the AI, correct errors in reasoning, and ensure the findings align with domain knowledge. The resulting transparency isn’t merely about understanding the ‘black box’ – it’s about actively steering the analysis and building confidence in the AI’s insights.

The ability to scrutinize an AI’s decision-making process is paramount to fostering confidence in its results, and both HACHI and Tempo address this through detailed intermediate artifacts. These artifacts aren’t simply outputs, but rather visible representations of the analytical steps undertaken, allowing researchers to trace the logic behind each conclusion. This validation process, while demanding approximately one to two hours per team review, isn’t merely about error correction; it’s a crucial step in understanding how the AI arrived at its findings, strengthening trust and ensuring responsible application of the technology. By exposing the ‘reasoning’ behind the data analysis, these systems move beyond black-box predictions, offering a pathway to collaborative intelligence and verifiable insights.

The PCS Framework establishes a formalized structure for data science workflows, moving beyond ad-hoc processes to ensure reproducibility and reliability. This framework meticulously encodes established best practices – encompassing data handling, model selection, and analytical techniques – into a standardized, reusable system. By defining clear protocols and automating repetitive tasks, PCS minimizes the potential for human error and facilitates consistent application of rigorous methodologies. This structured approach not only streamlines data science projects but also enables easier auditing, validation, and knowledge transfer, ultimately fostering greater confidence in the resulting analyses and predictive models.

The Emerging Intelligence: LLMs as the Foundation

The integration of generative artificial intelligence, driven by Large Language Models (LLMs), represents a fundamental shift in the operational capacity of both HACHI and Tempo. These LLMs are no longer simply supplemental tools; they are becoming deeply interwoven with the core functionality of each system, enabling previously unattainable levels of automated reasoning and insight. This isn’t merely about automating simple tasks, but about empowering the platforms to understand, interpret, and synthesize complex medical data with a degree of sophistication mirroring human expertise. Consequently, both HACHI and Tempo are evolving from data repositories into active knowledge engines, poised to accelerate the pace of medical discovery through increasingly intelligent data processing and analysis.

Large Language Models function as the core reasoning component within complex data processing pipelines, moving beyond simple keyword searches to nuanced understanding. These models don’t merely identify information; they interpret relationships, contextualize data points, and infer meaning from unstructured text, such as clinical notes or research publications. This capability facilitates sophisticated data extraction, pinpointing not just what information exists, but how it relates to a specific query or hypothesis. Consequently, analyses previously requiring significant manual effort – like identifying patterns in patient histories or synthesizing findings across multiple studies – are now achievable with greater speed and accuracy, accelerating the pace of medical discovery and enabling more informed decision-making.

The trajectory of medical discovery is increasingly intertwined with the evolution of Large Language Models. Current applications, such as sophisticated data extraction and analysis within systems like HACHI and Tempo, represent only a nascent stage of their potential. As LLMs continue to advance – exhibiting improved reasoning, contextual understanding, and the ability to synthesize information from vast datasets – their capacity to accelerate biomedical research will expand exponentially. Future iterations promise not only to refine existing analytical capabilities but also to facilitate de novo hypothesis generation, predict treatment efficacy with greater precision, and even personalize medicine based on individual patient profiles. This ongoing refinement positions LLMs as critical tools, capable of transforming the landscape of healthcare and unlocking insights previously inaccessible to researchers.

The pursuit of genuinely useful AI, as outlined in the paper, isn’t simply about achieving a desired outcome; it’s about understanding how that outcome is reached. This echoes Linus Torvalds’ sentiment: “Talk is cheap. Show me the code.” The article champions a workflow built on transparently designed intermediate artifacts – essentially, the ‘code’ of the data science process – allowing users to dissect analytical choices and validate assumptions. By forcing explicit representation of reasoning, the system isn’t a black box delivering answers, but a landscape for exploration, aligning with the core idea of empowering users to contribute unique knowledge and refine their questions through direct engagement with the analytical process.

What’s Next?

The insistence on ‘transparent’ AI often fixates on explaining decisions after the fact. This work suggests the more fruitful path lies in designing systems that invite scrutiny during creation-making the process of analysis itself legible. It begs the question: if the goal isn’t simply prediction, but augmented cognition, then what constitutes a useful ‘intermediate artifact’? The current landscape treats these as implementation details, but they are, fundamentally, the levers by which a human can steer, refine, and interrogate the machine’s reasoning.

A persistent limitation remains the assumption of a stable ‘question.’ Data science, in practice, is rarely about answering a pre-defined query. Instead, it’s a conversation – a series of approximations, misinterpretations, and unexpected revelations. Future work should explore how these intermediate artifacts can actively prompt reformulation of the initial question, and how to represent the provenance of those changes. Every exploit starts with a question, not with intent.

Ultimately, the challenge isn’t building more powerful AI, but building AI that tolerates-even encourages-being broken. A system that resists human intervention is not transparent; it is merely opaque. The next step is to move beyond simply visualizing the machine’s logic and to create tools that allow a user to actively dismantle and rebuild it, piece by piece.

Original article: https://arxiv.org/pdf/2603.24877.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/