Why AI Explanations Often Mislead

Author: Denis Avetisyan

A new analysis reveals that commonly used methods for understanding artificial intelligence can produce unreliable results, offering a false sense of certainty.

Even in randomly initialized neural networks, spurious correlations emerge between principal components of token representations and semantic labels-as demonstrated with BERT embeddings of IMDb sentences-and these representations, despite their origins, can be leveraged by simple probes to achieve nontrivial cross-validated accuracy, suggesting inherent structure even before training.

Framing AI interpretability as a problem of statistical-causal inference is crucial for addressing identifiability issues and enabling robust uncertainty quantification in neural networks.

Despite advances in artificial intelligence, explaining how these systems arrive at decisions remains surprisingly fragile, mirroring the unsettling finding that analyses of brain activity can detect ‘social cognition’ even in a dead salmon. In ‘The Dead Salmons of AI Interpretability’, we argue that current AI interpretability methods are susceptible to similar artifacts due to fundamental issues of identifiability, leading to unreliable explanations and poor generalizability. We propose reframing interpretability as a statistical-causal inference problem, treating explanations as parameters of a model and demanding rigorous uncertainty quantification against explicit alternative hypotheses. Can this shift towards a more statistically grounded approach finally transform AI interpretability into a truly rigorous and pragmatic science?

The Evolving Illusion of Understanding

The increasing complexity of modern machine learning models presents a substantial hurdle in understanding why they make specific decisions. While algorithms achieve impressive predictive accuracy, discerning the underlying rationale often proves elusive, and reported explanations can be surprisingly unreliable. This isn’t simply a matter of needing better tools; the fundamental challenge lies in the inherent difficulty of reverse-engineering a complex function. Even with sophisticated interpretability techniques, explanations are frequently unstable – minor changes to the model or input data can dramatically alter the reported reasons for a prediction. Consequently, reliance on these explanations for critical applications – such as healthcare or finance – requires careful consideration, as the perceived understanding may not accurately reflect the model’s actual behavior and can lead to flawed conclusions.

The pursuit of understanding why machine learning models make certain decisions is plagued by inherent statistical vulnerabilities. Researchers have long recognized issues like the “dead salmon artifact”-where brain activity patterns seemingly corresponding to a specific stimulus are observed even when the stimulus is absent-and the “multiple comparison problem,” where repeated statistical tests increase the chance of finding spurious correlations. This work reveals that many current interpretability methods exhibit a similar fragility, suffering from what is termed “non-identifiability.” Essentially, multiple distinct model configurations can produce identical explanations according to these methods, meaning a reported feature importance, for example, doesn’t uniquely pinpoint the actual drivers of a model’s behavior. This casts doubt on the reliability of explanations generated by these tools and underscores the need for more robust approaches to machine learning interpretability, as a statistically insignificant explanation can easily be mistaken for a genuine insight.

An interpretability task is defined by a hypothesis space, a distribution of causal queries about the system, and an error measure to evaluate those queries.

The Limits of Definitive Identification

Identifiability, in the context of model analysis, refers to the degree to which the values of a model’s internal parameters can be uniquely estimated from observed external data. Complex, overdetermined systems – those with more parameters than independent data points – frequently exhibit a lack of identifiability. This means multiple combinations of parameter values can produce the same observed output, preventing the recovery of a single, true parameter set. Consequently, even with sufficient data, accurately determining the specific contribution of each parameter becomes statistically impossible, leading to uncertainty in model interpretation and potentially unreliable predictions. The issue is not simply one of data scarcity, but an inherent property of the model structure itself when faced with a disproportionate number of adjustable variables.

The inability to uniquely determine model parameters – a condition known as non-identifiability – directly impacts the reliability of techniques used to explain model behavior. Methods such as feature attribution and probing, while intended to identify which input features or internal components contribute most to a specific output, can produce misleading results when the underlying model is non-identifiable. This is because multiple combinations of parameters can yield the same observed behavior, making it impossible to confidently assign causality to any particular feature or component. Consequently, interpretations derived from these techniques should be treated with caution, as they may reflect correlations rather than true dependencies within the model.

Feature attribution and probing methods, while intended to reveal the basis for model decisions, exhibit substantial statistical fragility. Evaluations across sentiment analysis and part-of-speech tagging tasks demonstrate that probe accuracy can be statistically insignificant; specifically, p-values less than 0.01 were observed when comparing probe performance to that achieved through random reinitialization of model weights. This indicates that observed probe accuracy may not reliably reflect genuine feature importance, but rather arise from chance correlations within the training data or inherent biases in the evaluation process. Consequently, interpretations derived from these methods should be treated with caution, as they may not generalize beyond the specific experimental setup.

Probing pretrained BERT models reveals that learned representations encode sentiment and syntactic information more effectively than random computations, and similarly demonstrate rudimentary world modeling capabilities even in small language models like pythia-160m.

Reconstructing Understanding Through Causal Inquiry

Statistical-causal inference reframes model interpretability not as a question of understanding the model itself, but as the problem of constructing a surrogate model capable of accurately answering specific causal queries about the system being modeled. This approach moves beyond simply identifying correlations within the data and focuses on determining the effect of interventions or changes to specific variables. By defining interpretability as the ability to predict outcomes under hypothetical scenarios, this framework allows for a more rigorous and quantifiable assessment of model understanding. The surrogate model, inferred from the original model’s behavior, then serves as a proxy for explaining how the system responds to external stimuli, enabling a focus on counterfactual reasoning and causal effect estimation.

Establishing identifiability within a statistical-causal inference framework necessitates determining whether the causal effect of interest can be uniquely estimated from the observed data. This often involves assumptions about the underlying causal structure and the absence of unobserved confounders. Bayesian Inference provides a mechanism to address identifiability concerns by treating model parameters as random variables with prior distributions. These priors are then updated based on observed evidence, yielding posterior distributions that reflect the uncertainty in parameter estimates. This principled approach allows for a quantitative assessment of model behavior, moving beyond point estimates to provide a probabilistic understanding of causal effects and their associated confidence intervals, thereby facilitating more robust and reliable interpretations.

Causal approaches to interpretability seek to determine genuine relationships within a model, moving beyond the identification of mere correlations; this is achieved by framing interpretability as a process of inferring causal effects. Evaluation using spatial representation analysis indicated performance gains with these methods, as measured by R² scores. Specifically, models utilizing embeddings achieved a score of 0.12, randomized approaches yielded 0.38, and pretrained models demonstrated the highest spatial understanding with a score of 0.45. These results suggest that approaches focused on identifying causal relationships can lead to more robust and interpretable model explanations, with measurable improvements in spatial reasoning capabilities.

The Inevitable Fragility and the Pursuit of Robust Validation

Hypothesis testing continues to be a foundational element in validating artificial intelligence systems, yet its efficacy hinges on meticulous application and a keen awareness of potential pitfalls. Simply achieving statistical significance isn’t enough; researchers must prioritize the creation of robust null models through rigorous randomization techniques. This process effectively establishes a baseline against which observed results can be reliably compared, and critically, helps to disentangle genuine effects from spurious correlations. By carefully controlling for confounding factors – variables that could independently influence outcomes – randomization minimizes the risk of drawing incorrect conclusions and strengthens the validity of any claims made about a model’s performance. Without this careful approach, even seemingly significant findings can prove fragile and unreliable, ultimately hindering responsible AI development and deployment.

Despite advancements in interpretability techniques, machine learning models remain vulnerable to statistical fragility, even when employing methods designed to illuminate their inner workings. Tools like sparse autoencoders, which distill data into essential features, and concept-based explanations, which link model decisions to human-understandable concepts, offer valuable insights into model representations. However, these techniques are not immune to the inherent noise and variability within datasets. Subtle perturbations in training data or model parameters can lead to significant shifts in these explanations, creating a false sense of understanding. Consequently, relying solely on these methods to validate model robustness is insufficient; careful consideration of statistical significance and rigorous testing against adversarial examples are essential to ensure reliable and trustworthy artificial intelligence.

Acknowledging the inherent fragility of artificial intelligence models is paramount for their ethical and effective deployment. Recent investigations demonstrate that even sophisticated architectures can produce statistically significant, yet potentially misleading, results – spatial representations, for instance, yielded Z-scores of 100 for embeddings and 25 for pretrained models, showcasing a clear distinction from random baselines but not necessarily reflecting genuine understanding. This highlights the necessity for transparent development practices, where model limitations are openly communicated, and outputs are interpreted with appropriate caution; relying solely on performance metrics can be deceptive. Responsible AI necessitates a shift toward prioritizing robust validation, careful consideration of potential biases, and a nuanced understanding of what these models actually represent, fostering trust and preventing unintended consequences.

The pursuit of AI interpretability, as detailed in this work, often feels like attempting to chart a course through increasingly turbulent waters. Many current methods, while appearing to offer insight, prove statistically fragile – akin to building explanations on shifting sands. This instability echoes a fundamental truth about complex systems: they inevitably decay. Vinton Cerf observed, “Any sufficiently advanced technology is indistinguishable from magic.” While magic implies mystery, the true challenge isn’t simply observing the system’s behavior, but rigorously quantifying the uncertainty inherent in its explanations. Framing interpretability as a problem of statistical-causal inference, as the paper advocates, is a necessary step towards acknowledging this decay and building systems that age gracefully, offering explanations that remain reliable even as the underlying technology evolves.

What Lies Ahead?

The pursuit of interpretability, as this work suggests, is less about illumination and more about charting the inevitable decay of understanding. Systems, even those constructed of logic gates and weighted connections, are not immune to the passage of time, and explanations, however carefully constructed, are merely snapshots of a fleeting present. The identification of ‘dead salmon’ artifacts – explanations that persist even when input is removed – is not a bug in the method, but a symptom of a deeper truth: correlation masquerades as causation, and stability is often just a delay of disaster.

Future efforts must confront the fundamental non-identifiability inherent in complex systems. Framing interpretability as a statistical-causal inference problem is a logical step, yet it risks simply shifting the locus of uncertainty. Robust uncertainty quantification is vital, not to solve the problem of explanation, but to accurately map the boundaries of ignorance. The field will likely progress not toward perfect understanding, but toward increasingly sophisticated methods for acknowledging what cannot be known.

Ultimately, the question is not whether an AI’s reasoning can be fully explained, but whether its failures can be anticipated. Systems age not because of errors, but because time is inevitable. The challenge, then, is to build systems that age gracefully, and to develop tools that accurately reflect the eroding foundations of their explanations.

Original article: https://arxiv.org/pdf/2512.18792.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Evolving Illusion of Understanding

The Limits of Definitive Identification

Reconstructing Understanding Through Causal Inquiry

The Inevitable Fragility and the Pursuit of Robust Validation

What Lies Ahead?

See also: