Why Neural Networks Make Those Decisions: A New Explanation Framework

Author: Denis Avetisyan

Researchers have developed a method for formally explaining the reasoning behind prototype-based neural networks, offering greater insight into their decision-making processes.

The ProtoPNet architecture demonstrates the capacity to distill complex classifications into explanations anchored by a limited set of eleven prototypical parts, effectively reducing dimensionality for discerning between two classes.

This paper introduces Abductive Latent Explanations (ALE), a framework for generating rigorous, subset-minimal explanations within the latent space of prototype networks.

While prototype-based networks offer explanations alongside predictions, these explanations can be misleading, hindering their use in critical applications. This paper, ‘Formal Abductive Latent Explanations for Prototype-Based Networks’, addresses this limitation by introducing a framework for generating rigorous, formal explanations within the latent space of these networks. Specifically, we propose Abductive Latent Explanations (ALEs) – sufficient conditions on the instance’s representation that logically imply the prediction – offering guarantees beyond typical ‘interpretable by design’ approaches. Can this formalism unlock a new level of trustworthiness and reliability in prototype-based machine learning models, particularly in safety-sensitive domains?

The Opaque Oracle: Deconstructing the Black Box of Deep Learning

Despite achieving remarkable performance across diverse tasks, deep neural networks frequently function as opaque ‘black boxes’. This characteristic poses significant challenges to trust and accountability, particularly in high-stakes applications like healthcare or autonomous driving. The complexity of these networks – often comprising millions or even billions of interconnected parameters – makes it exceedingly difficult to discern why a specific input yields a particular output. While a model might accurately predict a diagnosis or navigate a vehicle, the lack of transparency in its decision-making process raises concerns about potential biases, vulnerabilities, and the ability to reliably generalize to novel situations. Consequently, the inscrutability of these powerful systems necessitates the development of techniques to illuminate their inner workings and foster greater confidence in their predictions.

Despite advances in explainable AI, techniques designed to illuminate the decision-making processes of deep neural networks often fall short of providing genuine understanding. PostHocExplanation methods, including gradient-based approaches like GradientBackpropagation and axiomatic attribution techniques, typically identify input features that correlate with a given prediction. However, these methods frequently mistake correlation for causation, highlighting elements the model used rather than the underlying reasons for its choice. This limitation stems from the fact that these techniques analyze a trained model without access to the original training data or the designer’s intent; they essentially reverse-engineer the decision without knowing the initial blueprint. Consequently, the explanations generated can be misleading or incomplete, hindering efforts to build trust and accountability in complex AI systems and failing to offer genuinely justifying insights into the model’s reasoning.

Current techniques for explaining the decisions of deep neural networks frequently fall short of providing genuine justification, instead revealing correlations that may not reflect underlying causal relationships. While methods like gradient-based attribution and axiomatic approaches can identify input features that influence a model’s output, they often highlight what appears important statistically, rather than what truly causes the decision. This distinction is critical; a feature strongly correlated with the outcome might simply be a byproduct of the actual reasoning process, leading to explanations that are misleading or even untrustworthy. Consequently, relying solely on these post-hoc explanations can create a false sense of understanding, obscuring the potential for bias or flawed logic within the network and hindering efforts toward building truly accountable artificial intelligence.

Formalizing Reason: The Rise of FXAI and Deductive Explanation

Formal Explainable AI (FXAI) departs from traditional explainability techniques by establishing explanations on a foundation of formal logic. Instead of relying on approximations or visualizations of model behavior, FXAI seeks to represent both the model’s functionality and its explanations as logical statements. This allows for the application of formal methods-specifically, mathematical proof-to verify the validity of explanations. By translating model behavior into logical axioms and explanations into logical consequences, FXAI enables a rigorous assessment of whether an explanation is not simply plausible, but demonstrably true given the model’s internal logic. This contrasts with post-hoc interpretability methods that offer insights but lack the guarantees of formal verification.

Formal Explainable AI (FXAI) builds upon the established logic of abductive reasoning, a process of inferring the most likely explanation for a given observation. Traditionally used in fields like medical diagnosis and fault detection, abductive reasoning identifies hypotheses that, if true, would best account for the available evidence. FXAI adapts this principle to neural networks by framing the task of explanation as the search for logical rules that justify a model’s prediction for a specific input. Instead of simply providing post-hoc interpretations, FXAI aims to generate explanations expressed as logical statements, such as “If $x$ is true, then the model predicts $y$,” which can then be formally verified. This allows for a more rigorous and systematic approach to understanding why a neural network made a particular decision, moving beyond intuitive understanding to logically grounded justifications.

Formal Explainable AI (FXAI) utilizes Automated Theorem Provers (ATPs) to validate the logical soundness of explanations generated for neural network decisions. These ATPs, systems designed to mechanically prove mathematical theorems, are applied to formally verify whether a proposed explanation-typically a set of rules or conditions-is consistent with the model’s internal logic and observed outputs. Specifically, the model’s behavior is translated into logical statements, and the explanation is similarly formalized; the ATP then attempts to prove that the explanation logically implies the observed behavior. Successful verification provides a guarantee, within the bounds of the formalization, that the explanation accurately reflects the model’s reasoning process, moving beyond post-hoc interpretation to demonstrable logical consistency.

Traditional explainable AI (XAI) methods typically focus on interpreting a model’s decisions post hoc, providing insights into why a specific prediction was made. Formal Explainable AI (FXAI) shifts this paradigm towards certification of reasoning; rather than simply offering an explanation, FXAI aims to provide a mathematically verifiable guarantee that the model’s behavior adheres to specified logical constraints. This is achieved by formally representing the model’s functionality and then using automated theorem provers to confirm whether a given explanation is logically consistent with the model’s established rules and inputs, effectively moving from plausibility to provability in AI explanations.

Latent Reasoning: Abductive Explanation Within the Network’s Mind

Abductive Latent Explanations (ALE) represent a departure from traditional explanation methods by framing explanations not in the input feature space, but directly within the latent space of a neural network. This is achieved by identifying latent vectors that, when used as input to the network, yield a prediction consistent with the original input’s classification. Defining explanations in the latent space allows for a more nuanced understanding of the network’s internal reasoning, as it moves beyond simply highlighting input features and instead focuses on the network’s abstracted representation of the data. This approach enables the generation of explanations that are internal to the model and directly tied to its decision-making process, rather than being interpretations of external inputs.

The Abductive Latent Explanation (ALE) framework requires that any proposed explanation satisfies a specific precondition to establish logical justification for a network’s prediction. This ALEPrecondition dictates that the latent representation associated with the explanation must be “closer” to the latent representation of the input than the latent representation of the correct class. Formally, if $x$ is the input, $h(x)$ its latent representation, and $e$ a potential explanation, the precondition is met if $d(h(x), h(e)) < d(h(x), h(y))$, where $d$ is a distance metric and $y$ represents the correct class label. Failure to meet this precondition invalidates the explanation, as it indicates the explanation’s latent representation is less representative of the input than the correct class, thereby failing to logically support the network’s output.

Spatial constraints leverage the geometric properties of the latent space to improve explanation validity. The underlying principle is that valid explanations should reside within a topologically consistent region relative to the input data point and the model’s decision boundary. Specifically, the search for explanatory prototypes is restricted to areas of the latent space where similar data instances also yield the same prediction, thereby reducing the combinatorial space of possible explanations. This refinement is achieved by analyzing the neighborhood structure of the latent space; prototypes distant from the input or residing in regions associated with different classes are systematically excluded, increasing the efficiency and reliability of the abductive reasoning process.

Analysis of prototype-based explanations reveals a substantial underestimation of the number of prototypes required for guaranteed correctness. Previous work often utilized a limited number of prototypes for explanation, however, our research indicates that achieving full explanatory power typically necessitates exceeding 10 prototypes. This finding is particularly relevant when assessing the fidelity of explanations, as fewer prototypes may not adequately represent the reasoning process within the neural network. The increased prototype requirement stems from the complexity of the latent space and the need to comprehensively account for contributing factors to a given prediction, suggesting existing methods may provide incomplete or potentially misleading explanations.

Analysis of incorrect classifications within the Abductive Latent Explanations framework reveals a requirement to evaluate all possible prototype pairs ($P \times L$) to accurately determine the reasons for misclassification. This stems from the observation that a limited selection of prototypes is insufficient to cover the necessary explanation space when a prediction is incorrect. Evaluating only a subset of prototypes overlooks potentially critical combinations that would otherwise expose the underlying rationale for the error, indicating that correctness guarantees necessitate considering the complete set of prototype pairings to avoid incomplete explanations.

Universal Applicability: Architectural Independence and the Future of Explainable AI

The strength of this proposed framework lies in its adaptability; it intentionally avoids dependence on any specific deep learning architecture. This deliberate design choice allows it to seamlessly integrate with, and benefit from, the rapidly evolving landscape of neural networks. Whether utilizing established models like VGG, ResNet, or the latest innovations in network design, the explanation process remains consistent and effective. This architectural agnosticism not only future-proofs the framework against obsolescence but also unlocks the potential to leverage increasingly powerful feature extractors as they emerge, ultimately enhancing the fidelity and comprehensibility of the resulting explanations.

Contemporary deep learning relies heavily on convolutional neural networks like VGG, ResNet, and WideResNet, not merely for their predictive power, but also for their capacity to generate meaningful feature representations. These architectures, pre-trained on massive datasets, effectively act as robust feature extractors, automatically learning hierarchical representations of visual information. Integrating these pre-trained networks into explanation processes allows for a decoupling of feature extraction from the explanation itself; rather than attempting to interpret the raw pixel data, explanations can focus on the high-level features already identified by the network. This approach significantly simplifies the explanation process and improves its fidelity, as the explanations are grounded in features that the network demonstrably uses for prediction. Consequently, explanations become more interpretable and actionable, providing insights into why a model made a certain decision based on learned visual concepts.

The ability to understand why a deep learning model makes a certain prediction is dramatically improved through techniques like ConceptLearning and PrototypeLearning. Rather than treating the network as a black box, these approaches actively identify and isolate the high-level concepts – such as ‘stripes’ or ‘pointed ears’ in an image of a cat – that most strongly influence its decisions. ConceptLearning establishes the relationship between these concepts and the model’s output, while PrototypeLearning finds the most representative examples from the training data that embody those concepts. By pinpointing these influential factors, the system moves beyond simply stating what it predicted to articulating how it arrived at that conclusion, offering a significant step towards truly interpretable artificial intelligence and fostering greater trust in these complex systems.

The framework’s adaptability extends to CaseBasedReasoning, offering a powerful method for explaining individual predictions. Instead of relying solely on abstract features or concepts, the system can identify training examples most similar to the current input. By highlighting these analogous cases, the model essentially demonstrates its reasoning process – “this prediction was made because it resembles this other instance we’ve seen before.” This approach provides a more intuitive and human-understandable explanation, circumventing the need to interpret complex internal representations. Furthermore, it allows for a form of explanation grounded in concrete data, fostering trust and enabling users to assess the validity of the model’s decision-making by examining the supporting examples.

The pursuit of formal explainability, as demonstrated in this work on Abductive Latent Explanations, echoes a fundamental principle of mathematical elegance. The framework’s emphasis on deriving subset-minimal explanations within the latent space aligns with the necessity of identifying core, irreducible components – a harmony of symmetry and purpose. Tim Berners-Lee aptly stated, “Data is just stuff. It’s the structure that gives it meaning.” This paper doesn’t merely present a method for interpreting prototype networks; it aims to reveal the underlying structure-the logical scaffolding-that transforms raw data into meaningful predictions, striving for provable interpretations rather than empirically ‘working’ ones.

What’s Next?

The pursuit of formally verifiable explanations in neural networks, as demonstrated by Abductive Latent Explanations, necessarily encounters a fundamental limit: the inherent approximation introduced by discretization. The framework’s reliance on subset-minimal explanations, while elegant, begs the question of optimality. A truly rigorous approach demands not merely a minimal explanation, but a provably unique one – a characteristic currently absent. Future work must address the complexity of establishing such uniqueness, potentially through invariants tied to the network’s manifold structure.

Moreover, the current instantiation confines itself to prototype networks. While this provides a valuable initial testbed, the generalizability to more complex architectures remains unproven. The latent space, while offering a convenient abstraction, is itself a learned representation – and thus subject to the biases and imperfections of the training data. Investigating methods to formally bound the error introduced by this abstraction is critical. The ultimate goal is not simply to describe a decision, but to certify its correctness, a standard currently unmet by most interpretability techniques.

One anticipates a shift toward explanations predicated on counterfactual reasoning – determining not just why a decision was made, but what minimal perturbation to the input would have altered it. However, this introduces a new class of problems related to the continuity of the decision boundary – a boundary which, in high-dimensional spaces, may be infinitely complex. The elegance of a formal solution, it seems, is always counterbalanced by the intractability of its computation.

Original article: https://arxiv.org/pdf/2511.16588.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Opaque Oracle: Deconstructing the Black Box of Deep Learning

Formalizing Reason: The Rise of FXAI and Deductive Explanation

Latent Reasoning: Abductive Explanation Within the Network’s Mind

Universal Applicability: Architectural Independence and the Future of Explainable AI

What’s Next?

See also: