Seeing What Networks See: Explaining Vision Models with Open Vocabulary

Author: Denis Avetisyan

Researchers have developed a new framework to generate human-understandable explanations for how vision models process images, moving beyond the constraints of pre-defined categories.

Iterative refinement of concept sets demonstrably enhances the quality of open vocabulary explanations, suggesting that even seemingly abstract semantic frameworks benefit from practical, cyclical adjustments to align with real-world applicability.

This work introduces a method for compositional explanation generation using open vocabulary semantic segmentation to align with neuron activations in deep neural networks.

Despite advances in deep neural networks, understanding how individual neurons encode information remains a central challenge in AI interpretability. This paper, ‘Open Vocabulary Compositional Explanations for Neuron Alignment’, addresses limitations in existing compositional explanation methods, which typically rely on fixed, human-annotated datasets. The authors introduce a novel framework leveraging open vocabulary semantic segmentation to generate compositional explanations for neuron activations using arbitrary concepts and datasets. This approach unlocks greater flexibility in probing neuron function and raises the question of how effectively model-generated annotations can replace human labels in fostering more transparent and adaptable AI systems.

The Illusion of Understanding: Peering into the Neural Network Black Box

Deep Neural Networks, often referred to as “BlackBoxDNNs”, consistently demonstrate impressive capabilities across diverse tasks, from image recognition to natural language processing. However, this performance comes at a cost: a fundamental lack of transparency. While these networks excel at what they do, understanding how they arrive at specific conclusions remains a significant challenge. This opacity isn’t merely an academic concern; it actively hinders the refinement of these systems. Without insight into the internal logic driving network decisions, debugging errors or improving performance becomes a process of trial and error, rather than informed optimization. Furthermore, the inability to explain a network’s reasoning erodes trust, particularly in high-stakes applications where accountability and reliability are paramount. The consequence is a tension between achieving state-of-the-art results and ensuring these powerful tools are both dependable and understandable.

Current techniques for analyzing deep neural networks often fall short when attempting to elucidate the reasoning behind their decisions. While performance metrics can indicate what a network achieves, they offer little insight into how it arrives at a particular conclusion. This presents a substantial obstacle to effective debugging; identifying the source of an error is difficult when the internal logic remains hidden. Consequently, improving network accuracy or robustness becomes a process of trial and error, rather than informed refinement. The inability to trace the flow of information and understand the contribution of individual neurons severely limits the potential for optimization and hinders the development of more reliable and trustworthy artificial intelligence systems.

The deployment of Deep Neural Networks (DNNs) into high-stakes applications, such as medical diagnosis, autonomous vehicle control, and financial risk assessment, is significantly hampered by their inherent lack of interpretability. While these systems can achieve remarkable accuracy, the inability to understand why a particular decision was reached presents substantial challenges. This isn’t merely a question of trust; in regulated industries, demonstrating the rationale behind automated decisions is often a legal requirement. Furthermore, the opacity of DNNs hinders debugging – identifying and correcting errors is difficult when the internal logic remains a mystery. Consequently, the absence of understandable reasoning limits the potential for refinement and adaptation, preventing these powerful tools from reaching their full potential in critical real-world scenarios where accountability and reliability are non-negotiable.

Addressing the challenge of ‘black box’ neural networks necessitates a shift towards interpretability, demanding techniques that correlate neuron activations with human-understandable concepts. Current research focuses on methods that probe internal representations, attempting to decode what features or patterns trigger specific neuron responses. This involves not simply identifying that a neuron fires, but why – linking its activity to recognizable elements within the input data, such as edges, textures, or even abstract ideas. Successful implementation of these methods promises to move beyond purely predictive models, enabling researchers to diagnose biases, refine network architecture, and ultimately build more trustworthy and robust artificial intelligence systems capable of explaining how they arrive at their conclusions, fostering confidence in critical applications like medical diagnosis or autonomous vehicle control.

Our framework identifies areas of neuron #19 activation (blue) to explain the behavior of cluster 4 at varying levels of detail.

Deconstructing the Decision: Logical Formulations for Neuron Behavior

CompositionalExplanations utilize $LogicalFormulas$ to represent the behavior of individual neurons within a neural network. This approach moves beyond simply noting a neuron’s activation; it aims to define the specific concepts to which that activation corresponds. Each neuron’s response is thus linked to a logical statement, effectively translating its internal processing into a human-interpretable rule. For example, a neuron might be represented by the formula “$IF$ ‘has_stripes’ $AND$ ‘is_mammal’ $THEN$ activate,” indicating it responds to the combined presence of those features. This representation allows for a precise articulation of a neuron’s function, connecting its activation not just to input patterns, but to the underlying concepts those patterns represent.

ConceptAlignment is quantitatively determined by assessing the statistical correlation between a neuron’s activation strength and the presence of specific concepts within the input data. This assessment typically involves calculating a correlation coefficient, such as Pearson’s $r$, between the neuron’s response and a concept indicator variable – a binary value denoting the concept’s presence or absence. Higher absolute values of the correlation coefficient indicate stronger ConceptAlignment, signifying a reliable relationship between the neuron’s activity and the identified concept. The methodology requires a labeled dataset where both input stimuli and associated concepts are known to enable accurate quantification of this alignment.

Representing neuron behavior with logical expressions enables the creation of human-readable explanations by translating neuron activations into statements about detected concepts. Specifically, a neuron’s response to an input is formalized as a logical formula, such as $A \land B \rightarrow C$, indicating that if concepts A and B are present, the neuron activates, contributing to the identification of concept C. This allows for a direct mapping between neuronal activity and the specific features or concepts in the input data that trigger it. Consequently, the contribution of each neuron to the network’s overall decision-making process becomes transparent, as its logical expression defines the conditions under which it influences the final output.

The logical representation of neuron behavior, achieved through compositional explanations, facilitates both verification and manipulation of the network’s internal logic. Specifically, expressing neuron responses as $LogicalFormulas$ allows for formal verification of whether a neuron is functioning as intended – for example, confirming it activates only when a specified concept is present. Beyond verification, these formulas enable targeted manipulation; altering the $LogicalFormulas$ associated with a neuron directly modifies its behavior, offering a mechanism to refine network function, correct errors, or implement specific constraints without retraining the entire model. This capability moves beyond simply understanding what a network does to actively controlling how it operates.

Cluster 5 neurons exhibit activation patterns-highlighted in blue-consistent with the Closed approach and our framework, revealing their response within the 6-8 range.

Seeing What the Network Sees: Semantic Segmentation for Concept Discovery

Semantic segmentation identifies and labels individual pixels within an image, effectively creating pixel-level masks that delineate objects or regions of interest. OpenVocabularySemanticSegmentation expands on this capability by utilizing foundational models to recognize a broader, potentially unlimited, range of concepts without being restricted to pre-defined categories established during training. The output of this process, termed SegmentationMasks, provides a detailed spatial understanding of image content, indicating the precise location and extent of identified concepts. This granular level of detail is crucial for downstream tasks requiring precise object localization and contextual understanding, enabling applications beyond simple object recognition, such as scene understanding and image editing.

Open-vocabulary semantic segmentation differs from closed-vocabulary approaches by utilizing foundational models – typically large-scale, pre-trained vision-language models – to enable recognition of objects not present in the training dataset. Closed-vocabulary systems are limited to a predefined set of categories, requiring retraining for novel objects. In contrast, open-vocabulary methods leverage the knowledge embedded within these foundational models to generalize to unseen objects through techniques like zero-shot learning or few-shot adaptation. This capability stems from the models’ exposure to vast amounts of textual data associating visual features with a broad range of concepts, allowing them to infer the identity of objects based on their visual characteristics even if those objects weren’t explicitly included during training. Consequently, open-vocabulary segmentation provides a more flexible and adaptable solution for identifying concepts in images.

Semantic segmentation models, when applied to images, produce pixel-level classifications that delineate instances of identified concepts. This process yields segmentation masks, where each pixel is assigned a label corresponding to a specific concept present in the image. These pixel-level labels are not simply object detections; they provide a detailed spatial understanding of the scene, outlining the precise boundaries and extent of each identified concept. This granular output is essential for generating compositional explanations, as it provides the necessary data to reason about the relationships between different concepts and their spatial arrangement within the image, enabling a more nuanced and detailed understanding of the visual content.

The MMESHHeuristic is implemented as a multi-stage filtering process designed to reduce the computational cost of identifying relevant concepts for explanation generation. Initially, a fast, approximate method identifies a candidate set of concepts present in an image. This set is then refined by evaluating concept saliency based on the size of the corresponding segmentation mask; smaller masks are prioritized as they often represent more focused, salient objects. Finally, a mesh-based connectivity analysis is performed on the segmentation masks to eliminate redundant or highly correlated concepts, ensuring that only the most distinct and informative concepts are selected for further processing. This heuristic significantly reduces the search space, enabling efficient concept discovery even with large vocabularies.

Despite focusing on image regions depicting bird parts like heads and wings, the explanation model successfully identified the presence of birds with high precision.

Trust, But Verify: Validating and Refining the Explanation

The integrity of compositional explanations hinges on acknowledging the potential for adversarial manipulation, a subtle but critical consideration in machine learning interpretability. Studies reveal that intentionally crafted, imperceptible perturbations to input data – known as adversarial attacks – can significantly alter a model’s reasoning process, leading to explanations that appear plausible yet are fundamentally misleading. These manipulations don’t necessarily change the model’s output – the prediction remains correct – but they can warp the internal justification the model uses to arrive at that conclusion. Consequently, robust validation procedures must actively test for susceptibility to such attacks, ensuring that explanations reflect genuine model behavior rather than artifacts of carefully designed noise. Failing to account for this vulnerability risks overconfidence in interpretability methods and the potential for flawed decision-making based on deceptive rationales.

Determining the range of activation for individual neurons is fundamental to validating the plausibility of compositional explanations. A neuron’s activation range-the specific inputs that consistently trigger a response-defines the scope of its functional role within the network. Explanations positing a neuron’s involvement in a task outside this established range are inherently suspect, suggesting a misattribution or an overestimation of its contribution. Analyzing this range allows researchers to assess whether an explanation aligns with the neuron’s demonstrated capabilities, ensuring that the identified compositional elements are genuinely supported by the network’s behavior and are not simply artifacts of the explanation method itself. This focus on activation range provides a critical check on the fidelity and interpretability of complex neural network models.

Rigorous validation of compositional explanations necessitates quantitative assessment against established ground truth data, and datasets like the CUB dataset – featuring detailed, human-annotated descriptions of bird characteristics – provide precisely this benchmark. By comparing automatically generated explanations to these human annotations, researchers can move beyond subjective evaluation and statistically measure both the accuracy and fidelity of the generated results. This approach allows for a precise determination of how well the model’s explanations align with human understanding, enabling improvements to the interpretability and trustworthiness of complex machine learning systems.

Evaluations conducted on the challenging CUB Dataset reveal that this framework generates compositional explanations with performance closely mirroring that of human annotation, as demonstrated by comparable scores in Alignment, Precision, and Relevance. Notably, the system achieves a significantly higher Relevance Score – a statistically significant improvement with a p-value of less than 0.001 compared to existing methods – suggesting a superior ability to identify truly salient features. While generating these detailed explanations requires approximately four to five minutes per neuron, the resulting fidelity and accuracy offer a valuable tool for interpreting complex neural network decision-making processes.

Cluster 5 neurons exhibit activation patterns, highlighted in blue, that align with explanations derived from both the Closed approach and our framework.

The pursuit of explainability in deep neural networks often feels like chasing a mirage. This paper’s approach to compositional explanations, sidestepping the need for painstakingly curated datasets through open vocabulary segmentation, is… predictably clever. It addresses a practical bottleneck, certainly. One anticipates, however, that these ‘elegant’ compositional explanations will, in production, inevitably become entangled with unforeseen edge cases and data drift. As Yann LeCun once stated, “The problem with AI today is that we’re very good at curve fitting, but very bad at understanding.” This research is a step towards understanding, but one suspects the ‘tangled monolith’ of real-world deployment awaits, regardless of how neatly the explanations are initially composed.

What’s Next?

The pursuit of ‘explainability’ often feels like rearranging deck chairs on the Titanic. This work, elegantly sidestepping the need for meticulously annotated datasets-a limitation that plagued prior attempts-is a step forward, certainly. But the relief is likely temporary. The system trades one brittle dependency for another: the reliability of the open vocabulary segmentation itself. Someone, somewhere, is already preparing a paper detailing the adversarial attacks that will break it. They’ll call it AI and raise funding.

The core problem isn’t generating explanations, it’s trusting them. This framework will produce neat, compositional breakdowns, but what happens when those breakdowns confidently misattribute causality? The documentation will lie again, of course. It always does. The current focus on visual explanations also feels… quaint. It’s a good starting point, but one suspects that true complexity will reside in the interplay of neurons, not the pixels they react to.

One anticipates a near future filled with increasingly elaborate explanation generators, each more convincing and ultimately more fragile than the last. The system, inevitably, will start as a simple bash script, and end as a Kafkaesque tangle of dependencies. Tech debt is just emotional debt with commits, after all. The question isn’t whether this approach will fail, but how spectacularly.

Original article: https://arxiv.org/pdf/2511.20931.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Understanding: Peering into the Neural Network Black Box

Deconstructing the Decision: Logical Formulations for Neuron Behavior

Seeing What the Network Sees: Semantic Segmentation for Concept Discovery

Trust, But Verify: Validating and Refining the Explanation

What’s Next?

See also: