What Video AI ‘Sees’: Decoding Action and Outcome

Author: Denis Avetisyan


New research reveals how video AI models internally represent the nuances of actions, distinguishing between successful and unsuccessful attempts even when the final classification remains the same.

The study demonstrates that individual Multi-Layer Perceptron (MLP) blocks possess causal sufficiency in generating the layer 11 signal, indicating that focused intervention at these components is sufficient to influence the system's outcome without needing to consider broader network interactions - a testament to localized control within a complex architecture.
The study demonstrates that individual Multi-Layer Perceptron (MLP) blocks possess causal sufficiency in generating the layer 11 signal, indicating that focused intervention at these components is sufficient to influence the system’s outcome without needing to consider broader network interactions – a testament to localized control within a complex architecture.

A causal analysis of a Video Vision Transformer demonstrates an internal circuit where attention mechanisms gather evidence and multilayer perceptrons compose concepts to represent action outcomes.

Despite increasing performance, deep video models often obscure how nuanced semantic information-beyond simple classification-is represented internally. This is addressed in ‘Attention Gathers, MLPs Compose: A Causal Analysis of an Action-Outcome Circuit in VideoViT’, which reverse-engineers the internal circuit responsible for representing action outcomes in a pre-trained video vision transformer, revealing a division of labor where attention heads gather evidence and MLP blocks compose concepts to generate a ‘success’ signal. This work demonstrates that even models trained solely for classification develop sophisticated internal representations of complex outcomes, exhibiting robustness to ablation. What implications does this ‘hidden knowledge’ have for building truly Explainable and Trustworthy AI systems?


The Unfolding of Vision: Deconstructing the Video Vision Transformer

The Video Vision Transformer (VVT) establishes a new benchmark in the field of video understanding, leveraging the power of the Transformer architecture – originally prominent in natural language processing – to interpret visual information across time. This innovative model doesn’t simply analyze individual frames; it processes video as a sequence of visual ‘tokens’, enabling it to capture both spatial and temporal relationships within the footage. Recent evaluations on the large-scale Kinetics-400 Dataset, which challenges models to recognize 400 different human actions, demonstrate the VVT’s superior performance compared to previous state-of-the-art methods. Its ability to accurately classify complex activities – from playing a guitar to riding a horse – signals a significant leap forward in machine vision, offering potential applications ranging from automated video surveillance to more intuitive human-computer interaction.

Beyond simply attaining high accuracy scores, a comprehensive understanding of a model’s decision-making process is paramount for developing reliable artificial intelligence. While the Video Vision Transformer demonstrates impressive performance in video analysis, its internal logic remains a critical area of investigation; knowing what a model predicts is insufficient without grasping how it arrives at that prediction. This focus on interpretability is not merely an academic exercise, but a necessity for ensuring robustness against adversarial attacks, identifying and mitigating biases, and ultimately fostering trust in AI systems deployed in real-world applications where accountability is essential. The pursuit of ‘explainable AI’ moves beyond black-box functionality, demanding transparency into the features and patterns that drive these complex video understanding algorithms.

Attention analysis reveals that a specific head in layer 10 functions as an outcome detector, focusing on key frames in videos depicting either a successful 'Strike' or a failed 'Gutter' in bowling.
Attention analysis reveals that a specific head in layer 10 functions as an outcome detector, focusing on key frames in videos depicting either a successful ‘Strike’ or a failed ‘Gutter’ in bowling.

Mapping the Internal Landscape: A Mechanistic Approach

Mechanistic Interpretability offers a systematic approach to understanding the computations performed by large neural networks, specifically focusing on reverse-engineering the Video Vision Transformer (VVT). This framework moves beyond simply observing input-output relationships; it aims to decompose the network into functionally interpretable components. By analyzing the network’s internal states and tracing information flow, researchers can identify specific neurons or groups of neurons responsible for particular computations. This dissection process involves techniques designed to isolate and characterize the roles of individual components, ultimately revealing the key internal representations the VVT utilizes to process video data and arrive at its predictions. The goal is not to create a simplified model, but rather to build a comprehensive understanding of the existing model’s architecture and computational strategy.

Delta Analysis and Activation Patching are utilized to identify specific activations within the VVT that contribute to differentiating between contrasting video inputs. Delta Analysis involves presenting two similar video stimuli – one eliciting a specific response and another that does not – and measuring the resulting differences in the model’s activations. Activation Patching then systematically ablates or modifies individual activations, observing the impact on the model’s output to determine which activations are causally responsible for the observed difference. By iteratively perturbing and analyzing activations, these techniques allow researchers to pinpoint the specific internal features that encode information relevant to the distinction between the input videos, effectively reverse-engineering the model’s decision-making process.

The identification of an ‘Outcome Signal’ represents a key finding derived from mechanistic interpretability techniques applied to Video Vision Transformers (VVTs). This internal representation, localized within the network, functions as a prediction of the ultimate result of an action depicted in the video. Specifically, analyses utilizing Delta Analysis and Activation Patching consistently highlight particular activations that encode information about what will happen as a consequence of the observed action – for example, whether an object will be moved, a goal will be reached, or a specific state change will occur. The presence of this signal suggests the VVT doesn’t merely process visual input, but actively models and predicts future states based on observed actions, making it a crucial component in understanding the model’s decision-making process.

A token-wise heatmap reveals how contributions of individual tokens evolve across frames during a
A token-wise heatmap reveals how contributions of individual tokens evolve across frames during a “strike run” to determine the final predicted output class.

Dissecting the Prediction: Attention and MLP Blocks in Concert

The VVT architecture incorporates Attention Mechanisms to selectively aggregate information from input video frames. These mechanisms operate by weighting different spatial and temporal locations within the video based on their relevance to the anticipated outcome. This process allows the VVT to prioritize the most informative portions of the input sequence, effectively filtering out irrelevant background noise or redundant information. The attention weights are dynamically calculated based on the current state of the network, enabling the model to focus on different aspects of the video at different points in time. This selective attention enhances the model’s ability to extract meaningful spatio-temporal features, improving performance in action outcome recognition tasks.

MLP (Multilayer Perceptron) blocks within the VVT architecture function to synthesize attended features into conceptual representations of action outcomes. These blocks receive the outputs from attention mechanisms, which have identified relevant spatio-temporal information in the video frames. The MLP blocks then process these features through multiple layers of non-linear transformations, effectively composing high-level concepts that describe the nuanced details of the observed action. This compositional process refines the representation, enabling the VVT to accurately interpret and categorize complex action outcomes based on the attended visual evidence.

Quantitative analysis of the VVT architecture demonstrates that both Attention and Multi-Layer Perceptron (MLP) blocks contribute significantly to the construction of the ‘Outcome Signal’. Specifically, MLPs are identified as the primary drivers of signal recovery, accounting for 42-60% of the contribution. Attention blocks contribute 37-54% to the ‘Outcome Signal’, indicating a complementary role in gathering and weighting spatio-temporal evidence. These results suggest that while Attention mechanisms effectively focus on relevant input features, the MLP blocks are crucial for composing these features into higher-level concepts that define the action outcome.

The attention heatmap visualizes that the [CLS] token at Layer 9, Head 8 focuses on specific input features during processing.
The attention heatmap visualizes that the [CLS] token at Layer 9, Head 8 focuses on specific input features during processing.

Establishing Causal Links: Ablation and Probing as Validation

Automated Top-K Ablation functions by iteratively removing the k most salient tokens – as determined by a chosen saliency metric – from the input sequence and observing the resulting change in the model’s ‘Outcome Signal’. This process is systematic, varying k to assess the cumulative effect of feature removal. A significant drop in the ‘Outcome Signal’ following the ablation of specific tokens indicates those tokens are crucial for generating the correct output. By quantifying the impact of each ablation, we can identify which input features the model relies on most heavily, effectively establishing feature importance without manual intervention or predefined hypotheses.

Linear probe analysis assesses the degree to which distinct semantic concepts are encoded within the internal representations of a neural network. This technique involves training a linear classifier on top of the frozen internal activations of the network to predict a target variable. If the internal representations are linearly separable – meaning a linear classifier can achieve high accuracy – it suggests that the network has organized its internal state to represent these concepts in a distinguishable manner. Conversely, low accuracy indicates that the relevant information is not readily accessible through linear combinations of the internal activations, potentially requiring more complex decoding mechanisms or suggesting the concepts are not clearly represented within those activations.

Direct Logit Attribution, specifically utilizing the CLS token, was employed to determine the influence of internal model representations on the final classification output. This technique traces the contribution of individual input features to the predicted logits, effectively mapping the flow of information from input to prediction. Analysis revealed that component ablation – systematically removing portions of the model – resulted in negligible changes to classification accuracy for both ‘strike’ and ‘gutter’ videos. This finding supports the conclusion that the identified internal representations are causally linked to the model’s predictions; removing components does not significantly alter the outcome, indicating these components are not strictly necessary for the classification decision.

The trained probe achieved 100% accuracy in differentiating between successful and failed bowling runs (
The trained probe achieved 100% accuracy in differentiating between successful and failed bowling runs (“strike” vs. “gutter”), effectively functioning as a superficial performance fingerprint.

Beyond the Dissection: Implications for Artificial Intelligence

The dissection of the Visual Vector Transformation (VVT) offers a crucial window into the ‘black box’ of artificial neural networks, illuminating how these systems internally represent and process visual information. Researchers have discovered that the VVT doesn’t simply memorize training data, but instead constructs abstract, hierarchical features – akin to how the human visual cortex operates – allowing it to generalize and recognize patterns in novel images. This newfound understanding transcends mere model analysis; it provides concrete evidence for the emergence of interpretable visual concepts within artificial intelligence, moving beyond correlational observations to reveal the underlying computational logic. By pinpointing which internal components respond to specific visual attributes – edges, textures, object parts – scientists can begin to map the network’s ‘visual vocabulary’ and, ultimately, build more transparent and reliable AI systems capable of robust visual reasoning.

A deeper comprehension of how visual reasoning unfolds within artificial neural networks offers a pathway toward building AI systems that are not simply ‘black boxes’. By illuminating the internal logic driving these models, researchers can proactively address potential biases and vulnerabilities, fostering greater transparency and accountability. This increased interpretability is crucial for mitigating the risk of unintended consequences in critical applications-from autonomous vehicles and medical diagnostics to financial modeling and criminal justice-where flawed decision-making could have significant repercussions. Ultimately, prioritizing interpretability alongside performance will be essential for establishing public trust and ensuring the responsible deployment of increasingly powerful AI technologies.

Investigations are now shifting toward applying these analytical techniques to a broader range of sophisticated models, extending beyond the specific architecture initially examined. This expansion aims to determine if the discovered principles of visual representation are universally applicable across different neural network designs. Crucially, researchers are also exploring how this newfound understanding of internal logic can be actively used to enhance model capabilities – not simply to observe them. The hypothesis is that by consciously designing networks to align with these observed principles, it may be possible to improve both their performance on existing tasks and, importantly, their ability to generalize to unseen data, fostering more robust and adaptable artificial intelligence systems.

The study of internal representations within the Video Vision Transformer reveals a fascinating architecture, one where attention mechanisms gather evidence and MLPs synthesize concepts. This echoes a fundamental truth about all complex systems: they evolve, and their internal logic, while seemingly static, is perpetually reshaped by incoming data. As Claude Shannon observed, “The most important thing in communication is to convey the meaning, not the signal.” Here, the ‘signal’ is the video data, but the meaningful distinction between action outcomes – strike versus miss – is meticulously composed within the network, highlighting an inherent, evolving computational process. The architecture lives a life, and this research is merely witnessing its subtle shifts.

The Long View

The disentanglement of action outcome signals within the VideoViT, as demonstrated by this work, is not an endpoint, but a necessary excavation. The architecture has yielded a glimpse of its internal logic, but the fragility of such understandings must be acknowledged. Every delay in fully deciphering these networks is, in effect, the price of a more robust comprehension. A system understood only at a single moment is a system perpetually on the verge of obsolescence. The distinction between ‘strike’ and ‘miss’ is, after all, a fleeting judgment within a continuous stream of sensory input.

Future efforts should not focus solely on replicating this outcome disentanglement in larger models. That would be mere scaling, a postponement of true understanding. More critical is the development of tools that can trace these signals through time, mapping the evolution of these internal representations. How do these outcome signals interact with prior expectations? How are they modulated by contextual information? An architecture without a history is, inevitably, ephemeral.

The true test will not be in identifying what the model computes, but in understanding why it computes it in that particular way. The current methodology-activation patching and delta analysis-provides a valuable lens, but it is limited by its static nature. The field requires a more dynamic approach, one that can capture the temporal unfolding of these internal computations and reveal the underlying principles of their operation.


Original article: https://arxiv.org/pdf/2603.11142.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-03-15 17:41