Seeing Actions Through Relationships: A New Graph-Based Approach

Author: Denis Avetisyan

Researchers are leveraging the power of graph neural networks to better understand human actions by modeling the connections between visual cues and skeletal data.

PAN constructs visual token graphs from 2D skeletal data, enabling the differentiation of five challenging actions-even amidst viewpoint variations that produce visually similar appearances-demonstrating an enhanced representational capacity beyond traditional skeletal graph approaches.

This paper introduces PAN, a human-centric framework that represents RGB frames as visual token graphs aligned with skeletal data for state-of-the-art multimodal action recognition.

Despite advances in human action recognition, effectively fusing complementary information from RGB and skeletal data remains a challenge due to inherent modality heterogeneity. This paper introduces a novel framework, ‘Patch as Node: Human-Centric Graph Representation Learning for Multimodal Action Recognition’, which models RGB video as a graph of visual tokens aligned with human skeletal data. By representing patches as nodes, this human-centric approach facilitates more coherent cross-modal fusion and achieves state-of-the-art performance on multiple benchmarks. Could this graph-based paradigm unlock new levels of understanding in complex activity analysis and beyond?

Decoding Human Action: Bridging the Gap Between Observation and Understanding

The ability to accurately interpret human actions holds significant promise for a range of critical applications, from enhancing public safety through intelligent surveillance systems to revolutionizing healthcare with proactive patient monitoring and assistive technologies. However, current approaches to automated action recognition often falter when attempting to synthesize information from multiple data streams – such as visual RGB footage and skeletal joint data. These methods typically process each data type in isolation, overlooking the inherent correlations that exist between them and limiting their ability to discern subtle, yet crucial, aspects of human behavior. This fragmented approach creates a bottleneck, hindering the development of truly robust and reliable systems capable of understanding the complexities of human movement and intent.

Current methods for interpreting human actions frequently analyze visual RGB data and skeletal joint positions as distinct information sources, a practice that inadvertently hinders overall accuracy. This separation overlooks the fundamental relationship between what a person looks like performing an action and the underlying biomechanics represented by their skeletal structure. For instance, a bending elbow is visible in RGB imagery, but its precise location and movement are directly encoded in the skeletal data; treating these as independent inputs forces the system to re-learn this inherent connection. Consequently, performance plateaus as the system struggles to reconcile discrepancies and fails to leverage the complementary strengths of each modality, ultimately limiting its ability to robustly understand complex human behavior.

A comprehensive understanding of human action requires moving beyond the analysis of individual data streams, such as visual RGB data or skeletal joint positions. Current limitations in action recognition stem from treating these modalities in isolation, neglecting the crucial interplay between what a person does and how they do it. Researchers are increasingly focused on developing unified frameworks capable of reasoning across multiple modalities, allowing algorithms to leverage the complementary information present in both visual appearance and body kinematics. This synergistic approach enables a more nuanced interpretation of behavior, as subtle visual cues can validate or refine skeletal data, and vice versa. By effectively bridging these modalities, systems can achieve greater robustness to occlusions, varying viewpoints, and complex environmental conditions, ultimately leading to more accurate and reliable human action understanding for applications ranging from automated surveillance to personalized healthcare.

RGB and skeleton data can be integrated primarily through feature-level fusion, decision-level fusion, or by combining both approaches to leverage complementary information.

PAN: A Graph-Centric Framework for Integrated Perception

The PAN framework utilizes a graph-based representation to process RGB video frames and corresponding 2D skeleton data. RGB frames are constructed as visual graphs where nodes represent image tokens and edges denote spatial relationships between them. Simultaneously, skeleton data is represented as a skeletal graph, with joints as nodes and bone connections defining edges. This dual-graph structure allows for explicit modeling of intra-modal relationships – how elements within the RGB or skeleton data relate to each other – and crucially, inter-modal relationships, defining how visual features connect to skeletal information. By representing both modalities as graphs, PAN facilitates the application of graph-based deep learning techniques to fuse information and reason about the relationships between visual appearance and human pose.

PAN utilizes Graph Convolutional Networks (GCNs) to model relationships within and between RGB and skeleton data. GCNs operate directly on graph structures, allowing the framework to represent joints as nodes and their connections as edges, thereby explicitly encoding spatial dependencies. Temporal dependencies are captured by applying GCNs across consecutive frames, processing sequences of graph data. This graph-based approach enables cross-modal reasoning; information propagates between the RGB and skeleton graphs via shared nodes and edges, allowing the network to learn correlations and integrate features from both modalities. The aggregation process within GCN layers effectively fuses information, enabling the model to leverage complementary cues from visual and skeletal data for improved action recognition or pose estimation.

PAN utilizes token embeddings generated from pre-trained Visual Foundation Models (VFMs) to represent visual features as nodes within the graph structure. These VFMs, trained on large-scale image datasets, provide high-dimensional feature vectors that capture complex visual information. Rather than directly using raw pixel data or hand-engineered features, PAN projects these VFM outputs into token embeddings, effectively creating a learned representation of visual elements. Each token embedding then corresponds to a node in the visual graph, allowing the framework to leverage the semantic richness of the VFM features for downstream multimodal fusion and reasoning. This approach allows for a more robust and informative visual representation compared to traditional methods.

Channel-wise topology refinement addresses limitations in standard Graph Convolutional Networks (GCNs) by adaptively adjusting the adjacency matrix used for message passing. This is achieved through learnable channel-wise weights applied to each edge in the graph, allowing the model to prioritize or suppress specific relationships between nodes. By modulating the influence of each connection, the refinement process enables GCNs to better capture nuanced dependencies and improve representation learning, particularly in complex multimodal graphs where relationships may vary in significance across different feature channels. This adaptive weighting facilitates more effective information propagation and allows the model to focus on the most relevant connections for a given task.

PAN-Unified integrates visual and skeletal graph embeddings within a single graph convolutional network, while PAN-Ensemble utilizes late fusion of classification scores from separate networks to improve performance.

Refining Temporal Understanding: Multi-Scale Convolutions and Post-Calibration

PAN utilizes multi-scale temporal convolutions to address the challenge of modeling temporal relationships within action recognition tasks. This approach involves applying convolutional filters of varying kernel sizes to the input sequence, enabling the network to capture dependencies across different time windows. Smaller kernel sizes focus on short-term dependencies, while larger kernel sizes model long-term relationships. By aggregating features extracted from these multiple scales, PAN gains a more comprehensive understanding of the temporal dynamics of the observed actions, improving its ability to differentiate between similar movements and recognize complex activities.

Attention-Based Post Calibration operates on the token embeddings generated during the feature extraction process to improve action recognition accuracy. This refinement involves applying an attention mechanism that weights the importance of each visual feature within the embeddings. By focusing on the most salient features, the calibration process reduces the influence of irrelevant or noisy information, resulting in a more discriminative representation of the action being performed. This targeted refinement allows the model to prioritize key visual cues, enhancing its ability to correctly classify actions, particularly in complex scenarios with cluttered backgrounds or partial occlusions.

Rigorous evaluation of PAN on established benchmark datasets demonstrates its superior performance in action recognition. Specifically, on the NTU RGB+D 120 – CSub dataset, PAN achieves a Top-1 Accuracy of 91.7%, exceeding the performance of current state-of-the-art methods. This metric indicates that PAN correctly identifies the most likely action label in 91.7% of the test cases within the CSub subset of the NTU RGB+D 120 dataset, a commonly used standard for evaluating skeletal-based action recognition models.

Evaluations demonstrate that the proposed PAN model achieves state-of-the-art performance on two benchmark datasets. Specifically, PAN attains a Top-1 Accuracy of 88.3% on the NTU RGB+D 120 – CSet dataset, representing an improvement over existing methods. Furthermore, the model achieves a Mean Class Accuracy of 67.7% on the Tokyo-Smarthome dataset, indicating robust performance across a diverse set of human activity categories within that dataset.

PAN utilizes <span class="katex-eq" data-katex-display="false">2D</span> skeletal data for guided sampling but relies solely on RGB frames for uniform sampling, employing attention-based post-calibration with sampled and original token embeddings and stacking <span class="katex-eq" data-katex-display="false">L1L\_{1}</span> basic blocks of graph (GC) and temporal (TC) convolutions before classification. — PAN utilizes $2D$ skeletal data for guided sampling but relies solely on RGB frames for uniform sampling, employing attention-based post-calibration with sampled and original token embeddings and stacking $L1L\_{1}$ basic blocks of graph (GC) and temporal (TC) convolutions before classification.

Expanding the Framework: Architectural Versatility and Real-World Impact

The PAN-Ensemble architecture enhances performance by employing a dual-path Graph Convolutional Network (GCN). This design incorporates two distinct GCN pathways that independently process graph data, allowing the model to capture complementary information and improve feature representation. Crucially, the outputs from these pathways are not merged prematurely; instead, a late fusion strategy is applied. This approach delays the integration of features until the final stages of processing, which contributes to increased robustness against noisy or incomplete data. By combining the strengths of both pathways through late fusion, PAN-Ensemble demonstrably achieves improved accuracy and stability across a range of graph-based tasks, offering a more reliable and effective solution compared to single-path GCN models.

The PAN-Unified architecture represents a significant simplification of the original PAN framework by consolidating graph representation learning into a single Graph Convolutional Network (GCN). This innovative approach eliminates the need for multiple GCN pathways and subsequent fusion, drastically reducing computational complexity and enabling faster processing speeds. By performing all graph-based feature extraction within a unified structure, PAN-Unified achieves comparable, and in some cases superior, performance to its dual-path counterpart while requiring fewer parameters and less processing power. This streamlined design not only enhances efficiency but also facilitates easier deployment on resource-constrained devices, broadening the potential applications of PAN to a wider range of real-world scenarios where computational resources are limited.

The PAN framework, through its extensions, exhibits a notable capacity to adjust and perform effectively across a wide spectrum of applications. This adaptability stems from the modular design, allowing for the integration of diverse graph convolutional network (GCN) architectures – from dual-path ensembles enhancing robustness to streamlined, unified approaches prioritizing computational efficiency. This inherent flexibility isn’t merely theoretical; demonstrated performance gains across varied datasets – encompassing areas like healthcare diagnostics, security threat detection, and nuanced human-computer interaction – underscore the framework’s practical relevance. The ability to readily tailor PAN to specific needs, without substantial architectural overhaul, positions it as a versatile tool for researchers and practitioners alike, capable of addressing complex challenges in a multitude of domains.

Consistent performance gains across a spectrum of datasets suggest the Potential for Adaptive Networks (PAN) framework extends beyond theoretical promise and into practical utility. Evaluations encompassing healthcare diagnostics, biometric security systems, and nuanced human-computer interaction challenges demonstrate PAN’s robust adaptability. This consistent success isn’t merely statistical; it points to the framework’s capacity to generalize effectively, handling the inherent complexities and variability present in real-world data. The ability to maintain accuracy and efficiency across such diverse applications positions PAN as a valuable tool for researchers and developers seeking solutions in fields reliant on complex relational data and pattern recognition, fostering innovation and offering a pathway towards more intelligent and responsive systems.

The pursuit of robust action recognition, as demonstrated by PAN, necessitates a shift from treating video as a continuous stream to discerning underlying structural dependencies. Each frame isn’t merely a visual input, but a collection of tokens forming a graph, intrinsically linked to skeletal data. This echoes Andrew Ng’s sentiment: “Machine learning is about learning the right representation.” PAN effectively engineers a representation where visual tokens and skeletal information aren’t treated as disparate modalities, but as interconnected nodes within a unified graph. The method’s success lies not just in achieving state-of-the-art results, but in revealing how humans naturally decompose actions into recognizable patterns, and then encoding those patterns in a way that a machine can understand and replicate.

Looking Ahead

The construction of human-centric graph representations, as demonstrated by PAN, offers a compelling, if not predictable, trajectory for multimodal action recognition. The alignment of visual tokens with skeletal data suggests a move toward systems that attempt to understand action, rather than simply detect it. However, the inherent challenge remains: are these graphs truly capturing the underlying semantics, or merely reflecting correlations learned from datasets? Reproducibility, of course, will be crucial in distinguishing between these possibilities.

Future work should address the limitations of current cross-modal fusion techniques. Simply concatenating or averaging features, even within a graph structure, feels increasingly… convenient. A deeper exploration of how information flows between modalities – and, more importantly, why certain modalities dominate in specific contexts – is warranted. The reliance on pre-trained vision transformers also introduces a potential bottleneck; disentangling learned representations from these models will be essential for truly generalizable systems.

Ultimately, the field risks becoming fixated on performance benchmarks. The pursuit of ever-higher accuracy should not overshadow the need for explainability. Visualizing and interpreting the learned graph structures – understanding which visual tokens and skeletal joints contribute most to a given action classification – will be vital for building trust and ensuring responsible deployment of these technologies. Perhaps, then, the true measure of progress will not be the speed of recognition, but the clarity of understanding.

Original article: https://arxiv.org/pdf/2512.21916.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Decoding Human Action: Bridging the Gap Between Observation and Understanding

PAN: A Graph-Centric Framework for Integrated Perception

Refining Temporal Understanding: Multi-Scale Convolutions and Post-Calibration

Expanding the Framework: Architectural Versatility and Real-World Impact

Looking Ahead

See also: