Sensing Context: A New Approach to Understanding Human Activity

Author: Denis Avetisyan

Researchers are combining the strengths of centralized and federated learning with Transformer models to build more accurate and privacy-preserving human activity recognition systems.

Despite a consistent central tendency across iterations of the federated learning process, client-specific performance-measured by [latex]BA[/latex]-exhibits substantial variability in its range and susceptibility to outlier values, suggesting inherent instability within the distributed system.

This paper introduces FED-HARGPT, a hybrid learning framework leveraging Transformers for human activity recognition on decentralized, non-IID data from edge devices.

The increasing volume of data generated by wearable sensors presents both opportunities and challenges for human activity recognition, particularly concerning data privacy. This paper introduces ‘FED-HARGPT: A Hybrid Centralized-Federated Approach of a Transformer-based Architecture for Human Context Recognition’-a novel framework that combines centralized and federated learning techniques with a Transformer-based model to address this need. Experimental results demonstrate that this hybrid approach achieves comparable performance to fully centralized models on non-IID data while preserving data privacy. Could this approach unlock more robust and privacy-conscious human activity monitoring in real-world applications?

The Illusion of Centralized Knowledge

Human Activity Recognition systems traditionally depend on gathering data from users’ devices and transmitting it to a central server for processing. This centralized approach, while simplifying analysis, introduces significant privacy vulnerabilities as sensitive behavioral information leaves the user’s control. Beyond privacy, this model struggles with scalability; as the number of users and the volume of collected data increase, the central server faces immense computational strain and potential bottlenecks. Furthermore, reliance on constant data transmission drains device batteries and requires consistent network connectivity, limiting the feasibility of widespread, continuous monitoring. Consequently, the field is shifting towards decentralized and on-device processing methods to address these inherent limitations and empower users with greater control over their personal data.

Analyzing activity from mobile devices presents unique challenges beyond simply identifying movements. Real-world sensor data is inherently noisy and variable, stemming from diverse device placements, differing sensor qualities, and the sheer unpredictability of human behavior. This complexity is often compounded by imbalanced datasets, such as the widely-used ExtraSensory Dataset, where certain activities – like running or cycling – are significantly less represented than more common actions like walking or sitting. This disparity can lead algorithms to prioritize frequently observed activities, severely hindering their ability to accurately recognize rarer, yet potentially critical, events. Effectively addressing these imbalances and accounting for the inherent noise within mobile sensor data is crucial for developing robust and reliable human activity recognition systems.

Conventional machine learning algorithms, including Support Vector Machines and K-Nearest Neighbors, frequently encounter limitations when applied to mobile activity recognition. While historically effective, these methods demand significant computational resources, posing challenges for deployment on mobile devices with constrained processing power and battery life. Furthermore, their performance often degrades when encountering data from users with varying demographics, physical conditions, or usage patterns – a phenomenon known as poor generalization. This is because these algorithms struggle to effectively learn robust feature representations that are invariant to individual differences, requiring substantial, balanced datasets for each user to achieve acceptable accuracy. Consequently, developing more efficient and adaptable techniques remains crucial for realizing the full potential of mobile activity recognition across diverse populations and real-world scenarios.

A logarithmic-scale histogram of label distributions from the ExtraSensory dataset demonstrates a significant class imbalance across the six labels.

The Seeds of Distributed Intelligence

Federated Learning (FL) offers a decentralized approach to training Human Activity Recognition (HAR) models by processing data locally on edge devices – such as smartphones or wearables – and omitting the requirement for centralized data storage. This paradigm shifts data handling; instead of collecting raw data on a server, the training process itself is distributed. Each device utilizes its locally stored data to compute model updates, which are then sent to a central server for aggregation. Only these model updates, not the raw data, are transmitted, thereby addressing privacy concerns and reducing bandwidth requirements. This approach is particularly advantageous in scenarios where data sensitivity or regulatory restrictions prohibit centralized data collection, and enables model personalization based on individual user behavior.

The Flower framework is a scalable, decentralized learning framework designed to facilitate machine learning model training across numerous client devices. It operates on a client-server architecture, enabling the coordination of training processes without requiring data centralization. A key feature of Flower is its ability to handle ‘Non-IID Data’ – data that is not independently and identically distributed – which is common in real-world deployments where each client possesses a unique data distribution. Flower achieves this through flexible aggregation strategies and supports various machine learning frameworks, including TensorFlow, PyTorch, and MXNet, allowing for customization to address the challenges presented by heterogeneous and statistically diverse datasets across a federated network.

The Federated Averaging (FedAvg) algorithm is a key component in federated learning, designed to create a robust global model from decentralized data. It operates by distributing the current global model to a subset of client devices. Each client then trains this model locally using its private dataset. Following local training, clients transmit only their model updates – specifically, the changes made to the model weights – back to a central server. The server then averages these updates, weighted by the size of each client’s dataset, to create a new, improved global model. This process is repeated iteratively, allowing the global model to converge despite the data being non-IID (non-independent and identically distributed) across clients. The weighting by dataset size ensures that contributions from clients with more data have a proportionally larger impact on the global model, improving overall performance and generalization.

Echoes of Language in Movement

The Human Activity Recognition (HAR) model utilizes a Transformer architecture, specifically the GPT-2 model, to process sequential data from mobile sensors. GPT-2, originally designed for natural language processing, is adapted to analyze time-series data generated by accelerometers, gyroscopes, and other onboard sensors. This approach capitalizes on the Transformer’s inherent ability to model long-range temporal dependencies within the sensor data, effectively capturing the order and duration of movements that characterize different activities. Unlike recurrent neural networks, Transformers process the entire input sequence in parallel, improving computational efficiency and allowing for better capture of relationships across the entire time series, rather than being limited by sequential processing.

The activity recognition model is engineered for multi-label classification, enabling the simultaneous identification of multiple activities occurring within a single data instance. Unlike single-label classification which assigns only one activity per instance, this approach accommodates scenarios where users perform several actions concurrently – for example, walking while talking on the phone. The model’s output therefore consists of a vector of probabilities, each representing the likelihood of a specific activity being present, allowing for the detection of all relevant activities exceeding a predefined confidence threshold. This functionality is crucial for accurately representing complex human behavior as captured by mobile sensor data.

Historically, human activity recognition (HAR) systems relied on dedicated signal processing techniques to extract relevant features from raw sensor data. Methods like Wavelet Transform, Principal Component Analysis (PCA), and Independent Component Analysis (ICA) were commonly employed to reduce dimensionality and highlight discriminative information. However, the adoption of Transformer architectures, specifically GPT-2 in this work, has shifted this paradigm. These models possess an intrinsic ability to learn feature representations directly from the time-series data, effectively automating the feature engineering process and reducing the need for pre-processing steps traditionally performed by methods such as Wavelet Transform, PCA, and ICA. While these earlier techniques may still be valuable for data preparation or noise reduction, the Transformer’s learned representations generally achieve superior performance in capturing the complex temporal dependencies crucial for accurate activity classification.

The Illusion of Accuracy, the Promise of Adaptation

Evaluations of the Federated Learning-based Transformer model, rigorously assessed using the ‘Balanced Accuracy’ metric, reveal an average performance of approximately 0.75 across multiple data folds. This indicates a robust capability in accurately classifying human activities while addressing the challenges of data heterogeneity inherent in real-world applications. The ‘Balanced Accuracy’ – a measure particularly suited to imbalanced datasets – consistently ranged from 0.718 to 0.779, demonstrating stable and reliable performance across different data partitions. This level of accuracy signifies a promising advancement in Human Activity Recognition, suggesting the model’s potential for practical deployment in personalized health monitoring and ambient assisted living scenarios.

Evaluations revealed a consistent level of performance across multiple data partitions, with the model’s ‘Balanced Accuracy’ fluctuating between 0.718 and 0.779. This range indicates robustness; the model did not exhibit significant performance drops when tested on different subsets of the data. Such stability is crucial for real-world deployment, suggesting the model generalizes well and is not overly sensitive to variations in user activity patterns or sensor data. The relatively narrow margin between the highest and lowest accuracy scores further validates the approach’s reliability and predictable performance across diverse data folds, offering confidence in its potential for widespread application in human activity recognition systems.

Evaluations reveal the proposed Federated Learning-based Transformer model demonstrates competitive efficacy when contrasted with existing state-of-the-art methods for Human Activity Recognition. Across a diverse set of sixty client devices, the model achieved performance equal to or exceeding that of current benchmarks on nineteen of those clients. This indicates a significant capability to not only match, but surpass, established techniques in personalized activity monitoring, particularly within specific user contexts and data characteristics. The results suggest the model’s architecture and federated approach offer a robust foundation for further optimization and broader applicability in real-world scenarios.

The developed Federated Learning-based Transformer model offers a significant advancement in Human Activity Recognition (HAR) by prioritizing data privacy and system scalability. Traditional HAR systems often require centralized data collection, raising concerns about sensitive user information; this approach, however, enables personalized activity monitoring directly on user devices, processing data locally and only sharing model updates. This decentralized structure not only safeguards individual privacy but also allows the system to adapt to diverse user behaviors and environments without the need to transmit raw data. The inherent scalability of Federated Learning further facilitates deployment across a large number of users and devices, making continuous, personalized health and wellness monitoring a practical reality – all while maintaining robust data protection standards.

Continued development centers on refining the model’s core architecture, with investigations planned into more sophisticated Transformer designs to capture intricate temporal dependencies within human activity data. Simultaneously, researchers aim to harness the benefits of on-device learning, a paradigm that shifts computational demands from centralized servers to the individual user’s device. This approach not only promises to reduce latency and enhance data privacy but also offers the potential for continuous model adaptation, tailoring performance to the unique movement patterns and environmental conditions experienced by each user. By combining architectural innovation with decentralized learning strategies, the system aspires to achieve increasingly accurate and personalized human activity recognition capabilities.

The pursuit of increasingly complex architectures, as demonstrated by FED-HARGPT’s hybrid centralized-federated Transformer model, echoes a fundamental truth about systems. This work, striving for enhanced human activity recognition through distributed learning, doesn’t build a solution so much as cultivate an environment for emergence. Søren Kierkegaard observed that “Life can only be understood backwards; but it must be lived forwards.” Similarly, this research acknowledges the forward momentum of data complexity-the non-IID nature of edge data-while retrospectively attempting to impose order. The system isn’t simply designed; it’s grown from the inherent uncertainties of real-world input, a testament to the fact that every architectural choice is, inevitably, a prophecy of future adaptation – or failure.

What’s Next?

The pursuit of human context recognition, as illustrated by this work, isn’t a problem of model architectures-it’s an exercise in deferred compromise. Each refinement of centralized-federated learning, each layer added to the Transformer, merely postpones the inevitable: the mismatch between the model’s abstracted reality and the messy, unpredictable nature of lived experience. The preservation of data privacy, while laudable, is a symptom of a deeper truth: the system doesn’t know the data, it merely holds its ghost.

The handling of non-IID data suggests a coming reckoning with the fallacy of distribution. To believe a model can generalize beyond the conditions of its training is to misunderstand the fundamental nature of context. Future efforts will likely not focus on correcting for data heterogeneity, but on embracing it-on building systems that are exquisitely sensitive to the unique signature of each data source, even as those signatures shift and decay.

Ultimately, the true challenge isn’t achieving higher accuracy on benchmark datasets. It’s acknowledging that every successful prediction is a temporary truce with uncertainty. The system doesn’t learn what a human is; it learns what a human has been, and extrapolates, always imperfectly, into what a human might do. And when the system is silent, it is not resting – it is calculating the probability of its own failure.

Original article: https://arxiv.org/pdf/2603.24601.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Centralized Knowledge

The Seeds of Distributed Intelligence

Echoes of Language in Movement

The Illusion of Accuracy, the Promise of Adaptation

What’s Next?

See also: