Beyond Prediction: Measuring Confidence in Smart Systems

Author: Denis Avetisyan

As machine learning increasingly powers everyday applications like activity recognition, understanding when a model is unsure is just as crucial as knowing what it predicts.

The distributions of distances to nearest clusters for training and validation datasets demonstrate a consistent relationship, with the 99th percentile of the validation set serving as a clear demarcation for characterizing cluster proximity.

This review presents a framework for quantifying uncertainty in deep learning models used for human activity recognition, addressing challenges posed by distributional shift through reconstruction loss, latent space analysis, and Monte Carlo Dropout.

While machine learning increasingly powers pervasive systems, its reliance on data-driven models introduces inherent uncertainty absent in traditional software. This paper, ‘Quantifying Uncertainty in Machine Learning-Based Pervasive Systems: Application to Human Activity Recognition’, addresses this challenge by proposing a framework to quantify prediction reliability, specifically within the complex domain of human activity recognition. The approach combines reconstruction loss, latent space distance, and Monte Carlo Dropout to effectively detect distributional shifts and assess model confidence. Can robust uncertainty quantification unlock more trustworthy and adaptable AI systems for real-world applications?

The Fragility of Intelligence: Navigating Shifting Realities

Machine learning algorithms are rapidly becoming integral to daily life, underpinning applications from medical diagnosis and financial modeling to autonomous vehicles and personalized recommendations. However, the performance of these algorithms is predicated on the assumption that the data encountered during deployment mirrors the data used during training. This assumption frequently fails in real-world scenarios, where changes in data distribution – known as distributional shifts – are commonplace. These shifts can manifest in various ways, such as alterations in user behavior, seasonal variations, or the introduction of novel conditions not present in the training set. Consequently, a model that performed flawlessly during development may experience significant and unpredictable degradation in accuracy and reliability when exposed to these altered conditions, highlighting a critical vulnerability in the widespread deployment of machine learning systems.

Machine learning models, despite their growing sophistication, are acutely sensitive to alterations in the data they encounter, a phenomenon manifesting as distributional shifts. These shifts take several forms: covariate shift, where the input features change; label shift, where the relationship between inputs and outputs evolves; and domain shift, where the entire data-generating process differs from the training environment. Consequently, a model performing with high accuracy in a controlled setting can experience significant performance degradation when deployed in the real world, leading to unreliable predictions and potentially harmful outcomes. The core issue lies in the assumption that the data encountered during deployment will mirror the training data; when this assumption is violated, the model’s learned relationships become less valid, demanding robust strategies for detection and mitigation of these shifts to ensure continued dependability.

A critical limitation of many conventional machine learning systems is their difficulty in expressing the confidence – or lack thereof – in their own predictions. While adept at identifying patterns within training data, these models frequently offer outputs without accompanying measures of uncertainty, presenting a significant obstacle to reliable deployment in real-world scenarios. This absence of calibrated confidence can lead to dangerous outcomes, particularly in high-stakes applications like autonomous driving or medical diagnosis, where acting on a seemingly plausible but ultimately incorrect prediction could have severe consequences. Unlike human decision-making, which often incorporates intuitive assessments of risk, these “black box” algorithms struggle to signal when a given input falls outside of their learned distribution or when a prediction is likely to be flawed, demanding new approaches to quantify and communicate predictive uncertainty for truly robust and safe operation.

Current machine learning research increasingly focuses on developing techniques to grapple with the inherent instability caused by shifts in data distributions. These approaches move beyond simply achieving high accuracy on static datasets and instead prioritize robustness – the ability to maintain performance when faced with previously unseen data. Researchers are exploring methods such as domain adaptation, which aims to transfer knowledge from one distribution to another, and techniques for quantifying aleatoric and epistemic uncertainty – differentiating between noise inherent in the data and the model’s lack of knowledge. Furthermore, advancements in continual learning are allowing models to adapt to evolving data streams without catastrophic forgetting, while causal inference methods are being leveraged to identify and mitigate spurious correlations that can lead to brittle performance when distributions change. Ultimately, these efforts seek to build machine learning systems capable of reliable operation in dynamic, real-world environments.

The reconstruction loss distributions for training and validation datasets align, with the 0.99 quantile of the validation set indicating good generalization performance.

Acknowledging the Unknown: The Power of Uncertainty Quantification

Uncertainty Quantification (UQ) in machine learning involves characterizing and reducing the impact of uncertainties inherent in model inputs, model parameters, and the modeling process itself. Rather than providing single-point predictions, UQ aims to define a probability distribution over possible outcomes, allowing for a more complete assessment of model reliability. This is achieved through techniques that propagate uncertainty from input variables through the model to quantify the uncertainty in the predictions. Key benefits of implementing UQ include improved risk assessment, enhanced decision-making under uncertainty, and the ability to identify areas where further data collection or model refinement are needed. UQ methods are particularly crucial in high-stakes applications, such as medical diagnosis, financial modeling, and safety-critical systems, where understanding the potential for error is paramount.

Bayesian models, unlike traditional machine learning approaches that yield point estimates, produce predictive distributions representing the probability of different outcomes. This is achieved by treating model parameters as random variables with associated prior distributions. Through Bayes’ theorem, these priors are updated with observed data to generate posterior distributions over the parameters. Consequently, predictions are not single values but probability distributions – $P(y|x)$ – reflecting the uncertainty inherent in both the estimated parameters and the prediction itself. This allows for a quantifiable assessment of prediction reliability, providing not only an estimate of the most likely outcome but also a measure of the confidence associated with that estimate, often expressed as variance or credible intervals.

Monte Carlo Dropout (MC Dropout) provides an efficient approximation of Bayesian inference in neural networks by enabling multiple predictions from a single trained model. During both training and testing, random dropout is applied to the network’s layers, effectively creating an ensemble of different network architectures. Each forward pass with different dropout masks yields a slightly different prediction. By repeating this process numerous times – typically hundreds or thousands – a distribution of predictions is generated. The variance of this distribution serves as an estimate of the model’s predictive uncertainty; higher variance indicates greater uncertainty. This technique avoids the computational expense of traditional Markov Chain Monte Carlo (MCMC) methods while still providing a quantifiable measure of confidence in the model’s output, without requiring modifications to the training process beyond standard dropout implementation.

Autoencoders facilitate anomaly detection and out-of-distribution (OOD) data identification by learning a compressed, latent space representation of normal data. During training, the autoencoder minimizes reconstruction loss – the difference between the input and its reconstructed output – effectively learning to accurately reproduce typical data points. Anomalies or OOD data, by definition, deviate from this learned distribution, resulting in significantly higher reconstruction losses. Furthermore, incorporating a latent space distance metric, such as mean squared error between latent vectors, penalizes latent representations that are distant from the typical latent space distribution, enhancing the separation between normal and anomalous data. A higher combined loss – reconstruction loss plus latent space distance – indicates a greater probability that the input is an anomaly or originates from a different distribution than the training data.

The 99th percentile of the variance distribution of dropout-induced predictions on the validation dataset is indicated by the red line, demonstrating the range of prediction uncertainty.

Calibrating Confidence: Measuring the Reliability of Predictions

Uncertainty Accuracy (UA) serves as a primary evaluation metric for calibrated machine learning models by assessing the alignment between predicted confidence levels and actual observation frequencies. Specifically, UA quantifies how well a model’s stated uncertainty reflects the true likelihood of an incorrect prediction; a high UA indicates that when the model expresses low confidence, it is, in fact, more likely to be incorrect. This is typically measured by binning predictions based on their confidence scores, and then calculating the average accuracy within each bin; a well-calibrated model will exhibit a strong positive correlation between confidence and accuracy. Unlike traditional accuracy metrics which only indicate if a prediction is correct or incorrect, UA provides insight into the reliability of the model’s predictions, crucial for risk-sensitive applications.

Differentiating between aleatory and epistemic uncertainty is fundamental to developing reliable machine learning models. Aleatory uncertainty, also known as data uncertainty, represents the inherent noise or randomness present in the data itself; it is irreducible even with more data. Epistemic uncertainty, conversely, arises from a lack of knowledge about the model parameters, stemming from limited training data or model misspecification; this type of uncertainty can be reduced with increased or more informative data. Accurate identification of each type enables targeted improvements: addressing epistemic uncertainty through data acquisition or model refinement, and acknowledging aleatory uncertainty as a fundamental limit to prediction accuracy.

Quantifying both aleatory and epistemic uncertainty provides critical insight into a model’s reliability and applicability. Aleatory uncertainty, representing inherent noise in the data, defines a lower bound on prediction error; understanding this allows for appropriate loss function selection and data augmentation strategies. Epistemic uncertainty, reflecting a model’s lack of knowledge due to limited training data or exposure to specific scenarios, highlights areas where the model may generalize poorly. By separately assessing these components, developers can identify situations where a model’s predictions are unreliable and implement mitigation strategies such as active learning, data collection focused on high-uncertainty regions, or the deployment of ensemble methods. This detailed understanding facilitates informed decision-making regarding model deployment, risk assessment, and the appropriate level of human oversight required for critical applications.

Evaluations of the proposed framework demonstrate an achieved Uncertainty Accuracy (UA) ranging from approximately 70% to 80% when subjected to various distributional shifts. These shifts include label shifts, where the relationship between inputs and outputs changes; covariate shifts, involving alterations in the input data distribution; and domain shifts, representing more substantial differences between training and testing environments. Performance improvements were observed with the implementation of cascading methods, which particularly enhance UA scores in the presence of domain shifts, suggesting a greater ability to generalize to unseen data distributions. These results indicate a quantifiable level of reliability in the model’s uncertainty estimates across diverse conditions.

Real-World Impact: Enhancing Human Activity Recognition

Human Activity Recognition systems, which aim to identify actions like walking, running, or sitting from sensor data, are substantially improved through the application of Uncertainty Quantification (UQ) methods. Traditional HAR often provides only a single prediction without indicating the confidence level associated with that prediction; UQ addresses this limitation by providing a measure of reliability. This is crucial because sensor data is inherently noisy and can vary greatly between individuals, leading to misclassifications. By quantifying uncertainty, systems can flag ambiguous data points, request additional information, or even abstain from making a prediction when confidence is low. The result is a more robust and dependable HAR system, particularly valuable in applications like healthcare monitoring, where inaccurate activity recognition could have serious consequences, or in assistive technologies requiring precise and trustworthy input.

Recent advancements in human activity recognition increasingly leverage the power of transformer architectures, particularly when paired with masked autoencoders. This combination enhances performance by enabling the model to learn robust representations from sequential sensor data, even when portions of the input are obscured or missing. Masked autoencoders function by randomly masking parts of the input sequence and training the model to reconstruct the missing data, forcing it to develop a deeper understanding of the underlying patterns and dependencies within the activity. The resulting models demonstrate improved generalization capabilities and resilience to noisy or incomplete data, crucial for real-world applications where sensor readings can be unreliable. This approach effectively addresses the challenges of varying activity intensities, individual user differences, and the complexities of real-world movement patterns, leading to more accurate and reliable activity classification.

The effectiveness of uncertainty quantification (UQ) techniques in human activity recognition (HAR) is fundamentally dependent on evaluation against established datasets. Utilizing publicly available, real-world datasets – such as the HHAR Dataset, which captures a diverse range of physical activities – provides a standardized benchmark for comparing different UQ approaches. Rigorous evaluation on these datasets allows researchers to assess not only the accuracy of activity classification, but also the reliability of the quantified uncertainty – crucial for safety-critical applications. By testing UQ methods against established baselines and diverse sensor data, the robustness and generalizability of these techniques can be comprehensively determined, fostering advancements in the field and ensuring practical applicability beyond controlled laboratory settings.

The developed methodology demonstrates a compelling capacity for rapid processing, achieving execution times of under 0.1 seconds when analyzing 10,000 samples on a NVIDIA L4 GPU. This performance benchmark is critical, as it positions the approach as highly viable for deployment in real-time pervasive applications – scenarios demanding immediate data interpretation, such as continuous health monitoring or responsive assistive technologies. Importantly, this speed is not achieved at the expense of accuracy; the system maintains a high level of uncertainty awareness, ensuring reliable and trustworthy activity recognition even within dynamic and complex environments. This combination of speed and accuracy represents a significant step towards practical and dependable human activity recognition systems.

The model architecture integrates a Masked Autoencoder (MAE) with a Hierarchical Attention Transformer, as detailed in Ek et al. (2024).

The pursuit of robust pervasive systems, as detailed in this work, necessitates a clear understanding of model limitations. It’s a delicate balance; the framework’s combination of reconstruction loss, latent space distance, and Monte Carlo Dropout to quantify uncertainty demonstrates this well. As Claude Shannon once stated, “The most important thing is to get the message across.” This paper isn’t merely about achieving high accuracy in Human Activity Recognition; it’s about reliably communicating how confident the system is in its predictions. If the system looks clever by simply outputting a label, it’s probably fragile – the true strength lies in acknowledging what it doesn’t know, especially when faced with distributional shifts. Structure, in this case the uncertainty quantification, dictates behavior, and a well-defined structure allows the system to gracefully degrade rather than fail catastrophically.

Where Do We Go From Here?

The pursuit of quantifying uncertainty in machine learning, particularly within the context of pervasive systems, reveals a fundamental tension. This work, by combining reconstruction error, latent space analysis, and probabilistic dropout, offers a pragmatic step toward robust human activity recognition. However, it simultaneously highlights the inherent fragility of relying solely on model-internal assessments of confidence. If a design feels clever, it probably is fragile. The true test lies not in how well a model estimates its own uncertainty, but in how gracefully the system degrades under unforeseen distributional shifts.

Future efforts must move beyond isolated model evaluations. A complete system will require mechanisms for detecting shifts before they propagate into erroneous predictions, perhaps through continual learning strategies or by integrating external sources of corroborating evidence. The focus should shift from achieving ever-higher accuracy on benchmark datasets to building systems that demonstrably understand their limitations. A simple, well-understood system that acknowledges its ignorance will always outperform a complex one attempting to feign omniscience.

Ultimately, the question isn’t whether machines can quantify uncertainty, but whether they can tolerate it. Structure dictates behavior, and a truly robust pervasive system will be defined not by its ability to predict the future, but by its resilience in the face of the unpredictable.

Original article: https://arxiv.org/pdf/2512.09775.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Fragility of Intelligence: Navigating Shifting Realities

Acknowledging the Unknown: The Power of Uncertainty Quantification

Calibrating Confidence: Measuring the Reliability of Predictions

Real-World Impact: Enhancing Human Activity Recognition

Where Do We Go From Here?

See also: