Author: Denis Avetisyan
New research demonstrates an automated approach to understanding how people feel during interactions with robots, moving beyond reliance on subjective feedback.

This study classifies user satisfaction in human-robot interaction using time series analysis of nonverbal social signals, eliminating the need for manual annotation.
Assessing user experience remains a critical bottleneck in the deployment of socially interactive agents despite their increasing presence in real-world scenarios. This paper, ‘Classification of User Satisfaction in HRI with Social Signals in the Wild’, introduces a method for automatically classifying user satisfaction during human-robot interaction by analyzing time series data derived from nonverbal cues like body pose, facial expressions, and proxemics. Our results demonstrate the feasibility of identifying interactions with low user satisfaction without relying on manually labeled datasets, offering a pathway toward automated feedback mechanisms for improving robot performance. Could this approach ultimately enable truly adaptive and user-centered social robots capable of seamless interaction?
Decoding the Nuances of Human Connection
The development of truly effective Socially Interactive Agents (SIAs) hinges on their ability to perceive and interpret the subtle language of human nonverbal communication. Current artificial intelligence systems, while proficient in tasks requiring logical reasoning, often struggle with the ambiguities inherent in body language, facial expressions, and proxemics – the use of space. Unlike the clearly defined rules governing computer code, social signals are frequently context-dependent and expressed with varying degrees of intensity, making reliable decoding exceptionally difficult. This poses a significant hurdle; an SIA unable to accurately read cues such as a furrowed brow or averted gaze risks misinterpreting a user’s emotional state, leading to awkward or inappropriate interactions. Consequently, advancements in AI must prioritize the nuanced understanding of these nonverbal cues to create agents capable of fostering genuine connection and building trust with human users.
Current approaches to interpreting human social signals often falter due to the inherent complexity of nonverbal communication. Existing systems typically analyze body language, facial expressions, and proxemics – the use of spatial distance – in isolation, neglecting the crucial interplay between them. A slight shift in posture, for example, can dramatically alter the meaning of a facial expression, while maintaining a comfortable distance can signal trust, even in the absence of overt positive cues. These systems struggle to account for the subtle, context-dependent nuances that humans effortlessly process, leading to misinterpretations and hindering the development of truly socially intelligent agents. The difficulty lies not simply in recognizing individual signals, but in deciphering how these signals dynamically combine to convey meaning, a challenge demanding more sophisticated computational models and significantly larger, richly annotated datasets.
The creation of truly effective Socially Interactive Agents (SIAs) hinges on their capacity to accurately interpret the subtle language of human interaction, and this interpretation directly impacts user experience. Beyond simply recognizing facial expressions or body postures, successful SIAs must synthesize these signals – along with proxemics, vocal tone, and contextual cues – to gauge genuine engagement and respond appropriately. Misinterpreting these signals can lead to awkward or frustrating interactions, hindering the development of trust and rapport. Consequently, research focuses on refining algorithms that move beyond basic recognition to achieve a nuanced understanding of social cues, ultimately fostering more natural, satisfying, and productive relationships between humans and artificial intelligence.

Capturing the Unspoken: A Video-Based Observation
Video data was utilized to analyze nonverbal communication during human-robot interaction. The Furhat Robot, a socially interactive robot platform, was employed to capture these interactions in a manner designed to elicit naturalistic behavior from participants. This platform facilitates the recording of facial expressions, head movements, and body language, providing a dataset representative of real-world social cues. The robot’s expressive capabilities also allow for controlled stimulus presentation, enabling researchers to investigate responses to specific nonverbal prompts. Data captured via the Furhat platform forms the basis for subsequent analysis using computer vision techniques.
The system utilizes the YOLO (You Only Look Once) algorithm for real-time person detection within video streams, enabling accurate bounding box identification of individuals present in the scene. Complementing this, MediaPipe provides precise facial landmark detection, identifying key points on the face – including eyes, mouth, and nose – to track facial movements and expressions. These landmarks are represented as 2D coordinates, allowing for quantitative analysis of facial features and their changes over time. The combination of YOLO and MediaPipe ensures robust person identification and detailed facial data extraction, forming the foundation for subsequent behavioral analysis.
The py-feat library was instrumental in quantifying facial expressions from video data, enabling analysis of Action Units (AUs) as defined by the Facial Action Coding System. This allowed for the detection and measurement of specific muscle movements associated with emotions, such as AU1 for inner brow raising, AU6 for cheek raising, and AU12 for lip corner pulling. By tracking AU intensity over time, the system generated a dynamic profile of facial activity, providing a granular data stream used to infer user emotional states like happiness, sadness, anger, or surprise. The library’s capabilities included automated AU detection, as well as the calculation of metrics like facial expression symmetry and the overall intensity of displayed emotions, providing a robust dataset for subsequent analysis and modeling.
From Data to Insight: Feature Engineering and Classification
Feature engineering involved the creation of both manually designed features and automated extraction of features from time series data. The tsfresh and catch22 libraries were utilized for automated feature extraction, calculating a large number of time-series characteristics such as statistical measures, complexity calculations, and change-related features. Handcrafted features were designed based on domain expertise and hypothesized relationships with user satisfaction. Both approaches generated feature vectors which were then used as inputs for machine learning classification algorithms.
Several machine learning classifiers were employed to evaluate the effectiveness of the extracted features. Random Forest, a versatile ensemble learning method, was utilized for its ability to handle high-dimensional data and reduce overfitting. Support Vector Machines (SVM) were implemented to identify optimal hyperplanes for classification, leveraging kernel functions to map data into higher-dimensional spaces. Logistic Regression, a linear model, provided a probabilistic assessment of class membership, offering interpretable results and serving as a baseline for comparison against more complex algorithms. Performance was assessed using standard metrics such as accuracy, precision, recall, and F1-score to determine the optimal classification approach for user satisfaction prediction.
A user satisfaction classification study was conducted with 46 participants, each interacting with the system for an average of 2 minutes and 12 seconds. Results indicated that the tsfresh feature extraction method achieved the highest classification accuracy at 97.8%. Machine learning models trained with tsfresh features also yielded strong performance: Random Forest and Naive Bayes both reached 92% accuracy. The catch22 feature extraction approach resulted in a range of 80.4% to 91.3% accuracy, demonstrating its viability as an alternative method for classifying user satisfaction.
Beyond Accuracy: Towards Truly Empathetic Interactions
The efficacy of sophisticated Socially Intelligent Agents (SIAs) hinges on their ability to perceive and interpret the subtle cues of human communication, extending beyond merely processing spoken words. This approach centers on accurately decoding nonverbal signals – encompassing facial expressions, body language, and interpersonal distance – to create a more nuanced understanding of a user’s state. By leveraging these indicators, the system dynamically adjusts its responses, moving beyond pre-programmed scripts to offer interactions that feel intuitively appropriate and responsive. This enhanced perception allows the SIA to recognize signs of frustration, confusion, or engagement, prompting tailored assistance or adjustments to the interaction style – ultimately fostering a more natural and effective dialogue between human and machine.
Sophisticated Socially Intelligent Agents (SIAs) are increasingly capable of dynamically adjusting their behavior based on a nuanced comprehension of the user’s current state. This responsiveness extends beyond simply acknowledging expressed needs; it involves interpreting subtle cues to proactively tailor interactions for optimal user experience. By discerning factors like user frustration, confusion, or engagement, SIAs can modify parameters such as response speed, complexity of language, or even the overall tone of the conversation. Such adaptive behavior fosters a sense of personalized connection, ultimately leading to demonstrably higher levels of user satisfaction and a more positive, productive interaction. The capacity to move beyond standardized responses towards truly empathetic communication represents a significant advancement in human-computer interaction.
The research indicates a significant opportunity to move beyond purely functional Socially Intelligent Agents (SIAs). Statistical analysis, specifically a Cronbach’s Alpha of 0.78 achieved with the user satisfaction scale, confirms the reliability of the assessment and supports the notion that these agents can elicit genuine positive responses from users. This level of internal consistency within the satisfaction metrics suggests that improvements in decoding nonverbal cues and adapting interactions aren’t merely increasing efficiency, but are fostering a sense of connection and engagement. Consequently, the development of SIAs capable of recognizing and responding to emotional states promises a future where technology isn’t just helpful, but also genuinely considerate and enjoyable to interact with.
The pursuit of automated user satisfaction classification, as detailed in this work, aligns with a fundamental principle of efficient understanding. It distills complex interactions-the nuanced flow of social signals-into quantifiable data, enabling objective assessment without reliance on subjective human annotation. This echoes David Hilbert’s assertion: “We must be able to answer yes or no to any definite question.” The research seeks to formulate a ‘definite question’ – is the user satisfied? – and provide a definitive, data-driven answer, stripping away ambiguity through rigorous time series analysis and feature engineering. The core concept of eliminating manual annotation demonstrates a commitment to precision and automation, furthering the possibilities of truly responsive human-robot interaction.
What Lies Ahead?
The demonstrated feasibility of automated satisfaction assessment represents a reduction, not a resolution. The current work skillfully navigates the problem of annotation-a costly and subjective bottleneck-but merely shifts the burden to feature engineering. The true challenge isn’t recognizing that satisfaction changed, but understanding why. The signal itself is cheap; the interpretation, profoundly expensive. Future iterations must move beyond correlating movement with valence, and towards modeling the causal pathways linking robotic action to human emotional response.
A persistent limitation lies in the assumption of universality. Social signals, while appearing fundamental, are demonstrably culturally modulated. A smile, a nod, even sustained eye contact-these are not absolute metrics, but negotiated behaviours. To deploy such systems effectively requires acknowledging-and quantifying-this variability. The pursuit of a ‘one-size-fits-all’ model will invariably introduce error, masking genuine dissatisfaction with culturally-induced misinterpretations.
Ultimately, the goal should not be to perfectly read human emotion, but to design interactions that minimize the need for such readings. A truly intelligent system anticipates user needs, preventing dissatisfaction before it manifests. This requires a shift in focus: from passive observation to proactive adaptation-from decoding signals to shaping circumstances. The most elegant solution, as always, is to eliminate the noise altogether.
Original article: https://arxiv.org/pdf/2512.03945.pdf
Contact the author: https://www.linkedin.com/in/avetisyan/
See also:
- Clash Royale Best Boss Bandit Champion decks
- Clash Royale December 2025: Events, Challenges, Tournaments, and Rewards
- Clash Royale Witch Evolution best decks guide
- Clash Royale Furnace Evolution best decks guide
- Mobile Legends December 2025 Leaks: Upcoming new skins, heroes, events and more
- Mobile Legends X SpongeBob Collab Skins: All MLBB skins, prices and availability
- Mobile Legends November 2025 Leaks: Upcoming new heroes, skins, events and more
- BLEACH: Soul Resonance: The Complete Combat System Guide and Tips
- The Most Underrated ’90s Game Has the Best Gameplay in Video Game History
- Doctor Who’s First Companion Sets Record Now Unbreakable With 60+ Year Return
2025-12-04 12:18