Beyond the Smile: Reading true Emotion with Eyes and Faces

Author: Denis Avetisyan


Researchers have created a new dataset and model that combine facial expressions and eye movements to achieve more accurate emotion recognition.

The EMER dataset-a multimodal, participant-rich collection with multi-view annotations-presents a new avenue for researching the discrepancy between expressed and felt emotion.
The EMER dataset-a multimodal, participant-rich collection with multi-view annotations-presents a new avenue for researching the discrepancy between expressed and felt emotion.

A novel multimodal dataset and Transformer-based model integrate eye-tracking and facial expression analysis to bridge the gap between observed behavior and underlying emotional state.

While facial expressions are often considered primary indicators of emotion, they can be deliberately masked or socially influenced, creating a disconnect between perceived and genuine feeling. This limitation motivates the research presented in ‘Smile on the Face, Sadness in the Eyes: Bridging the Emotion Gap with a Multimodal Dataset of Eye and Facial Behaviors’, which introduces a novel dataset and model designed to integrate eye behavior with facial expression analysis. The authors demonstrate that incorporating eye movement data significantly enhances emotion recognition accuracy, bridging the gap between outward display and inner state via a Transformer-based model. Could a more comprehensive understanding of subtle eye behaviors unlock truly robust and reliable emotion recognition systems?


The Illusion of Emotional Clarity: Why Faces Tell Only Part of the Story

Emotion recognition systems commonly prioritize analyzing facial expressions, a practice that inadvertently diminishes the significance of gaze patterns as crucial emotional indicators. While expressions offer valuable insight, they can be consciously controlled or culturally influenced, potentially misleading automated assessments. Research demonstrates that where a person looks – their gaze direction and patterns – often reveals underlying emotional states with greater fidelity, offering clues about attention, intent, and even concealed feelings. Subtle shifts in gaze, such as prolonged eye contact or avoidance, can signal discomfort, interest, or deception, nuances frequently missed by systems solely focused on facial muscle movements. Consequently, a more comprehensive understanding of human emotion requires integrating gaze data alongside traditional facial expression analysis, allowing for a more robust and accurate interpretation of affective states.

Current emotion recognition systems frequently falter because they struggle to cohesively analyze data from multiple sources – what experts term ‘multimodal’ data. While a system might process facial expressions and vocal tones, it often treats these as independent streams of information, missing crucial correlations. This fragmented approach leads to inaccuracies; a furrowed brow, for instance, could signal concentration rather than sadness if not considered alongside gaze direction and body language. Consequently, these systems exhibit limited generalizability, performing well only on the specific datasets they were trained on and failing to accurately interpret emotions in new contexts or across diverse populations. The inability to synthesize information effectively hinders the development of truly robust and reliable emotion AI, as nuanced emotional states require a holistic assessment of behavioral cues.

The accurate decoding of human emotion extends beyond simply identifying facial expressions; it necessitates a comprehensive analysis of how attention interacts with those expressions. Current research indicates that emotional states aren’t solely communicated through muscle movements, but are profoundly shaped by where a person directs their gaze. This dynamic interplay – the coordinated dance between facial action and visual attention – provides critical context often missed by traditional methods. Studies reveal that subtle shifts in gaze, such as increased attention to specific features or avoidance of direct eye contact, can significantly alter the perceived emotional intensity and even the interpretation of the expressed emotion. Consequently, a truly nuanced understanding requires computational models capable of integrating these multimodal signals – tracking both what is expressed on the face and where attention is focused – to achieve a more reliable and ecologically valid assessment of emotional experience.

The EMER dataset offers a comprehensive resource for emotion analysis, integrating facial expression videos, eye movements, and multi-view emotion annotations including both facial expression recognition (FER) and emotion regulation (ER) labels.
The EMER dataset offers a comprehensive resource for emotion analysis, integrating facial expression videos, eye movements, and multi-view emotion annotations including both facial expression recognition (FER) and emotion regulation (ER) labels.

EMER: A Dataset Grounded in How We Actually See

The EMER dataset consists of synchronized video recordings of facial expressions coupled with corresponding eye-tracking data. Specifically, each instance within the dataset includes a video depicting a participant’s facial expressions, a time-aligned sequence of eye movement data capturing gaze positions over time, and eye fixation maps visualizing areas of focused visual attention. This integration of visual and ocular data allows researchers to analyze emotional responses not only through overt facial displays, but also through the patterns of visual attention that accompany those expressions. The dataset provides a holistic view of emotional processing by linking observable behavior with underlying attentional mechanisms.

The EMER dataset incorporates multi-view emotion annotations obtained through assessments from multiple annotators. This approach moves beyond single-rater evaluations, enabling researchers to assess inter-rater reliability and capture the subjective nature of emotional expression. Specifically, each stimulus within the dataset is labeled with emotion categories by several independent observers, providing a more robust and nuanced understanding of the emotional content. This multi-view labeling strategy facilitates a comprehensive analysis of emotional states, accounting for variations in perception and interpretation, and allowing for the identification of consensus and disagreement in emotional assessment.

Data for the EMER dataset was collected using the Tobii Pro Fusion eye-tracking device, a system known for its high temporal and spatial resolution. This device operates at a sampling rate of up to 1200 Hz, enabling precise capture of saccades and fixations. The Pro Fusion utilizes infrared illumination and corneal reflection analysis to track gaze position with an accuracy of 0.5 degrees of visual angle or less. This level of precision is crucial for detailed analysis of visual attention patterns related to emotional responses, allowing researchers to accurately map where participants focus their gaze and for how long while experiencing or observing emotional stimuli.

Our EMERT method improves emotion recognition by effectively linking facial expressions with corresponding eye behaviors to bridge the emotion gap.
Our EMERT method improves emotion recognition by effectively linking facial expressions with corresponding eye behaviors to bridge the emotion gap.

EMERT: A Transformer That Looks Where It Counts

The EMERT model is a Transformer-based architecture designed for multimodal emotion recognition, specifically integrating eye-tracking data with facial expression analysis. It utilizes a multi-task learning framework, simultaneously optimizing for both emotion recognition (ER) and facial expression recognition (FER). Adversarial learning is incorporated to enhance the robustness and generalization capabilities of the model. The core architecture employs a MER (Multimodal Emotion Recognition) Transformer, which processes both visual and eye-tracking inputs to learn correlations between gaze patterns, facial cues, and underlying emotional states. This approach allows EMERT to leverage the complementary information provided by these modalities for improved performance.

The EMERT model incorporates eye movement data as a key input feature to enhance emotion recognition capabilities. By processing gaze patterns alongside other modalities, the model learns correlations between specific eye behaviors and underlying emotional states. This integration resulted in performance gains of up to 8.19% in Emotion Recognition (ER) tasks and 1.76% in Facial Expression Recognition (FER) evaluations, demonstrating the significance of eye-behavior data in improving the accuracy of affective computing systems.

In three-class Emotion Recognition (ER) evaluations, the EMERT model demonstrated performance metrics exceeding those of the Self_MM model. Specifically, EMERT achieved a Weighted Average Recall (WAR) of 59.28%, representing a 5.2% improvement over Self_MM. The Unweighted Average Recall (UAR) for EMERT was 52.62%, which is 9.73% higher than Self_MM, and the F1-score reached 55.71%, an 8.23% increase compared to the baseline model. These results indicate a substantial enhancement in the model’s ability to correctly identify and classify emotional states.

The EMERT model’s training regimen included the addition of Gaussian Noise with a variance of 0.01 to input data. This technique promotes the development of resilient feature representations within the model, enhancing its ability to generalize to unseen data and maintain performance under noisy conditions. Quantitative results demonstrate a 2.29% reduction in performance degradation when evaluated with noise, compared to alternative training methods that do not incorporate this adversarial noise injection.

EMERT’s attention maps reveal that the emotion recognition (ER) head focuses on facial details like eyes, mouth, and nose corners, while the facial expression recognition (FER) head prioritizes broader facial regions.
EMERT’s attention maps reveal that the emotion recognition (ER) head focuses on facial details like eyes, mouth, and nose corners, while the facial expression recognition (FER) head prioritizes broader facial regions.

Beyond the Algorithm: What Gaze Tells Us About How We See Each Other

EMERT, a novel approach to emotion recognition, highlights the critical, yet often overlooked, connection between where a person looks and how emotions are conveyed and interpreted. The system doesn’t simply analyze facial features; it meticulously simulates human eye movements, recognizing that gaze patterns are integral to emotional expression. This explicit modeling of visual attention allows EMERT to capture subtle cues – the fleeting glances, the sustained focus on specific areas – that contribute significantly to understanding another’s emotional state. By mirroring the way humans naturally process visual information, the model demonstrates that emotional recognition isn’t solely based on what is seen, but crucially, how it is seen, revealing a nuanced interplay between visual attention and the expression of feelings.

Recent advancements in modeling human visual behavior have yielded significant improvements in automated emotion recognition. Specifically, a novel approach explicitly incorporating gaze patterns achieved a weighted accuracy (WAR) of 51.18% when classifying emotions into seven distinct categories on a standard facial expression recognition benchmark. This represents a substantial leap forward, as the model demonstrated an accuracy-5 improvement of up to 23.04% on the SIMS dataset when contrasted with the performance of the MulT model. These results underscore the critical role that subtle eye movements and attentional focus play in deciphering emotional states, suggesting that algorithms sensitive to these visual cues can markedly enhance the accuracy of emotion-detecting systems and provide valuable insights into the mechanisms of social cognition.

The demonstrated link between gaze patterns and emotion recognition extends beyond mere accuracy, suggesting a fundamental reassessment of how the brain processes social cues. Researchers can now investigate the precise cognitive mechanisms – the neural pathways and computational processes – that translate subtle shifts in eye movement into meaningful emotional interpretations. This opens possibilities for exploring how atypical gaze behaviors, observed in conditions like autism or social anxiety, contribute to difficulties in social interaction. Further study could also reveal the extent to which these cognitive processes are universal across cultures or shaped by individual experiences, ultimately providing a deeper understanding of the complex interplay between perception, emotion, and social cognition.

Correlation analyses using Pearson's, Spearman's, and Kendall's coefficients reveal a strong relationship between eye movement data and both 7-class emotion recognition (ER, blue) and facial expression recognition (FER, orange), with values approaching 1 indicating a stronger correlation.
Correlation analyses using Pearson’s, Spearman’s, and Kendall’s coefficients reveal a strong relationship between eye movement data and both 7-class emotion recognition (ER, blue) and facial expression recognition (FER, orange), with values approaching 1 indicating a stronger correlation.

The pursuit of truly understanding emotion, as this paper details with EMER and EMERT, feels…familiar. It’s a beautifully constructed system, layering eye-tracking with facial expressions, hoping to move beyond simple recognition to genuine understanding. One recalls Andrew Ng’s words: “If you can’t measure it, you can’t improve it.” This dataset is an attempt to measure the nuances often lost in static facial analysis. Yet, the history of machine learning is littered with ‘breakthrough’ datasets that solved a problem in isolation, only to reveal unforeseen limitations in production. The elegance of the Transformer architecture, while promising, will inevitably encounter the messy reality of real-world data, and the gap between recognized expression and felt emotion will likely persist, demanding yet another ‘novel’ approach. It’s a cycle, really.

The Road Ahead

The introduction of EMER and EMERT represents a predictable escalation in the feature arms race. More data, larger models-it’s a cycle. The claim of ‘bridging the emotion gap’ feels generous; it has simply created a more complex system for mapping observable behaviors to subjective labels. Someone will inevitably discover the edge cases-the subtle cultural variations, the deliberately masked expressions, the individuals for whom the training data is systematically biased. And then, the retraining begins.

The reliance on adversarial learning, while currently fashionable, should be viewed with a healthy skepticism. It’s an expensive way to complicate everything, and often introduces new, less obvious failure modes. The question isn’t whether EMERT performs well on a benchmark-it will, initially-but how gracefully it degrades when faced with the messy, unpredictable reality of human interaction. If code looks perfect, no one has deployed it yet.

Future work will undoubtedly explore the integration of even more modalities-voice tone, body language, physiological signals. Each addition will increase the model’s complexity and, inevitably, its susceptibility to overfitting. The core problem remains: translating external signals into internal states is, at best, an educated guess. Expect more datasets, more architectures, and a slow, incremental improvement in accuracy-until the next revolutionary framework arrives, bringing with it a fresh set of problems.


Original article: https://arxiv.org/pdf/2512.16485.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-22 01:03