Reading Minds: Predicting Human Reactions to Robots

Author: Denis Avetisyan


New research shows that AI can accurately anticipate how people will perceive a robot’s performance in social situations, offering a path toward more natural human-robot interactions.

Large Language Models demonstrate the capacity to infer human perception of a mobile robot’s navigational competence-assessed through binary performance levels-even with limited examples provided via In-Context Learning, suggesting a potential for nuanced understanding of robotic guidance despite sparse data.
Large Language Models demonstrate the capacity to infer human perception of a mobile robot’s navigational competence-assessed through binary performance levels-even with limited examples provided via In-Context Learning, suggesting a potential for nuanced understanding of robotic guidance despite sparse data.

Large language models, trained with limited examples, demonstrate superior performance in predicting human perception of robot navigation compared to traditional methods.

Accurately gauging human responses to robotic behavior remains a challenge, particularly given the resource-intensive nature of traditional user studies. This limitation motivates the work presented in ‘Few-Shot Inference of Human Perceptions of Robot Performance in Social Navigation Scenarios’, which explores a novel approach leveraging the few-shot learning capabilities of Large Language Models (LLMs) to predict human perceptions of robot performance in dynamic social environments. Our results demonstrate that LLMs can achieve state-of-the-art prediction accuracy with significantly less labeled data than conventional supervised learning methods, and that performance is further enhanced by incorporating personalized examples. Could this paradigm shift towards LLM-driven perception prediction pave the way for truly adaptive and user-centered robotic navigation?


The Fragility of Prediction: Navigating the Human Element

For a social robot to navigate human environments effectively, it must move beyond simple obstacle avoidance and instead predict how people will react to its movements. Traditional robotic systems, designed for predictable factory floors or warehouse spaces, lack the capacity to anticipate the nuanced, often irrational, responses of humans – a quick step to avoid a perceived collision, a change in walking direction to maintain personal space, or even a startled jump from an unexpected approach. This predictive ability isn’t about forecasting specific actions, but rather modeling the probability of various responses based on the robot’s perceived intent and the surrounding social context. Failing to account for this inherent human unpredictability results in robotic movements that feel clumsy, intrusive, or even threatening, hindering genuine interaction and limiting the robot’s acceptance in social settings.

For robots to navigate social spaces effectively, they must move beyond simply avoiding obstacles and begin to anticipate how humans will interpret their actions. Human perception isn’t solely based on what a robot does, but on judgments of its competence – whether it seems capable and reliable – its intent – the perceived goal behind the action – and, critically, how surprising that action is. A predictable, yet capable, robot fosters trust and ease of interaction, while unexpected, yet harmless, deviations from the norm can even enhance engagement. However, misinterpreting these cues-a robot appearing clumsy, acting without clear purpose, or behaving erratically-can quickly lead to discomfort, distrust, and even avoidance. Therefore, accurately modeling these nuanced human perceptions of robot behavior is not merely a technical challenge, but a fundamental requirement for creating truly seamless and positive human-robot interactions.

Current approaches to predicting human reactions to robots often fall short due to an inability to fully capture the nuanced interplay of factors influencing perception. Existing models frequently rely on simplified assumptions about how humans interpret robot behavior, struggling to account for the complex cognitive processes involved in assessing competence, intent, and, crucially, the degree of surprisingness elicited by a robot’s actions. This limitation prevents robots from genuinely adapting to dynamic social environments; instead of proactively responding to unarticulated expectations, they react after a misinterpretation occurs. Consequently, the development of truly adaptive robots – those capable of fluid, natural interaction – remains hindered by the difficulty in creating computational models that accurately reflect the intricacies of human social cognition and the often-subtle cues that govern our interactions with others.

In-context learning (ICL) enables an LLM to predict a person’s perception of a robot by leveraging demonstrations, either from diverse users or specifically from the person providing the evaluation.
In-context learning (ICL) enables an LLM to predict a person’s perception of a robot by leveraging demonstrations, either from diverse users or specifically from the person providing the evaluation.

Large Language Models: A Potential for Anticipatory Behavior

Recent progress in Large Language Models (LLMs) presents a viable method for anticipating human perception of robot actions. Traditionally, predicting this required explicitly programmed behavioral rules or extensive datasets linking robot movements to human responses. LLMs, pre-trained on massive text corpora, demonstrate an ability to infer human expectations regarding behavior; this capability extends to robotic actions when appropriately prompted. By framing robot behavior as a textual description – for example, “the robot approaches a person and extends its arm” – LLMs can predict likely human interpretations, such as perceived friendliness or threat. This approach bypasses the need for task-specific data collection and allows for generalization across diverse interaction scenarios, offering a significant advancement over conventional methods in human-robot interaction.

Few-shot learning addresses the challenge of deploying Large Language Models (LLMs) in dynamic, real-world scenarios where extensive training datasets for each specific context are impractical. Traditional machine learning approaches require substantial labeled data for each new situation; however, few-shot learning enables LLMs to generalize from only a small number of examples – typically between one and ten – to accurately predict human perception of robotic actions in novel environments. This is particularly critical for social robotics, where interactions occur within highly variable contexts and obtaining large, labeled datasets for every possible scenario is infeasible. The efficiency of few-shot learning significantly reduces the data requirements and computational cost associated with adapting LLMs to diverse social settings, facilitating more robust and flexible robot behavior prediction.

In-Context Learning (ICL) represents a paradigm shift in applying Large Language Models (LLMs) to perception prediction by eliminating the need for traditional parameter updates during training. Instead of modifying the model’s weights, ICL provides the LLM with a limited number of example demonstrations – pairings of robot behaviors and corresponding human perceptions – directly within the input prompt. The LLM then leverages its pre-trained knowledge to infer the likely human perception for a new, unseen robot behavior, based solely on the provided examples. This approach has demonstrated significant efficacy, achieving up to 92% accuracy in predicting human perceptions of robotic actions without requiring any gradient descent or backpropagation.

The prompt structure requests a large language model to evaluate robot competence based on a defined input format.
The prompt structure requests a large language model to evaluate robot competence based on a defined input format.

The SEAN TOGETHER v2 Dataset: Grounding Perception in Shared Experience

The SEAN TOGETHER v2 Dataset comprises multimodal data recorded during human-robot interactions, specifically focusing on scenarios involving collaborative tasks. This dataset includes synchronized recordings of robot states (joint angles, velocities), pedestrian motion capture data (position, velocity), and associated environmental sensor data. The dataset’s scale includes over 100 unique interaction sequences, each lasting several minutes, and features a diverse range of human participants and robotic platforms. Data is provided in ROS bag format, facilitating straightforward integration with robotic simulation and machine learning pipelines. The dataset is designed to support the training and evaluation of Large Language Models (LLMs) intended for applications in human-robot collaboration, allowing for the development of models capable of understanding and predicting human behavior in shared workspaces.

The SEAN TOGETHER v2 Dataset incorporates trajectory data consisting of time-series records of both robot and pedestrian positions, velocities, and accelerations. This data is captured in three-dimensional space and includes information on multiple interaction scenarios, such as navigation in crowded environments and collaborative task execution. The granularity of the trajectory data allows for the analysis of subtle behavioral cues, including proxemics, gaze direction, and predicted paths, which are critical for understanding the dynamics of human-robot interaction. Each trajectory is time-stamped and linked to contextual information, enabling the modeling of interaction intent and the development of predictive models for safe and efficient navigation.

The Large Language Model (LLM) was trained to predict human perceptions of robot behavior through a process of prompt engineering. This involved constructing specific textual prompts that provided the LLM with contextual information about the robot’s actions and the surrounding environment. These prompts were designed to elicit reasoning from the LLM, effectively guiding it to infer what a human observer would likely perceive in the given situation. The prompts included details of the robot’s trajectory, the pedestrian’s movements, and relevant environmental factors, allowing the LLM to generate predictions about human understanding and potential reactions to the robot’s behavior. Variations in prompt structure and content were tested to optimize the LLM’s predictive accuracy and consistency.

A Random Forest model served as a comparative baseline against the Large Language Model (LLM) approach for predicting human perceptions in interactive scenarios. Quantitative analysis demonstrated the LLM significantly outperformed the Random Forest model in accuracy. Moreover, model performance was substantially improved when trained using four personalized demonstrations; this increase in accuracy achieved statistical significance as determined by a p-value of less than 0.0001, indicating a high level of confidence in the observed improvement due to the personalized data.

Beyond the Horizon: Towards Truly Adaptive Interactions

Large language models, despite their impressive capabilities, are constrained by a finite context window – a limit to the amount of input data they can effectively process at once. This limitation necessitates innovative strategies for efficiently encoding and selecting the most relevant information. Researchers are actively exploring techniques like data summarization, key phrase extraction, and hierarchical data structures to compress information without significant loss of meaning. Furthermore, methods for intelligently retrieving and prioritizing data based on its relevance to the current task are crucial; simply providing more data isn’t necessarily better if the model cannot discern the signal from the noise. Optimizing data input in this manner isn’t merely about circumventing a technical hurdle, but about fundamentally improving the model’s ability to learn, reason, and ultimately, perform complex tasks within realistic constraints.

The nuanced understanding of social interactions benefits greatly from the integration of multimodal data streams beyond simple text. Research demonstrates that incorporating visual cues – such as facial expressions and body language from video – alongside spatial information regarding proximity and orientation significantly enhances a model’s ability to interpret intent and context. This approach moves beyond relying solely on linguistic content, allowing the system to decipher subtleties often conveyed non-verbally. By processing these combined data types, models can more accurately predict actions, recognize emotional states, and ultimately respond in a more appropriate and human-like manner, leading to more effective communication and collaboration between humans and artificial intelligence.

Recent progress in multimodal data integration is enabling robots to exhibit increasingly adaptive behavior, shifting away from pre-programmed routines towards responses tailored to individual user perceptions. Studies demonstrate that personalized demonstrations – providing the robot with examples specific to a user’s preferences – significantly enhance performance, with accuracy improvements reaching statistical significance ($p < 0.0001$). Notably, a configuration utilizing K=68 demonstrations proved particularly effective in specific scenarios, suggesting an optimal balance between learning from personalized data and generalizing to novel situations. This ability to dynamically adjust behavior not only improves task completion but also fosters more natural and intuitive interactions, promising a future where robots seamlessly integrate into human environments and collaborate with users in a truly responsive manner.

The study reveals a fascinating dynamic: the capacity of Large Language Models to anticipate human judgment in complex social scenarios. This echoes Bertrand Russell’s observation that, “The difficulty lies not so much in developing new ideas as in escaping from old ones.” The model’s success isn’t merely about processing data; it’s about circumventing ingrained expectations of how robots should behave, and instead, predicting how humans will perceive that behavior. Just as architecture requires an understanding of its historical context to endure, this research demonstrates that effective human-robot interaction necessitates a model capable of learning and adapting to nuanced human perceptions – a system that, while technologically novel, must acknowledge the enduring principles of social navigation.

What’s Next?

The demonstrated capacity of Large Language Models to anticipate human perception in dynamic social spaces is, predictably, a transient advantage. Any improvement ages faster than expected; the fidelity of these predictions will inevitably degrade as the implicit contract between model and observer shifts. The current reliance on ‘few-shot’ learning, while effective, merely delays the inevitable need for a more fundamental understanding of how humans encode, and react to, robotic behavior. The personalized prompting shows promise, but reveals a deeper issue: perception is not a static property of action, but a negotiation-a continually recalibrated assessment based on past interactions.

Future work must address the inherent instability of this predictive system. The challenge isn’t simply refining the model’s accuracy, but acknowledging that ‘correct’ prediction is a moving target. A robust system will require mechanisms for continuous learning, not just from explicit feedback, but from subtle shifts in human response-a form of implicit recalibration. This necessitates moving beyond isolated scenarios; the true test lies in maintaining predictive coherence across extended interactions, accounting for the accumulation of perceptual ‘debt’.

Ultimately, the pursuit of accurate perception prediction is a journey back along the arrow of time-an attempt to reconstruct the complex history of human-robot interaction. Rollback is inevitable, as the model’s internal representation diverges from the lived experience. The longevity of any approach will be determined not by its initial accuracy, but by its capacity to gracefully accommodate-and learn from-that decay.


Original article: https://arxiv.org/pdf/2512.16019.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-19 09:41