The Art of the Approach: Teaching Robots When to Talk

Author: Denis Avetisyan

Researchers have developed a system that allows robots to learn the best moments to initiate conversations with people, leading to more natural and engaging interactions.

The system anticipates inevitable failures in interaction initiation, structuring itself not as a rigid command sequence, but as a flow diagram-a network of potential breakdowns and recovery paths-recognizing that any attempt at perfect control merely prefigures the points at which the system will inevitably unravel.

A novel Interaction Initiation System utilizes machine learning and time series analysis of human pose and language to predict optimal conversational openings in a real-world museum setting.

Successfully initiating interactions remains a challenge for social robots, often resulting in awkward or ineffective communication. This paper, ‘When to Say “Hi” – Learn to Open a Conversation with an in-the-wild Dataset’, introduces the Interaction Initiation System (IIS), a machine learning approach designed to predict optimal conversation start times based on human body language. Through a field study at the Deutsches Museum Bonn with N = 201 user interactions, we demonstrate the IIS’s ability to accurately identify both the appropriate greeting period and the individual likely to initiate conversation. Could this system pave the way for more natural and engaging human-robot interactions in public spaces?

The Illusion of Timing: Anticipating the Human Rhythm

Despite advancements in robotics, achieving truly natural interaction remains elusive, often hampered by a robot’s inability to time its responses effectively. Current systems frequently initiate communication or action at moments that feel awkward or intrusive to a human partner, disrupting the flow of conversation or task completion. This isn’t a matter of slow processing speed, but rather a deficit in anticipating when a response is welcome – a nuanced skill humans effortlessly employ. A robot might offer assistance before a person has fully formulated a need, or interject into a thought before it’s complete, creating a disjointed experience. This mistiming stems from a reliance on reacting to explicit cues, rather than proactively forecasting a human’s intentions and readiness for engagement, ultimately hindering the development of truly seamless and acceptable human-robot collaboration.

The difficulty in anticipating a human’s next move represents a significant hurdle in creating truly natural robotic interactions. Current systems frequently struggle to discern subtle cues indicating a person’s willingness to engage, often initiating communication at inopportune moments. This isn’t simply a matter of technical lag; it’s a failure to model the complex cognitive states underpinning human behavior. A person might be momentarily lost in thought, focused on a task, or simply not desiring conversation, and a robot lacking this understanding risks interrupting or intruding. Consequently, even technically proficient robots can feel clumsy and unsettling, hindering the development of trust and seamless collaboration, as effective communication relies heavily on respecting these unstated boundaries and unspoken signals of readiness.

Predictive systems are crucial for robots to move beyond simple reactivity and achieve truly natural dialogue with humans. Current research focuses on developing algorithms that analyze subtle cues – such as gaze direction, body posture, and even micro-expressions – to anticipate a person’s next action or communicative intent. These systems employ machine learning models, often trained on vast datasets of human interaction, to forecast the precise moment a person is receptive to robotic input. By anticipating needs and responding proactively, rather than reactively, robots can avoid interrupting ongoing tasks or delivering information at inopportune times, fostering a sense of seamless collaboration and significantly improving the overall user experience. The ultimate goal is to create interactions where the robot feels less like a tool and more like a perceptive partner, capable of understanding and responding to unspoken cues.

Robotic interactions lacking predictive capability often result in experiences perceived as clumsy or even unwelcome by humans. The sensation of a robot acting out of sync with natural conversational rhythms can create a significant disconnect, leading to discomfort and diminished trust. This isn’t merely a matter of politeness; poorly timed robotic responses demand additional cognitive effort from the human participant, who must actively adjust to the machine’s pace rather than engaging effortlessly. Consequently, usability suffers, and the potential for seamless integration into daily life is undermined, as individuals may subconsciously avoid or resist interacting with a system that fails to anticipate their needs and respond at appropriate moments.

The experimental setup is located within a designated interaction area inside the Deutsches Museum Bonn, as viewed from the entrance.

Orchestrating Engagement: The Interaction Initiation System

The Interaction Initiation System employs a sequential methodology for anticipating user receptivity to robotic interaction. This process begins with forecasting human pose and activity, followed by classification of predicted actions into categories informing the robot’s subsequent behavior. The system then selects from a defined set of responses – Wait, Speak, or Listen – based on this classified prediction. This multi-stage approach allows the robot to proactively assess the user’s state and initiate interactions at moments deemed most likely to be well-received, rather than reacting to explicit cues.

The Timing Classifier is a core component of the Interaction Initiation System, responsible for predicting opportune moments for robotic interaction. This prediction is achieved by analyzing forecasted human poses, enabling the system to proactively initiate engagement. Performance was evaluated using a weighted F1-score of 74% on a dedicated test dataset, indicating the classifier’s ability to balance precision and recall in identifying suitable initiation times. This metric assesses the system’s accuracy in determining when a user is receptive to interaction, minimizing disruptive or unwanted engagement attempts.

The Action Classifier determines the optimal robotic response – Wait, Speak, or Listen – based on predictions of human behavior. This component utilizes a Support Vector Machine (SVM) and has demonstrated 75.3% accuracy in classifying the appropriate action. Input to the SVM consists of features derived from the forecasted human pose and contextual information, enabling the robot to select a response designed to maximize engagement. The classifier’s output directly influences the robot’s subsequent action, facilitating a more natural and responsive interaction.

The Interaction Initiation System operates on a proactive model, aiming to begin robotic interaction only when a positive user response is probable. This is achieved by forecasting user behavior and preemptively selecting an appropriate robotic action – Wait, Speak, or Listen – based on predicted states. The system avoids unsolicited interaction by continuously assessing the likelihood of a welcomed response, utilizing a multi-stage classification process to determine optimal initiation timing and action selection. This approach prioritizes user experience by minimizing disruptive or unwanted engagement, and relies on data-driven predictions to maximize the potential for successful communication.

The timing classifier utilizes a detailed architecture to accurately categorize temporal patterns.

The Algorithmic Skeleton: A Technical Pipeline

Body Landmark Extraction is the initial step in our system, leveraging the MediaPipe library to perform multi-person pose estimation from video input. MediaPipe identifies 33 2D keypoints representing various anatomical locations – including joints and extremities – on each person detected in a frame. These landmarks are output as normalized coordinates, ranging from 0 to 1, relative to the image dimensions, providing a consistent and scalable representation of human pose. The system is capable of processing standard video formats and frame rates, delivering landmark data in real-time for subsequent processing stages.

Following body landmark extraction, a feature extraction process transforms the raw landmark coordinates into a format compatible with the Human Pose Forecasting Model. This involves calculating joint angles and relative positions, normalizing the data to a consistent scale, and potentially applying dimensionality reduction techniques. The resulting feature vector represents the skeletal pose at each time step, providing a concise and informative input for the BlockRNN. This pre-processing step is crucial for improving model performance and ensuring stable predictions by reducing noise and highlighting relevant pose information.

The Human Pose Forecasting Model utilizes a BlockRNN architecture, a type of recurrent neural network specifically designed for time series prediction. This model takes sequences of body landmark data as input and forecasts future body positions. Performance is quantified using the Root Mean Squared Error (RMSE), a standard metric for evaluating the accuracy of continuous predictions; the model achieves an RMSE of 0.0426 on the evaluation dataset. This value represents the average magnitude of the error between predicted and actual body joint positions, normalized by the scale of the input data, indicating a high degree of accuracy in forecasting human pose over short time horizons.

Person detection is implemented using YOLOv5, a convolutional neural network optimized for speed and accuracy. This system identifies and localizes individuals within the video stream, providing bounding box coordinates for each detected person. Critically, subsequent pose estimation and forecasting processes are then constrained to operate only on individuals present within these defined bounding boxes, effectively limiting analysis to the designated interaction area and reducing computational load from irrelevant background elements. The YOLOv5 implementation achieves a mean average precision (mAP) of 0.68 on the COCO dataset, providing reliable person detection performance for this application.

Beyond the Lab: Validation and Ethical Considerations

The Interaction Initiation System underwent rigorous testing not within a controlled laboratory, but amidst the bustling exhibits of the Deutsches Museum Bonn. This deployment allowed researchers to capture a unique dataset reflecting the spontaneity and unpredictability of genuine human-robot encounters. By observing how individuals reacted to the robot’s attempts at engagement in a public setting, the system’s algorithms were exposed to a breadth of behaviors and conversational cues far exceeding those achievable in simulated environments. The resulting data, a comprehensive record of these interactions, proved invaluable in refining the system’s ability to accurately predict opportune moments for initiating conversation, ultimately bridging the gap between robotic responsiveness and natural social exchange.

The resulting Video Dataset proved instrumental in iteratively improving the Interaction Initiation System’s performance. Analyzing recordings of genuine interactions allowed researchers to pinpoint instances where the system accurately predicted opportune moments for engagement, as well as cases where it faltered. This granular feedback facilitated a cycle of refinement, optimizing the algorithms responsible for assessing social cues and predicting conversational openings. The dataset enabled quantitative evaluation of the system’s success rate, measured by its ability to initiate interactions that were perceived as natural and well-timed by participants. Ultimately, the availability of this real-world data moved the system beyond simulated environments and towards robust, reliable performance in dynamic social settings.

A central tenet of the Interaction Initiation System’s deployment involved a robust consent mechanism, designed to prioritize user privacy and fully comply with stringent Data Protection Regulation standards. Participants interacting with the Furhat robot at the Deutsches Museum Bonn were presented with clear, concise information regarding data collection practices – specifically, that video and audio data would be recorded for research purposes. Explicit, informed consent was obtained from each individual before any interaction commenced, ensuring voluntary participation and the right to withdraw at any time. This process wasn’t merely a formality; it was integral to the study’s ethical framework, safeguarding personal information and fostering trust in human-robot interactions within a public setting. The collected data was anonymized and securely stored, further reinforcing the commitment to responsible data handling and upholding participant rights.

The experiments relied on the expressive capabilities of the Furhat Robot, a socially interactive platform designed to facilitate realistic human-robot encounters. Deploying the Interaction Initiation System on Furhat within the Deutsches Museum Bonn allowed researchers to observe nuanced behaviors in a dynamic, public setting, moving beyond the constraints of controlled laboratory environments. This public space presented a unique opportunity to capture spontaneous interactions, revealing how individuals naturally respond to a robot’s attempts to initiate conversation – their hesitations, approvals, and non-verbal cues. The robot’s human-like appearance and ability to convey emotion were crucial in eliciting genuine reactions, providing valuable data for refining the system’s ability to predict appropriate moments for interaction and ultimately fostering more seamless and comfortable social robotics experiences.

The pursuit of an ‘Interaction Initiation System’ feels less like engineering and more like tending a garden. This work, attempting to predict the precise moment for a robot to engage, highlights a fundamental truth: timing isn’t a calculation, but a feeling. G.H. Hardy observed, “The essence of mathematics is its freedom.” Similarly, this system doesn’t build interaction; it attempts to discern patterns within the inherent chaos of human behavior. The system’s reliance on time series analysis and pose estimation reveals a desire for predictability, yet the very nature of human interaction resists such rigid control. It is a beautifully flawed endeavor, a prophecy of inevitable missteps, and a testament to the fact that scalability is merely a justification for complexity.

The Unfolding Conversation

This work, like all attempts to choreograph interaction, merely delays the inevitable drift toward complexity. The Interaction Initiation System (IIS) offers a prediction, not a solution. It maps a present moment onto a future possibility, believing that a statistically favorable time exists for utterance. But the museum, the human, the robot – all are systems growing in unpredictable directions. Each successful initiation is merely a temporary alignment, a fleeting resonance before divergence. The true challenge isn’t when to speak, but accepting that every greeting is also a goodbye, every opening a foreshadowing of closure.

The reliance on pose estimation and time series analysis feels, predictably, like building walls against the tide. The IIS models external behavior, yet the core of interaction lies in internal state-the shifting currents of attention, intention, and, ultimately, boredom. Future work will inevitably grapple with this inner landscape, perhaps through increasingly sophisticated (and increasingly fragile) models of human cognition. The question isn’t whether the robot can predict the opportune moment, but whether it can gracefully respond to the absence of one.

It is a comfort to believe one can engineer naturalness. This research, like so many before it, will discover that the most elegant architectures are often the first to buckle under the weight of real-world variance. The system will not solve conversation; it will simply become another node in the complex web of signals, another voice contributing to the beautiful, chaotic murmur of the museum. And in that, perhaps, lies a certain quiet dignity.

Original article: https://arxiv.org/pdf/2512.03991.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Timing: Anticipating the Human Rhythm

Orchestrating Engagement: The Interaction Initiation System

The Algorithmic Skeleton: A Technical Pipeline

Beyond the Lab: Validation and Ethical Considerations

The Unfolding Conversation

See also: