Author: Denis Avetisyan
Researchers have developed a tour guide robot that combines spoken explanations with coordinated gestures to draw visitors’ attention and create a more engaging museum experience.

This paper details CLIO, a robot utilizing co-speech actions and a large language model to improve visual attention guidance and user engagement in museum settings.
While audio guides offer informative museum experiences, directing visitor focus to specific exhibit details remains a challenge. This paper introduces CLIO: A Tour Guide Robot with Co-speech Actions for Visual Attention Guidance and Enhanced User Engagement, a system designed to address this limitation through coordinated audio-gestural cues. Our experiments demonstrate that CLIO effectively guides visual attention and significantly enhances user engagement compared to traditional audio-only tours, leveraging a Large Language Model to seamlessly integrate narrative with embodied actions. Could such interactive robotic guides redefine the future of museum education and visitor experience?
Early Robots: Acknowledging the Limits of Pure Mechanics
Early robotic tour guides, such as the platforms RHINO and MINERVA, successfully proved that autonomous navigation within intricate spaces was achievable. These pioneering systems could reliably map, localize, and traverse environments like museums or historical sites without direct human control. However, their capabilities were largely limited to the physical act of guiding; they struggled with the more subtle aspects of a compelling tour experience. While capable of reaching designated points, these early robots lacked the ability to adapt to visitor engagement, interpret non-verbal cues, or deliver information in a naturally conversational manner. The focus remained heavily weighted towards the technical challenge of ‘getting there’ rather than enriching the journey for the visitor, highlighting a significant gap between functional autonomy and truly engaging interaction.
Initial robotic tour guides, while capable of navigating museums and other complex spaces, prioritized the technical aspects of movement over the subtleties of visitor engagement. These early systems excelled at determining a route and maintaining their position within an environment – a focus on what is known as path planning and localization. However, crucial elements of a successful human tour, such as maintaining eye contact, using gestures, and responding to visitor reactions, were largely absent. The robots moved around the exhibits, but didn’t truly connect with the people viewing them, highlighting a significant gap between functional autonomy and genuinely engaging social interaction. This oversight stemmed from an initial emphasis on solving the logistical challenges of robotic navigation, with less consideration given to the non-verbal communication that enriches the human tour experience.
Replicating the skillset of a human tour guide in a robotic system presents a significant hurdle, extending far beyond simple navigation. Effective guides don’t just convey information; they dynamically adjust their presentation based on audience engagement, employing subtle non-verbal cues like eye contact, pacing, and gesture to maintain interest and understanding. Translating these complex social dynamics – the ability to read a room, respond to individual questions, and tailor the experience accordingly – into algorithms requires more than just mapping physical space. It demands a nuanced understanding of human psychology and the development of robotic behaviors that convincingly simulate empathy, enthusiasm, and adaptability – qualities not easily quantified or programmed, and essential for creating truly engaging and memorable experiences for visitors.

Beyond Navigation: Coordinating Guidance with Gestures
Coordinated Audio-Gestural Guidance was investigated as a means of improving upon the shortcomings of prior robotic guidance systems. This approach integrates verbal narration with non-verbal communication methods, specifically the use of directed gaze – establishing eye contact with the visitor – and deictic gestures such as pointing. The intent is to provide a multi-modal communication channel that enhances clarity and engagement for the user, leveraging both auditory and visual cues to direct attention and convey information about points of interest. This differs from systems reliant solely on spoken instructions, which can be less effective in complex or noisy environments.
The Action Queue is a sequentially ordered list of commands that dictates the robot’s behavior during interaction. Each element within the queue represents a discrete action, such as verbal instruction, gaze orientation, or physical movement. This structure is critical for maintaining coherence, as actions are executed in the predefined order, preventing conflicting or abrupt transitions. The queue allows for the interleaving of multiple action types-for example, a verbal cue followed immediately by a corresponding gaze shift-creating a fluid and understandable interaction for the user. Prioritization within the queue is also possible, allowing the system to interrupt lower-priority actions with higher-priority ones, such as responding to unexpected user input or obstacle avoidance.
Successful implementation of Coordinated Audio-Gestural Guidance necessitated advancements in robotic perception and actuation. Specifically, the “TrackVisitor” functionality relies on real-time visitor localization using sensor data – typically a combination of depth cameras and 2D visual tracking – to maintain awareness of the user’s position within the interaction space. Concurrently, the “LookAtExhibit” capability demands precise control of the robot’s head and eye mechanisms, enabling it to dynamically orient its gaze towards points of interest and establish visual contact with the visitor, which is achieved through inverse kinematics and servo motor control.

Quantifying Engagement: Tracking Where Visitors Actually Look
A mobile eye tracker was utilized to quantitatively assess visual attention guidance, providing objective data on whether the robotic guide, CLIO, effectively directed participant gaze to designated exhibits. This involved measuring where participants looked and for how long, with data recorded throughout the museum tour. The eye tracker’s precision allowed for the determination of fixation points and saccades, enabling researchers to analyze the temporal sequence of visual attention. Specifically, the system tracked time to first fixation (TFF) – the duration between the robot’s deictic gesture and the participant’s initial visual focus on the intended exhibit – as a key performance indicator of attentional capture.
Analysis of visitor gaze data revealed a substantial decrease in Time to First Fixation (TFF) when interacting with the ‘duncan’ exhibit using CLIO’s deictic gestures, dropping from a mean of 6.92 seconds in the audio-only control condition to 2.02 seconds when CLIO actively directed attention. TFF, measured via a mobile eye tracker, represents the duration between the initiation of a stimulus and a visitor’s initial visual focus on that stimulus. This 70.5% reduction in TFF indicates a significantly improved ability of the robotic guide to quickly and effectively draw visitor attention to points of interest, suggesting CLIO’s deictic actions successfully guided visual attention.
Subjective evaluation of the robotic tour guide was conducted using the Godspeed Questionnaire and User Engagement Scale to assess visitor perceptions and engagement levels. Statistical analysis revealed significant improvements ($p < 0.001$) in three key metrics when compared to the audio-only baseline condition. Specifically, participants rated the robot as exhibiting higher Animacy, demonstrating increased Attention to Exhibits, and reporting greater overall Tour Engagement, as measured by a composite score incorporating Focused Attention and Reward Factor. These findings indicate that the robotic intervention positively influenced visitor experience and enhanced engagement with the museum exhibits, as perceived by the participants themselves.
The software architecture for this research was built upon the Robot Operating System 2 (ROS2) framework, chosen for its real-time capabilities, distributed communication features, and support for a wide range of robotic hardware and software components. Environmental mapping and localization were performed using FAST-LIO2, a LiDAR-based odometry and mapping package optimized for speed and accuracy. This combination of ROS2 and FAST-LIO2 provided a robust and reliable platform for experimentation, enabling consistent data collection and minimizing the impact of software or mapping failures on the results. The resulting system facilitated iterative development and allowed for seamless integration of various modules, including robot control, data logging, and user interface components.

Beyond Simple Guidance: The Promise of Adaptive Tours
Research indicates that strategically employed deictic cues – specifically, gestures like pointing – significantly enhance visitor attention and understanding during museum tours. This study demonstrated that when a robotic tour guide, equipped with a laser-pointing system, directed attention to specific exhibits, participants exhibited improved comprehension of the displayed information. The effect stems from the innate human tendency to follow gaze and pointing gestures, effectively pre-focusing attention and reducing cognitive load. By leveraging this natural behavior, the robotic guide could successfully draw visitor focus to key elements, fostering a more engaging and informative museum experience and suggesting a powerful method for directing attention in complex environments.
The development of CLIO signifies a notable advancement in the creation of immersive and educational museum tours. This tour guide robot isn’t simply a repository of information; it actively directs visitor attention using deictic cues, specifically pointing gestures, to highlight key artifacts and enhance comprehension. Through the integration of robotics and attentional signaling, CLIO moves beyond traditional audio guides or static displays, fostering a more dynamic and engaging learning environment. The robot’s capabilities demonstrate the potential for personalized, interactive experiences that cater to individual visitor needs and ultimately enrich their cultural understanding, paving the way for a future where museums actively guide and shape the visitor journey.
Ongoing development centers on transforming the museum experience from a generalized tour into a highly personalized journey for each visitor. Researchers aim to implement algorithms that analyze individual interests – potentially through pre-tour questionnaires or real-time observation of engagement – and dynamically adjust the content delivered by the robotic tour guide. This includes not only selecting relevant exhibits but also tailoring the depth of information presented and the style of interaction, creating a responsive and adaptive experience. By monitoring visitor reactions – such as gaze direction, dwell time, and even subtle physiological cues – the robot will learn to optimize its behavior in real-time, ensuring maximum engagement and fostering a more meaningful connection with the museum’s collection. This adaptive approach promises to move beyond simple information delivery towards a truly interactive and individualized learning environment.

The pursuit of elegant solutions in robotics, as demonstrated by CLIO’s coordinated audio-gestural guidance, often feels like building a sandcastle against the tide. The system aims to direct visual attention, a noble goal, but production environments will inevitably introduce edge cases the designers never anticipated. As Carl Friedrich Gauss observed, “If others would think as hard as I do, they would not have so many questions.” This sentiment resonates; the more complex the interaction-LLM integration, co-speech gestures, visual tracking-the more opportunities for unpredictable failures. It’s not about preventing crashes, it’s about accepting they will happen and hoping, at least, they’re consistently reproducible. One suspects future archaeologists will be studying the failure modes of ‘engaging’ museum robots long after the novelty wears off. It’s the same mess, just with more sensors.
The Road Ahead
The presentation of CLIO, a robot attempting to orchestrate attention in a complex environment, is predictably elegant. The coordination of speech and gesture, coupled with a large language model, addresses a narrow slice of the problem – getting a user to look at the intended object. The larger, unstated problem, of course, is what happens after they do. Does directed attention translate to actual engagement, or simply a politely-nodded acknowledgement before the visitor returns to their phone? The metrics will be…interesting.
Future work will inevitably focus on robustness. Museums aren’t controlled laboratories. Lighting shifts, unexpected obstacles, and the sheer unpredictability of human movement will quickly expose the limitations of any carefully-tuned system. The question isn’t whether CLIO will fail, but where it will fail, and how gracefully. Perhaps a more fruitful avenue lies in accepting a degree of imperfection, and designing systems that can recover from, and even learn from, unexpected interactions.
The promise of ‘infinite scalability’ always seems to hover around these projects. Extending CLIO’s capabilities to larger museums, or even outdoor environments, will require more than just increased processing power. It will demand a fundamental re-evaluation of what constitutes ‘engagement’ in the first place. Because if all tests pass, it’s likely they’re testing for something remarkably trivial.
Original article: https://arxiv.org/pdf/2512.05389.pdf
Contact the author: https://www.linkedin.com/in/avetisyan/
See also:
- Clash Royale Best Boss Bandit Champion decks
- Best Hero Card Decks in Clash Royale
- Clash Royale Witch Evolution best decks guide
- Clash Royale December 2025: Events, Challenges, Tournaments, and Rewards
- Ireland, Spain and more countries withdraw from Eurovision Song Contest 2026
- JoJo’s Bizarre Adventure: Ora Ora Overdrive unites iconic characters in a sim RPG, launching on mobile this fall
- ‘The Abandons’ tries to mine new ground, but treads old western territory instead
- How to get your Discord Checkpoint 2025
- Best Builds for Undertaker in Elden Ring Nightreign Forsaken Hollows
- Clash of Clans Meltdown Mayhem December 2025 Event: Overview, Rewards, and more
2025-12-08 12:25