Navigating Social Spaces: A New Dataset for Human-Robot Interaction

Author: Denis Avetisyan

Researchers have released Bi3, a comprehensive dataset designed to improve the ability of robots to safely and effectively navigate complex, real-world social environments alongside people.

Despite sharing identical workspace goals, human and robotic trajectories diverge substantially across controllers, demonstrating that path generation is highly sensitive to the underlying control strategy.

Bi3 provides diverse, multi-platform data on human motion and robot behavior, addressing critical gaps in existing resources for social robot navigation research.

Despite advances in robot navigation, reliably predicting human behavior in crowded spaces remains a significant challenge. To address this, we introduce Bi3: A Biplatform, Bicultural, Biperson Dataset for Social Robot Navigation, comprising 10.5 hours of multimodal data-including motion tracks and video-collected from 74 participants across the USA and France during close-proximity human-robot interactions. Our analysis reveals Bi3 to be a uniquely diverse benchmark, featuring five navigation algorithms and two robot platforms, suitable for training models of human motion prediction and robot control. Will this dataset accelerate the development of robots capable of seamlessly navigating and collaborating within dynamic human environments?

Navigating the Human World: Anticipating Interaction

As robots transition from controlled factory floors to the unpredictable bustle of human environments, the ability to anticipate pedestrian movements becomes critically important for their safe and effective operation. Unlike static obstacles, people are dynamic and often exhibit goal-oriented behavior, requiring robots to move beyond simple collision avoidance. Accurate prediction isn’t merely about preventing accidents; it’s about fostering seamless, intuitive interactions – allowing robots to navigate crowded spaces, assist individuals, and generally coexist harmoniously. This necessitates a shift towards predictive models that consider not just where a person is, but where they are likely to go, demanding sophisticated algorithms capable of interpreting subtle cues in human posture, gaze, and velocity to proactively adjust robotic trajectories and ensure both safety and efficiency.

Conventional robotic navigation systems frequently categorize humans as static obstructions, a simplification that proves inadequate when encountering the nuanced, predictive nature of human movement. This approach fails to recognize that people rarely travel in straight lines or exhibit wholly reactive behaviors; instead, individuals constantly anticipate their surroundings and adjust their trajectories accordingly. Consequently, robots relying on such rudimentary models struggle to interpret intentions, leading to hesitant movements, inefficient paths, and potentially unsafe interactions. The inherent complexity of human behavior-including subtle cues like gaze direction, body language, and social conventions-requires a more sophisticated understanding than simply registering a human presence as an impediment to avoid.

The development of robots capable of navigating human environments is hampered by a critical limitation in training data; existing datasets often fail to capture the nuanced and varied interactions characteristic of real-world encounters. To address this, researchers have compiled the Bi3 Dataset, a resource comprising 10.5 hours of footage documenting natural, close-proximity navigation between a mobile robot and 74 different human participants. This extensive collection goes beyond simple obstacle avoidance scenarios, capturing a breadth of behaviors – including yielding, negotiation, and subtle adjustments in trajectory – offering a significantly richer learning environment for robots striving to achieve socially aware and safe navigation capabilities. By exposing algorithms to this diversity of human responses, the Bi3 Dataset aims to foster the development of more robust and adaptable robotic systems ready to operate seamlessly alongside people.

The inability of robots to reliably anticipate human behavior presents a significant obstacle to their widespread integration into everyday life. Current limitations in predictive capabilities restrict robotic deployment to highly structured or isolated settings, preventing effective operation in the unpredictable environments humans naturally inhabit – from crowded shopping malls and busy sidewalks to collaborative workspaces and domestic homes. Without a nuanced understanding of human intention and movement, robots struggle to navigate these dynamic spaces safely and efficiently, leading to hesitant, awkward, or even potentially hazardous interactions. This shortfall ultimately confines robotic assistance to limited applications, hindering the realization of their full potential as collaborators and companions in the human world.

Human motion statistics from our dataset align with those of existing datasets, as indicated by comparable means and standard deviations.

Building a Foundation: Expanding the Scope of Data

The development of comprehensive datasets is essential for advancing human motion prediction models. Earlier datasets often lacked the complexity to accurately represent real-world interactions. Datasets like Bi3 address this limitation by capturing a wider range of scenarios and incorporating detailed data on human-robot interactions. This increased diversity enables more robust training and evaluation of prediction algorithms, moving beyond simplistic motion patterns to encompass the nuances of dynamic, real-world environments. The availability of such datasets facilitates the development of models capable of generalizing to previously unseen interactions and improving performance in practical applications.

The THOR and THOR MAGNI datasets address the need for multi-agent motion prediction by focusing on interactions within shared workspaces. THOR, and its extended version THOR MAGNI, simulate realistic human-robot collaboration scenarios, allowing for the collection of data detailing how humans navigate and interact with both static obstacles and other active agents – including robots – during task completion. These datasets differ from single-agent benchmarks by explicitly capturing the complexities of joint motion planning, collision avoidance, and cooperative behavior, providing a more comprehensive evaluation of prediction models intended for real-world collaborative robotics applications. The inclusion of multiple agents necessitates models capable of reasoning about the intentions and trajectories of others, increasing the difficulty and relevance of the prediction task.

The Edinburgh, Stanford Drone, and ETH/UCY datasets serve as established benchmarks for human motion prediction research. These datasets provide standardized data against which new methodologies can be quantitatively compared, facilitating objective evaluation of progress in the field. The Edinburgh dataset focuses on pedestrian motion in crowded spaces, Stanford Drone captures drone-view human trajectories, and ETH/UCY provides data of human motion in both synthetic and real-world environments. Utilizing these datasets allows researchers to demonstrate improvements in prediction accuracy, efficiency, and robustness relative to existing state-of-the-art approaches, and ensures reproducibility of results across different studies.

The Bi3 Dataset facilitates detailed analysis of human-robot interaction through data collected using the Hello Robot Stretch and Willow Garage PR2 robotic platforms. A key characteristic of Bi3 is its capture of closer proximity interactions – exhibiting a lower minimum distance between humans and robots compared to existing datasets. Furthermore, Bi3 data demonstrates greater diversity in human motion, as indicated by a significantly higher standard deviation in human velocity measurements; this suggests a broader range of movement speeds and patterns during interaction, providing a more robust basis for training predictive models.

Robot motion statistics reveal performance differences between the UM (dark orange) and LAAS (light orange) controllers, as indicated by the mean and standard deviation of motion metrics.

Advanced Modeling Techniques: Refining Predictive Capacity

The Human Scene Transformer and CoHAN (Context-aware Human Activity Network) Model represent advancements in human motion prediction by utilizing deep learning architectures to model complex human behaviors. The Human Scene Transformer employs a transformer network, originally developed for natural language processing, to encode both human pose data and scene context, allowing it to better anticipate future movements based on environmental understanding. CoHAN, conversely, focuses on modeling social interactions by explicitly representing relationships between individuals, capturing nuanced behaviors arising from group dynamics. Both approaches move beyond traditional methods by learning high-dimensional representations of human motion, enabling the prediction of non-linear and multimodal trajectories, and improving accuracy in scenarios with complex interactions or occlusions. These models rely on substantial datasets for training and validation, and performance is typically evaluated using metrics such as Average Displacement Error (ADE) and Final Displacement Error (FDE).

Social Robot Navigation Systems are increasingly utilizing advanced trajectory prediction models in conjunction with Model Predictive Control (MPC) to enhance navigational safety and efficiency. MPC operates by repeatedly solving an optimization problem over a finite time horizon, predicting future system states – including human trajectories – and calculating control inputs that minimize a cost function. By incorporating predicted human movements into the MPC framework, the robot can proactively adjust its path to avoid collisions and maintain a comfortable distance from pedestrians. This contrasts with reactive methods that respond to immediate sensor data. The optimization process considers both predicted human behavior and robot dynamics, enabling the generation of smooth, dynamically feasible trajectories. Successful implementation requires accurate prediction models and computationally efficient MPC solvers to operate in real-time.

The validation of advanced predictive navigation systems is heavily dependent on the availability of comprehensive datasets that represent realistic human behavior. Several datasets are commonly utilized for this purpose, including the JRDB (Jointly Recorded Database of human behaviors), L-CAS (Large Crowd and Activity Scenario), and SCAND (Social Navigation Dataset). These datasets provide trajectories and interactions of pedestrians in diverse environments, enabling researchers to train and evaluate the performance of prediction models. Performance metrics derived from these datasets, such as minimum distance to humans and robot acceleration, are crucial for quantifying safety and efficiency improvements over simpler modeling approaches. Data from the LAAS laboratory demonstrates the impact of data quality and dataset characteristics on system behavior, as observed variations in robot movement suggest platform limitations influence path planning even with advanced prediction models.

Recent advancements in predictive navigation represent a departure from simpler models such as “No Prediction” and Constant Velocity approaches. Data collected at LAAS demonstrated that implementation of these advanced techniques resulted in consistently greater minimum distances maintained between robots and humans during trials. However, this increased separation was coupled with higher observed robot acceleration rates, indicating a trade-off between proximity and smoothness of movement. This suggests that current platform characteristics, including robotic drivetrain capabilities and control algorithms, are influencing path planning and limiting the ability to achieve both close proximity and fluid, natural motion in collaborative scenarios.

During the assembly task, subjects sequentially assemble blocks [latex](1)[/latex] and [latex](2)[/latex] while a robot navigates a predefined path between assembly goals.

Measuring Human Perception: Towards Seamless Integration

Determining whether humans will readily accept robots into their daily lives extends far beyond simply assessing a robot’s ability to navigate a space; social acceptance is paramount. Researchers recognize that a robot’s success isn’t solely defined by its functional capabilities, but also by how readily people perceive it as trustworthy, safe, and agreeable. Consequently, standardized metrics are vital for gauging these subjective qualities. The Robotic Social Attributes Scale, for example, provides a framework for evaluating a robot’s perceived intelligence, approachability, and emotional responsiveness, allowing designers to move beyond purely technical benchmarks and prioritize the development of robots that are not just efficient, but also genuinely welcomed by those they interact with. This shift in focus is critical for ensuring the seamless integration of robotic technology into human society.

Understanding the mental effort required during human-robot interaction is paramount for designing truly collaborative systems. The NASA Task Load Index (TLX) offers a standardized, subjective assessment of this cognitive burden, measuring workload across dimensions like mental demand, physical effort, temporal demand, performance, effort, and frustration. By quantifying these factors, researchers gain crucial insight into how humans perceive and respond to robotic partners. This data informs design choices aimed at minimizing mental strain, optimizing task allocation, and ultimately fostering more comfortable and efficient collaboration. A lower TLX score suggests a more seamless interaction, indicating the robot effectively supports – rather than hinders – human performance, paving the way for wider acceptance and integration of robotic assistance in various domains.

Advancements in robotic perception and labeling are heavily reliant on the availability of comprehensive datasets, and resources like INTERACT and TBD are proving instrumental in accelerating this progress. These datasets provide researchers with the large-scale, annotated information necessary to train and evaluate algorithms designed for object recognition, scene understanding, and human-robot interaction. By offering meticulously labeled images and videos capturing a diverse range of scenarios, they enable the development of more robust and adaptable robotic systems. The increasing size and complexity of these datasets directly correlate with improvements in a robot’s ability to accurately perceive its environment and, crucially, to interpret human intentions, ultimately pushing the boundaries of what’s possible in collaborative robotics and autonomous operation.

Ongoing development in human-robot interaction centers on creating systems capable of greater resilience and flexibility in diverse environments. Researchers are actively integrating recent advances in perception and social understanding – informed by datasets like the expansive Bi3 collection, which features 10.5 hours of interaction data contributed by 74 participants – to facilitate truly seamless collaboration. This work aims to move beyond controlled laboratory settings and enable robots to function effectively in real-world applications, from assisting in homes and hospitals to working alongside humans in complex industrial scenarios. The ultimate goal is to build robotic partners that not only understand human intent but also adapt to individual preferences and unforeseen circumstances, creating a more intuitive and productive relationship.

The introduction of the Bi3 dataset exemplifies a principle of systemic understanding. This dataset, designed to capture the nuances of human-robot interaction in diverse settings, acknowledges that isolated components – robot behaviors, environmental factors, user perceptions – do not function in isolation. As Donald Davies observed, “The way to approach a complex problem is to break it down into simpler parts, but always remembering that the whole is more than the sum of its parts.” Bi3’s bicultural and biperson data collection directly addresses this holistic view, aiming to move beyond fragmented understandings of social navigation and build truly robust, adaptive systems. The dataset’s focus on multiple perspectives-robot, human, and environment-highlights the interconnectedness crucial for effective interaction.

What Lies Ahead?

The introduction of Bi3, with its explicit focus on bicultural interaction and diverse robotic behaviors, exposes a quiet assumption within the field: that navigation is merely a technical problem. Systems break along invisible boundaries-if one cannot model the subtle shifts in human expectation arising from cultural context, pain is coming in the form of unpredictable, even unsafe, interactions. This dataset is not an end, but a necessary widening of the lens.

Anticipating weaknesses requires a move beyond simple motion prediction. The current emphasis on ‘successful’ trajectories obscures the critical data residing in failures – the near misses, the awkward adjustments, the moments where human trust visibly erodes. Future work must prioritize capturing and analyzing these negative spaces, treating them not as outliers, but as fundamental signals of systemic fragility.

Ultimately, the true test of a social robot lies not in its ability to reach a destination, but in its capacity to navigate the complex topography of human social expectation. The elegance of a solution will not be measured by computational efficiency, but by its ability to disappear-to fade into the background of everyday life without disrupting the delicate balance of human interaction.

Original article: https://arxiv.org/pdf/2605.06863.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Navigating the Human World: Anticipating Interaction

Building a Foundation: Expanding the Scope of Data

Advanced Modeling Techniques: Refining Predictive Capacity

Measuring Human Perception: Towards Seamless Integration

What Lies Ahead?

See also: