Bringing Interactions to Life: Generating Realistic Human Activity from Text

Author: Denis Avetisyan

Researchers have developed a new method for creating believable scenes of people interacting with each other and objects, all driven by simple text descriptions.

The system generates complex scenes with multiple interacting people by consistently modeling both how people interact with objects and with each other, effectively translating text prompts into detailed depictions of human activity.

A novel score-based diffusion model, trained on a new HHOI dataset, enables the generation of realistic 3D multi-person interactions.

Modeling nuanced human behavior-particularly the interplay between individuals and their environment-remains a challenge for artificial intelligence. This paper, ‘Learning to Generate Human-Human-Object Interactions from Textual Descriptions’, addresses this gap by introducing a novel framework for generating realistic Human-Human-Object Interactions (HHOIs) from textual descriptions. Leveraging score-based diffusion models and a newly captured dataset, the authors demonstrate the synthesis of complex, multi-person interactions involving objects-a significant advancement over existing single-person Human-Object Interaction approaches. Could this generative framework unlock more natural and intuitive human-computer interfaces, or enable more realistic simulations of social dynamics?

The Illusion of Understanding: Dissecting Human Interaction

Traditional studies of human behavior frequently dissect interactions into separate components – how a person manipulates an object and how people interact with each other – overlooking the fundamental connection between these actions. This isolated approach fails to capture the reality of collaborative tasks, where a person’s actions with an object are inextricably linked to their communication and coordination with another person. For instance, handing a tool isn’t simply an object transfer; it’s a communicative act, influenced by the recipient’s anticipated needs and the shared goal of the collaboration. Consequently, models built on these isolated foundations struggle to accurately represent the subtle interplay of motion, communication, and shared intent that characterizes true human collaboration, hindering progress in fields like robotics and virtual reality.

Accurately simulating collaborative tasks demands a unified approach to motion generation, recognizing that human behavior isn’t simply the sum of individual actions but emerges from a continuous interplay between people and their shared environment. The HHOIProblem – modeling coordinated behavior around an object – isn’t solvable by treating human and object interactions as separate entities; instead, researchers are focusing on algorithms that simultaneously consider the intentions, movements, and physical constraints of both participants and the object itself. This holistic strategy allows for the generation of more realistic and plausible interactions, accounting for the subtle adjustments, anticipatory movements, and shared understanding that characterize true collaboration. Such models aim to predict not just what people do, but how they do it, creating virtual interactions that feel natural and intuitive.

Current computational approaches to modeling human interaction frequently fall short when tasked with recreating the subtle choreography of shared activity. Existing techniques often produce motions that, while superficially resembling human behavior, lack the physical realism and intricate coordination characteristic of natural interactions. This manifests as jerky movements, unrealistic grasping behaviors, or a failure to anticipate a partner’s actions – all indicators of a limited understanding of the underlying biomechanics and social dynamics. Consequently, there is a pressing need for novel generative techniques capable of producing more plausible and nuanced simulations, moving beyond simple kinematic models to incorporate principles of physics, predictive modeling, and an appreciation for the inherent complexities of human-human and human-object interplay.

Our model successfully generates complex human-human interaction (HHOI) scenes with varying numbers of people, accurately preserving natural social cues, unlike baseline methods.

Diffusion Models: A Bit of Noise, a Lot of Hope

ScoreBasedDiffusionModels are a class of generative models that learn to reverse a diffusion process, transforming noise into data. This is achieved by training a neural network to predict the score function – the gradient of the log data density – at each step of the diffusion process. During generation, the model starts with random noise and iteratively refines it towards a realistic data sample, in this case, human motion, by following the learned score function. The process is probabilistic, allowing for the generation of diverse and plausible outputs, and avoids the mode collapse issues often seen in other generative models like Generative Adversarial Networks (GANs). This approach is particularly effective for complex, high-dimensional data like human motion capture data, where capturing the full distribution of possible motions is critical for realism.

Two distinct diffusion models were developed to address the complexities of realistic motion generation: HOIDiffusionModel and HHIDiffusionModel. HOIDiffusionModel is specifically designed for generating motions involving interaction between a human and an object, while HHIDiffusionModel focuses on generating motions depicting interactions between two or more humans. This specialization allows each model to more effectively capture the nuances of these specific interaction types, leading to more plausible and realistic generated motion sequences. Both models utilize a ScoreBasedDiffusionModel framework but are trained on datasets curated for their respective interaction scenarios.

The motion generation process utilizes diffusion models to statistically represent and sample from the distribution of observed human movements. These models are trained on datasets of human motion capture data, effectively learning the probabilities associated with different poses, velocities, and interactions. During generation, the model begins with random noise and iteratively refines it, guided by the learned distribution, to produce a sequence of plausible motions. The diversity of generated outputs stems from the model’s ability to sample different points within the learned distribution, while plausibility is maintained by adhering to the statistical patterns present in the training data. This approach allows for the creation of motions that are not simply memorized from the training set but rather novel combinations reflecting the underlying structure of human behavior.

The HHOI diffusion architecture utilizes two independent diffusion models-one for human-object interactions (HOI) and another for human-human interactions (HHI)-each trained with separate networks to learn their respective score functions.

Stitching the Illusion Together: Coherence and Plausibility

InconsistencyLoss is implemented as a regularization term within the training process to maintain coherence between Human-Object Interaction (HOI) and Human-Human Interaction (HHI) components. This loss function directly penalizes discrepancies between predicted object states and human poses, ensuring that generated interactions are physically plausible and logically consistent. Specifically, it assesses the alignment of object positions and orientations with corresponding human actions, such as grasping or pushing, and the reciprocal influence between agents during HHI. By minimizing the InconsistencyLoss, the model learns to generate scenes where human actions realistically affect object states and vice versa, and where human movements are coordinated and responsive to other agents’ actions, leading to more believable and stable simulations.

CollisionLoss is implemented as a penalty function within the training process to minimize intersections between the generated 3D human mesh and environmental geometry. This loss term calculates the volume of overlap between the mesh and static objects in the scene, and propagates a gradient signal back through the network to discourage physically implausible poses. By directly addressing collision issues during training, the system generates motions that adhere to physical constraints, resulting in improved stability and realism of the simulated human-environment interactions. The magnitude of the CollisionLoss is directly proportional to the extent of the penetration, ensuring that even minor intersections contribute to the learning process.

The SMPLX model, a 3D representation of human pose and shape, is integrated into the motion generation process to enhance the fidelity and realism of generated human movements. This model allows for detailed control over human geometry and articulation, moving beyond simple pose estimation. To ensure accurate alignment between the generated motions and input data, a DepthOptimization technique is employed; this process minimizes the discrepancy between the rendered depth maps of the generated human model and the corresponding depth information from the input data, resulting in improved visual consistency and physical plausibility of the generated human-object interactions.

Data augmentation techniques are employed during training to artificially expand the size and variability of the dataset. This is achieved through transformations applied to existing data, including rotations, scaling, translations, and variations in lighting conditions. By exposing the model to a wider range of input variations, data augmentation improves the model’s ability to generalize to unseen data and enhances its robustness to real-world variations, ultimately leading to improved performance and stability in generated human-object interaction scenarios.

A capsule-based approximation effectively models SMPL-X humans, facilitating a computationally efficient collision loss formulation.

The Data and the Devil: Scaling the Illusion

The foundation of this work lies in the CORE4D dataset, a substantial collection of human-object interaction scenarios featuring two individuals collaboratively completing tasks. This dataset distinguishes itself through its scale and diversity, encompassing a wide range of activities performed with various objects, and captured from multiple viewpoints. Such richness allows the models to learn nuanced patterns of human collaboration, going beyond simple, isolated actions to understand the dynamics of joint effort. The extensive data provides a robust training ground for generating realistic and contextually appropriate motions, enabling the system to anticipate and react to the actions of both people and objects within a shared environment, ultimately fostering more natural and believable interactions.

The foundation of this research rests upon a meticulously captured dataset, assembled using a specialized MultiViewCaptureSystem. This system facilitated the acquisition of high-resolution 3D data, crucial for accurately representing the nuances of human-object and human-human interaction. By employing multiple synchronized cameras, the system achieved comprehensive coverage of the collaborative tasks, enabling the reconstruction of detailed and precise 3D models of both participants and objects. The resulting data isn’t simply a visual record; it’s a rich, volumetric representation that captures subtle movements and spatial relationships, ultimately driving the fidelity and realism of the generated motions and providing a robust basis for machine learning algorithms.

To refine the generated motions and achieve a heightened sense of realism, the system employs techniques such as Denoising Diffusion Probabilistic Models (DNO) and InterGen. DNO excels at reconstructing plausible movements from noisy or incomplete data, effectively smoothing transitions and reducing artifacts commonly found in synthesized animations. InterGen, a generative model specifically designed for motion in-betweening, intelligently fills gaps within existing motion sequences, creating fluid and natural-looking actions. By integrating these approaches, the system overcomes challenges related to temporal consistency and ensures that generated interactions exhibit a higher degree of visual fidelity, resulting in more believable and engaging collaborative scenarios.

The system leverages TextEmbedding to translate natural language descriptions into a format usable by the motion generation network, effectively enabling users to guide the collaborative interactions. This process involves mapping textual phrases – such as “hand over the mug” or “high five” – into a dense vector representation that captures the semantic meaning of the desired action. The network then utilizes this vector as a conditioning input, influencing the generated motions to align with the specified interaction. Consequently, the model doesn’t simply produce random collaborative movements; instead, it synthesizes actions that are explicitly driven by the textual description, offering a level of control and expressiveness previously unattainable in motion generation.

Evaluations involving human participants revealed a clear preference for the generated motions produced by this model when contrasted with existing methods. This preference stemmed from a perceived increase in realism and a stronger alignment with textual descriptions guiding the interactions. Quantitatively, this subjective assessment was supported by Fréchet Distance (FD) scores – a metric used to gauge the similarity between distributions – which were significantly higher for both body pose and interpersonal distance compared to baseline approaches. These elevated FD scores indicate that the model not only generates plausible individual movements, but also convincingly simulates the nuanced spatial relationships and coordination inherent in collaborative human activity, resulting in more natural and believable interactions.

The dataset combines real-world data from CORE4D and a multiview capture system with synthetically generated samples to enhance diversity and scale.

The pursuit of realistic multi-agent interaction, as demonstrated in this work with the HHOI dataset and score-based generative models, feels…predictable. It’s another layer of complexity built atop existing complexity. Yann LeCun once said, “Everything we do will eventually become a subroutine.” And so it goes. This paper attempts to generate interactions, to simulate reality. It’s elegant, certainly, but one can almost guarantee production environments will reveal edge cases the models never anticipated. The generated 3D poses and object manipulations will inevitably clash with the messy unpredictability of the real world. It’s a sophisticated subroutine, destined to require patching, refactoring, and ultimately, replacement. If it works – wait.

What’s Next?

The generation of plausible Human-Human-Object Interactions (HHOIs) from text, as demonstrated, feels less like a breakthrough and more like a very expensive way to complicate everything. The fidelity of the generated 3D poses is, predictably, the first illusion to crack upon even cursory inspection in any production environment. The HHOI dataset itself, while a necessary step, simply shifts the problem-now the tooling will need to debug the dataset, not just the models. The truly difficult task remains: making these systems robust to the inevitable ambiguities in natural language and the sheer messiness of real-world physics.

Future iterations will undoubtedly focus on increasing the realism of generated scenes. However, a more pressing concern is defining ‘success’ beyond visual fidelity. Current metrics offer little insight into whether these generated interactions are meaningful or even physically possible for extended periods. The current paradigm risks optimizing for visually pleasing renderings, while ignoring the underlying constraints of human movement and object manipulation. If code looks perfect, no one has deployed it yet.

The long game isn’t about generating perfect simulations. It’s about accepting that these systems will always be approximations, and building tools to manage the resulting errors. The real challenge lies not in reaching for increasingly complex generative models, but in designing systems that can gracefully handle the inevitable failures, and perhaps, learn from them. It’s a shift from chasing an ideal to preparing for reality.

Original article: https://arxiv.org/pdf/2511.20446.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Understanding: Dissecting Human Interaction

Diffusion Models: A Bit of Noise, a Lot of Hope

Stitching the Illusion Together: Coherence and Plausibility

The Data and the Devil: Scaling the Illusion

What’s Next?

See also: