Building Blocks for Interaction: A Unified Framework for HOI Generation and Editing

Author: Denis Avetisyan


Researchers have developed a new approach to realistically generate and manipulate scenes involving people and objects, opening doors for advanced image and video editing capabilities.

OneHOI facilitates a unified workflow for both generating and editing human-object interactions, demonstrated through its ability to synthesize complex scenes from layouts and shapes, simultaneously edit multiple interactions - such as altering a person’s action from holding a dog to lying on a bench - modify single interactions like changing from holding a ball, and even transform entire environments from calm days to stormy oceans, all within a single model and showcasing diverse conditional control over scene composition and object attributes.
OneHOI facilitates a unified workflow for both generating and editing human-object interactions, demonstrated through its ability to synthesize complex scenes from layouts and shapes, simultaneously edit multiple interactions – such as altering a person’s action from holding a dog to lying on a bench – modify single interactions like changing from holding a ball, and even transform entire environments from calm days to stormy oceans, all within a single model and showcasing diverse conditional control over scene composition and object attributes.

OneHOI leverages Diffusion Transformers and relational reasoning to jointly model human-object interactions, achieving state-of-the-art performance in both generation and editing tasks.

Modeling human-object interaction (HOI) remains challenging due to the disconnect between generating realistic scenes and precisely editing existing ones. This paper introduces ‘OneHOI: Unifying Human-Object Interaction Generation and Editing’, a novel diffusion transformer framework that jointly learns both tasks by explicitly representing relational structures within scenes. At its core, OneHOI leverages a Relational Diffusion Transformer to model verb-mediated relations and disentangle complex multi-HOI scenarios, achieving state-of-the-art results across generation and editing benchmarks. Could this unified approach pave the way for more intuitive and controllable methods for creating and manipulating complex visual narratives?


Unraveling Visual Narratives: The Importance of Human-Object Interaction

Truly comprehending a visual scene extends beyond simply identifying the objects within it; accurate scene understanding necessitates recognizing how those objects, and particularly people, interact with one another. This area of study, known as Human-Object Interaction (HOI), delves into the nuanced relationships – a person holding a cup, a dog chasing a ball, a car approaching a pedestrian – that define the context and meaning of an image. It’s not enough to know that a person and a bicycle are present; understanding whether the person is riding, repairing, or obstructing the bicycle is critical for complete scene interpretation. Consequently, advancements in computer vision increasingly focus on HOI analysis, as it forms a fundamental building block for tasks ranging from robotic navigation to image captioning and detailed visual search.

Current image editing tools, while proficient at altering individual objects or backgrounds, frequently falter when tasked with realistically modifying the relationships between people and things. Attempts to change an action – for example, swapping a person’s grip on an object or altering their posture during an interaction – often yield visually jarring results. These tools struggle to maintain consistency in lighting, shadows, and perspective, leading to scenes that appear disjointed or physically impossible. The resulting incoherence stems from a failure to account for the complex interplay of spatial reasoning, physical constraints, and contextual understanding inherent in natural human-object interactions; a simple color adjustment or object replacement doesn’t address the nuanced changes required to depict a believable shift in activity.

Successfully editing scenes to alter human-object interactions presents a significant technical hurdle: maintaining consistent identity and spatial relationships. Current image manipulation techniques often fail to account for how changes to one element impact the believability of others; simply moving an object or person without adjusting surrounding context can lead to jarring inconsistencies. Preserving these relationships requires algorithms capable of understanding not just the individual components of a scene, but their complex interdependencies – the relative positions, orientations, and even implied forces between them. This demands new approaches to image manipulation that move beyond pixel-level edits, instead focusing on semantically aware transformations capable of ensuring that altered scenes remain visually coherent and physically plausible, ultimately necessitating a deeper integration of computer vision and geometric reasoning.

Our human-object interaction editing method reliably generates new interactions while maintaining subject identity, unlike baseline approaches which commonly introduce artifacts or fail to modify pose.
Our human-object interaction editing method reliably generates new interactions while maintaining subject identity, unlike baseline approaches which commonly introduce artifacts or fail to modify pose.

OneHOI: A Unified Framework for Coherent Scene Manipulation

OneHOI introduces a unified framework for both generating and editing Human-Object Interactions (HOIs) using a single model, eliminating the need for separate systems for each task. This consolidation is achieved through a novel architecture built upon Diffusion Transformers (DiTs), allowing for consistent and coherent manipulation of HOI scenes. By integrating generation and editing within a single model, OneHOI streamlines the process of creating and modifying interactions between humans and objects, offering increased efficiency and control over the resulting imagery. This approach contrasts with prior methods requiring distinct models for generating new HOIs versus altering existing ones.

OneHOI utilizes Diffusion Transformers (DiTs) as its foundational architecture due to their established proficiency in generative modeling and precise spatial control. DiTs employ a diffusion process, progressively adding noise to data and then learning to reverse this process to generate new samples. This approach enables OneHOI to create high-resolution, realistic images depicting human-object interactions. Furthermore, the transformer component of DiTs facilitates accurate spatial reasoning, allowing the model to understand and manipulate the relative positions of humans and objects within a scene, which is critical for generating coherent and plausible interactions. This combination of generative and spatial capabilities directly contributes to the high-quality results achieved by OneHOI.

The OneHOI framework facilitates detailed control over Human-Object Interaction (HOI) generation and editing by directly manipulating the core components of an HOI triplet: the person, the action, and the object. This architecture allows for granular adjustments to each element, enabling nuanced scene alterations and realistic depictions of interactions. Quantitative evaluation demonstrates the model’s performance, achieving a PickScore of 21.41 and a Human Preference Score (HPS) of 0.2617 on standard HOI generation benchmarks, indicating a high degree of realism and alignment with human expectations.

OneHOI streamlines human-object interaction (HOI) generation and editing through a unified, multi-step workflow enabling mixed-condition synthesis, layout-free and layout-guided HOI manipulation, and attribute modification to create complex scenes with arbitrary shapes.
OneHOI streamlines human-object interaction (HOI) generation and editing through a unified, multi-step workflow enabling mixed-condition synthesis, layout-free and layout-guided HOI manipulation, and attribute modification to create complex scenes with arbitrary shapes.

HOI-Edit-44K: A Benchmark for Robust Interaction Editing

HOI-Edit-44K is a newly introduced dataset designed to support the development and benchmarking of human-object interaction (HOI) editing models. It comprises 44,000 paired examples, each demonstrating an edit to an HOI while preserving the identity of the involved human subject and object. The dataset’s large scale facilitates robust model training, and the paired structure allows for quantitative evaluation of edit accuracy and identity preservation. These pairs consist of original images and their corresponding edited versions, providing a direct comparison for assessing model performance in manipulating HOIs.

The HOI-Edit-44K dataset underwent a rigorous curation and filtering process utilizing the PViC Human-Object Interaction (HOI) detector to ensure data quality and suitability for model training. This involved employing PViC to identify and validate the presence of both humans and objects within each scene, as well as to confirm the accurate depiction of interactions between them. Instances with low confidence scores from the PViC detector, or those exhibiting ambiguous or incorrect HOI annotations, were systematically removed from the dataset. This filtering step minimizes noise and inaccuracies, resulting in a high-quality dataset specifically designed to facilitate robust training and evaluation of HOI editing models.

The HOI-Edit-44K dataset, comprising 44,000 paired examples of human-object interaction edits, demonstrably improves the generalization capability of the OneHOI model to novel scenes and intricate interactions. Performance evaluations using the IEBench benchmark indicate a HOI Editability score of 0.638 and an Editability-Identity Score of 0.638, quantifying OneHOI’s ability to both successfully modify interactions and maintain the original identities of subjects within those interactions. These scores represent a measurable increase in performance attributable to the dataset’s scale and the diversity of represented human-object interactions.

The HOI-Edit-44K dataset provides examples for evaluating human-object interaction editing capabilities.
The HOI-Edit-44K dataset provides examples for evaluating human-object interaction editing capabilities.

Expanding the Canvas: Towards Dynamic Scene Understanding

The strength of OneHOI lies in its ability to move beyond isolated image manipulations and address the complexities of complete scenes. Rather than altering a single interaction between objects, the model can simultaneously refine multiple relationships within an image, representing a significant advancement in human-object interaction (HOI) editing. This capability is vital for achieving realistic and coherent scene understanding, as real-world visuals rarely consist of isolated events; instead, they depict interconnected actions and relationships. By tackling multiple HOIs at once, OneHOI doesn’t just change what is happening in an image, but can fundamentally alter the narrative and context of the entire scene, offering a level of control previously unattainable in automated image editing.

The architecture of OneHOI benefits significantly from its foundation in Flux.1, a generative model capable of producing high-quality synthetic data. This capability extends beyond simply training the initial model; Flux.1 allows for the creation of diverse and customized datasets tailored to specific editing tasks or scenarios. By generating synthetic data, researchers can overcome limitations imposed by real-world datasets – such as biases or lack of representation for rare interactions – and effectively augment training data to improve model robustness and generalization. This synthetic data generation capability isn’t limited to refinement of existing functionalities; it actively expands the potential applications of OneHOI into areas where labeled data is scarce or unavailable, offering a pathway towards more adaptable and versatile image manipulation tools.

The development of OneHOI signifies a considerable leap toward nuanced image manipulation and AI-assisted content creation, offering users unprecedented control over visual storytelling. This model doesn’t merely alter images; it reframes them by precisely editing interactions between objects-a capability substantiated by a 26.4% performance increase in human-object interaction (HOI) generation when contrasted with models designed for specific tasks. Furthermore, OneHOI demonstrates a 21.1% improvement in layout-free HOI editing, indicating its capacity to modify scenes without rigid structural constraints. Validated through metrics like an ImageReward score of 0.5524, this work promises tools that will empower creators to construct and refine visual narratives with greater fidelity and artistic command.

The model achieves versatile human-object interaction (HOI) generation by conditioning on arbitrary shapes and composing both HOI and object-only inputs within a single scene.
The model achieves versatile human-object interaction (HOI) generation by conditioning on arbitrary shapes and composing both HOI and object-only inputs within a single scene.

The presented OneHOI framework underscores the importance of relational reasoning in generative modeling, a principle echoed in Andrew Ng’s assertion: “The key to AI is not to create artificial general intelligence, but to build tools that help people.” OneHOI doesn’t aim to replace human understanding of human-object interaction, but rather to augment it through a system capable of generating and editing complex scenes. By explicitly modeling these relationships-how humans interact with objects-the framework achieves state-of-the-art performance, effectively becoming a powerful tool for scene manipulation and a demonstration of how understanding underlying patterns unlocks significant advancements in AI capabilities.

Where Do We Go From Here?

The OneHOI framework, viewed as a particularly refined microscope for examining the dance between humans and objects, reveals a landscape far more intricate than initially suspected. The ability to not only generate but also edit human-object interactions suggests a shift from simply creating plausible scenes to manipulating the very grammar of action. However, the model remains, at its core, a pattern-matching engine. It excels at rearranging existing elements, but true creativity-the ability to conceive of entirely novel interactions-remains elusive. The current emphasis on relational reasoning, while powerful, skirts the question of intentionality. A model can predict how a person will use an object, but it doesn’t understand why.

Future investigations should therefore move beyond the purely visual. Integrating tactile and auditory information, for instance, could ground the model in a richer, more embodied understanding of the world. Furthermore, addressing the limitations in handling complex, multi-agent scenarios-where numerous individuals interact with multiple objects simultaneously-will be crucial. The current approach, though elegant, risks becoming bogged down in combinatorial complexity.

Ultimately, the goal isn’t merely to build a model that mimics human-object interaction, but one that can anticipate it. This requires a move away from treating interactions as isolated events and toward understanding them as part of a larger, ongoing narrative. The model, in essence, must learn to tell a story-and that is a challenge far beyond the scope of current generative frameworks.


Original article: https://arxiv.org/pdf/2604.14062.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-04-17 18:46