Beyond Defined Interactions: A New Era for Human-Object Understanding

Author: Denis Avetisyan

Researchers are pushing the boundaries of computer vision by enabling systems to recognize and reason about a limitless range of human interactions with objects.

U-HOI presents a departure from conventional human-object interaction (HOI) detection by eliminating the need for pre-training or a predefined interaction vocabulary, instead leveraging off-the-shelf Multimodal Large Language Models (MLLMs) and a post-generation refinement process to directly extract HOI triplets from free-form outputs-a methodology that circumvents the limitations of prior art.

This work introduces Unconstrained Human-Object Interaction (U-HOI) and demonstrates its feasibility using multimodal large language models with a novel refinement strategy.

Traditional human-object interaction (HOI) detection relies on predefined interaction vocabularies, limiting adaptability to dynamic, real-world scenarios. This work, ‘Towards Unconstrained Human-Object Interaction’, introduces a novel task-Unconstrained HOI (U-HOI)-and investigates the potential of multimodal large language models (MLLMs) to recognize interactions without such constraints. Our findings demonstrate that MLLMs, coupled with a post-generation refinement pipeline, can effectively address U-HOI, though current models still face limitations. Could this approach unlock a new era of flexible and robust HOI understanding in complex environments?

The Inherent Limitations of Closed-Set Human-Object Interaction

Conventional Human-Object Interaction (HOI) detection systems typically operate under a ‘closed-set’ assumption, meaning they are trained to recognize only a predetermined list of interactions – things like ‘person holding cup’ or ‘person riding bicycle’. While effective within these defined boundaries, this approach severely limits real-world applicability. The environment is dynamic and presents countless nuanced interactions not accounted for in these fixed categories. A system trained solely on predefined interactions will inevitably fail when encountering novel or uncommon actions, such as a person delicately adjusting a miniature sculpture or using a tool in an unconventional way. This reliance on a finite vocabulary necessitates extensive manual annotation to cover a wider range of possibilities, creating a significant and ongoing bottleneck in the development of truly adaptable and intelligent HOI systems.

Current Human-Object Interaction (HOI) detection systems, often built on a ‘closed-set’ of predefined actions, face considerable challenges when encountering previously unseen interactions. The reliance on a fixed vocabulary necessitates exhaustive manual annotation of training data, a process that is both time-consuming and expensive. This annotation bottleneck severely limits the scalability of these systems and hinders their ability to adapt to the dynamic and open-ended nature of real-world scenarios. Because any interaction falling outside of these predefined categories is typically ignored, the practical utility of these methods is substantially reduced, demanding a shift toward more flexible and adaptable approaches to HOI understanding.

The limitations of current Human-Object Interaction (HOI) systems, which depend on a predetermined list of possible actions, are driving a demand for more adaptable approaches. As real-world scenarios present an infinite variety of interactions, the inability to generalize beyond known categories significantly hinders practical application. Researchers are increasingly focused on developing systems capable of open-vocabulary HOI detection – methods that can reason about interactions without being explicitly trained on them. This necessitates a shift toward models that understand the underlying semantics of actions, enabling them to infer interactions based on visual context and relationships, rather than simply matching predefined labels. Such advancements promise to unlock truly intelligent systems capable of navigating and interpreting the complexities of human behavior in dynamic environments.

AnyHOIfirst establishes human-object relationships by first identifying pairs with an object detector, then using a multimodal large language model to generate scene descriptions from cropped regions, and finally refining a relationship graph derived from these descriptions by filtering irrelevant triplets.

Vision-Language Models: A Foundation for Open-Vocabulary Interaction

Open-vocabulary Human-Object Interaction (HOI) detection overcomes the constraints of traditional, closed-set methods which are limited to pre-defined interaction categories. This is achieved through the application of Vision-Language Models (VLMs). Closed-set detectors require explicit training on every possible interaction, hindering generalization to novel scenarios. In contrast, VLMs, by learning a shared embedding space between visual and textual data, can infer interactions based on contextual understanding. This allows for the recognition of HOIs not encountered during training, as the model can leverage its knowledge of both visual features and linguistic descriptions to reason about unseen combinations of humans, objects, and actions. This contextual understanding is crucial for adapting to diverse and previously unknown HOI instances.

Vision-Language Models (VLMs) such as CLIP and BLIP demonstrate the capability to generalize to Human-Object Interactions (HOIs) outside of their training data through the use of learned visual-semantic embeddings. Traditional HOI detection relies on predefined interaction categories, limiting performance to known interactions; VLMs, however, leverage knowledge transferred from large-scale image-text pre-training. This allows the models to assess the semantic compatibility between detected objects and potential actions, even for previously unseen combinations. Specifically, CLIP utilizes contrastive learning to align visual and textual representations, while BLIP employs a bootstrapping approach to enhance image-text understanding, both resulting in improved zero-shot and few-shot HOI detection performance and greater adaptability to novel scenarios.

Vision-Language Models (VLMs) facilitate Human-Object Interaction (HOI) detection by establishing a correspondence between visual features extracted from images and semantic embeddings representing object and action classes. This is achieved through pre-training on large-scale image-text datasets, enabling the model to learn a shared representation space where visual and linguistic concepts are aligned. Consequently, HOI detection transitions from a classification task limited to predefined interaction categories to a retrieval process where the model identifies the most semantically consistent interaction given a visual scene, leading to improved generalization and the ability to recognize novel interactions not explicitly present in the training data.

Our AnyHOI approach, built upon LLaVA OV 0.5B, significantly improves human-object interaction prediction over the CLIP baseline on the HICO-DET dataset.

AnyHOI: A Refinement of Multimodal Large Language Model Outputs

The AnyHOI system initiates human-object interaction (HOI) recognition by utilizing Multimodal Large Language Models (MLLMs) to generate initial interaction hypotheses. This approach provides a flexible foundation as MLLMs, pre-trained on extensive text and image data, can propose a diverse range of potential interactions without requiring task-specific engineering. The generated interactions serve as a starting point for subsequent refinement stages within the AnyHOI pipeline, allowing for the incorporation of more precise contextual and factual information to improve accuracy and consistency. This initial generation step decouples the system from reliance on fixed interaction categories, enabling the handling of unconstrained and novel HOI scenarios.

Post-generation refinement techniques address inconsistencies and inaccuracies present in initial interactions generated by Multimodal Large Language Models (MLLMs). Specifically, the Factual text-to-scene-graph transformer analyzes the generated text and compares it to the visual scene, identifying discrepancies between stated relationships and observed objects. This transformer constructs a scene graph representation from both modalities, enabling the model to revise the interaction description to align with the visual evidence. This process enhances the factual correctness and overall coherence of the generated interactions, resulting in more reliable and consistent outputs.

Test-Time Compute (TTC) is a strategy employed to augment the reasoning capacity of Multimodal Large Language Models (MLLMs) during inference, particularly in scenarios involving complex Human-Object Interaction (HOI) recognition. Implementation of TTC with the LLaVA 0.5B model resulted in a quantifiable performance gain of +2% measured by mean Average Precision (mAP) on the HICO-DET dataset. This improvement indicates that TTC enables the model to better leverage contextual information and refine its predictions without requiring updates to model weights, offering a computationally efficient method for enhancing performance at inference time.

Performance evaluations using the HICO-DET dataset demonstrate the effectiveness of the AnyHOI pipeline when integrated with different Multimodal Large Language Models. Specifically, the system achieves a mean Average Precision (mAP) of 14.68% when utilizing Qwen2-VL 7B, indicating competitive performance relative to established Human-Object Interaction (HOI) detectors. An alternative configuration employing Idefics2 8B yields a mAP of 10.28% on the same HICO-DET dataset, further validating the adaptability and potential of the AnyHOI framework across various MLLM architectures.

The system generates diverse interaction proposals from a baseline model and aggregates them by frequency to produce final predictions, enabling a robust test-time compute strategy.

Evaluating Robustness: Metrics and Datasets for Comprehensive Assessment

Mean Average Precision (MAP) continues to serve as a foundational metric for gauging the precision of human-object interaction (HOI) detection systems. However, applying MAP to realistic, unconstrained environments demands nuanced consideration; traditional benchmarks often feature simplified scenes and limited interaction diversity. A high MAP score doesn’t necessarily translate to robust performance when a model encounters the complexities of everyday life-occlusions, unusual viewpoints, or previously unseen combinations of actions and objects. Consequently, researchers are increasingly focused on refining MAP’s application, exploring methods to weigh rare but critical interactions more heavily, and supplementing it with metrics that assess a model’s ability to generalize beyond the training data and handle the long tail of possible HOIs.

Beyond simply detecting the presence of a human-object interaction, a robust evaluation also requires assessing whether the model understands the interaction itself. Semantic Recall directly addresses this need, functioning as a complementary metric to traditional measures like Mean Average Precision. It gauges a model’s ability to correctly identify interactions based on their meaning – for example, distinguishing between a person holding a cup and a person drinking from it. This is achieved by evaluating if the predicted interaction aligns with the semantic relationships inherent in the scene, even if the visual appearance is novel or ambiguous. A high Semantic Recall score indicates the system isn’t merely recognizing patterns, but demonstrating a degree of conceptual understanding about how humans and objects relate to one another – a crucial step towards truly intelligent scene interpretation.

Current human-object interaction (HOI) detection relies heavily on benchmark datasets such as HICO-DET for standardized evaluation, yet these datasets possess inherent limitations. While providing a valuable starting point, they often fail to fully represent the diversity and complexity of real-world interactions, frequently exhibiting biases toward common scenarios and neglecting the long tail of less frequent, yet plausible, combinations. This restricted scope can lead to inflated performance metrics that do not accurately reflect a model’s ability to generalize to novel or nuanced interactions encountered outside of the training distribution. Consequently, researchers are increasingly recognizing the need to supplement existing benchmarks with more comprehensive and representative datasets, or to develop novel evaluation protocols that specifically address the challenges of rare and complex HOI detection to obtain a more robust and reliable assessment of model capabilities.

Recent advancements in human-object interaction (HOI) detection have yielded promising results, notably with the LLaVA OV 0.5B model combined with the AnyHOI dataset and test-time compute techniques. This configuration achieved a mean Average Precision (mAP) of 5.93% on rare interactions, signifying a substantial improvement in identifying less common, yet crucial, pairings between humans and objects. This performance boost suggests the model is better equipped to handle the long tail of interactions found in real-world scenarios, moving beyond frequently observed combinations to recognize nuanced and less predictable actions. The successful integration of these components demonstrates the potential for enhancing HOI detection systems to be more robust and adaptable to the complexities of visual understanding.

AnyHOI and CLIP[52] demonstrate qualitative human-object interaction detection on the HICO-DET[5] dataset.

The pursuit of unconstrained human-object interaction, as detailed in this work, demands a rigorous approach to problem-solving. Every unnecessary parameter introduces potential for error, obscuring the fundamental correctness of the system. This aligns perfectly with Andrew Ng’s assertion: “Simplicity is prerequisite for reliability.” The paper’s methodology, focusing on distilling complex interactions into a refined output via MLLMs, exemplifies this principle. By minimizing reliance on predefined vocabularies and maximizing semantic recall, the research seeks a provably correct solution – an elegant implementation reflecting the inherent mathematical purity of the interactions themselves. Redundancy, even in the realm of large language models, must be meticulously pruned to achieve true reliability.

What’s Next?

The presented work, while a pragmatic step towards genuinely unconstrained human-object interaction, merely shifts the locus of the problem. The reliance on Multimodal Large Language Models (MLLMs) introduces a new dependency-one on the opaque complexities of scale. Demonstrating performance on a task divorced from predefined vocabularies is commendable, but the true test lies in demonstrable understanding, not merely successful token generation. The refinement pipeline, however necessary for current results, feels suspiciously akin to patching a fundamentally flawed architecture.

Future research must address the core issue of semantic grounding. Current MLLMs excel at statistical correlations, but lack the capacity for robust, compositional reasoning. A system capable of truly ‘understanding’ interaction requires a formal representation of both action and object properties, coupled with a mechanism for verifying the logical consistency of observed events. The focus should move beyond simply generating plausible descriptions, towards constructing provably correct models of the physical world.

Ultimately, the pursuit of unconstrained HOI demands a departure from the current paradigm of brute-force scaling. The elegance of a solution will not be measured in parameters, but in algorithmic efficiency and the capacity to generalize beyond the limitations of the training data. A genuinely intelligent system should require less compute, not more, to achieve a given level of performance.

Original article: https://arxiv.org/pdf/2604.14069.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inherent Limitations of Closed-Set Human-Object Interaction

Vision-Language Models: A Foundation for Open-Vocabulary Interaction

AnyHOI: A Refinement of Multimodal Large Language Model Outputs

Evaluating Robustness: Metrics and Datasets for Comprehensive Assessment

What’s Next?

See also: