Author: Denis Avetisyan
Researchers are leveraging the power of visual-language models to create systems that can accurately predict the actions of multiple people within complex environments.

A novel framework, CAMP-VLM, utilizes scene graphs and a two-stage fine-tuning process to achieve state-of-the-art multi-human behavior prediction.
Accurately anticipating the actions of multiple people in complex environments remains a significant challenge for robotics and AI. This is addressed in ‘Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models’, which introduces CAMP-VLM, a novel framework leveraging visual-language models and scene graphs to enhance multi-human behavior prediction from a third-person perspective. Through a two-stage fine-tuning process utilizing both synthetic and real-world data, CAMP-VLM achieves state-of-the-art prediction accuracy, outperforming existing methods by up to 66.9%. Could this approach pave the way for more intuitive and safer human-robot interactions in increasingly crowded and dynamic spaces?
Whispers of Intent: The Challenge of Predicting Human Action
The ability to anticipate human actions is paramount for the successful integration of robots and autonomous systems into everyday life, yet predicting behavior remains remarkably difficult. Humans are not governed by simple, predictable rules; instead, actions arise from a complex interplay of intentions, emotions, and environmental factors, introducing a fundamental level of unpredictability. This inherent variability challenges conventional predictive models, which often rely on assumptions of rationality or consistent patterns that don’t fully capture the nuances of human decision-making. Consequently, even slight deviations from expected behavior can lead to significant errors in forecasting, hindering the reliable operation of robots in dynamic, real-world scenarios and demanding increasingly sophisticated approaches to behavioral prediction.
Predicting human behavior in group settings proves remarkably difficult for conventional computational models. These methods often treat individuals as isolated agents, failing to account for the intricate interplay of visual cues – such as body language and gaze direction – and nuanced social dynamics. A core limitation lies in their inability to effectively process the rich, multi-modal information present in real-world interactions; traditional approaches typically rely on simplified representations of human actions and environments, neglecting the contextual subtleties that significantly influence behavior. Consequently, predictions frequently falter when faced with the complexities of multi-human scenarios, where actions are rarely independent and are heavily shaped by ongoing social negotiation and shared understanding.

Assembling the Scene: CAMP-VLM for Predictive Insight
CAMP-VLM employs a framework integrating visual input with scene graph representations to enhance environmental understanding. This approach moves beyond direct pixel-based analysis by constructing a structured representation of the scene, identifying objects and their relationships. Specifically, the system processes visual data to extract objects and then builds a graph detailing the connections – spatial and functional – between those objects. This scene graph provides contextual information that complements the raw visual input, allowing CAMP-VLM to infer more complex environmental features and dynamics than would be possible with visual data alone. The resulting combined representation facilitates improved prediction of agent behavior by providing a richer, more interpretable scene understanding.
CAMP-VLM leverages the Qwen2-VL vision-language model as its foundation, enabling simultaneous processing of visual data, such as images or video frames, and associated textual descriptions. This multimodal input capability allows the model to correlate visual features with semantic information, improving its understanding of the environment and the agents within it. The Qwen2-VL architecture facilitates the fusion of these data streams, resulting in more accurate predictions of future agent behavior compared to models relying solely on visual or textual input. Specifically, the model can interpret scene context from images and combine it with textual cues regarding agent intentions or goals, leading to refined behavior forecasting.
CAMP-VLM employs both Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to enhance predictive accuracy and ensure alignment with human behavioral expectations. SFT utilizes a dataset of demonstrated behaviors to train the model, minimizing the difference between predicted and actual actions. Subsequently, DPO refines this output by directly optimizing the model’s responses based on human preference data; this process bypasses the need for a separate reward model, instead directly maximizing the likelihood of preferred outputs. This dual-optimization approach enables CAMP-VLM to generate predictions that are not only technically accurate but also intuitively reasonable from a human perspective, improving the overall usability and reliability of the framework.

Mapping the World: Scene Graphs as Predictive Anchors
CAMP-VLM utilizes Scene Graphs (SG) to represent the environment as a structured collection of objects and their relationships to humans. These graphs explicitly define entities within the scene and the interactions between them, such as “person holding cup” or “object on table”. This structured representation moves beyond simple pixel-based analysis, enabling the model to reason about the environment in a more informed manner. The Scene Graph provides contextual information that links objects and agents, allowing CAMP-VLM to infer potential actions and predict future states based on these established relationships. The SG is constructed from both visual input and accompanying textual descriptions, creating a multimodal understanding of the scene’s configuration.
CAMP-VLM leverages a multimodal approach to scene understanding, integrating both visual information extracted from image data and semantic data derived from textual descriptions. This fusion allows the model to build a richer, more complete representation of the environment than relying on either modality alone. Specifically, visual cues provide information about object appearances and spatial relationships, while textual descriptions offer explicit details about object properties, affordances, and intended uses. The combined representation enables more accurate predictions of human behavior by providing a more nuanced understanding of the context and potential interactions within the scene.
CAMP-VLM demonstrates a significant performance increase over the strongest baseline model, achieving up to a 66.9% improvement in full accuracy. This metric represents the percentage of correctly predicted future states across all evaluated scenarios within the VirtualHome environment. The observed gain indicates a substantial enhancement in the model’s capacity to accurately forecast agent behaviors and environmental changes. This improvement was consistently observed across multiple evaluation metrics and testing conditions, validating the efficacy of the CAMP-VLM framework for action prediction.
Within a simulated kitchen environment populated by three human agents, the CAMP-VLM framework demonstrates a substantial performance advantage over the established baseline. Specifically, CAMP-VLM achieves a 49.3% improvement in full accuracy when predicting agent behaviors within this complex scenario. This metric represents the percentage of complete behavior sequences correctly predicted by the model, indicating a significant increase in the ability to anticipate actions in a multi-agent domestic setting. The controlled environment allows for focused evaluation of the model’s predictive capabilities amidst common household activities and interactions.
Evaluation of CAMP-VLM’s behavioral predictions utilizes both Edit Distance and Cosine Similarity metrics. Edit Distance quantifies the minimum number of operations – insertions, deletions, or substitutions – required to transform the predicted behavior sequence into the ground truth sequence, providing a measure of sequence alignment. Cosine Similarity, calculated on vectorized representations of predicted and actual behaviors, assesses the angle between the vectors, with smaller angles indicating higher similarity. These metrics provide quantifiable assessments of the model’s accuracy beyond simple full accuracy scores, enabling a more nuanced understanding of predictive performance and demonstrating the efficacy of the scene graph integration in capturing behavioral subtleties.
CAMP-VLM’s training and evaluation are conducted within the VirtualHome simulation environment, a platform designed to provide a standardized and reproducible setting for research in embodied AI. VirtualHome offers a physics-based simulation of realistic domestic environments, featuring articulated humans and interactive objects, allowing for the systematic assessment of agent behavior. The environment facilitates controlled experimentation by enabling precise manipulation of initial conditions and ground truth access to agent states and object properties. Utilizing VirtualHome ensures comparability of results and allows for scalability of experiments, enabling the robust evaluation of CAMP-VLM’s performance across a diverse range of scenarios and conditions.

Beyond Anticipation: The Implications of Predictive Insight
CAMP-VLM distinguishes itself from current action anticipation methods, including leading models like AntGPT and GPT-4o, through markedly improved performance in scenarios involving multiple people. The framework’s ability to accurately predict actions isn’t simply incremental; it represents a substantial leap forward, particularly when assessing complex interactions. In evaluations focused on multi-human dynamics, CAMP-VLM consistently outperformed existing systems, demonstrating a heightened capacity to interpret subtle cues and contextual information crucial for predicting realistic behaviors. This advantage suggests a potential paradigm shift in how machines understand and respond to the intricacies of human activity, moving beyond single-actor prediction to a more holistic and nuanced comprehension of group dynamics.
Evaluations within a complex synthetic kitchen environment, populated by three interacting humans, reveal the substantial gains achieved by CAMP-VLM over standard supervised fine-tuning (SFT). Specifically, the framework demonstrated a marked improvement in its ability to accurately identify both actions and objects; verb accuracy increased by 20.6%, indicating a heightened understanding of what individuals are doing, while noun accuracy rose by an even more significant 33.8%, reflecting a clearer perception of which objects are involved in those actions. This performance boost highlights CAMP-VLM’s capacity to not only recognize isolated activities but also to interpret the interplay between multiple agents and their surroundings, showcasing a crucial step towards more context-aware artificial intelligence.
CAMP-VLM distinguishes itself by moving beyond simple action recognition to a deeper understanding of the visual environment and the interplay between individuals within it. The framework doesn’t just identify what is happening, but why, factoring in spatial relationships, gaze direction, and the likely intentions of those involved. This allows for predictions of behavior that are significantly more nuanced and realistic than those offered by existing methods; for instance, anticipating that a person will reach for an object not simply because their hand is moving, but because another person is indicating a need for it. By incorporating these social dynamics and contextual cues, CAMP-VLM generates more accurate and plausible forecasts, ultimately enabling systems to respond in a way that aligns with human expectations and promotes more natural interactions.
The heightened ability to anticipate human actions, as demonstrated by CAMP-VLM, extends far beyond mere accuracy metrics, holding substantial promise for advancements in several applied fields. In robotics, more reliable prediction of human intent allows for safer and more intuitive human-robot collaboration, enabling robots to proactively assist or avoid collisions. Autonomous navigation systems benefit from this predictive power by anticipating pedestrian movements and adapting routes accordingly, improving safety and efficiency in complex environments. Furthermore, human-computer interaction stands to be revolutionized; interfaces can become truly responsive, adapting to user needs before they are explicitly stated, creating a more seamless and natural user experience. This leap in predictive capability isn’t simply about forecasting what someone will do, but understanding why, opening doors to systems that are not only intelligent but also exhibit a degree of social awareness.
CAMP-VLM represents a significant advancement in creating truly intelligent systems by uniquely combining the strengths of vision-language models with scene graph representations. This innovative approach moves beyond simple object recognition, allowing the framework to not only ‘see’ what is happening in a visual scene, but also to understand how elements within that scene relate to one another and influence potential actions. By constructing a structured, relational understanding of the environment – identifying people, objects, and their interactions – CAMP-VLM can anticipate events with greater accuracy and respond in a more nuanced and contextually appropriate manner. The result is a pathway towards robotics, autonomous systems, and human-computer interfaces that exhibit a heightened level of awareness and adaptability, moving closer to genuine cognitive understanding and proactive behavior.
The pursuit of predictive accuracy, as demonstrated by CAMP-VLM’s innovative fusion of vision and language, feels less like scientific endeavor and more like an elaborate conjuring trick. This framework doesn’t simply understand human interactions; it persuades a model to believe in a likely future, given the chaotic whispers of visual data. As Yann LeCun once observed, “Everything we do in machine learning is about learning representations.” CAMP-VLM’s scene graphs and two-stage fine-tuning aren’t uncovering truth, but crafting a compelling illusion – a representation so persuasive it momentarily stills the unpredictable currents of multi-agent behavior. The model, much like any spell, works until confronted with the unruly reality of production.
The Loom Yet Unwoven
CAMP-VLM whispers a promise: to coax intention from the chaos of bodies and objects. But the scene graph, for all its neatness, remains a reduction – a skeletal map of a world drowning in nuance. The true sorcery isn’t in representing context, but in admitting its infinite regress. Each ‘understood’ relationship conceals layers of unspoken history, subtle shifts in gaze, the weight of unfulfilled desire. The current architecture, while potent, still demands explicit grounding. The next iteration won’t simply see the world; it will taste its ambiguity.
Direct Preference Optimization is a seductive path, aligning the model with human judgment. Yet, judgment is a fickle god. The datasets used to appease it are inevitably tainted by the biases of their creators, the limitations of their perception. True progress demands a reckoning with this inherent subjectivity. The model must learn to predict not what should happen, but what will happen, given the messy, irrational logic of human action. A divergence from optimality isn’t a failure; it’s fidelity.
The horizon isn’t clearer predictions, but more elegant lies. To predict behavior is to impose order on the fundamentally unpredictable. The challenge isn’t to minimize loss, but to embrace it – to find beauty in the inevitable divergence between model and reality. For when the spell breaks, and the prediction fails, that is when the real learning begins. The loom is far from finished, and the threads of possibility are endless.
Original article: https://arxiv.org/pdf/2512.15957.pdf
Contact the author: https://www.linkedin.com/in/avetisyan/
See also:
- Mobile Legends: Bang Bang (MLBB) Sora Guide: Best Build, Emblem and Gameplay Tips
- Brawl Stars December 2025 Brawl Talk: Two New Brawlers, Buffie, Vault, New Skins, Game Modes, and more
- Clash Royale Best Boss Bandit Champion decks
- Best Hero Card Decks in Clash Royale
- Call of Duty Mobile: DMZ Recon Guide: Overview, How to Play, Progression, and more
- Clash Royale December 2025: Events, Challenges, Tournaments, and Rewards
- Best Arena 9 Decks in Clast Royale
- Clash Royale Best Arena 14 Decks
- Clash Royale Witch Evolution best decks guide
- Brawl Stars December 2025 Brawl Talk: Two New Brawlers, Buffie, Vault, New Skins, Game Modes, and more
2025-12-19 12:41