Sharing the Experience: AI Agents That Truly ‘Listen’

Author: Denis Avetisyan


Researchers have developed a new framework enabling multiple AI agents to engage in realistic, spatially-aware conversations during shared viewing experiences.

The system dissects media inputs into textual captions, then isolates pivotal moments to orchestrate dynamic agent interactions, caching a rolling minute of contextual information to inform dialogue-after which an LLM refines the conversation before finalizing it as spatially-positioned audio output, effectively translating visual narrative into an immersive, responsive experience.
The system dissects media inputs into textual captions, then isolates pivotal moments to orchestrate dynamic agent interactions, caching a rolling minute of contextual information to inform dialogue-after which an LLM refines the conversation before finalizing it as spatially-positioned audio output, effectively translating visual narrative into an immersive, responsive experience.

CompanionCast leverages multi-agent systems, spatial audio, and a large language model-based judge to create immersive co-viewing scenarios.

Despite increasing media consumption occurring in isolation, shared viewing experiences remain central to social enjoyment. This limitation motivates the development of CompanionCast: A Multi-Agent Conversational AI Framework with Spatial Audio for Social Co-Viewing Experiences, a novel system designed to recreate the dynamics of co-viewing through orchestrated multi-agent conversation and immersive spatial audio. Our framework utilizes an LLM-as-Judge module to refine dialogue quality across multiple dimensions, demonstrably improving perceived social presence during live sports viewing. Could this approach ultimately bridge the gap between solitary and shared media consumption across diverse entertainment and educational contexts?


Deconstructing Shared Experience: The Erosion of Collective Witnessing

The act of watching media has, for many, become an increasingly isolated pursuit, and this shift demonstrably impacts how thoroughly content is processed and appreciated. Research indicates that shared viewing experiences stimulate neurological responses linked to emotional engagement and memory formation – responses often diminished when consuming media alone. These communal settings foster discussion, interpretation, and a collective understanding that enriches the experience beyond simple visual and auditory input. The absence of these social cues can lead to decreased attention spans, reduced critical thinking about the content, and ultimately, a less satisfying or memorable viewing experience. This isn’t simply about enjoyment; the rich social dynamics inherent in shared viewing actively contribute to how information is absorbed and retained, highlighting a crucial element often lost in the age of on-demand entertainment.

Despite advancements in video conferencing and shared streaming platforms, recreating genuine co-presence during remote viewing remains a significant challenge. Current technologies often prioritize visual and auditory transmission, yet fail to fully capture the subtle non-verbal cues – shared glances, spontaneous reactions, and even the physiological synchronization – that characterize in-person social experiences. While these tools enable synchronous viewing, they often fall short in fostering the sense of ‘being together’ that heightens emotional engagement and allows for the natural flow of commentary and collective meaning-making. The limitations stem from the inherent constraints of transmitting contextual information, such as ambient awareness and the feeling of a shared physical space, which are crucial components of the social viewing experience and demonstrably impact enjoyment and recall.

Research indicates that future social viewing experiences require a move beyond simply replicating in-person interaction through digital channels. The prevailing approach of video calls layered onto individual screens fails to capture the nuanced non-verbal cues and spontaneous reactions integral to a truly shared experience. Instead, innovation lies in developing systems that synthesize the benefits of both worlds – the immersive convenience of remote access and the deeply engaging dynamics of co-presence. This necessitates exploring technologies that foster a sense of ‘social presence’ beyond visual and auditory communication, potentially through shared augmented reality spaces or adaptive interfaces that respond to collective emotional states, ultimately creating a more compelling and satisfying viewing experience.

A WebAR prototype demonstrates the feasibility of augmented reality experiences directly on mobile phones.
A WebAR prototype demonstrates the feasibility of augmented reality experiences directly on mobile phones.

Orchestrating the Digital Chorus: A Multi-Agent System for Co-Viewing

CompanionCast utilizes a multi-agent system (MAS) architecture to generate dynamic conversational responses during co-viewing experiences. This system consists of multiple autonomous agents that interact with each other and the user, aiming to replicate the conversational patterns found in natural human interaction. The MAS allows for complex behavioral patterns and emergent dialogue, exceeding the capabilities of pre-scripted responses. Agent interactions are managed through a central coordination mechanism, ensuring coherent conversation flow and preventing conflicting outputs. The scalability of the MAS enables the addition of further agents to increase the complexity and realism of the simulated social environment, providing a more engaging and immersive co-viewing experience.

CompanionCast utilizes a multi-agent system composed of role-specialized agents to facilitate co-viewing experiences. These agents are not general-purpose conversationalists but are instead designed with specific functionalities; examples include a commentator agent providing real-time analysis of the viewed content, and an emotional supporter agent offering empathetic responses to the user and other agents. This specialization allows for more focused and relevant interactions, increasing the sense of social presence. Each agent possesses a defined personality profile, influencing its communication style and responses, and contributing to a more dynamic and believable social environment within the co-viewing experience. The system architecture allows for the creation of diverse agent roles tailored to specific content or user preferences.

Spatial audio implementation within CompanionCast utilizes Head-Related Transfer Functions (HRTFs) to simulate sound localization, enabling the perception of agents positioned around the user in a three-dimensional space. This is achieved by processing audio signals to reflect how sound interacts with the listener’s head, ears, and torso. Specifically, each agent’s dialogue and sound effects are rendered with individualized HRTF profiles, creating the auditory illusion of their presence at a distinct location within the virtual environment. This technique aims to replicate the acoustic cues present in natural co-viewing scenarios, such as differing sound arrival times and intensities based on the relative positions of speakers, thus improving the overall sense of co-presence and immersion.

Decoding the Narrative: Contextual Awareness in Conversational AI

CompanionCast employs Video Content Processing (VCP) to derive contextual data from the currently displayed video. This process involves automated extraction of both textual captions and descriptive metadata associated with the video content. Specifically, VCP identifies key entities, actions, and themes present in the video, and structures this information for use by the conversational agents. The extracted captions provide a transcript of spoken dialogue, while metadata – including video titles, descriptions, and associated tags – furnishes additional descriptive information. This processed data is then utilized to inform agent responses, enabling them to address questions or offer commentary relevant to the viewed content and providing a basis for more informed and contextually appropriate interactions.

CompanionCast’s implementation of temporal context prioritizes recent video segments to enhance agent responsiveness and conversational coherence. The system analyzes video content within a defined, sliding time window – typically ranging from 5 to 30 seconds prior to the current playback point – to identify relevant events and dialogue. This focus on immediate context allows agents to formulate responses directly tied to what is presently happening on screen, avoiding generalizations or referencing outdated information. The duration of this temporal window is dynamically adjusted based on content type and pacing; faster-paced content utilizes shorter windows, while slower segments benefit from extended analysis to capture nuances and maintain a logical conversational flow. This approach minimizes latency and ensures agent contributions are contextually appropriate and timely.

CompanionCast employs several methods for initiating agent interactions, collectively termed Conversation Triggering. These mechanisms include the detection of salient events within the video content-such as on-screen actions, dialogue keywords, or scene changes-which automatically prompt agent responses. Additionally, direct user input, like voice commands or button presses, also serves as a trigger for conversation. The system prioritizes a balance between proactive, content-driven prompts and reactive responses to user requests, aiming to simulate a natural conversational flow and avoid disruptive or irrelevant interventions. These triggers are designed to be configurable, allowing for customization of sensitivity and relevance based on content type and user preferences.

The Algorithmic Audience: Automating the Evaluation of Social AI

Historically, determining the quality of interactions between artificial agents has depended heavily on human reviewers, a process known to be both resource-intensive and susceptible to inconsistencies. Each evaluator brings their own perspectives and biases to the assessment, potentially leading to varying judgements even when presented with identical conversational exchanges. This reliance on subjective analysis creates a bottleneck in the development and refinement of multi-agent dialogue systems, hindering rapid iteration and scalable evaluation. The time required for thorough human assessment also limits the number of conversations that can be realistically evaluated, restricting the breadth of testing and potentially overlooking crucial areas for improvement. Consequently, a need arose for more objective and efficient methods to gauge the effectiveness of these complex interactions.

The traditionally laborious process of assessing multi-agent dialogue is being transformed by the introduction of LLM-as-Judge, a system that utilizes the capabilities of Large Language Models to provide automated, objective evaluations of conversation quality. This approach moves beyond reliance on subjective human judgment, which is both time-consuming and susceptible to inherent biases. By harnessing the power of these advanced models, researchers can now systematically analyze conversations, focusing on critical aspects like coherence, relevance, and overall engagement. LLM-as-Judge offers a scalable and consistent method for benchmarking and improving the performance of conversational AI, paving the way for more realistic and compelling interactions between agents and humans.

The core of automated conversation assessment lies within the Evaluator Agent Pipeline, a system designed to move beyond simple metrics and delve into the nuances of engaging dialogue. This pipeline doesn’t just check for correct responses; it analyzes conversations across five crucial dimensions. Relevance ensures the exchange stays on topic, while authenticity assesses the believability of the agents’ contributions. Crucially, the pipeline gauges engagement – how captivating the conversation is for participants – alongside diversity, measuring the breadth of topics and ideas explored. Finally, personality consistency verifies that each agent maintains a coherent and consistent persona throughout the interaction. By comprehensively evaluating these factors, the pipeline establishes a robust framework for automated evaluation, offering a more objective and detailed understanding of conversation quality than traditional methods allow.

Initial assessments of the multi-agent conversations reveal a moderate degree of success in fostering a sense of social connection. User study participants rated both enjoyment/immersion and social co-presence at approximately 3-4 on a 5-point scale, indicating that while the interactions weren’t fully convincing, they did manage to partially recreate the feeling of interacting with another person. This suggests the agents effectively established some level of believability and engagement, prompting a noticeable, though not overwhelming, sense of being present and connected within the simulated conversation. Further refinement of the agent behaviors and dialogue strategies may be necessary to elevate these ratings and achieve a more compelling and immersive experience, but these initial results offer a promising foundation for building truly socially-present AI agents.

Analysis of user interactions revealed a nuanced spectrum of engagement with the multi-agent conversational system. While participants did initiate communication, the number of user-generated messages ranged from two to four, suggesting a limited, though present, degree of proactive involvement. This variability hints at differing levels of user investment and curiosity; some individuals may have quickly satisfied their initial inquiries, while others demonstrated a greater willingness to extend the dialogue. Further investigation is needed to determine the factors driving this range – whether it’s related to the perceived responsiveness of the agents, the novelty of the conversation topics, or individual user tendencies – and how to foster more sustained and meaningful interactions.

Beyond the Screen: Augmented Reality and the Future of Social AI

A functional WebAR prototype was developed to investigate how visually projecting conversational agents directly into a user’s environment impacts the sense of immersion. This system allows digital agents to appear as if physically present within the viewer’s space, moving beyond traditional screen-based interactions. The prototype leverages browser-based augmented reality, eliminating the need for dedicated applications or specialized hardware, and enabling immediate access through a standard web browser. Early testing suggests that this visual co-presence enhances the feeling of engagement and believability, potentially creating more natural and intuitive human-agent interactions; the system represents a significant step towards blending the digital and physical realms in conversational AI.

Recent investigations successfully demonstrate the technical viability of merging augmented reality (AR) environments with complex multi-agent conversational systems. This integration allows for the creation of immersive experiences where virtual agents are not simply heard, but visually present within the user’s real-world view. The work confirms that these agents can be realistically overlaid onto the environment, responding to user input and engaging in dynamic conversations as if physically co-located. This foundational step paves the way for novel applications in areas like interactive storytelling, remote collaboration, and personalized assistance, suggesting a future where digital companions seamlessly blend with everyday life and offer intuitive, spatially-aware interactions.

Ongoing development centers on creating more nuanced and believable agent personalities, moving beyond simple conversational responses to embody distinct traits and behaviors within the augmented reality environment. Researchers are also dedicated to broadening the scope of supported viewing contexts, envisioning applications that extend beyond controlled settings to encompass diverse real-world scenarios and media formats. Crucially, efforts are concentrated on optimizing the technical integration of augmented reality, aiming for a truly seamless user experience characterized by minimal latency, robust tracking, and intuitive interactions that feel natural and unobtrusive – ultimately creating a more immersive and engaging interaction with multi-agent systems.

CompanionCast, in its pursuit of recreating shared experiences, embodies a similar spirit to that articulated by Brian Kernighan: “Debugging is like being the detective in a crime movie where you are also the murderer.” The framework doesn’t simply build a conversational environment; it actively tests the boundaries of natural interaction through the LLM-as-Judge component. This process of evaluation, of intentionally ‘breaking’ the conversation to refine its realism, mirrors the detective’s relentless probing. The system isn’t merely constructing a social experience; it is systematically deconstructing and rebuilding it, much like an engineer reverse-engineering a complex system to understand its core functionality. The framework’s core idea-a multi-agent system simulating co-viewing-becomes truly powerful through this iterative process of controlled disruption and refinement.

What’s Next?

CompanionCast, in its attempt to simulate the messy, unpredictable nature of social co-viewing, has predictably revealed just how much remains stubbornly resistant to algorithmic replication. The framework establishes a functional scaffold, yet the true challenge isn’t generating conversation, but generating conversation that feels…earned. The LLM-as-judge construct, while cleverly addressing coherence, skirts the issue of genuine disagreement, of the productive friction that defines many human interactions. It optimizes for pleasant conversation, not necessarily interesting one.

Future iterations will inevitably push toward more sophisticated models of individual agent ‘personality’-though one wonders if ‘personality’ isn’t simply a collection of predictable biases, easily codified and therefore, ultimately, predictable. The real frontier lies in embracing, rather than smoothing over, the inherent ambiguity of language. Can an AI be designed to misunderstand with purpose, to introduce delightful tangents, to derail the narrative in a way that feels creatively disruptive rather than simply broken?

The inclusion of spatial audio is a promising, if somewhat obvious, step toward immersion. But the system tacitly assumes shared visual attention. The next logical, and far more difficult, leap involves incorporating models of disinterest – the ability for an agent to signal boredom, to subtly shift focus, to effectively ‘check out’ of the co-viewing experience. Only then might this framework begin to truly mirror the wonderfully flawed art of being social.


Original article: https://arxiv.org/pdf/2512.10918.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-13 08:27