Beyond the Script: AI Co-Pilots for Immersive Planetarium Experiences

Author: Denis Avetisyan

Researchers explore how large language models can assist human presenters in live, interactive planetarium shows, enhancing engagement and reducing cognitive strain.

This study investigates the feasibility of using Large Language Models to support real-time visualization control and natural language interaction within science center planetariums.

Delivering engaging and dynamic experiences in live scientific visualizations presents a unique challenge given the need for real-time responsiveness and intricate control. This paper, ‘Piloting Planetarium Visualizations with LLMs during Live Events in Science Centers’, investigates the feasibility of employing Large Language Models (LLMs) as assistive ‘co-pilots’ within planetarium software, specifically to manage camera movements, simulation time, and visual elements. Our results indicate that while LLMs currently lack the nuanced skills of experienced human pilots, they demonstrate potential for reducing operator workload and enabling more interactive presentations. Could proactive AI agents fundamentally reshape the role of the live visualization presenter and unlock new avenues for public science engagement?

Deconstructing the Pilot: The Human Bottleneck in Visualization

The creation of compelling live visualizations, such as those experienced in planetariums, currently depends heavily on a dedicated ‘Human Pilot’ who directly manipulates complex software like OpenSpace. This individual isn’t simply presenting information; they are simultaneously operating the underlying system, a task demanding significant cognitive resources. Controlling trajectory, managing data layers, adjusting viewpoints, and responding to real-time requests all contribute to a substantial mental workload. This manual control isn’t merely about technical skill; it actively diverts the Pilot’s attention from the narrative itself and, crucially, from observing and engaging with the audience. The inherent complexity necessitates a division of labor, yet this division inadvertently creates a bottleneck, limiting the potential for truly fluid and immersive experiences.

Current visualization systems, while capable of rendering stunning imagery, often place a considerable burden on the presenter, known as the ‘Human Guide’. Instead of intuitively guiding an audience through a cosmic journey, these systems demand precise, explicit commands for every navigational action – a detailed list of instructions for panning, zooming, and transitioning between celestial objects. This reliance on direct manipulation disrupts the presenter’s flow, forcing them to split attention between content delivery and technical operation. The result is a fragmented experience, diminishing the immersive quality of the visualization and hindering the Guide’s ability to establish a compelling narrative for the audience. Truly seamless presentations require a shift towards systems that anticipate the Guide’s intent, allowing for more natural and fluid interaction with the visualized data.

The current reliance on manual control within live visualizations presents a substantial barrier to truly immersive experiences. When a presenter, or ‘Guide’, is preoccupied with the technical operation of software – adjusting viewpoints, managing simulations, and executing commands – their cognitive resources are diverted from crafting compelling narratives and responding to audience cues. This division of attention diminishes the Guide’s capacity for spontaneous elaboration, personalized explanations, and dynamic adaptation to the evolving interests of those present. Consequently, the potential for a seamless, captivating journey through complex data is compromised, leaving audiences with a less engaging and impactful experience than the technology promises. A shift towards more intuitive control mechanisms is therefore crucial to unlock the full potential of these powerful visualization tools.

The Ghost in the Machine: LLMs as Visualization Co-Pilots

The system architecture incorporates a Large Language Model (LLM) to facilitate interpretation of spoken instructions from a human operator, representing a shift from traditional command-line interfaces. This LLM-driven approach allows for the processing of complex, natural language requests, enabling a more intuitive and flexible interaction paradigm. Rather than requiring pre-defined commands, the LLM analyzes the semantic content of the spoken intent, identifying the user’s goals and translating them into system actions. This moves beyond simple stimulus-response cycles to enable nuanced understanding of the operator’s needs within the OpenSpace environment.

The Large Language Model (LLM) within the system employs Natural Language Reasoning (NLR) to parse user requests that extend beyond simple keyword recognition. This capability allows the LLM to interpret the intent behind spoken instructions, even when expressed with ambiguity or complexity. Following intent recognition, the LLM formulates specific “Tool Calls”-structured requests targeting functionalities within the OpenSpace environment. These Tool Calls define the precise action to be performed – such as manipulating a dataset, altering a visualization parameter, or initiating a simulation – and are formatted for direct execution by OpenSpace, enabling a dynamic and responsive interaction.

A Speech-to-Text (STT) model serves as the initial processing stage, converting audible human speech into a digital text stream. This conversion is essential because Large Language Models (LLMs) operate on textual data; they cannot directly interpret audio signals. The STT model’s accuracy directly impacts the LLM’s ability to correctly interpret user intent. Modern STT systems utilize deep learning architectures, typically based on recurrent or transformer networks, trained on massive datasets of spoken language to achieve high transcription rates and robustness to variations in accent, speech rate, and background noise. The output of the STT model-the transcribed text-is then provided as input to the LLM for subsequent natural language reasoning and tool call generation.

Two Modes of Operation: Reactive and Proactive Control

In Reactive Mode, the agent functions as a direct extension of the Human Guide’s control, executing actions solely in response to explicit commands received through the communication channel. This mode prioritizes precision and predictability; the agent does not initiate any action independently and strictly adheres to the Guide’s instructions. The system is designed to interpret and immediately enact these commands, providing granular control over the visualization and allowing the Guide to manage the presentation flow with detailed accuracy. This direct command structure ensures that the Guide maintains complete authority over the experience, facilitating interventions or adjustments as needed without any autonomous behavior from the agent.

The system’s Proactive Mode leverages the Large Language Model (LLM) to perform actions without direct instruction from the Human Guide, with the goal of streamlining the presentation experience. This functionality isn’t based on pre-programmed sequences; rather, the LLM dynamically assesses the presentation context and predicts subsequent actions that align with the overall narrative flow. Autonomous actions may include advancing to the next visualization element, adjusting viewpoint parameters, or initiating specific data queries – all executed preemptively to maintain a smooth and engaging presentation for the audience. This predictive capability aims to reduce the cognitive load on the Guide by automating routine tasks and allowing them to focus on higher-level communication and interpretation.

Contextual Awareness within the presentation agent relies on continuous monitoring of both the visualization state and the narrative flow. The Large Language Model (LLM) processes data regarding the currently displayed visual elements, including object properties, relationships, and active selections. Simultaneously, it tracks the spoken or textual narrative provided by the Human Guide, identifying key topics, entities, and intended focus. This combined understanding allows the LLM to infer the Guide’s likely next steps or desired modifications to the visualization, enabling proactive actions such as pre-fetching data, highlighting relevant objects, or initiating transitions without explicit command.

Communication between the agent and the OpenSpace platform is facilitated by the WebSocket protocol, a full-duplex communication system over a single TCP connection. This allows for persistent, real-time data exchange, eliminating the overhead of repeatedly establishing new connections for each interaction. WebSocket enables the agent to both receive commands from the Human Guide and transmit autonomous actions to OpenSpace with minimal latency, crucial for maintaining a fluid and responsive presentation experience. The protocol supports text and binary data formats, accommodating a variety of control signals and visualization updates, and operates across standard HTTP ports, simplifying firewall traversal.

Beyond Automation: The Evolving Role of the Visual Storyteller

The system’s automation of repetitive visualization tasks fundamentally shifts the role of the presenter, termed the ‘Human Guide’, from a technical operator to a dedicated storyteller. By handling the intricacies of data navigation and display, the technology liberates the Guide to concentrate on crafting a narrative, interpreting the visualized information, and directly engaging the audience. This focus on communication, rather than control, cultivates a more compelling and impactful presentation experience, allowing nuanced insights to take center stage and fostering a stronger connection between the data and those receiving it. The result is not simply a display of information, but a carefully curated journey designed to inform, inspire, and provoke thought.

This innovative system operates as a sophisticated visualization interaction recommender, significantly diminishing the demands for direct manual operation of the OpenSpace environment. By proactively suggesting appropriate actions – such as viewpoint adjustments, data overlays, or specific explorations – the system anticipates the needs of the ‘Human Guide’. This automation frees the guide from repetitive tasks, allowing a greater focus on crafting a compelling narrative and engaging the audience with the visualized data. The result is a more fluid and immersive experience, as the system handles the technical intricacies of navigation and data presentation, effectively acting as an intelligent assistant within the visualization workflow.

Recent research indicates the practical viability of Large Language Model (LLM)-based co-pilots for complex visualization tasks. A study exploring user interaction with such a system revealed a nuanced preference for how these AI assistants operate; while the technology proved functional across interaction styles, a majority of participants – three out of five – expressed a stronger inclination towards a ‘reactive’ mode. This suggests users favored an approach where the LLM responds to direct prompts or actions, rather than proactively suggesting visualizations. This finding is crucial for refining the design of future AI-driven visualization tools, highlighting the importance of maintaining user control and agency in the exploratory data analysis process.

A detailed error analysis revealed critical areas for refining the performance of the AI visualization pilot. Researchers categorized errors across four distinct dimensions: Detection, assessing the AI’s ability to correctly perceive the current state of the visualization; Reasoning, evaluating the logic behind its proposed actions; Context, examining its understanding of the broader narrative and user intent; and Naturalness, judging the fluency and appropriateness of the AI’s interaction style. This four-dimensional framework highlighted that improvements in accurately detecting user needs and strengthening the AI’s reasoning capabilities are paramount to producing more effective and seamless co-piloting experiences. By focusing on these specific areas, developers can address the root causes of errors and move towards creating AI agents that genuinely augment, rather than hinder, the process of visual exploration and storytelling.

Advancing this automated visualization system necessitates the implementation of robust error detection mechanisms. These systems will not simply execute commands, but actively monitor actions for misalignment with the ‘Human Guide’s’ overarching intent. Such safeguards will analyze proposed visualizations, predict potential misinterpretations, and proactively offer alternative approaches before unintended consequences arise. This proactive error detection goes beyond simple syntactic correctness; it requires a contextual understanding of the narrative being constructed and the audience’s likely interpretation, ensuring the agent’s contributions consistently enhance, rather than detract from, the storytelling process. Ultimately, integrating these checks will foster a collaborative environment where the AI functions as a truly supportive co-pilot, anticipating and preventing errors before they impact the final presentation.

The capacity for a language model to truly understand and respond to complex visualization requests hinges on its ability to move beyond text-based input. Integrating the language model with a multimodal model offers a pathway to interpret visual cues directly from the data being explored – recognizing patterns, anomalies, or significant features within the visualization itself. This expanded understanding allows the system to anticipate the ‘Human Guide’s’ needs more effectively, proactively suggesting relevant interactions or highlighting key areas of interest. By grounding its reasoning in both linguistic input and visual data, the system transcends simple command execution and begins to function as a genuinely insightful co-pilot, capable of enhancing the storytelling process and fostering a more dynamic exploration of complex datasets.

The exploration of LLMs as ‘co-pilots’ within live planetarium visualizations mirrors a fundamental principle of understanding any complex system: probing its boundaries. This research doesn’t aim to replace the human pilot, but rather to augment their capabilities, offloading cognitive burden and opening avenues for more dynamic interaction. As Brian Kernighan aptly stated, “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” Similarly, the team acknowledges that a fully autonomous system isn’t the immediate goal; instead, the focus lies on intelligently assisting the human operator, accepting the inherent complexity and leveraging LLMs to navigate it. This approach recognizes that true mastery comes not from flawless execution, but from informed experimentation and iterative refinement.

Beyond the Co-Pilot

The pursuit of ‘proactive AI’ in live visualization reveals a predictable truth: systems designed to anticipate need inevitably stumble on the unpredictable nature of inquiry. This work demonstrates the LLM’s capacity to assist, not supplant, human expertise-a nuance often lost in the rush to automate. The true limitation isn’t the LLM’s knowledge base, but the difficulty of translating spontaneous questions into queries a machine can meaningfully process. Expect future efforts to focus less on broader knowledge and more on refining the interface – the messy, ambiguous space between a thought and a command.

Consider the implications. The planetarium, historically a curated experience, edges towards a genuinely interactive environment. But interaction demands resilience. The system must gracefully handle the irrelevant, the nonsensical, and the delightfully unexpected tangents that characterize human curiosity. This isn’t about building a smarter AI; it’s about building a more forgiving one. A system that doesn’t punish exploration, but rather rewards it with a reasonable attempt at response-even if that response is simply acknowledging the limits of its understanding.

Ultimately, the value lies not in automating the presentation of facts, but in amplifying the presenter’s ability to navigate the unknown. The LLM becomes a tool for cognitive offloading-a digital assistant capable of handling the mundane, freeing the human pilot to focus on the truly challenging aspects of live interpretation. The next iteration will not be about a more knowledgeable co-pilot, but one better at admitting what it doesn’t know – a surprisingly rare quality in artificial intelligence.

Original article: https://arxiv.org/pdf/2601.20466.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Deconstructing the Pilot: The Human Bottleneck in Visualization

The Ghost in the Machine: LLMs as Visualization Co-Pilots

Two Modes of Operation: Reactive and Proactive Control

Beyond Automation: The Evolving Role of the Visual Storyteller

Beyond the Co-Pilot

See also: