Bringing Stories to Life: AI Directs Virtual Actors

Author: Denis Avetisyan

Researchers have developed a system that uses artificial intelligence to translate descriptive scenes into dynamic actions for virtual agents, enabling procedurally generated interactive narratives.

The system generates interactive experiences by transforming user-defined scene configurations into executable actions, demonstrating a pipeline where initial conditions dictate subsequent dynamic behavior.

This work demonstrates a method for authoring agent-based narratives by combining large language models with a modular behavior system and scene descriptions.

Creating compelling, dynamic narratives for virtual agents remains a challenge due to the labor-intensive nature of scripting detailed behaviors. This paper introduces a system-LLM-Based Authoring of Agent-Based Narratives through Scene Descriptions-that leverages large language models to procedurally generate agent actions from simple scene descriptions. Our approach reliably translates high-level prompts into executable behaviors, enabling rapid prototyping of interactive agent-based stories. Could this methodology unlock new possibilities for scalable and adaptive storytelling in virtual environments and beyond?

The Unfolding Now: Beyond Linear Narrative

For centuries, storytelling has largely functioned as a linear presentation of predetermined events. This reliance on pre-scripted narratives, while effective in delivering a specific message, inherently restricts audience engagement and replayability. The traditional format often positions the recipient as a passive observer, limited to experiencing the story as it is told rather than actively participating in its unfolding. Consequently, opportunities for unique, personalized experiences are diminished, and the narrative’s potential for emergent meaning-the kind discovered through individual interpretation and interaction-remains largely untapped. This static quality contrasts sharply with the dynamic, unpredictable nature of real-world experiences, creating a disconnect that Agent-Based Narrative seeks to address by prioritizing interaction and emergence over rigid predetermination.

Agent-Based Narrative represents a significant departure from conventional storytelling, moving beyond pre-defined plots to embrace a system where narratives unfold organically. Instead of dictating events, this approach establishes a world populated by autonomous agents – characters driven by their own goals, motivations, and internal logic. Stories aren’t told; they happen as a consequence of these agents interacting with each other and their environment. The resulting narratives are dynamic and unpredictable, offering a level of emergent complexity rarely found in traditional media. Each playthrough or iteration can yield a unique story, shaped by the specific choices and actions of the agents, creating a deeply engaging experience where the audience witnesses, rather than receives, the unfolding drama. This methodology prioritizes believable behavior and realistic consequences, fostering a sense of immersion and allowing for truly novel narrative arcs.

Creating compelling, emergent narratives through agent interactions demands more than simply programming individual behaviors; it requires a sophisticated system for converting broad narrative aims – such as “establish a sense of dread” or “forge an unlikely alliance” – into concrete, believable actions for each autonomous agent. This translation isn’t a matter of direct instruction, but rather of defining internal motivations, perceptual biases, and reactive tendencies within each agent. The system must then simulate how these agents, operating with their unique parameters, would independently respond to a shared environment and each other, allowing a story to unfold as a natural consequence of their interactions. Successfully bridging this gap between high-level intent and granular behavior is the central challenge, necessitating advancements in areas like behavioral modeling, artificial intelligence, and computational psychology to ensure agents don’t simply act but convincingly feel and react within the narrative context.

This scene demonstrates a user-created environment featuring a virtual agent interacting with both functional and decorative objects.

The Logic of Action: From Prompt to Performance

The system architecture leverages Large Language Models (LLMs) as the primary means of defining agent behavior. Rather than directly programming actions, desired agent activities are expressed as natural language text prompts submitted to the LLM. The LLM processes these prompts and generates a corresponding textual description of the action the agent should perform. This text-based output serves as an intermediary representation, decoupling high-level behavioral goals from the specific implementation details of agent control, and enabling flexibility in agent design and response.

The SceneDirector functions as the intermediary between the Large Language Model (LLM) and the virtual agent’s actions. It receives text-based action proposals generated by the LLM and processes this natural language output into a structured format understandable by the agent’s behavioral systems. This parsing involves identifying the intended action, relevant objects or characters, and any necessary parameters, then translating these elements into specific function calls or behavior tree traversals within the agent’s control system. The SceneDirector’s functionality is critical for bridging the gap between the LLM’s linguistic output and the agent’s physical execution of tasks in the virtual environment.

Empirical testing of Large Language Model (LLM) response times for agent action generation revealed that ChatGPT consistently exhibited the fastest performance. Across a series of experiments, ChatGPT generated responses in a range of 0.79 to 3.50 seconds. This timeframe represents a statistically significant improvement over the response times observed from Claude, Gemini, and Grok during the same testing scenarios. The observed performance difference indicates that ChatGPT currently offers a substantial advantage in applications requiring low-latency agent behavior generation.

The system provides separate menus for selecting interactable objects, non-interactable objects, and agents, enabling precise control over the simulation environment.

The Embodied Agent: Movement and Presence

Realistic agent embodiment in virtual environments relies on the coordinated function of three core systems: animation, navigation, and Inverse Kinematics (IK). The animation system provides the visual representation of movement, while the navigation system enables agents to autonomously traverse the environment, pathfinding around obstacles and towards defined goals. Crucially, IK integrates with both systems, dynamically adjusting agent posture and limb positions to maintain balance and believability during movement and in response to environmental interactions. This integrated approach ensures that agent actions are not simply pre-defined animations, but dynamically generated responses that appear physically plausible and contextually appropriate, contributing to a higher degree of immersion for the user.

The Navigation System, often referred to as NavMesh, functions by generating a traversable graph representing the free space within the Scene Description. Agents utilize algorithms, such as A* pathfinding, to compute optimal paths across this graph from a starting location to a designated goal. This system incorporates obstacle avoidance by dynamically recalculating paths or employing steering behaviors to navigate around static and dynamic obstructions. Furthermore, the Navigation System considers agent dimensions, ensuring paths are viable given the agent’s physical size, and can be configured with parameters like maximum slope angle and step height to influence movement feasibility within the environment.

The integration of Inverse Kinematics (IK) and the Animation System is critical for generating realistic agent behavior. IK solves for joint angles to achieve desired end-effector positions, allowing agents to reach for objects or maintain balance even on uneven terrain. This contrasts with traditional animation techniques where all joint angles are pre-defined. The Animation System then layers these IK-driven poses with pre-authored animations – such as walking, running, or idle stances – to create a fluid and believable performance. By dynamically adjusting poses based on environmental interaction and combining them with established animation cycles, the system minimizes unnatural or robotic movements, significantly enhancing the user’s sense of immersion.

Beyond the Script: A Future of Emergent Storytelling

Traditional interactive narrative systems frequently rely on Behavior Trees, a methodology demanding extensive and painstaking manual authoring of every possible scenario and response. This process, while offering precise control, proves incredibly time-consuming and inflexible, particularly when aiming for dynamic, open-ended experiences. A novel approach diverges from this rigid structure, instead prioritizing a system where narrative elements are not pre-defined but generated through computational means. This shift reduces the burden on developers, enabling the creation of richer, more complex interactions without the limitations imposed by exhaustive, hand-crafted scripting. The result is a move toward narratives that feel less like pre-determined paths and more like genuinely responsive and evolving stories.

The advent of large language models (LLMs) is fundamentally reshaping interactive storytelling, moving beyond pre-defined scripts toward experiences that genuinely respond and adapt. Instead of narratives unfolding along rigidly programmed paths, LLMs facilitate a dynamic process where the story evolves organically based on player actions and the internal ‘reasoning’ of the virtual world. This isn’t simply branching dialogue; it’s a system capable of generating novel plot points, character motivations, and world details on the fly, creating a sense of unpredictability and immersion previously unattainable. Consequently, the narrative isn’t told to the player, but emerges through interaction, fostering a uniquely personal and engaging experience where each playthrough can be distinct and surprising.

Recent advancements showcase a compelling synergy between large language models and agent-based systems, resulting in the real-time generation of dynamic, coherent narratives. This innovative approach bypasses the limitations of traditional scripting methods by allowing narratives to unfold procedurally, driven by the interplay between an LLM’s planning capabilities and the behavioral execution of virtual agents. The system effectively orchestrates agent actions and dialogue, ensuring that the unfolding story remains logically consistent and engaging without requiring pre-authored content. This marks a significant step towards truly interactive experiences where stories aren’t simply told to the user, but emerge from their interactions and the autonomous actions of the virtual world, promising a future of adaptable and endlessly replayable narratives.

Performance comparisons across five scenarios reveal variations in the capabilities of the four large language models examined.

The Language of Action: Affordances and Context

Affordance modeling is fundamental to creating realistic agent behavior, as it establishes the potential actions an agent can perform with any given object or within a specific environment. This process doesn’t simply define what an agent can do, but crucially, how it understands its interaction possibilities; a chair, for instance, doesn’t inherently tell an agent it’s suitable for sitting, but rather, the agent’s internal model, built through affordance analysis, reveals this potential. By representing objects not as static entities, but as collections of actionable properties – such as ‘supportive’, ‘graspable’, or ‘occludable’ – the system enables agents to reason about their surroundings and select appropriate actions. This careful consideration of interaction potential is paramount; without it, agents may perform illogical or unrealistic actions, breaking immersion and hindering believability, while accurate affordance modeling paves the way for seamless and intuitive interactions within a virtual world.

The believability of an agent’s behavior hinges on its ability to act in a manner consistent with the surrounding environment and the objects within it. Accurate representation of potential actions – what an object allows the agent to do – ensures contextual appropriateness. If an agent attempts to pour water from a solid block, or fails to grasp a handle when reaching for a door, the illusion of intelligence breaks down. This necessitates a robust system where agents don’t simply know what actions are possible, but also understand when those actions make sense given the situation; a chair affords sitting, but only when a suitable posture and space are available. Successfully modeling these affordances allows for fluid, natural interactions, fostering a stronger sense of presence and immersion for the user, and ultimately, a more convincing artificial intelligence.

Current interaction systems often rely on pre-defined affordances – the qualities of an object that suggest how it should be used. However, research is actively progressing towards equipping agents with the ability to dynamically discover affordances as situations unfold. This involves developing algorithms that allow agents to perceive novel objects and, through observation or limited interaction, infer potential uses without prior programming. Such a system would enable agents to navigate unforeseen circumstances and interact with unfamiliar environments in a more robust and believable manner, moving beyond scripted behaviors to truly adaptive responses. The ultimate goal is to create agents capable of independent problem-solving and seamless integration into complex, real-world scenarios by continually learning and refining their understanding of how objects and environments afford action.

The presented work inherently acknowledges the transient nature of even meticulously crafted systems. While the system aims to generate dynamic narratives through agent behavior and LLM-driven scene interpretation, its continued efficacy relies on adaptation. As Edsger W. Dijkstra stated, “It’s not enough to have good code; you need to have good explanations of what that code is doing.” This highlights a crucial parallel: the system’s procedural generation, while innovative, requires continuous monitoring and refinement – a form of ‘explanation’ to maintain narrative coherence and believability over time. Any improvement in the agent behavior or LLM prompting, however sophisticated, will inevitably age faster than expected, necessitating a continual cycle of assessment and adjustment to prevent decay within the interactive storytelling experience.

What Lies Ahead?

The coupling of large language models with agent-based systems, as demonstrated, offers a pathway to narrative generation, but it is a pathway that, like all things, will inevitably encounter entropy. The current architecture excels at translating description into action, yet the true challenge isn’t simply doing – it’s the emergence of genuine, unpredictable behavior. Systems learn to age gracefully when their internal inconsistencies are not merely patched, but become generative forces. The focus may shift from perfecting the translation itself to deliberately introducing controlled ‘noise’ into the process, allowing for deviations from the described scene.

A critical area for future work resides in addressing the inherent limitations of relying on descriptive prompts. The system’s capacity is currently bound by the quality and completeness of those descriptions. Perhaps the next iteration will not seek to interpret a scene, but to imagine one, to fill in the gaps and construct a world beyond the immediate prompt.

Ultimately, the value may not reside in achieving perfect procedural generation, but in understanding the limitations of such endeavors. Sometimes observing the process of a system’s decline, its graceful aging, is more illuminating than attempting to accelerate its progress. The pursuit of seamless interaction may yield to an appreciation for the beauty of imperfection, the telltale signs of a system evolving, rather than merely executing.

Original article: https://arxiv.org/pdf/2512.20550.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/