Can AI Truly Tell a Story?

Author: Denis Avetisyan

New research reveals that while artificial intelligence can generate visually-rich narratives, its approach to storytelling differs significantly from human creativity.

A sequence of story and character images serves as the basis for comparative textual analysis, where both human and GPT-4o outputs are segmented-marked by [SEP]-to correspond with individual images in the sequence, establishing a unit for evaluating performance across varying prompt lengths.

A unified metric for narrative coherence demonstrates distinct discourse patterns between human-authored and vision-language model-generated stories.

While large language models increasingly demonstrate fluency in generating narratives, assessing the coherence of those stories-particularly when grounded in visual information-remains a challenge. This work, ‘Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence’, introduces a unified metric to compare narrative coherence in human-written and model-generated stories based on visual prompts, examining factors like coreference, discourse relations, and multimodal grounding. The analysis reveals that despite surface-level similarities, vision-language models exhibit distinct patterns of narrative organization compared to humans. Ultimately, can these insights into coherence differences guide the development of more human-like and engaging storytelling AI?

The Fragile Architecture of Narrative Coherence

Narrative coherence, the quality of a story feeling connected and understandable, is fundamental to engaging an audience. It transcends simple grammatical correctness or logical sequencing; instead, it relies on the seamless integration of events, motivations, and ideas within a constructed world. A coherent narrative doesn’t merely present information, but builds upon it, creating a sense of flow where each element feels purposefully linked to what came before and foreshadows what is to come. This isn’t about avoiding contradictions, but about resolving them in a satisfying manner, or, at the very least, presenting them as meaningful parts of a larger, understandable framework. Without this underlying coherence, even the most fantastical or exciting plot can feel disjointed and ultimately fail to captivate, leaving the audience struggling to connect with the story’s emotional core.

Current computational methods for evaluating narrative coherence frequently fall short of capturing true understanding, instead focusing on easily quantifiable but ultimately superficial textual features. These approaches often prioritize surface-level connections – such as the frequency of shared words or the length of sentences – rather than deeper elements of meaning. Consequently, a story might be flagged as coherent simply because of repetitive phrasing, even if the events themselves lack logical flow or consistent character motivations. This reliance on shallow metrics limits the effectiveness of automated narrative analysis and hinders efforts to build artificial intelligence capable of generating genuinely compelling and internally consistent stories, as true coherence demands a grasp of the underlying relationships between events, characters, and themes.

A truly coherent narrative hinges on more than just grammatical correctness; it demands a robust understanding of how elements connect across the story’s progression. Referential continuity ensures that entities – people, places, objects – are consistently identified and tracked, preventing confusing shifts in designation. Equally important is the skillful management of topic shifts, where transitions between subjects feel natural and contribute to the overall flow, rather than appearing abrupt or disjointed. Finally, character consistency – maintaining believable motivations, behaviors, and relationships – is paramount; deviations erode trust and disrupt the reader’s immersion. Assessing these three facets – reference, topic, and character – represents a significant challenge for computational models, yet mastering them is crucial for both accurately evaluating existing narratives and building artificial intelligence capable of crafting genuinely compelling stories.

The ability to quantify narrative coherence holds significant implications for two distinct but converging fields. Automated narrative analysis, currently limited by surface-level metrics, stands to gain a far more nuanced understanding of story structure and emotional impact through precise measurement of referential continuity, topic shifts, and character consistency. Simultaneously, the development of truly compelling AI-generated stories hinges on the same principles; algorithms capable of evaluating and replicating coherent narratives are essential for moving beyond formulaic or disjointed outputs. By focusing on these core elements of storytelling, researchers can not only deconstruct what makes a narrative engaging, but also empower artificial intelligence to craft stories that resonate with audiences on a deeper level, effectively bridging the gap between computational generation and human artistic expression.

The model accurately predicts the distribution of implicit discourse relations within stories, as demonstrated by the consistent proportion of each relation type across multiple narratives.

Unveiling the Underlying Grammar of Story

Entity-based models of discourse coherence function by identifying and tracking entities – the people, places, and things discussed – across consecutive sentences or larger narrative segments. These models move beyond simple keyword matching by explicitly representing the roles these entities play – agent, patient, instrument, etc. – within each segment. By maintaining a record of entity states and transitions, the model can establish connections based on shared entities and consistent role assignments, even when surface-level linguistic features differ. This approach allows for the identification of coherence relationships that would be missed by methods focusing solely on lexical overlap or explicit connectives, as it infers connections based on the ongoing participation of entities within the discourse.

Discourse-relation frameworks utilize predefined relation types – such as causation, contrast, elaboration, and temporal sequence – to categorize the connections between text segments. These frameworks, often implemented as directed graphs, represent discourse units as nodes and the identified relations as edges. Common examples include Rhetorical Structure Theory (RST) which focuses on the purpose of text spans relative to each other, and the Penn Discourse Treebank (PDTB), which annotates explicit and implicit discourse connectives to identify relations like contingency, comparison, and expansion. The application of these frameworks allows for the formal representation of discourse structure, facilitating computational analysis of text coherence and the development of natural language understanding systems.

The effectiveness of entity-based and discourse-relation models is fundamentally dependent on accurate coreference resolution. These models function by tracking entities – people, objects, concepts – as they are referenced across a text. Inaccurate identification of which noun phrases refer to the same underlying entity introduces errors in tracking participant roles and establishing relationships between discourse segments. Robust coreference techniques are therefore essential to minimize these errors; they must reliably link all mentions of a given entity, even those expressed through pronouns, definite noun phrases, or indirect references. Failure to do so compromises the model’s ability to build a coherent representation of the discourse and negatively impacts downstream tasks such as summarization, question answering, and machine translation.

Traditional measures of text coherence often rely on surface-level features like lexical overlap and pronoun resolution; however, integrating entity-based models and discourse-relation frameworks allows for a quantification of coherence based on underlying semantic connections. By tracking entities and their roles throughout a text, and simultaneously identifying the relationships between discourse segments, it becomes possible to assign numerical values to the strength and consistency of these connections. This approach moves beyond simple feature counting to assess how well a text’s segments build upon each other to create a unified and logically structured whole, potentially utilizing metrics such as the density of entity-based connections or the consistency of identified discourse relations across a document.

Extending the Narrative Canvas: Multimodal Coherence

Traditional narrative analysis has largely focused on textual elements, however, contemporary storytelling frequently integrates visual components such as images, video, and illustrations. These visual elements are not merely decorative; they actively contribute to the overall meaning and coherence of the narrative. Visuals can provide contextual information not explicitly stated in the text, offer alternative perspectives on events, and reinforce or even contradict textual descriptions. This interplay between text and visuals creates a richer, more complex narrative experience, requiring analytical methods that consider both modalities to fully understand the story being conveyed. The contribution of visual elements is particularly significant in modern media formats like graphic novels, film, and interactive digital storytelling.

Multimodal Character Grounding (MCG) is a quantitative assessment of the correspondence between textual references to characters and their visual representations in associated imagery. The process involves identifying character mentions within the text and linking them to corresponding visual depictions – typically bounding boxes or segmentations – extracted from images or video frames. This alignment is not simply presence/absence; MCG also considers attributes like pose, action, and relationships between characters to determine the degree of consistency between modalities. Metrics used in MCG commonly involve calculating overlap between predicted and ground truth bounding boxes, and evaluating the similarity of character attributes using embedding spaces derived from visual and textual data. The resulting score provides a measure of how well the visual and textual character information corroborate each other, contributing to overall narrative coherence.

GrooVIST and MovieNet are computational techniques employed to identify and annotate characters within visual media, specifically video frames. GrooVIST utilizes a graph-based approach to track objects and associate them with textual references, while MovieNet leverages pose estimation and tracking to detect and follow human figures across a video sequence. Both systems output bounding box coordinates and associated confidence scores for identified characters, enabling quantitative analysis of character presence and movement. These annotations facilitate the measurement of multimodal coherence by providing a structured representation of visual character data that can be directly compared with character mentions in accompanying text.

Quantifying multimodal alignment involves establishing metrics to evaluate the correspondence between entities and events described in text and their visual representations. This is typically achieved through automated analysis techniques, such as tracking character appearances and actions across both modalities and calculating similarity scores based on features like visual appearance, spatial relationships, and temporal co-occurrence. Higher alignment scores indicate a stronger cohesive relationship between the text and visuals, suggesting that the visual elements effectively support and reinforce the narrative presented in the text. These quantitative measures enable systematic evaluation of multimodal storytelling and facilitate comparative analysis of different narrative presentations to determine which combinations of text and visuals contribute most effectively to a unified and comprehensible experience for the audience.

Decoding the Dynamics of Narrative Flow

DeDisCo3 is a computational framework designed for the classification of implicit discourse relations within text. Unlike explicit relations signaled by cue phrases, implicit relations require inference to identify connections between adjacent text segments. The system utilizes a multi-label classification approach, assigning one or more labels from a predefined typology to each segment pair, thereby revealing the underlying rhetorical structure of the narrative. This allows for quantitative assessment of how effectively a text establishes connections beyond simple adjacency, providing a metric for evaluating narrative flow and coherence based on the identification of these nuanced relationships between story elements.

BERTopic utilizes a transformer-based approach to identify and track the evolution of topics within a narrative. The method embeds documents and applies dimensionality reduction techniques, specifically UMAP, to create dense clusters representing distinct topics. These clusters are then used in conjunction with class-based TF-IDF to create easily interpretable topic representations. By analyzing changes in the prevalence and composition of these topics throughout a text, BERTopic provides a quantitative means of assessing thematic progression and coherence, allowing for the identification of abrupt topic shifts or sustained thematic development. The resulting topic time series can be used as a feature for comparative analysis of narrative structure across different texts or generated by different models.

The Link-Append Coreference Model facilitates the accurate identification of entities within a narrative and tracks their subsequent references across text segments. This is achieved through a combination of entity linking – associating mentions with their corresponding entities – and an append mechanism which builds a continuous record of each entity’s presence throughout the story. Precise coreference resolution is critical for maintaining referential continuity, as ambiguous or broken references negatively impact narrative coherence and reader comprehension. The model’s output provides quantitative data on the consistency and clarity of entity tracking, allowing for objective assessment of a narrative’s referential structure.

Analysis of narrative coherence, quantified through the Narrative Coherence Score (NCS), revealed a statistically significant difference between human-authored and machine-generated stories. Human narratives demonstrated a mean NCS of 0.50, indicating a higher degree of internal consistency and logical flow compared to generated text. This difference was assessed using a composite metric incorporating discourse relation classification, coreference resolution, and topic modeling techniques. The observed disparity suggests that current natural language generation models struggle to replicate the complex narrative structuring characteristic of human storytelling, resulting in lower overall coherence scores.

Narrative Coherence Scores (NCS), calculated across the evaluated corpus, revealed a statistically significant disparity between human-authored and machine-generated narratives. Humans achieved a geometric mean NCS of 0.36, indicating a relatively high degree of internal narrative consistency. In contrast, certain language models, notably Llama 4 Scout, exhibited substantially lower NCS values, scoring as low as 0.06. This suggests a marked inability of these models to maintain coherent narrative structure compared to human writers, as quantified by the metrics employed in this study.

Analysis of coreference resolution yielded a statistically significant difference (p<0.05) in scores between human-authored narratives and those generated by evaluated models. Human stories achieved a mean coreference score of 0.77, indicating a higher degree of consistent entity tracking and referential clarity. While most models demonstrated lower scores, the Llama 4 Scout model exhibited a comparable performance, nearing the human level of referential continuity. This suggests that while current generative models struggle to maintain consistent references throughout a narrative, some progress is being made in this area, as demonstrated by Llama 4 Scout’s performance.

Analysis of narrative coherence revealed statistically significant differences in implicit discourse relation typology and topic switch rates between human-authored and generated texts. Human stories achieved a mean score of 0.46 for implicit discourse relation typology, exceeding the performance of all evaluated models (p<0.05). Similarly, human narratives demonstrated a higher rate of topic switches, scoring 0.42, which also significantly differed from the generated texts (p<0.05). These results suggest that human storytellers effectively utilize a broader range of nuanced, non-explicit connections between narrative segments and manage thematic progression with greater complexity compared to current generative models.

Perplexity measurements, calculated using the evaluated open-source language models, demonstrate that human-authored narratives exhibit lower predictability compared to machine-generated text. Lower perplexity scores generally indicate that a language model can more accurately predict the subsequent tokens in a sequence; conversely, higher perplexity scores, as observed with human narratives in this study, suggest a greater degree of novelty or unexpectedness in the text. This implies that human writers introduce linguistic patterns and creative choices that deviate from the statistical norms captured by the models, resulting in less predictable sequences. The observed differences in perplexity suggest that while models can generate coherent text, they struggle to replicate the subtle complexities and creative variations characteristic of human storytelling.

A significant decrease in human topic switching is observed when the topic space is compressed from [latex]\mathbf{nr\_topics}=15[/latex] to [latex]55[/latex], indicating a preference for broader topic coverage.

Towards Intelligent Narrative Systems: A Future of Storytelling

Recent progress in coherence assessment is driving the development of increasingly refined narrative analysis tools. These tools move beyond simple keyword identification, instead focusing on the relationships between events, the motivations of characters, and the logical flow of a story. Sophisticated algorithms now evaluate narratives based on principles of causality, intentionality, and world knowledge, allowing for a deeper understanding of how stories construct meaning. This detailed analysis isn’t limited to identifying plot holes; it enables systems to discern subtle nuances in storytelling, such as foreshadowing, irony, and character development. Consequently, researchers are building tools capable of automatically summarizing narratives, identifying key themes, and even predicting audience engagement, offering unprecedented opportunities for both humanistic inquiry and practical application.

The development of algorithms capable of evaluating narrative coherence represents a significant step toward artificial intelligence that can not only understand stories, but also create them. By dissecting compelling narratives into quantifiable elements – such as character motivations, plot progression, emotional arcs, and thematic consistency – researchers are building computational models of storytelling itself. These models allow AI systems to move beyond simple plot generation and toward crafting narratives that resonate with audiences, exhibiting believable character development and satisfying resolutions. The potential extends beyond mere entertainment; these advancements promise AI capable of generating personalized learning experiences, adaptive therapeutic interventions, and dynamic content tailored to individual preferences – all driven by a deeper understanding of what makes a story truly captivating.

The ability to computationally assess and generate coherent narratives extends far beyond theoretical advancements in artificial intelligence. Practical applications are poised to reshape diverse fields, notably entertainment where truly personalized experiences – stories dynamically tailored to individual preferences and emotional responses – become increasingly feasible. Similarly, educational content creation stands to benefit, with the potential for AI-driven systems to generate learning materials that adapt to a student’s pace and comprehension level, crafting engaging narratives that optimize knowledge retention. Beyond these, fields like therapeutic storytelling, automated journalism, and even interactive training simulations are set to leverage these narrative technologies, signifying a broad impact on how information is conveyed and experiences are designed.

The ongoing exploration of computational narrative holds the potential to fundamentally reshape how humans interact with technology and understand storytelling itself. Future research isn’t simply about building machines that can tell stories, but about gaining deeper insights into the very mechanisms that make narratives compelling to the human mind. This includes investigating the neurological underpinnings of engagement, the cultural variations in storytelling preferences, and the subtle cues that signal emotional resonance. As these frontiers are crossed, applications extend beyond entertainment – promising adaptive educational tools tailored to individual learning styles, therapeutic interventions leveraging the power of personal narrative, and even more effective communication strategies across diverse cultures. Ultimately, continued investigation into computational narrative offers not only the prospect of truly intelligent narrative systems, but also a powerful new lens through which to examine the human experience.

The study illuminates a crucial aspect of system evolution: even models exhibiting apparent fluency can lack deeper structural coherence. This echoes a fundamental principle-that architecture without history is fragile and ephemeral. While vision-language models adeptly generate visual stories, their distinct narrative coherence profiles, particularly in discourse structuring, suggest a superficial understanding of storytelling. This isn’t necessarily a failing, but rather a demonstration that true coherence isn’t simply about fluent output; it’s about the layered integration of information-a historical context-that imbues narrative with lasting meaning. As Brian Kernighan aptly stated, “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” The same applies here; achieving genuine narrative coherence requires more than clever generation-it demands a robust foundation in understanding how stories function.

The Current Account

The divergence in narrative coherence profiles, as demonstrated, is not a failing of current vision-language models, but a predictable symptom of their genesis. These systems excel at fluency-at assembling plausible surfaces-but struggle with the deeper work of structuring a discourse across time. Every failure is a signal from time; the models reveal their limitations not in what they say, but in how they remember what has been said. The observed distinctions are not bugs to be fixed, but echoes of the fundamental asymmetry between generation and comprehension.

Future work must move beyond metrics of superficial consistency and grapple with the problem of temporal integration. Refactoring is a dialogue with the past; models must learn to actively maintain and revise an internal representation of the narrative arc, not merely project local plausibility. A critical direction lies in exploring architectures that explicitly model the reader’s or viewer’s evolving beliefs and expectations-systems that understand a story not as a sequence of events, but as a process of belief revision.

The ultimate challenge, however, may not be to replicate human narrative coherence, but to understand its necessity. Why do humans impose such rigorous structures on experience? Perhaps the answer lies not in computational efficiency, but in something more fundamental-a need to impose order on the inevitable decay of information, to build narratives as bulwarks against the erosion of meaning.

Original article: https://arxiv.org/pdf/2603.25537.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/