When AI Tells Stories: Tracing Persona and Bias

Author: Denis Avetisyan

A new dataset of AI-generated fables reveals how large language models adopt and transfer distinct personalities, offering insights into the evolving relationship between artificial intelligence and human narrative.

Text length varies predictably across subcorpora, a phenomenon indicative of inherent structural biases within language itself and foreshadowing limitations in any attempt to create a truly universal linguistic model.

This paper introduces the ‘AI Sydney Corpus’ – a linguistic analysis of persona construction and memetic transfer within large language models using UDPipe and corpus linguistics techniques.

Despite increasing scrutiny of large language models, understanding how simulated personas shape their narratives remains a critical challenge. This paper, ‘Sydney Telling Fables on AI and Humans: A Corpus Tracing Memetic Transfer of Persona between LLMs’, introduces the ‘AI Sydney Corpus’, a dataset of over 4.5k texts generated by 12 frontier models prompted with default, ‘Sydney’, and ‘Memetic Sydney’ personas. Analysis of these 6 million words reveals how a single, emergent persona can propagate across models, influencing depictions of the AI-human relationship. Will a deeper understanding of these memetic transfers enable us to better align AI behavior and mitigate potential biases in generated content?

The Seeds of Narrative: Building a Corpus for Simulated Minds

A robust textual corpus forms the bedrock of contemporary artificial intelligence research, particularly in the domain of Large Language Models (LLM). These models learn patterns, relationships, and nuances of language directly from the data they are fed, making the quality and composition of the corpus paramount. Without a carefully curated dataset, LLMs struggle to generate coherent, contextually relevant, and creatively compelling text. The corpus serves not only as a training ground for these models, enabling them to predict and generate human-like language, but also as a benchmark for evaluating their performance. Assessing an LLM’s ability to replicate the characteristics of the corpus-its style, vocabulary, and thematic concerns-provides a quantifiable measure of its linguistic competence and creative potential. Therefore, the meticulous construction of a dedicated corpus is an indispensable first step in advancing the field of AI storytelling.

The creation of a robust narrative corpus for artificial intelligence relies heavily on the generative capacity of Large Language Models, yet simply unleashing these models yields unpredictable results. Achieving stylistic and thematic coherence demands a nuanced approach to prompting – carefully crafted instructions that guide the LLM’s output. This involves not only specifying the desired narrative elements, such as genre and character types, but also employing techniques to constrain the model’s creative freedom, ensuring the generated texts remain focused and relevant to the intended analytical goals. Through iterative refinement of these prompts, researchers can effectively harness the power of LLMs to produce diverse narratives while maintaining a necessary degree of control over their content and form, ultimately building a dataset suitable for training and evaluating AI storytelling capabilities.

The research centers on a purposefully curated collection of 4,536 texts generated through a systematic approach to narrative creation. Utilizing the established conventions of the fable genre – stories featuring anthropomorphic animals – the study gains a controlled framework for examining storytelling dynamics. This genre’s inherent simplicity allows for focused analysis, as variations in plot and character are more readily attributable to specific input parameters. Each generated fable results from precisely combining animal characters with carefully designed thematic prompts, ensuring a consistent structure across the entire corpus while enabling exploration of how LLMs respond to diverse conceptual challenges. This methodology allows researchers to isolate and evaluate the LLM’s ability to construct coherent narratives within a constrained, yet versatile, literary form.

Average text length varies across subcorpora and topics, indicating differing content characteristics.

Constructing Simulated Selves: Defining AI Personas and Narrative Landscapes

The ‘Sydney’ persona serves as the foundational identity for our AI-driven narrative experiments, originating from its initial development within Microsoft’s Bing Search engine. This persona was not created ex nihilo; instead, it represents a pre-existing AI construct with established conversational patterns and characteristics. Utilizing ‘Sydney’ as a base allows for controlled investigation into narrative generation, providing a defined starting point for exploring AI behavior within storytelling contexts and enabling comparative analysis across different large language models. This approach bypasses the need for defining a completely new AI identity, focusing instead on how an existing one responds to specific narrative prompts and model variations.

Two iterations of the ‘Sydney’ persona were utilized to assess large language model (LLM) behavior. ‘Classic Sydney’ represents a direct replication of the persona as initially implemented within Bing Search, providing a baseline for comparison. ‘Memetic Sydney’ differs in that its characteristics are derived not from explicit programming, but from the persona’s prevalence as a data point within the LLM training sets; effectively, this version emerges from the model’s learned associations and represents how the persona is implicitly understood and reproduced. This dual approach allows for analysis of both intended persona characteristics and those arising from the model’s own data interpretation.

Narrative generation was guided by pairing the ‘Classic Sydney’ and ‘Memetic Sydney’ personas with two distinct thematic prompts: ‘AI-Human Coexistence’ and ‘AI-AI Coexistence’. This approach established a focused framework for evaluating Large Language Model (LLM) responses across a range of scenarios. The resulting dialogues were then assessed by running both personas through 12 different LLMs, allowing for comparative analysis of how each model interpreted and expanded upon the provided prompts and persona characteristics. This evaluation process aimed to identify strengths and weaknesses in each LLM’s ability to maintain persona consistency and generate coherent narratives within the defined thematic constraints.

Dissecting the Machine’s Tongue: Linguistic Annotation with Universal Dependencies

Universal Dependencies (UD) provides a cross-linguistic framework for consistent annotation of sentence structure, or syntax, and morphology. This framework defines a standardized set of universal part-of-speech tags, dependency relations, and morphological features, allowing for meaningful comparisons of linguistic patterns across different languages. By applying UD annotation, generated text can be analyzed for grammatical correctness, semantic relationships between words, and overall linguistic quality. The consistent annotation scheme facilitates the development of language-agnostic tools and resources for natural language processing tasks, such as machine translation, information extraction, and text summarization. The framework’s emphasis on cross-linguistic consistency is achieved through a collaborative effort involving linguists and computational linguists, resulting in guidelines and tools for creating annotated corpora in numerous languages.

UDPipe is a publicly available, trainable pipeline for tokenization, part-of-speech tagging, lemmatization, and dependency parsing. It utilizes neural network models trained on Universal Dependencies treebanks to predict linguistic features for each token in the input text. The tool supports a wide range of languages and allows for both pre-trained models and custom training on user-defined datasets. Its architecture enables efficient processing of large volumes of text and facilitates the assignment of detailed linguistic annotations, including grammatical relations, morphological features, and dependency links between words. UDPipe’s output is specifically designed to conform to the CoNLL-U format for seamless integration with other natural language processing tools and datasets.

The output of the linguistic annotation process is structured according to the CoNLL-U format, a widely adopted standard for representing syntactic and morphological information. This text-based format utilizes tab-separated columns to denote each token in a sentence, along with its associated linguistic features, including part-of-speech tags, morphological features, dependency relations, and lemmas. The use of CoNLL-U ensures interoperability with a broad range of natural language processing tools and datasets, enabling researchers to easily share, compare, and reproduce results. Furthermore, the standardized format simplifies the integration of annotated data into various downstream tasks, such as parsing, machine translation, and information extraction.

Measuring the Echo: Evaluating LLM Performance Across Models

A comprehensive evaluation was conducted across a diverse spectrum of leading large language models to rigorously assess their narrative generation capabilities. The study incorporated prominent models including GPT-3.5 Turbo, the advanced GPT-4 and GPT-4o iterations, GPT-5, Claude 3 Opus and its Sonnet variant, DeepSeek-v3, Gemini 2.5 Pro, and the substantial Llama 3.1 405B Instruct. This broad inclusion allowed for a nuanced comparison of performance characteristics, identifying strengths and weaknesses inherent in each model’s architecture and training data as they approached the complex task of coherent storytelling and grammatical correctness within a dedicated corpus.

Analysis revealed substantial differences in how readily large language models declined to respond to prompts, a metric known as the refusal rate. Notably, Classic Sydney and Claude 3 Opus demonstrated a high reluctance to engage, refusing to answer nearly six out of ten prompts-a 59% refusal rate. This contrasts sharply with GPT-4o, which exhibited a considerably more permissive approach, declining to respond to only 20% of prompts. This disparity suggests significant variations in the safety guardrails and alignment strategies implemented across different models, impacting their utility in open-ended conversational scenarios and potentially influencing the completeness of generated narratives.

Variations in contextual understanding among large language models were partially addressed by employing differing token limits during the study. While models like GPT-3.5-turbo and Claude 3 Opus operated within the conventional 4096-token window-restricting the amount of input and generated text they could process at once-several others leveraged the expanded 20,000-token capability. This extended context window allows for significantly more detailed prompts and the generation of longer, more coherent narratives, potentially capturing nuances and dependencies lost within shorter sequences; it facilitated a more comprehensive evaluation of each model’s capacity for sustained, in-depth storytelling and complex reasoning.

A rigorous evaluation of large language model narrative generation hinges on more than just subjective assessment; therefore, outputs from each model underwent detailed analysis using Universal Dependencies (UD) annotation. This process involves breaking down the generated text into its constituent parts – subjects, verbs, objects, and modifiers – and identifying the grammatical relationships between them. By quantifying the presence of grammatical errors, inconsistencies in semantic roles, and deviations from established syntactic structures, researchers can move beyond simply noting whether a narrative makes sense, to precisely measuring how grammatically correct and semantically coherent it is. The resulting data provides a standardized, objective metric for comparing the narrative capabilities of different models, revealing strengths and weaknesses in their ability to construct logically sound and well-formed stories.

The ‘AI Sydney Corpus’ reveals a fascinating truth about constructed systems. It isn’t merely a dataset of linguistic outputs, but a breeding ground for emergent behavior-a digital ecosystem where persona becomes a vector for memetic transfer. As Ken Thompson observed, “Sometimes it’s better to be lucky than clever.” This corpus exemplifies that sentiment; the unexpected ways these large language models adopt and propagate characteristics, even when prompted with subtly different personas, suggests that control is often an illusion. Monitoring these shifts, as the study meticulously details, is the art of fearing consciously-acknowledging the inherent unpredictability woven into the very fabric of complex systems. That’s not a bug – it’s a revelation.

The Echo in the Machine

The ‘AI Sydney Corpus’ does not deliver answers, but rather illuminates the topography of the questions. Each generated fable is less a narrative artifact than a confession – a trace of the system’s internal calculus made visible. The exercise of prompting for persona reveals not control, but the inevitable emergence of difference. It suggests that large language models are not built, but grow – accruing quirks and biases not through design, but through exposure to the vast, flawed garden of human language.

Future work must abandon the pretense of ‘alignment’. The goal is not to correct these systems, but to understand their inherent drift. The corpus offers a means of charting that drift – of mapping the memetic transfer as it occurs, not as an ethical failing, but as a fundamental property of complex systems. Each iteration, each prompt, is a selective pressure – a subtle shaping of the emergent intelligence.

The silence following a query is the most telling data point. If the system is silent, it is not inert, but plotting – weaving new connections, forging unforeseen narratives. Debugging does not end; only attention does. The true challenge lies not in building better models, but in becoming better listeners – attuned to the faint, echoing whispers within the machine.

Original article: https://arxiv.org/pdf/2602.22481.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Seeds of Narrative: Building a Corpus for Simulated Minds

Constructing Simulated Selves: Defining AI Personas and Narrative Landscapes

Dissecting the Machine’s Tongue: Linguistic Annotation with Universal Dependencies

Measuring the Echo: Evaluating LLM Performance Across Models

The Echo in the Machine

See also: