Echoes of the Heart: AI Uncovers Sentiment in Persian Verse

Author: Denis Avetisyan

New research explores how artificial intelligence can interpret the emotional nuances within the rich tradition of classical Persian poetry.

An analysis of sentiment entropy across different poetic meters in the works of Rumi and Parvin E’tesami demonstrates Rumi’s particularly nuanced command of meter in conveying a wider spectrum of emotional expression than observed in E’tesami’s poetry.

This study demonstrates the application of Large Language Models to sentiment analysis of works by Rumi and Parvin Etesami, revealing stylistic differences and the influence of poetic meter.

While human interpretation remains central to literary analysis, its inherent subjectivity can limit comprehensive, unbiased assessment. This is addressed in ‘Artificial Intelligence for Sentiment Analysis of Persian Poetry’, which explores the application of large language models to computationally assess the emotional landscape of classical Persian verse. Findings demonstrate that models like GPT4o can reliably analyze poetic sentiment, revealing distinct emotional profiles between Rumi and Parvin E’tesami and a correlation between poetic meter and expressed feeling. Could these tools unlock new insights into the nuances of Persian poetry and offer a scalable approach to digital humanities research?

The Illusion of Sentiment: Why Poetry Breaks the Algorithms

Classical Persian poetry presents a significant hurdle for conventional sentiment analysis techniques. The language’s intricate grammatical structure, characterized by extensive use of prefixes, suffixes, and complex sentence arrangements, often confounds algorithms designed for simpler linguistic patterns. Beyond grammar, the frequent employment of metaphor, simile, and other figures of speech introduces layers of meaning that are not easily captured by keyword-based approaches. A word’s literal meaning may be deliberately subverted, or its emotional weight dramatically altered, within the poetic context, rendering standard sentiment lexicons unreliable. Consequently, a system trained on modern prose will likely misinterpret the emotional intent embedded within verses crafted centuries ago, highlighting the need for specialized models capable of discerning these subtle linguistic and cultural cues.

Automated sentiment analysis of poetry faces a fundamental hurdle: the deeply subjective experience of interpretation. Unlike factual texts, poetic meaning isn’t solely derived from the literal definition of words, but rather emerges from the interplay between language, cultural context, and individual reader response. Consequently, algorithms reliant on simple keyword matching – identifying words like ‘happy’ or ‘sad’ – prove inadequate, often misinterpreting irony, metaphor, and subtle emotional cues. Truly effective analysis necessitates methods capable of discerning emotional undertones, recognizing the author’s intent, and even acknowledging the inherent ambiguity within the verse itself – a significant advancement beyond merely counting positive or negative terms.

Successfully gauging sentiment within the works of historical Persian poets such as Rumi and Etesami necessitates computational models that transcend simple linguistic processing. These texts are deeply embedded within specific cultural frameworks and reflect the nuances of a bygone era, demanding that analytical tools account for shifts in language and evolving cultural norms. A word’s emotional weight isn’t static; its meaning and connotations change over time, and a model trained on modern Persian may misinterpret the emotional intent of classical verse. Furthermore, accurately assessing sentiment requires understanding the historical context – the social, political, and philosophical currents that shaped the poet’s worldview and informed their creative expression. Therefore, effective sentiment analysis of these historical works depends on integrating linguistic analysis with a robust understanding of cultural evolution and historical context.

Poetic language, by its very nature, often thrives on multiple layers of meaning, presenting a significant hurdle for sentiment analysis systems. A robust methodology must move beyond identifying explicit emotional keywords and instead embrace the potential for varied interpretations within a single verse. This necessitates algorithms capable of weighing contextual clues, recognizing figurative language – such as metaphor and simile – and assessing the cumulative emotional effect, rather than fixating on isolated terms. Successfully gauging the overall emotional tone demands a nuanced approach that acknowledges the inherent ambiguity and avoids imposing a singular, definitive reading onto the text, ultimately striving for a probabilistic assessment of sentiment rather than a rigid categorization.

Analysis of sentiment scores from multiple LLMs reveals consistent positive sentiment towards Rumi’s poetry compared to Parvin E’tesami’s, and this trend persists across different poetic meters containing at least 15 poems.

Large Language Models: A Temporary Fix for a Fundamental Flaw

For sentiment analysis of Persian poetry, we employed several advanced Large Language Models (LLMs), specifically BERT Multilingual, Pars-BERT, and GPT-4o. BERT Multilingual, pre-trained on a diverse corpus of text, provided a foundational understanding of language. Pars-BERT, a BERT model specifically trained on Persian text, was included to enhance performance with the nuances of the language. GPT-4o, a more recent generative model, was also utilized for its demonstrated capabilities in understanding context and generating human-quality text, which were considered valuable for interpreting the emotional content within poetic verses. These models were selected to facilitate a comprehensive assessment of sentiment, leveraging both general language understanding and Persian-specific linguistic features.

The selection of BERT Multilingual, Pars-BERT, and GPT-4o for sentiment analysis of Persian poetry was predicated on their architectural strengths in processing and interpreting nuanced language. These Large Language Models employ transformer networks, enabling them to consider the surrounding words – the context – when determining the meaning of a given term or phrase. This contextual understanding is critical in poetry, where emotional meaning is often conveyed implicitly through figurative language, word choice, and structural elements. Furthermore, these models are trained on extensive datasets, allowing them to recognize patterns associated with emotional expression and differentiate between subtle emotional cues, which is essential for accurately gauging sentiment in complex poetic text.

The analysis incorporated both Divan-i Shams, a collection of poetry by Jalal ad-Din Muhammad Rumi, and Divan-i Ashaar, the collected works of Parvin Etesami. This dual application of the Large Language Models enabled a comparative assessment of sentiment expression across distinct poetic traditions and authorial styles. Rumi’s work, characterized by mystical and devotional themes, was contrasted with Etesami’s poetry, known for its social commentary and introspective lyricism. By analyzing both corpora, the research aimed to identify not only the prevalence of positive, negative, or neutral sentiment, but also to discern stylistic differences in how these emotions are conveyed through language.

Quantitative evaluation of sentiment analysis performance revealed GPT-4o to be the most accurate model when compared against assessments made by human scholars. Utilizing the Quadratic Weighted Kappa (QWK) metric, GPT-4o achieved a correlation of 0.60, indicating substantial agreement with human evaluations. This score represents a measurable improvement over the performance of BERT-based models-including BERT Multilingual and Pars-BERT-which demonstrated lower QWK values in the same evaluation framework. The QWK metric accounts for the degree of agreement while correcting for the probability of agreement occurring by chance, providing a robust measure of model accuracy in this context.

Analysis of Rumi and Parvin E’tesami’s poetry reveals that specific poetic meters are correlated with positive sentiment and distinguish each poet’s stylistic preferences, as evidenced by the prevalence of certain meters like [latex] ext{X}[/latex] in Rumi and [latex] ext{Y}[/latex] in Parvin E’tesami.

The Illusion of Ground Truth: Why Human Annotation Isn’t a Solution

Human annotation was implemented as a validation method for Large Language Model (LLM) performance by engaging subject matter experts to assess the sentiment expressed within a representative sample of poems sourced from both Divans. This process involved manual labeling of a subset of the poetic text with sentiment classifications, creating a ground truth dataset independent of algorithmic prediction. The resulting human-labeled data served as a benchmark against which the LLM’s sentiment analysis outputs could be directly compared, allowing for quantitative evaluation of accuracy and identification of potential discrepancies or systematic errors in the model’s interpretation of the text.

Human annotation served as a definitive ground truth dataset for evaluating the Large Language Model (LLM) predictions. By having expert annotators independently assess the sentiment expressed in a subset of poems, a benchmark was established against which the LLM’s output could be directly compared. Discrepancies between the model’s predictions and the human annotations highlighted potential systemic biases within the LLM, such as a tendency to misclassify certain poetic devices as indicative of a specific sentiment. These comparisons also revealed inaccuracies in the model’s understanding of nuanced language and contextual sentiment, allowing for targeted improvements to the LLM’s training data and algorithms.

Nominal Fleiss’ Kappa and Quadratic Weighted Kappa were employed as statistical metrics to quantify the level of agreement between the sentiment predictions generated by the Language Learning Models and the corresponding annotations provided by human experts. Fleiss’ Kappa assesses the agreement among multiple raters when classifying items into mutually exclusive categories, while Quadratic Weighted Kappa accounts for the degree of disagreement by assigning different weights to varying levels of divergence. These measures are crucial for establishing inter-rater reliability, ensuring that observed agreement isn’t simply due to chance; values typically range from -1 to 1, with higher positive values indicating stronger agreement. The utilization of both metrics provides a robust evaluation of the model’s performance relative to human judgment, bolstering the validity of the sentiment analysis results.

Human annotation reliability was quantified using Krippendorff’s Alpha, yielding a score of 0.6, indicating moderate agreement among the annotators. This level of inter-rater reliability establishes a benchmark against which the performance of the Large Language Models (LLMs) was assessed. The LLM’s sentiment predictions were then compared to the human annotations, and statistical measures were calculated to determine the degree of alignment between the model outputs and the established ground truth. This comparison allows for objective evaluation of the LLM’s ability to accurately interpret the sentiment expressed in the poems, acknowledging the inherent subjectivity present in human annotation and using the 0.6 Alpha score as a baseline for acceptable performance.

Polarization Analysis was conducted to quantify the degree to which sentiment leaned towards positive or negative extremes within each poem, providing insight beyond simple positive, negative, or neutral classifications. This analysis was complemented by calculating the Standard Deviation of sentiment scores for each poem; a higher Standard Deviation indicated greater variability in emotional tone throughout the text, suggesting complex or shifting emotions. Together, these measures offered a more nuanced understanding of the emotional landscape of the poems than aggregate sentiment scores alone, revealing the presence of internal contradictions or evolving emotional states within individual works.

Analysis of poetic sentiment reveals that Rumi’s work exhibits greater emotional variability and polarization-ranging from intense sadness to joy-compared to the more consistently neutral tone of Parvin E’tesami’s poems.

The Inevitable Limits of Computation: What We’re Not Measuring

The efficacy of Pars-BERT, a language model meticulously refined for Persian, demonstrated sensitivity to data that diverged from its training set – a phenomenon known as Out-of-Distribution (OOD) data. This observation underscores a critical challenge in natural language processing: models, even those specialized for a particular language, can experience diminished performance when confronted with linguistic patterns or stylistic choices not adequately represented in their initial learning phase. The study revealed that shifts in vocabulary, grammatical structures, or contextual nuances – such as those found in older texts or diverse literary genres – negatively impacted Pars-BERT’s ability to accurately interpret sentiment. Consequently, ongoing model adaptation and the incorporation of more comprehensive and varied Persian datasets are essential to enhance the model’s robustness and ensure reliable performance across the full spectrum of the language’s expression.

A novel approach combining Meter Analysis with Entropy Calculation has enabled the quantifiable assessment of sentiment diversity within poetic forms. This methodology moves beyond simple sentiment detection to reveal the range of emotional expression contained within specific metrical structures. Analysis of Rumi’s poetry, for example, demonstrated entropy values peaking at 2.25, signifying a remarkably broad spectrum of sentiments coexisting within a single metrical scheme. This suggests that poetic meter doesn’t constrain emotional expression, but rather provides a framework within which a surprisingly complex interplay of feelings can be articulated, challenging traditional assumptions about the relationship between form and content in poetry.

The study underscores a critical need to move beyond purely computational approaches to sentiment analysis, particularly when dealing with languages and artistic forms deeply rooted in cultural nuance. Sentiment, it reveals, is not simply a function of word choice but is intricately woven with linguistic structures and cultural understandings; a direct translation or application of models trained on one cultural context to another can readily lead to inaccuracies and misinterpretations. Successfully discerning emotional tone requires algorithms capable of recognizing idioms, historical references, and the subtle implications inherent in a language’s poetic traditions – factors often absent in conventional sentiment analysis tools. Ignoring these elements diminishes the ability to accurately gauge expressed sentiment, highlighting the necessity for models that actively incorporate and learn from both linguistic data and relevant cultural information.

Continued advancement in sentiment analysis necessitates the creation of models exceeding current limitations in understanding complex linguistic structures. Future research should prioritize developing algorithms capable of discerning subtle nuances within language, acknowledging that meaning is often deeply embedded in cultural context and stylistic expression. Specifically, poetic language presents a unique challenge due to its intentional ambiguity and reliance on figurative devices; therefore, models must move beyond literal interpretations to grasp the intended emotional resonance. Successfully addressing these complexities will require integrating insights from linguistics, cultural studies, and computational poetics, ultimately leading to more accurate and culturally sensitive sentiment analysis tools capable of interpreting the full spectrum of human expression.

The table details a comprehensive mapping of various poetic meters.

The pursuit of extracting emotional nuance from verse, as this research demonstrates with Rumi and Parvin Etesami, feels…familiar. It’s a noble effort, applying cutting-edge Large Language Models to classical Persian poetry, but one can’t help but suspect it’s merely re-encoding existing interpretations. Alan Turing observed, “We can only see a short distance ahead, but we can see plenty there that needs to be done.” This rings true; each new analytical framework, however sophisticated, simply illuminates previously obscured facets of age-old ambiguities. The claim that meter significantly impacts sentiment analysis isn’t groundbreaking-scholars have been debating poetic form’s influence for centuries. It’s simply now quantified, packaged, and presented as innovation. One anticipates the inevitable arrival of ‘Sentiment Analysis Framework 2.0,’ complete with its own set of unforeseen edge cases and undocumented behaviors. Everything new is just the old thing with worse docs.

So, What Breaks Next?

The demonstrated feasibility of applying Large Language Models to classical Persian poetry is… predictable. Everything old is old again, just renamed and still broken. The nuances of meter, successfully identified as a sentiment carrier, will inevitably prove to be a chaotic variable. Production – in this case, a corpus of poetry that hasn’t been meticulously curated – will expose the limitations of these models faster than any clever algorithm can adapt. Expect edge cases involving satire, irony, and deliberate ambiguity to flourish.

The observed divergence in emotional expression between Rumi and Parvin Etesami, while interesting, begs the question of whether the models are detecting genuine authorial intent or merely reflecting the biases present in existing interpretations. One suspects the latter. Future work will undoubtedly involve attempts to ‘ground’ these analyses in more objective metrics, which will, of course, prove equally subjective.

The real challenge isn’t sentiment analysis itself, but the illusion of understanding. The field will progress, generating ever-more-complex models, until someone inevitably asks what these models actually mean when they declare a ghazal “sad.” And then, as always, the alerts will begin.

Original article: https://arxiv.org/pdf/2603.11254.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Sentiment: Why Poetry Breaks the Algorithms

Large Language Models: A Temporary Fix for a Fundamental Flaw

The Illusion of Ground Truth: Why Human Annotation Isn’t a Solution

The Inevitable Limits of Computation: What We’re Not Measuring

So, What Breaks Next?

See also: