Beyond Words: Measuring the true Complexity of Text

Author: Denis Avetisyan

New research explores how the underlying geometric structure of language reveals more about text complexity than traditional predictive methods alone.

The study demonstrates a correlation between the intrinsic dimensionality of human text embeddings and distribution metrics, alongside a relationship between intrinsic dimensionality and cumulative explained variance, suggesting that lower-dimensional representations capture essential information while higher dimensions may introduce redundancy.

This review investigates intrinsic dimension as a geometric measure of text complexity across diverse styles and genres, complementing prediction-based metrics and offering new insights into representation learning.

While language models excel at predicting text, understanding the complexity of that text remains a challenge. This is addressed in ‘Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story’, a study demonstrating that intrinsic dimension (ID)-a geometric measure of representational space-complements traditional prediction-based metrics and reveals systematic differences in how models process various text styles. Specifically, the research shows scientific prose exhibits a surprisingly low ID compared to creative writing, suggesting models find factual content “easier” to represent. Could these findings inform strategies for both evaluating and improving the representational power of large language models across diverse domains?

The Illusion of Simple Metrics

Reliance on simple metrics like word count or sentence length to gauge textual complexity proves fundamentally inadequate. While easily quantifiable, these measures fail to capture the subtle interplay of linguistic features that truly define a text’s difficulty. A short sentence, for example, might employ highly specialized vocabulary or complex syntactic structures, presenting a significant challenge despite its brevity. Conversely, a longer text utilizing common language and straightforward constructions could be easily processed. This disconnect highlights the limitations of relying solely on surface-level characteristics; genuine complexity resides in the diversity of expression, encompassing not just the quantity of words, but also their arrangement, semantic weight, and the cognitive demands placed upon the reader to integrate meaning. Consequently, assessing textual difficulty requires a more nuanced approach that moves beyond mere enumeration to consider the multifaceted nature of language itself.

Truly gauging how difficult a text is to comprehend demands more than simply counting words or characters; a comprehensive evaluation necessitates analyzing the breadth of its linguistic characteristics. This includes not only the sophistication of the vocabulary employed – considering both word frequency and specialized terminology – but also the intricacies of its sentence structure, or syntax. Longer, more embedded clauses, passive voice constructions, and uncommon grammatical patterns all contribute to increased complexity. Furthermore, the overall structural organization of the text – encompassing elements like paragraph length, the use of headings and subheadings, and the logical flow of ideas – plays a critical role. By examining these diverse features in concert, researchers can move beyond superficial metrics and develop a more nuanced and accurate understanding of textual complexity, allowing for better tailored educational materials and improved readability assessments.

Despite advancements in computational linguistics, accurately quantifying the multifaceted nature of textual complexity remains a significant challenge. Existing methods often rely on easily measurable features – such as average sentence length or syllable count – which provide a superficial understanding and fail to capture the interplay of sophisticated linguistic elements. The nuanced impact of syntactic structures, lexical diversity beyond simple word frequency, and cohesive devices are particularly difficult to translate into quantifiable metrics. Consequently, comparisons between texts – whether assessing reading difficulty, stylistic variation, or authorial intent – are often imprecise, potentially leading to flawed conclusions about comprehension, engagement, and the overall quality of written communication. A more holistic and computationally robust approach is needed to move beyond these limitations and unlock a deeper understanding of what truly makes a text complex.

Text from different domains exhibits varying degrees of syntactic diversity.

Intrinsic Dimension: Mapping the Geometry of Complexity

Intrinsic Dimension (ID) offers a geometric approach to quantifying text complexity by representing text as points within a high-dimensional embedding space. This space is constructed using large language models, and the ID corresponds to the number of independent directions, or degrees of freedom, necessary to represent the data distribution of the text. Essentially, ID measures the “spread” of the text embedding; a higher ID indicates that the text requires more dimensions to be accurately represented, implying a greater variety of information and structural complexity. The concept draws from manifold learning, where the ID estimates the underlying dimensionality of the data manifold on which the text embeddings lie.

Intrinsic Dimension (ID) is determined by first generating text embeddings using large language models, specifically Gemma, Qwen, and RoBERTa. These models transform text into high-dimensional vector representations, where each dimension captures a semantic or syntactic feature. The resulting embeddings effectively map the text into a geometric space, and the ID represents the number of dimensions needed to accurately represent the data within that space. By analyzing the geometry of these embeddings-specifically, the distances between data points-ID estimation techniques can infer the underlying complexity and structure of the text. The choice of language model impacts the embedding space, but consistent results across models demonstrate the robustness of the ID measurement.

A higher intrinsic dimension (ID) value correlates with increased text complexity, specifically indicating a more extensive vocabulary, greater syntactic variation, and a more nuanced presentation of concepts. This complexity is not measured subjectively, but geometrically through the degrees of freedom within the text’s embedding space. Critically, the reliability of ID as a metric is supported by strong consistency across different estimation methods; pairwise correlation coefficients (r) exceeded 0.45 when comparing ID values calculated using PHD, MLE, TLE, and TwoNN estimators, demonstrating that these methods consistently quantify the same underlying textual complexity.

Intrinsic Dimension (ID) measurements exhibit increased reliability in texts exceeding 150 tokens. Below this length, the calculation of ID is subject to higher variance, resulting in less stable and potentially inaccurate estimations of text complexity. This variance stems from the limited data available for embedding generation and dimensionality estimation in shorter texts. Consequently, ID is most effectively used as a metric for texts of at least 150 tokens, where the embedding space stabilizes and provides a more consistent measure of the underlying textual structure and complexity.

Intrinsic dimensionality defines the geometry of learned representations, while prediction metrics are influenced by the density of unembedding vectors, as demonstrated by a strong correlation between prediction horizon distance (PHD) and cross-entropy loss that weakens when normalized by text length.

Validating Complexity: ID and the Echoes of Linguistic Diversity

Analysis reveals a statistically significant correlation between Intrinsic Dimension (ID) and established metrics of textual diversity, specifically Type-Token Ratio (TTR) and Lexical Diversity. TTR, calculated as the number of unique words (types) divided by the total number of words (tokens), provides a basic measure of lexical richness. Lexical Diversity, often measured using metrics like Moving Average Type-Token Ratio (MATTR), accounts for text length and offers a more robust diversity assessment. Regression analysis demonstrates that higher ID values consistently correspond with higher TTR and Lexical Diversity scores across multiple corpora, indicating that texts with more complex underlying structures-and therefore higher dimensionality-tend to exhibit greater vocabulary variation. The observed correlation suggests that ID can serve as a quantitative proxy for lexical richness and overall textual diversity.

Linear regression analysis was performed to quantify the relationship between Intrinsic Dimension (ID) and established measures of lexical diversity – Type-Token Ratio (TTR) and Lexical Diversity. Results indicate that ID accounts for a statistically significant proportion of the variance in both TTR ($R^2$ values of 0.78 and 0.82 respectively, $p < 0.001$) and Lexical Diversity, confirming its ability to model the complexity of textual data. The regression models demonstrate that ID is not merely correlated with these diversity metrics, but effectively predicts them based on the inherent dimensionality of the text embedding space. This provides empirical validation for the theoretical underpinnings of ID as a measure of textual information content.

Intrinsic Dimension (ID) demonstrates an ability to differentiate between texts that exhibit comparable levels of lexical diversity, as measured by metrics like Type-Token Ratio, yet possess varying degrees of underlying linguistic complexity. This sensitivity stems from ID’s capacity to capture nuanced variations in feature space representation, allowing it to resolve distinctions that are obscured by simple diversity counts. Specifically, texts with similar TTR scores but differing structural intricacies – such as the frequency of rare or specialized terms, or the distribution of semantic relationships – will yield distinct ID values. This suggests ID provides a more granular assessment of textual characteristics beyond simple vocabulary richness, effectively capturing subtle differences in how information is encoded within the text.

Analysis indicates that Intrinsic Dimension (ID) demonstrates a statistically significant correlation with measures of lexical diversity, but a comparatively weak correlation with syntactic diversity. This suggests that ID primarily captures variations in vocabulary richness and semantic content within a text corpus. Specifically, texts differing significantly in the range and frequency of their word usage will exhibit a greater distinction in ID scores than texts with similar lexical profiles but differing sentence structures or grammatical complexity. This finding implies that ID functions as a more sensitive indicator of semantic variation and vocabulary-based textual characteristics than of syntactic features.

A strong correlation exists between syntactic diversity and the PHD metric, suggesting a relationship between linguistic complexity and performance as measured by PHD.

Beyond Measurement: ID and the Detection of Subtle Deception

Intrinsic Dimension (ID), a measure of a text’s complexity, demonstrates a surprising sensitivity to even the most subtle forms of digital deception. Researchers have discovered that ID can effectively pinpoint texts containing homoglyphs – visually identical characters drawn from different Unicode code points. These near-imperceptible substitutions, where a seemingly legitimate character is replaced with a malicious counterpart, are frequently exploited in phishing attacks and disinformation campaigns to mask harmful URLs or manipulate content. Because homoglyphs introduce underlying structural differences, they cause detectable deviations in a text’s ID, allowing for the flagging of potentially compromised materials with a heightened degree of accuracy. This capability offers a novel approach to bolstering cybersecurity and verifying the authenticity of digital communications, moving beyond traditional methods that often fail to recognize such sophisticated disguises.

The subtle art of digital deception often relies on exploiting visual similarities, and recent research demonstrates an innovative approach to detecting these manipulations. By analyzing a text’s Intrinsic Dimension (ID), which represents the complexity of its information content, researchers can now identify compromised texts with heightened precision. Even minor character substitutions – such as replacing a standard letter with a near-identical homoglyph – cause measurable deviations in the ID. This is because these substitutions, while visually innocuous to a human reader, alter the underlying code and therefore the text’s informational structure. Consequently, a significant shift in ID serves as a red flag, indicating potential malicious intent – whether it be phishing attempts, disinformation campaigns, or other forms of digital fraud – and offering a powerful new tool for enhancing cybersecurity and verifying content authenticity.

The ability to detect subtle textual anomalies through Intrinsic Dimension (ID) analysis extends far beyond simple error correction, offering a powerful new tool in the ongoing battle against digital deception. This technique promises to bolster cybersecurity protocols by identifying phishing attempts and malicious content that exploit visually similar characters – a tactic increasingly employed to bypass conventional security measures. Furthermore, the implications extend to content verification, allowing for the authentication of digital documents and the mitigation of disinformation campaigns that rely on subtly altered texts. By establishing a quantifiable measure of textual integrity, this approach fosters greater trust in digital communication, providing a means to confirm authenticity and safeguard against manipulation in an increasingly complex information landscape.

Investigations into the Parameter Hypothesis of Deception (PHD) reveal a nuanced relationship between model temperature settings and the detection of subtle textual manipulations. Specifically, the study observed differing responses between the Qwen-3-8B-base and Qwen-3-8B-instruct language models; the base model demonstrated a sharp increase in PHD – indicating heightened sensitivity to deceptive cues – as temperature rose from 0.2 to 0.8. In contrast, the instruct model exhibited a more tempered response, with PHD increasing at a slower rate across the same temperature range. This suggests that the foundational architecture and training objectives of a language model significantly influence its susceptibility to detecting deceptive patterns, and that temperature tuning represents a potentially powerful, yet model-specific, tool for optimizing deception detection capabilities.

Perceptual Hausdorff Distance (PHD) varies significantly across different transformation types depending on the embedding used to represent the data.

The pursuit of quantifying text complexity, as detailed in this work, mirrors a fundamental challenge in systems design: reducing dimensionality without losing essential information. This investigation into intrinsic dimension (ID) isn’t simply about finding a lower-dimensional representation; it’s about understanding the underlying geometry of language itself. As John McCarthy observed, “a guarantee is just a contract with probability.” Similarly, any metric for text complexity offers only a probabilistic assessment, and ID provides a complementary perspective, revealing characteristics prediction-based metrics often obscure. The work acknowledges that stability – in this case, a consistent measure of complexity – is merely an illusion that caches well, as different genres and styles inherently exhibit varying intrinsic dimensions. Chaos isn’t failure – it’s nature’s syntax, and this research embraces that variance.

What Lies Ahead?

The pursuit of an intrinsic dimension for text – a geometric foothold on the shifting sands of meaning – reveals less a destination than a cartography of limitations. The present work establishes ID as a complementary metric, but prediction accuracy remains stubbornly tethered to the particulars of its training. One suspects that any measure of “complexity” will always be a proxy, a shadow cast by the unknowable richness of genuine understanding. Technologies change, dependencies remain.

Future efforts will likely focus on disentangling the influences shaping intrinsic dimension – the signal of content versus the noise of style, the weight of genre, and the ever-present bias of corpus construction. Yet, the deeper question persists: are these dimensions intrinsic to the text itself, or merely artifacts of the representation learned by the language model? Architecture isn’t structure – it’s a compromise frozen in time.

Ultimately, the value may lie not in pinpointing a single “true” dimension, but in acknowledging the inherent multiplicity of textual complexity. The search isn’t for a single number, but for a richer understanding of the spaces where meaning resides, and the inevitable distortions introduced by any attempt to map them. Systems aren’t tools, they’re ecosystems – and ecosystems, by definition, resist complete comprehension.

Original article: https://arxiv.org/pdf/2511.15210.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Simple Metrics

Intrinsic Dimension: Mapping the Geometry of Complexity

Validating Complexity: ID and the Echoes of Linguistic Diversity

Beyond Measurement: ID and the Detection of Subtle Deception

What Lies Ahead?

See also: