Beyond Clever Mimicry: What Can Language Models Actually *Do*?

Author: Denis Avetisyan


A new perspective frames the apparent reasoning abilities of large language models as a form of inference tied to inherent invariances within their core mechanisms.

This review argues for a scientifically rigorous approach to evaluating language model capabilities by examining the model’s Markov kernel and its implications for epistemic uncertainty.

Claims of reasoning in large language models belie a critical gap between observed performance and underlying computational mechanisms. This paper, ‘On the Notion that Language Models Reason’, critically examines definitions of reasoning within natural language processing and proposes an alternative framework-viewing these models as implementing implicit finite-order Markov kernels that map contexts to token distributions. Consequently, seemingly rational outputs emerge not from explicit logical inference, but from statistical regularities and approximate invariances learned during training. Does this distinction fundamentally alter how we evaluate epistemic uncertainty and the potential for genuine intelligence in these systems?


The Illusion of Reasoning: Patterns, Not Deduction

The prevalent characterization of large language models as “reasoners” obscures the fundamental principles governing their operation. While these models can generate text that appears logical or insightful, this capacity stems not from genuine reasoning abilities, but from sophisticated pattern recognition and probabilistic prediction. These models don’t deduce conclusions; instead, they calculate the statistical likelihood of a given token appearing next in a sequence, based on the vast datasets they were trained on. This process, rooted in the mathematical framework of a $Markov Kernel$, dictates that the model’s output is always a prediction, inherently lacking the certainty associated with deductive reasoning. Framing language models as reasoners therefore misrepresents their core mechanism, potentially leading to unrealistic expectations and flawed evaluation metrics.

At their core, large language models operate not through logical deduction, but through probabilistic prediction. These models are fundamentally designed to calculate the conditional probability of the next token in a sequence, given all preceding tokens. This process is mathematically formalized by a $Markov Kernel$, which defines the probability distribution over possible next tokens. Essentially, the model doesn’t ‘think’ about the correct answer; it assesses which token is most likely to follow, based on the vast patterns learned from its training data. This probabilistic nature means the model generates text by sampling from this distribution, creating outputs that are statistically plausible given the input, rather than necessarily ‘true’ or ‘logical’ in a human sense. Understanding this core mechanism is crucial for appropriately evaluating and interpreting the capabilities of these increasingly prevalent systems.

Current evaluations of language models often rely on benchmarks designed to assess ‘reasoning’ abilities, but this work argues for a fundamental recalibration. Because these models operate as $Markov$ processes – predicting subsequent tokens based on conditional probabilities defined by a $Markov$ kernel – assessment should instead focus on measurable epistemic properties. Rather than determining if a model arrives at a ‘correct’ answer, evaluation should quantify the model’s uncertainty, its calibration – how well its predicted probabilities reflect actual outcomes – and its ability to update beliefs given new information. This shift moves beyond judging the result of a process and instead analyzes the probabilistic mechanisms within the model, offering a more nuanced and accurate understanding of its capabilities and limitations as a probabilistic inference engine.

The Tightrope Walk: Context Windows and Long-Range Dependencies

The context window, a core limitation of current Language Models (LLMs), defines the maximum input sequence length the model can process in a single pass. This constraint is measured in tokens, with each token roughly corresponding to a word or part of a word. Typical context window sizes range from 2048 to 32,768 tokens, although some models are expanding this limit. When the input exceeds this window, the model either truncates the input, loses information from the beginning of the sequence, or requires techniques like summarization or retrieval to manage the data within the allowable length. The size of the context window directly impacts the model’s ability to understand long-range dependencies and maintain consistency across extended texts; larger context windows generally allow for more comprehensive analysis but also increase computational cost and memory requirements.

The finite size of a Language Model’s context window directly impacts its capacity to maintain coherence and consistency in longer-form text generation. As the input sequence length approaches the window limit, the model exhibits a decreasing ability to accurately reference information from the beginning of the sequence when processing later tokens. This manifests as topic drift, contradictory statements, or a failure to adhere to previously established constraints or character definitions. Specifically, information presented earlier in the input becomes increasingly attenuated as the model processes subsequent tokens, leading to a degradation in the logical flow and internal consistency of the output. While techniques exist to mitigate this, the underlying constraint of a fixed context length remains a core limitation for extended text tasks.

While prompting strategies such as Chain-of-Thought prompting can improve performance by structuring the model’s reasoning process, they do not fundamentally overcome the limitations imposed by the finite context window. Language models process input text sequentially, and the context window defines the maximum input length; any information exceeding this limit is discarded. Consequently, even with optimized prompting, the model’s ability to reliably generate consistent and coherent outputs diminishes as the required context extends beyond the window’s capacity. This constraint introduces a potential source of error, as the model may lack access to crucial information necessary for accurate prediction or completion, impacting the overall reliability of generated text.

Beyond Task Scores: Measuring True Inferential Reliability

Traditional reasoning benchmarks for Language Models (LLMs) often prioritize performance on specific tasks, leading to evaluations that are susceptible to superficial patterns and lack generalizability. These benchmarks frequently fail to assess the model’s underlying reasoning process or its ability to maintain consistency across logically equivalent inputs. In contrast, an approach centered on $Epistemic Invariance$ offers a more robust evaluation by focusing on whether a model’s outputs remain consistent when presented with semantically equivalent variations of the same input. This methodology shifts the focus from achieving high scores on pre-defined tasks to assessing the model’s inherent capacity for logical consistency and reliable inference, providing a more comprehensive measure of model trustworthiness beyond task-specific accuracy.

Epistemic invariance, as a reliability measure, evaluates a language model’s consistency by testing whether its outputs change when presented with logically equivalent reformulations of the same input. This means that if an input sentence is transformed – through paraphrasing, reordering of clauses that don’t alter meaning, or other logical manipulations – a model exhibiting epistemic invariance should produce statistically similar outputs as it would for the original input. The principle centers on the idea that a reliable model’s inferences should be grounded in the underlying logical structure of the input, rather than superficial phrasing. Consequently, evaluating invariance determines if a model’s responses are sensitive to irrelevant changes in input presentation, highlighting potential vulnerabilities in its reasoning process.

Epistemic invariance is quantitatively assessed through metrics like Total Variation Distance (TVD), which measures the maximum difference between two probability distributions. The formalization $∀t∈T, 𝐕𝐓(κθ(⋅∣x),κθ(⋅∣t(x)))≤ϵT$ defines this assessment; it states that for all transformations $t$ within a set $T$, the TVD between the model’s output distribution $κθ(⋅∣x)$ for input $x$ and its output distribution for the transformed input $t(x)$ must be less than or equal to a threshold $ϵT$. A lower $ϵT$ indicates greater consistency, demonstrating the model’s ability to maintain similar output distributions despite logically equivalent input variations, thus providing an objective measure of its inferential reliability.

Evaluation using inferential consistency, quantified by the condition $∀(x,y)∈ℛr,κθ​(y∣x)≥1−δr$, moves beyond assessing performance on specific tasks and instead focuses on the fundamental reliability of a language model’s reasoning process. Here, $\kappaθ​(y∣x)$ represents the model’s confidence in predicting output $y$ given input $x$, and $\delta r$ is a permissible margin of error. This metric assesses whether the model consistently assigns a high probability-at least $1−δr$-to plausible outputs given a range of valid inputs, thereby providing a measure of its inherent trustworthiness and stability in drawing inferences, irrespective of the task at hand.

The Illusion of Understanding: From Vectors to Meaning

Language Models don’t inherently understand words; rather, they process them as discrete symbols. To bridge this gap, word embeddings transform each word into a dense vector in a high-dimensional space, where the geometric relationships between these vectors reflect semantic meaning. Words with similar meanings – such as “king” and “queen” – are positioned closer together in this space, allowing the model to recognize analogies and contextual similarities. This representation isn’t arbitrary; it’s learned from massive text corpora, identifying patterns in how words are used together. Consequently, the effectiveness of a Language Model is heavily reliant on the quality of these embeddings; a nuanced and accurate semantic representation is paramount for tasks requiring understanding, generalization, and the production of coherent, meaningful text.

Pointwise Mutual Information (PMI) Factorization offers a statistically grounded approach to constructing word embeddings from extensive text datasets. This technique assesses the co-occurrence of words – how often they appear together within a defined context – and calculates a PMI score representing the statistical dependence between them. Rather than treating words as discrete symbols, PMI Factorization decomposes the co-occurrence matrix, effectively learning a lower-dimensional vector representation for each word. These vectors capture semantic relationships; words frequently appearing in similar contexts are positioned closer together in the vector space. The resulting embeddings allow Language Models to move beyond simple keyword matching, enabling them to recognize analogies, understand synonyms, and ultimately, process language with greater sophistication, all derived directly from the patterns present within the training data.

The effectiveness of a language model hinges significantly on the quality of its word embeddings – the numerical representations of words that capture their semantic meaning. High-quality embeddings allow the model to recognize relationships between words, even if those words haven’t been encountered together frequently during training. This ability is crucial for generalization; a model with robust embeddings can accurately process and understand novel sentences or contexts. Conversely, poor embeddings result in a limited understanding of nuance, potentially leading to misinterpretations, inaccurate predictions, and a failure to grasp subtle differences in meaning. Essentially, these vector representations serve as the foundation for linguistic intelligence, directly impacting the model’s capacity to move beyond rote memorization and achieve genuine language comprehension.

The consistent and trustworthy performance of any Language Model is fundamentally linked to its ability to accurately represent the meaning of language. Enhancing semantic representation – moving beyond simple word association to capture contextual relationships and nuanced definitions – directly addresses inconsistencies in model outputs. When a model possesses a richer understanding of what words truly signify, it is less prone to generating illogical or contradictory statements. This improvement stems from a more robust internal framework for evaluating the plausibility of different word sequences; a model grounded in accurate semantics can better discern between valid and nonsensical combinations. Consequently, refinements in semantic representation translate to increased reliability, allowing Language Models to deliver more predictable, coherent, and ultimately, useful results, even when confronted with ambiguous or complex prompts.

The pursuit of ‘reasoning’ in these models feels less like building intelligence and more like documenting the inevitable entropy of complex systems. This paper’s focus on the Markov kernel and identifying invariances-essentially, what doesn’t break when pushed-is a pragmatic, if cynical, approach. It acknowledges that production will always stress-test the theory. As Henri Poincaré observed, “Mathematics is the art of giving reasons.” But these models aren’t giving reasons; they’re revealing patterns within probability distributions. The search for true inference is noble, but the real work lies in mapping the boundaries of what these systems can reliably, if predictably, fail at. It’s a meticulous cataloging of controlled failures, dressed up as progress.

What’s Next?

The framing of language model capability as inference via Markov kernels, while offering a welcome dose of mathematical rigor, predictably shifts the goalposts rather than truly clearing them. It simply re-contextualizes the problem – now the challenge isn’t if these models reason, but rather characterizing the precise nature of the invariances they exploit. One suspects this will quickly devolve into a taxonomy of clever prompts designed to expose increasingly brittle ‘reasoning’ shortcuts. The field will undoubtedly produce ever more complex benchmarks, each more meticulously crafted to circumvent the latest mitigation, until production systems inevitably break them all.

A genuine advancement will likely require abandoning the quest for general ‘reasoning’ altogether. The emphasis should instead be placed on formally defining the specific tasks for which these models are demonstrably reliable, and accepting that extrapolation beyond those boundaries is a fool’s errand. After all, the history of computing is littered with systems that performed beautifully on contrived examples, only to fail spectacularly when confronted with the messy reality of real-world data.

Ultimately, this work serves as a reminder that everything new is just the old thing with worse docs. The underlying problems remain stubbornly consistent: insufficient data, biased training, and a fundamental inability to grapple with genuine novelty. Perhaps, instead of seeking artificial intelligence, a more fruitful path lies in building systems that are simply good enough, and accepting their inherent limitations.


Original article: https://arxiv.org/pdf/2511.11810.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-11-19 01:39