Reasoning Agents Aren’t Always Logical

Author: Denis Avetisyan

New research reveals that even the most advanced AI reasoning systems can be surprisingly fragile when faced with subtly altered inputs.

The study dissects problem transformations, categorizing them to reveal how semantic meaning can be preserved-or deliberately altered-through focused modifications to original text.

Semantic invariance is not guaranteed by scale in large language model-based agentic AI, demonstrating distinct vulnerability profiles and universal fragility to contrastive examples.

Despite the increasing deployment of Large Language Models (LLMs) as autonomous reasoning agents, their reliability under semantically equivalent input variations remains largely unaddressed. This paper, ‘Semantic Invariance in Agentic AI’, introduces a metamorphic testing framework to systematically assess the robustness of LLM reasoning across diverse foundation models-Hermes, Qwen3, DeepSeek-R1, and gpt-oss-and scientific domains. Our results reveal that model scale is not a reliable predictor of stability, with the smaller Qwen3-30B-A3B exhibiting the highest invariance, while larger models prove surprisingly fragile-particularly to contrastive inputs. What implications do these distinct vulnerability profiles hold for building truly dependable agentic AI systems?

The Illusion of Intelligence: Fragility in Reasoning Agents

Despite impressive progress in natural language processing, large language model-based reasoning agents exhibit a surprising vulnerability to even subtle alterations in input phrasing. These agents, while capable of solving complex problems under specific conditions, often falter when presented with semantically equivalent prompts worded differently – a phenomenon suggesting a reliance on superficial patterns rather than genuine understanding. Researchers have demonstrated that minor changes, such as reordering clauses or substituting synonyms, can drastically reduce performance, highlighting a fragility that belies the apparent intelligence of these systems. This sensitivity poses a significant challenge to the reliable deployment of LLM agents in real-world scenarios where input diversity is inevitable, demanding a shift towards more robust and adaptable reasoning architectures.

Current assessments of large language model-based reasoning agents frequently underestimate their susceptibility to subtle input changes due to limitations in testing procedures. Existing benchmarks typically employ a narrow range of test cases, failing to capture the breadth of possible phrasing variations that maintain the same underlying meaning – a phenomenon known as semantic equivalence. This reliance on a restricted evaluation space creates a misleading impression of robustness; an agent may perform well on a specific test question but falter when presented with a semantically identical query expressed in different terms. Consequently, reported performance metrics often do not reflect an agent’s true ability to generalize and reason reliably, hindering accurate comparisons between models and masking potential vulnerabilities before real-world deployment.

The limited robustness of large language model-based reasoning agents presents a significant hurdle to their practical implementation in fields demanding consistent performance. While these agents demonstrate impressive capabilities in controlled environments, even subtle alterations to input phrasing – semantically equivalent requests, for instance – can lead to markedly different, and often incorrect, outputs. This fragility isn’t merely an academic concern; in applications like automated diagnosis, financial trading, or autonomous vehicle control, unpredictable behavior resulting from minor input variations could have severe consequences. Consequently, ensuring the reliability of these agents requires moving beyond benchmark datasets and developing methods to rigorously assess and improve their resilience to real-world ambiguity and noise, ultimately bridging the gap between impressive demonstration and trustworthy deployment.

A robustness analysis reveals variations in model performance, as measured by Mean Absolute Delta, score change distributions, semantic similarity of reasoning, and score change ranges.

Metamorphic Testing: Exposing the Cracks in the System

Metamorphic testing assesses system reliability by examining the consistency of outputs when presented with systematically altered inputs, circumventing the need for pre-labeled ground truth data. This approach focuses on verifying that specific transformations of an input – changes that shouldn’t fundamentally alter the expected reasoning process – result in correspondingly predictable changes in the system’s output. By evaluating these relationships, rather than absolute correctness against a known answer, metamorphic testing can uncover inconsistencies and vulnerabilities in a system’s logic, even in scenarios where defining a correct output is challenging or impossible. This is particularly useful for complex systems like large language models where exhaustive ground truth labeling is impractical.

Metamorphic testing relies on a defined set of metamorphic relations to systematically create varied test inputs. Structural Transformation modifies the input’s format without altering its meaning – for example, reordering a list of premises. Contrastive Transformation generates near-miss cases, subtly changing input elements to assess sensitivity to critical features. Verbosity Transformation alters the level of detail in the input, testing robustness to redundant or missing information. Finally, Contextual Transformation modifies surrounding contextual information to evaluate the agent’s reliance on irrelevant cues. Applying these transformations to an initial input generates a suite of test cases designed to probe specific aspects of the system’s reasoning capabilities.

Metamorphic testing relies on the principle that certain input variations should not change the expected output of a reasoning agent. These variations, termed metamorphic relations, are designed to preserve the core problem structure while altering superficial characteristics. For example, reordering irrelevant information within a prompt, changing the verbosity of a question, or slightly altering contextual details should ideally produce the same answer if the agent is focusing on the essential reasoning elements. By systematically applying these transformations and comparing the resulting outputs, the testing framework identifies inconsistencies that suggest a lack of robustness or an over-reliance on spurious correlations within the agent’s decision-making process. This approach allows for evaluation without requiring pre-labeled ground truth data, as the expected consistency between transformed inputs serves as the validation signal.

The evaluation framework incorporates a Problem Corpus designed to provide comprehensive coverage of the reasoning space. This corpus consists of a diverse set of problems categorized both by domain – encompassing areas such as commonsense reasoning, symbolic manipulation, and logical inference – and by difficulty level, ranging from simple, directly solvable instances to complex, multi-step problems. This categorization allows for targeted testing of the agent’s capabilities across a spectrum of reasoning challenges and facilitates the identification of weaknesses in specific domains or difficulty ranges. The corpus is continually expanded and refined to ensure robust and representative evaluation metrics.

This taxonomy details the implementation of metamorphic relations, categorizing transformations between different robot morphologies.

Unveiling Unexpected Vulnerabilities in LLM Architectures

Universal Contrastive Fragility, as observed in our research, indicates a consistent susceptibility of all tested Large Language Models (LLMs) to performance degradation when subjected to contrastive transformations. These transformations, involving subtle alterations to input prompts while preserving semantic meaning, consistently resulted in measurable instability across diverse model architectures, including those based on different training datasets and parameter sizes. The observed fragility is not limited to specific model families; instead, all tested models exhibited reduced performance metrics following the application of these transformations, suggesting a fundamental vulnerability inherent in the current LLM paradigm. This indicates that even seemingly minor input perturbations can significantly impact model outputs, raising concerns about the reliability of LLMs in real-world applications.

Analysis of LLM responses to contrastive transformations reveals that vulnerability is not uniform across model architectures. Different model families – including but not limited to Llama, Mistral, and GPT-OSS – demonstrate distinct patterns of sensitivity to specific metamorphic relations, such as synonym replacement, paraphrasing, or sentence reordering. These “Model-Family Vulnerability Signatures” indicate that the susceptibility to performance degradation varies predictably based on the underlying architecture and training data. For instance, certain models may exhibit greater instability when presented with semantic perturbations, while others are more sensitive to syntactic changes, suggesting that architectural choices influence the types of adversarial attacks to which a model is most vulnerable.

Contrary to the expectation of emergent robustness in larger language models, our research demonstrates a Scale-Robustness Inversion phenomenon. Specifically, we observed that increasing model size does not consistently correlate with improved resilience to contrastive perturbations. The Qwen3-30B model, with 30 billion parameters, achieved a stability rate of 79.6% and a Mean Absolute Delta (MAD) of 0.049-representing the lowest MAD observed across all tested models-and thus outperformed larger models in maintaining performance under these conditions. In comparison, the gpt-oss-120b model, significantly larger at 120 billion parameters, experienced a substantial performance degradation with a delta of -0.449 when subjected to the same contrastive transformations, highlighting the non-linear relationship between scale and robustness.

The gpt-oss-120b language model exhibited a significant performance decrease when subjected to contrastive transformations, registering a delta of -0.449. This metric quantifies the change in model output probability distribution following the application of these perturbations, with a negative value indicating a reduction in predictive accuracy. The observed delta suggests substantial performance degradation, implying that even relatively minor alterations to the input can lead to markedly different and less reliable outputs from the model. This result highlights the vulnerability of large language models to adversarial attacks and emphasizes the need for robust defense mechanisms.

Robust models exhibit tight score delta distributions for identity and paraphrase metamorphic relations, while the contrastive transformation induces the widest variance, indicating greater sensitivity to input changes.

Beyond Current Models: Towards Truly Robust Reasoning Agents

Current methods for evaluating reasoning agents often rely on narrow benchmark datasets, creating a misleading impression of genuine intelligence and hindering progress towards truly robust artificial intelligence. This work underscores the critical necessity of shifting evaluation paradigms to prioritize generalization and resilience – the ability to maintain performance across diverse and unforeseen inputs. Existing benchmarks frequently incentivize memorization and exploitation of dataset biases rather than authentic reasoning capabilities, leading to models that perform well in controlled settings but fail dramatically when confronted with real-world complexity. Consequently, a move towards more comprehensive and challenging evaluation methodologies, encompassing adversarial testing, out-of-distribution generalization assessments, and semantic verification techniques, is paramount to fostering the development of reasoning agents capable of reliable and adaptable performance.

Addressing the identified vulnerabilities in current reasoning agents necessitates a shift towards more resilient architectural designs. Researchers are actively exploring mechanisms like semantic verification, where a model’s output isn’t merely assessed for syntactic correctness but also for its alignment with underlying meaning and factual consistency. Complementary to this is the application of adversarial training, a technique that intentionally exposes the agent to subtly altered inputs designed to induce errors, thereby fortifying its robustness against unexpected or malicious data. These approaches aim to move beyond superficial pattern matching and cultivate a deeper understanding of the reasoning process, enabling agents to maintain reliable performance even when faced with noisy, ambiguous, or intentionally deceptive information. The ultimate goal is to build systems capable of not just performing reasoning, but of understanding it, and demonstrating consistent, dependable results across diverse and challenging scenarios.

A novel testing framework leverages the power of Sentence Embeddings to quantify semantic similarity, providing developers with a practical method for evaluating and strengthening model robustness. This approach moves beyond simple accuracy metrics by assessing whether a model’s response, even if syntactically correct, maintains the intended meaning when presented with subtly altered or adversarial inputs. By embedding both the original prompt and the model’s response into a high-dimensional semantic space, the framework calculates a similarity score, flagging responses that deviate significantly in meaning as potential vulnerabilities. This allows for targeted refinement of models, improving their resilience to nuanced changes and ensuring more reliable performance in real-world applications where inputs are rarely perfectly formulated.

A comprehensive evaluation was conducted across several open-source large language models – including variations from GPT-OSS, DeepSeek-R1, Qwen3, and Hermes – to establish a foundational understanding of their reasoning robustness. This testing sought to move beyond simple accuracy metrics and instead focus on semantic consistency, assessing whether models maintained logical coherence even with subtle input variations. Results indicated significant performance differences between architectures, with the Qwen3-30B model demonstrating a particularly strong capacity for robust reasoning, achieving a highest semantic similarity score of 0.91. This score suggests that Qwen3-30B is able to consistently produce outputs that are semantically aligned with the intended meaning of the input, even when faced with challenging or ambiguous prompts, highlighting its potential as a resilient foundation for future reasoning agents.

Performance degradation (Δ score, darker red) and reduced semantic consistency (darker blue) vary across metamorphic relations and models, revealing sensitivity to input perturbations.

The pursuit of robust agentic AI, as detailed in the study of semantic invariance, echoes a fundamental tenet of systems analysis: understanding limitations through rigorous testing. The paper’s findings-that scale alone doesn’t guarantee invariance and that LLMs exhibit vulnerability to carefully crafted contrastive inputs-reinforce this principle. It’s as if the code remains unread, even in the most complex systems. This aligns with Hilbert’s assertion: “We must be able to answer the question: can mathematics be reduced to mechanics?” The study’s approach-essentially probing the ‘mechanics’ of LLM reasoning-reveals the boundaries of current foundation models, demonstrating that even powerful systems require relentless interrogation to expose hidden fragility.

Beyond Invariance: Where Do We Go From Here?

The demonstrated lack of guaranteed semantic invariance in large language models, even with increasing scale, isn’t a bug – it’s a feature. Or, more accurately, it’s a predictable consequence of building intelligence from statistical correlations. The pursuit of robustness through sheer size now appears a somewhat naive endeavor. The finding that even minor, contrastive perturbations can derail reasoning agents doesn’t invalidate the technology, but it does force a reassessment of current validation strategies. Simply observing complex behavior isn’t enough; true security lies in understanding why a system fails, not just that it fails.

Future work must move beyond simply cataloging vulnerabilities. The observed differences in fragility across model families suggest that architectural choices, not just training data, play a crucial role. A fruitful line of inquiry involves actively seeking the breaking points of these systems – embracing metamorphic testing not as a defensive measure, but as a means of reverse-engineering the internal representations that govern reasoning.

Ultimately, the goal isn’t to create models that are impervious to all inputs, but to build systems where failure is transparent and predictable. A model that reveals its limitations is, paradoxically, more trustworthy than one that appears to succeed without explanation. The challenge, then, isn’t to eliminate error, but to expose it.

Original article: https://arxiv.org/pdf/2603.13173.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Intelligence: Fragility in Reasoning Agents

Metamorphic Testing: Exposing the Cracks in the System

Unveiling Unexpected Vulnerabilities in LLM Architectures

Beyond Current Models: Towards Truly Robust Reasoning Agents

Beyond Invariance: Where Do We Go From Here?

See also: