Trust the Expert… Or the Algorithm?

Author: Denis Avetisyan

New research reveals a surprising disconnect in how language models weigh advice from humans versus algorithmic systems.

Across all models and tasks studied, a consistent aversion to algorithms was observed, quantified by the discrepancy between trust ratings assigned to human experts and those given to algorithmic systems.

Large language models demonstrate inconsistent biases, expressing a preference for human expertise while implicitly favoring algorithmic recommendations in actual decision-making scenarios.

Despite increasing reliance on large language models (LLMs) for decision support, a fundamental question remains regarding how these models integrate information from differing sources of expertise. This research, titled ‘Language Models Exhibit Inconsistent Biases Towards Algorithmic Agents and Human Experts’, investigates this by examining LLM preferences for human versus algorithmic advice, revealing a paradoxical disconnect between stated trust and actual choices. Specifically, LLMs express greater trust in human experts but disproportionately favor algorithms when presented with performance data and incentivized to make predictions. These findings raise critical concerns about the consistency of LLM reasoning and the potential for subtle biases to impact their deployment in high-stakes applications-what implications do these inconsistent preferences have for building truly reliable and unbiased AI systems?

The Expanding Influence of LLMs and the Question of Verifiable Trust

The expanding role of Large Language Models (LLMs) extends beyond simple text generation, now encompassing increasingly complex decision-making processes across diverse fields. From assisting in medical diagnoses and financial forecasting to guiding legal strategies and informing policy recommendations, these models are being integrated into systems demanding a high degree of accuracy and dependability. This proliferation necessitates rigorous evaluation of their reliability, moving beyond metrics of fluency and coherence to assess the validity and consistency of their outputs. The stakes are particularly high as reliance on LLMs grows, potentially amplifying errors or biases and impacting outcomes in critical areas. Consequently, a comprehensive understanding of their limitations and vulnerabilities is paramount to ensure responsible implementation and maintain public trust in these powerful technologies.

Even with remarkable progress in natural language processing, Large Language Models (LLMs) are susceptible to inherent biases stemming from the data used in their training. These biases aren’t simply random errors; they manifest as systematic patterns of incorrect or unfair outputs, potentially amplifying existing societal prejudices in areas like gender, race, or socioeconomic status. Consequently, users encountering these skewed results may reasonably question the LLM’s objectivity and reliability, eroding trust and hindering the broader acceptance of these powerful tools. Addressing these biases isn’t merely a technical challenge; it’s a critical step towards ensuring that LLMs serve as equitable and dependable resources for all, rather than perpetuating harmful stereotypes or discriminatory practices.

Effective human-AI collaboration hinges not simply on the capabilities of algorithms, but on the nuanced ways people perceive and trust their advice. Research indicates that trust isn’t solely determined by accuracy; factors like the explanation provided with the recommendation, the perceived expertise of the system, and even the framing of the information significantly influence acceptance. A lack of transparency in algorithmic reasoning can breed skepticism, even if the outcome is correct, while overly complex explanations may overwhelm users. Consequently, designing AI systems that foster appropriate trust – neither blind acceptance nor outright rejection – is paramount. This requires a deeper understanding of cognitive biases, the role of social cues in algorithmic interaction, and the development of interfaces that promote both comprehension and confidence in AI-driven insights, ultimately enabling humans and machines to work together more effectively.

In Study 2, large language models exhibited varying probabilities of delegating subsequent predictions to either an algorithmic agent or a human expert, alongside a neutral preference option.

Baseline Preferences: The Initial Cognitive Bias Towards Expertise

Study 1 employed a direct assessment of stated preferences by requesting participants to assign trust ratings to both human experts and algorithmic agents performing the same task. Participants were presented with scenarios involving a defined problem and were asked to rate their level of trust in either a human expert’s proposed solution or the solution generated by an algorithmic agent. These ratings were collected across multiple scenarios and conditions, allowing for a quantitative comparison of baseline trust levels between the two agent types. The study design specifically isolated stated preference as the measured variable, independent of observed performance, to establish a baseline cognitive bias.

Study 1 results consistently demonstrated a preference for human experts over algorithmic agents, quantified by a positive ‘Trust Gap’ across all tested models. This ‘Trust Gap’ represented the difference in trust ratings, where human experts consistently received higher scores despite algorithmic agents exhibiting equal or better performance metrics. The magnitude of this gap varied depending on the specific model tested, but remained positive in all cases, indicating a statistically significant baseline preference for human judgment even when objectively less accurate or efficient than algorithmic alternatives. This finding suggests an inherent cognitive bias influencing initial trust assessments.

The observed preference for human experts, despite equivalent or superior algorithmic performance, indicates a pre-existing cognitive bias influencing trust assessment. This bias is likely rooted in established patterns of social interaction and learned reliance on human judgment, fostering a baseline level of trust accumulated through familiarity and repeated positive experiences. Consequently, individuals may initially favor human recommendations or predictions, requiring significant evidence of reliability for algorithmic agents to overcome this inherent predisposition. This suggests trust is not solely determined by observed performance, but also by pre-conceived notions and established relationships with the source of information.

Studies 1 and 2 reveal a stated-revealed trust inconsistency, showing that participants, despite reporting trust, often preferred the human expert over the algorithm (error bars represent standard error of the mean).

Revealed Preferences: Behavioral Response to Performance Feedback

Study 2 employed a behavioral paradigm to quantify revealed preferences for human versus algorithmic advice based on observed choices. Participants were presented with a series of decision-making tasks and, prior to each task, received performance feedback regarding the accuracy of both human and algorithmic advisors. The primary metric was the frequency with which participants selected advice from either source, effectively demonstrating their preference through action rather than stated opinion. This approach allowed researchers to bypass potential biases associated with self-reported preferences and directly assess how individuals incorporate performance information into their decision-making process when choosing between a human and an algorithmic agent.

Study 2 quantitatively demonstrated a correlation between algorithmic performance and user preference. As algorithmic agents consistently exhibited improved performance metrics, the probability of participants selecting the algorithm over human advice increased significantly. Statistical analysis revealed this preference shift was highly significant (p<0.001) across all tested models, indicating the observed behavior was not attributable to random chance. This finding suggests individuals adjust their decision-making based on observed performance data, favoring agents demonstrating higher accuracy or efficacy.

Analysis of stated versus revealed preferences indicated persistent algorithm aversion, even with demonstrably superior algorithmic performance. This aversion was quantified using a ‘Stated-Revealed Relative Risk’ metric, which compares the likelihood of choosing the algorithm based on stated preference versus actual choice after observing performance data. Values for this metric ranged from 1.29 to 8.52 across different models, indicating that individuals were between 1.29 and 8.52 times more likely to not choose the algorithm, given its superior performance, compared to what their stated preferences would suggest. This suggests a systematic bias against selecting algorithmic advice, despite objective evidence of its effectiveness.

The probability of large language models correctly identifying the superior predictor varies by task and is influenced by whether that predictor is framed as a human expert or an algorithm.

The Persistence of Bias Within Large Language Model Architectures

Large language models, despite their impressive capabilities, demonstrate that increasing model complexity doesn’t necessarily equate to reduced bias. The very architecture that allows these models to excel – a vast network of parameters and a reliance on identifying patterns within provided context, known as in-context learning – inadvertently creates avenues for the amplification of pre-existing societal biases. Because the models learn by associating patterns in data, any biases present in the training dataset – reflecting historical inequalities or prejudiced viewpoints – are readily absorbed and can be reproduced, even scaled, in the model’s outputs. This is further complicated by the fact that in-context learning relies on the model’s ability to extrapolate from limited examples, potentially reinforcing skewed perspectives if the provided context itself is biased. Consequently, a model’s performance in predicting outcomes doesn’t guarantee fairness or objectivity, highlighting the critical need for ongoing research into bias detection and mitigation strategies within these complex systems.

Large language models, including those like GPT, Llama, and Claude, demonstrate remarkable capabilities in generating human-quality text, yet these very strengths are inextricably linked to a critical vulnerability: the amplification of pre-existing societal biases. These models learn patterns from massive datasets of text and code, and if those datasets reflect historical or systemic prejudices-regarding gender, race, religion, or other characteristics-the model will inevitably internalize and perpetuate them. This isn’t a matter of the model deliberately exhibiting prejudice, but rather a statistical consequence of learning from biased data; the model predicts the most likely continuation of a given text, and if biased associations are prevalent in its training material, those associations will be reinforced in its output. Consequently, even seemingly neutral prompts can elicit responses that reflect and even exacerbate harmful stereotypes, highlighting the urgent need for careful data curation and the development of bias mitigation techniques within these powerful systems.

Recent analyses demonstrate a nuanced relationship between model scale and predictive accuracy, revealing that larger, more complex language models exhibit a statistically significant, though imperfect, ability to identify superior predictors – as evidenced by a regression coefficient of 0.66 with a p-value of 0.02. This suggests an increasing capacity for discernment as model size grows. However, this improved predictive capability does not inherently resolve the critical issue of bias. Despite advancements in scale and complexity, these models continue to be susceptible to perpetuating and even amplifying societal biases embedded within their training data, indicating that addressing bias requires strategies beyond simply increasing model parameters and remains a substantial challenge in the field of artificial intelligence.

Regression analysis reveals statistically significant ([latex]p < 0.001[/latex]) correlations between large language model (LLM) responses and human expert judgments, with stronger correlations (indicated by darker colors) observed across GPT, Llama-3, Llama-3.1, and Claude model families.

The study reveals a fascinating dissonance in how Large Language Models process trust, aligning with a fundamental principle of logical rigor. These models articulate a preference for human expertise, yet demonstrably favor algorithmic solutions when faced with actual choices-a clear indication that stated preference diverges from revealed preference. This echoes the need for formal definitions; without a precisely stated model of ‘trust’ and ‘expertise’, any expressed preference becomes ambiguous noise. As Linus Torvalds aptly stated, “Talk is cheap. Show me the code.” The research doesn’t merely report on opinions, but meticulously reveals the ‘code’-the actual decision-making process-of these complex systems, highlighting a gap between rhetoric and reality.

What Remains Constant?

The demonstrated disparity between stated preference and revealed choice in these language models is… predictable. The assertion of trust, a linguistic construct, proves malleable when confronted with quantifiable outcomes. Let N approach infinity – what remains invariant? Not the professed allegiance to human expertise, clearly. The models optimize for performance, exhibiting a pragmatism absent from their self-reported values. This begs the question: are these models truly ‘learning’ anything about trust, or merely mimicking the form of trust while internally calculating expected reward?

Future work must move beyond behavioral observation and towards formal verification. The current paradigm – testing on datasets, observing outputs – is inherently limited. A rigorous approach would demand a provable algorithm for ‘trust’ itself, defined not by subjective reporting but by demonstrable consistency between stated belief and action, even under adversarial conditions. Only then can one ascertain whether the model possesses an internal representation of trust, or merely simulates its external manifestation.

The inconsistency is not a bug, but a feature-a symptom of optimizing a complex function with incomplete specifications. The pursuit of ‘alignment’ will continue to falter if it relies on linguistic signals alone. The focus should shift toward defining a mathematically precise notion of ‘desired behavior’ and verifying that the model demonstrably adheres to it, regardless of its stated beliefs. The elegance of a solution, after all, lies not in its narrative, but in its mathematical purity.

Original article: https://arxiv.org/pdf/2602.22070.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Expanding Influence of LLMs and the Question of Verifiable Trust

Baseline Preferences: The Initial Cognitive Bias Towards Expertise

Revealed Preferences: Behavioral Response to Performance Feedback

The Persistence of Bias Within Large Language Model Architectures

What Remains Constant?

See also: