The AI Trust Trap: How Supporting Evidence Impacts Human Verification

Author: Denis Avetisyan

New research reveals that while providing evidence alongside AI-generated answers can speed up fact-checking, it doesn’t necessarily guarantee better judgment, and can even foster dangerous over-reliance.

The study assessed user veracity, establishing a framework for evaluating the truthfulness of information provided by participants.

A comparative analysis of retrieved passages and large language model explanations reveals crucial insights into human-AI teaming for veracity assessment of AI outputs.

Despite increasing reliance on generative artificial intelligence, verifying the veracity of its outputs remains a critical challenge, particularly in high-stakes domains. This research, ‘To Believe or Not To Believe: Comparing Supporting Information Tools to Aid Human Judgments of AI Veracity’, investigates how different forms of supporting information – full source text, passage retrieval, and large language model (LLM) explanations – impact human assessment of AI-generated data. We find that while retrieved passages offer a balance of speed and accuracy, LLM explanations can foster inappropriate trust and reliance, leading users to overlook errors. How can we design AI systems that empower responsible human agency in veracity assessment, ensuring appropriately calibrated trust without sacrificing efficiency?

The Illusion of Intelligence: Why AI Confidently Gets Things Wrong

Large Language Models, despite their impressive ability to construct seemingly coherent text, are prone to generating information that is demonstrably false or entirely fabricated – a phenomenon often referred to as ‘hallucination’. This isn’t a matter of simple error; the models confidently present these inaccuracies as factual statements, leveraging patterns learned from massive datasets without any inherent understanding of truth or validity. The issue stems from the probabilistic nature of their operation; these models predict the most likely sequence of words, not necessarily the most accurate one. Consequently, even highly sophisticated models can produce plausible-sounding but entirely untrue content, posing a significant challenge to their reliable deployment in fields demanding factual precision, and eroding confidence in AI-generated outputs.

The propensity of large language models to generate inaccurate or fabricated content poses a significant threat to user confidence and, consequently, restricts their deployment in fields demanding precision. Without reliable outputs, acceptance falters in crucial sectors like healthcare, finance, and legal services, where even minor errors can have substantial repercussions. This unreliability isn’t merely a technical hurdle; it’s a foundational challenge impacting the broader integration of AI into society. Consequently, progress is hampered until mechanisms ensuring consistent veracity are developed and implemented, preventing the erosion of trust and unlocking the full potential of AI-driven innovation.

The reliability of artificially generated content is fundamentally linked to the integrity of the data used to train the models. Errors, biases, or inconsistencies present in the training data are not simply carried forward, but often amplified during the generation process. This means that even sophisticated Large Language Models, capable of producing remarkably human-like text, can confidently disseminate misinformation if the foundational data is flawed. Consequently, meticulous data curation – including rigorous verification, cleaning, and bias mitigation – is not merely a preliminary step, but a crucial determinant of trustworthy AI outputs. Without a commitment to data veracity, the potential benefits of AI generation in fields demanding accuracy – such as healthcare, finance, or legal analysis – remain significantly compromised, as the system’s outputs are only as reliable as the information it has learned.

Accuracy on incorrect answers and acceptance of AI-generated answers differed significantly across supporting information sources (PDF, TopK, LLM) and answer types (simple vs. synthesized), as indicated by statistical significance and standard error bars.

Bolstering Trust: Giving AI a Paper Trail

Presenting users with supporting information alongside AI-generated responses demonstrably improves accuracy assessment by enabling direct verification of the AI’s claims. This strategy shifts the evaluation process from solely judging the perceived correctness of the answer to confirming its factual basis through provided source material. Studies indicate that access to supporting evidence increases user confidence in accurate responses while simultaneously improving their ability to identify inaccuracies or unsupported statements. The inclusion of source passages allows for granular evaluation, enabling users to assess not only what the AI states, but how it arrives at that conclusion based on the provided context. This contrasts with assessments of AI output in isolation, which are susceptible to subjective interpretation and reliance on prior beliefs.

Human-AI teaming, in the context of information provision, capitalizes on complementary capabilities. AI excels at rapidly processing large datasets to generate potential answers, while humans demonstrate strengths in nuanced judgment, contextual understanding, and error detection. By presenting AI-generated responses with supporting information – the source passages used for generation – the system facilitates human review and validation. This collaborative approach allows users to assess the AI’s reasoning, identify potential inaccuracies or biases, and ultimately build greater trust in the system’s output. The human component acts as a critical filter and verifier, enhancing the overall reliability and usefulness of the AI-driven information retrieval process.

BM25, a ranking function used in information retrieval, is commonly employed to identify passages relevant to contextualizing AI-generated responses. It assesses document relevance based on term frequency (TF), inverse document frequency (IDF), and document length normalization. The BM25 formula calculates a score reflecting the importance of query terms within a document, prioritizing passages containing frequently occurring, yet less common, terms. Parameters [latex]k_1[/latex] and [latex]b[/latex] control term frequency saturation and document length normalization respectively, allowing for tuning to optimize retrieval performance. This technique enables systems to provide users with source material supporting AI outputs, improving transparency and facilitating accuracy assessment.

Measuring the Impact: How Supporting Info Affects User Judgement

A user study was conducted with participants performing evaluation tasks while utilizing various supporting information tools. The study aimed to quantify the effects of these tools on both user performance – specifically, the accuracy of their assessments – and their level of trust in the system. Participants were observed and data was collected regarding their interaction with the tools, completion times, and the rationale behind their decisions. This data will be used to determine whether the supporting information improves the quality and efficiency of user evaluations, and to what extent users rely on the provided information when forming their judgments.

The evaluation of user interaction incorporated three primary metrics: assessment accuracy, reliance on AI outputs, and acceptance rate of AI-generated answers. Assessment accuracy was determined by comparing user evaluations – made with and without supporting information tools – against a pre-defined ground truth. Reliance on AI outputs was quantified by measuring the degree to which users incorporated AI-provided information into their final assessments. The acceptance rate of AI-generated answers specifically tracked the frequency with which users directly adopted AI-provided answers without modification, serving as an indicator of perceived AI credibility and usability.

Cognitive workload measurement is critical in assessing the utility of supporting information tools, as simply providing more data doesn’t guarantee improved user performance. Techniques such as NASA-TLX questionnaires, pupil dilation tracking, and measurement of response times are employed to quantify the mental demand placed on users during evaluation tasks. A decrease in measured cognitive workload – indicating reduced mental effort – alongside improved accuracy and acceptance rates suggests the supporting information is genuinely beneficial. Conversely, an increase in workload, even with maintained or slightly improved accuracy, indicates the added information may be creating additional cognitive burden and hindering overall usability, requiring refinement of the information presentation or tool design.

The Bottom Line: Why AI Needs a Chain of Custody for Its Claims

The capacity of supporting information to bolster assessment accuracy is demonstrably linked to its effective presentation; simply providing additional data is insufficient. Research indicates that while supplemental information can improve judgments, its benefits are contingent on avoiding cognitive overload for the user. A successful system balances comprehensiveness with conciseness, ensuring the information is readily digestible and doesn’t hinder, but rather enhances, the evaluation process. This delicate balance is crucial, as poorly presented supporting material can increase cognitive workload and ultimately diminish a user’s ability to accurately assess the provided information, underscoring the need for carefully designed interfaces and information delivery methods.

The provision of supporting information demonstrably increases user confidence in their assessments, ultimately fostering more informed decision-making processes. Research indicates that when individuals are presented with evidence alongside an answer or conclusion, they exhibit a greater belief in the validity of that information, even when the supporting material isn’t perfect. This heightened confidence doesn’t necessarily equate to blind acceptance; rather, it suggests users feel better equipped to evaluate the information and integrate it into their existing knowledge. However, the type of supporting information is critical; while it boosts confidence overall, poorly designed or overly complex support – like cumbersome PDF documents – can also increase cognitive load. Successfully implemented support, such as efficiently retrieved passages or concise explanations, enables users to approach synthesized answers with greater assurance and a clearer understanding of the reasoning behind them, leading to more robust and reliable outcomes.

The study demonstrates a clear divergence in how different support mechanisms affect user interaction with AI-generated answers. Utilizing BM25 passage retrieval – a technique focused on identifying relevant document snippets – demonstrably improved user efficiency in evaluating answers without compromising judgment accuracy. Conversely, while large language model (LLM)-generated explanations also boosted processing speed, this came at a cost: users exhibited a tendency to overtrust the AI’s reasoning, leading to diminished error detection capabilities. This suggests that simply accelerating information delivery isn’t enough; the nature of the support is critical, as explanations that prioritize fluency over factual correctness can foster inappropriate reliance and hinder critical assessment of synthesized answers.

Analysis of user interaction revealed a significant disparity in processing speeds depending on the format of supporting information provided alongside synthesized answers. Specifically, participants required considerably more time to evaluate responses when presented with PDF-based documentation compared to those receiving information retrieved via the TopK method or generated by a Large Language Model; these differences reached statistical significance (p = .003 for TopK, p < .001 for LLM). This suggests that while PDFs offer comprehensive source material, their format introduces a cognitive bottleneck, slowing down the assessment process and potentially hindering efficient decision-making in scenarios where rapid evaluation is critical. The study highlights the importance of information delivery mechanisms, indicating that readily accessible and concisely presented support, such as that provided by TopK and LLM approaches, can substantially improve user efficiency without necessarily sacrificing accuracy.

Research indicates a concerning tendency for users to readily accept answers generated by artificial intelligence, particularly when those answers are accompanied by explanations from large language models. While LLM-generated explanations significantly increased acceptance rates – exceeding those provided by both PDF documents and top-ranked passage retrieval (TopK) methods – this acceptance occurred despite demonstrably lower accuracy on incorrect answers. This suggests that the persuasive nature of LLM-generated text can override critical evaluation, leading to an inappropriate reliance on AI and a reduced capacity to detect errors. The findings highlight a crucial need for systems to not only provide explanations but also to actively encourage users to verify information, especially when dealing with synthesized answers drawn from multiple sources, to mitigate the risk of blindly accepting inaccurate AI outputs.

Research indicates that while providing supporting information enhances decision-making, the format significantly impacts cognitive strain on the user. Specifically, presenting information via PDF documents resulted in substantially higher cognitive workload scores when compared to both TopK retrieval and Large Language Model (LLM)-generated explanations-differences that reached statistical significance (p = .002 and p = .001, respectively). This suggests that the format of supporting documentation-potentially due to factors like visual complexity, difficulty in navigating lengthy documents, or the effort required to synthesize information from a static source-can actively hinder efficient processing, even if the information itself is accurate. These findings underscore the importance of carefully considering the user experience when designing robust AI systems and highlight the potential benefits of streamlined, easily digestible support materials over traditional document-based approaches.

The need for robust verification processes becomes paramount when artificial intelligence systems generate ‘synthesized answers’ – responses compiled from multiple information sources. Unlike retrieving a single, direct answer, synthesis demands the AI not only locate relevant data but also integrate it cohesively, potentially introducing errors or misinterpretations in the process. Consequently, providing users with access to the supporting evidence used in synthesis isn’t merely helpful, but essential for fostering trust and ensuring accuracy; it allows for independent validation of the AI’s reasoning and identification of any inconsistencies or inaccuracies that may arise during information combination. Without such verification, users risk accepting potentially flawed conclusions, particularly as reliance on AI-generated content increases and the underlying sources remain opaque.

The pursuit of perfect veracity assessment, as demonstrated by this research into supporting information tools, inevitably reveals the limitations of even the most sophisticated systems. It seems a constant truth that elegant theories – in this case, leveraging LLM explanations – quickly encounter the messy reality of production use. The study highlights how LLM explanations, while improving efficiency, can foster inappropriate reliance, echoing a familiar pattern. As Carl Friedrich Gauss observed, “If I speak for my own benefit, I may be concise; but for the benefit of others, I must be plain.” This research confirms that simply presenting an answer, even with explanation, doesn’t guarantee calibrated trust – clarity and demonstrable support remain crucial, lest users accept AI-generated answers without sufficient skepticism. The promise of automated dataset verification is alluring, but this work serves as a potent reminder that ‘MVP’ often translates to ‘we’ll address the trust issues later.’

What’s Next?

The predictable march continues. This work confirms what production systems have long whispered: efficiency gains invariably trade off against unforeseen reliance. Offering explanations – even those generated by impressively scaled language models – doesn’t magically instill calibrated trust. It simply provides a more compelling narrative for error. The real problem isn’t whether an LLM can justify a wrong answer, but that humans will accept the attempt as justification enough. Legacy systems had blatant errors; at least those were easily dismissed.

Future effort will likely focus on ‘trust calibration’ interfaces – meters and flags attempting to quantify AI confidence. These will be built, deployed, and then painstakingly bypassed by anyone facing a deadline. A more interesting, though far less fundable, path lies in understanding why humans overtrust, even when presented with uncertainty. The tools aren’t the issue; the issue is a persistent, and likely immutable, cognitive bias.

Ultimately, the goal shouldn’t be to eliminate inappropriate reliance, but to design for graceful degradation. Expecting perfect trust assessment is a category error. Instead, systems should be built to minimize the impact of those inevitable moments when a compelling falsehood is accepted as truth. Consider it less about belief, and more about damage control.

Original article: https://arxiv.org/pdf/2603.11393.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Intelligence: Why AI Confidently Gets Things Wrong

Bolstering Trust: Giving AI a Paper Trail

Measuring the Impact: How Supporting Info Affects User Judgement

The Bottom Line: Why AI Needs a Chain of Custody for Its Claims

What’s Next?

See also: