Can AI Truly Understand Scientific Tables?

Author: Denis Avetisyan


A new benchmark reveals that current AI systems struggle not with planning how to answer questions about scientific data, but with accurately executing the necessary computations.

The system assesses model performance across languages by first isolating relevant data-specifically, rows corresponding to Qwen2-Audio-then calculating the average accuracy for each language and pinpointing the most challenging one; this process is repeated for every model within the dataset to reveal comparative linguistic weaknesses.
The system assesses model performance across languages by first isolating relevant data-specifically, rows corresponding to Qwen2-Audio-then calculating the average accuracy for each language and pinpointing the most challenging one; this process is repeated for every model within the dataset to reveal comparative linguistic weaknesses.

SciTaRC, a challenging question-answering dataset for scientific tables, highlights execution fidelity as the primary limitation for large language models in complex reasoning tasks.

Despite advances in artificial intelligence, reliably answering questions requiring both language understanding and complex computation over scientific tabular data remains a significant challenge. This is addressed in ‘SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation’, which introduces a new benchmark revealing substantial failure rates-even for state-of-the-art models like Llama-3.3-70B-Instruct, which falters on 65.5% of tasks-stemming not from flawed planning, but from an inability to faithfully execute reasoning steps. Our analysis highlights a universal “execution bottleneck” affecting both code- and language-based approaches, with brittle performance on raw tables and errors in comprehension and calculation, respectively. Can future neuro-symbolic systems overcome these execution limitations to unlock the full potential of scientific data analysis?


The Illusion of Understanding: Beyond Pattern Matching

Though contemporary Large Language Models demonstrate impressive abilities in retrieving and presenting factual information, effectively answering complex scientific questions necessitates capabilities beyond simple pattern recognition. These models often succeed by identifying correlations within vast datasets, but struggle when confronted with problems requiring genuine understanding of underlying principles, logical deduction, or the application of scientific reasoning. A system capable of merely recalling previously encountered information will falter when tasked with synthesizing new knowledge or extrapolating from existing data-a critical skill in scientific inquiry. Consequently, achieving true competency in scientific question answering demands a shift from statistical association to genuine comprehension, pushing the boundaries of current artificial intelligence methodologies.

Conventional approaches to scientific question answering often falter when confronted with the need for accurate calculation and sequential logic applied to structured data formats, such as the tables prevalent in scientific literature. These methods frequently rely on pattern recognition within text, proving inadequate when a problem necessitates extracting numerical values, performing operations – like unit conversions or averaging – and then integrating those results across multiple rows or columns. The inherent complexity of deciphering table structure, understanding data relationships, and executing precise computations presents a significant hurdle; a system might correctly identify relevant information but fail to arrive at the correct answer due to errors in the processing of quantitative data. This limitation underscores the necessity for advanced techniques capable of bridging the gap between linguistic comprehension and rigorous numerical reasoning, allowing systems to not only find information but also process it accurately.

Current artificial intelligence systems, despite advancements in processing language, still struggle with the nuanced demands of scientific reasoning, particularly when it requires combining textual comprehension with numerical analysis. The SciTaRC benchmark, a challenging test designed to evaluate this integrated skill, reveals a significant performance gap; even the most sophisticated models currently achieve only 76.8% accuracy. This result underscores a critical need for novel methodologies that move beyond simple pattern recognition and instead facilitate a true synthesis of linguistic understanding and precise computational abilities – enabling machines to not only ā€˜read’ scientific text, but also to rigorously interpret and manipulate the quantitative data it contains.

Despite a general performance decline with increasing context length, leading models demonstrate improved reasoning across multiple tables, suggesting an ability to effectively link and utilize information from diverse sources as indicated by sample sizes on the x-axis.
Despite a general performance decline with increasing context length, leading models demonstrate improved reasoning across multiple tables, suggesting an ability to effectively link and utilize information from diverse sources as indicated by sample sizes on the x-axis.

SciTaRC: A Controlled Collapse of Expectation

SciTaRC is a dataset comprised of 800 question-answer pairs derived from scientific tables sourced from published research papers. The dataset is specifically designed to assess a model’s ability to perform multi-hop reasoning over tabular data, requiring the integration of information from both the textual question and the numerical and textual content of the tables. Unlike traditional question answering datasets focused on unstructured text, SciTaRC emphasizes the challenges presented by structured data, aiming to identify limitations in current models’ capacity to handle scientific information presented in a tabular format and drive development towards more robust reasoning capabilities. The tables cover a range of scientific domains, including biology, chemistry, and materials science, with questions requiring calculations, comparisons, and the application of domain-specific knowledge.

SciTaRC questions necessitate the integration of natural language understanding with numerical reasoning skills. Each question requires models to first interpret the semantics of the query, then locate relevant data within the provided table, and finally perform calculations – including addition, subtraction, multiplication, division, and comparisons – on the table’s numerical values to arrive at the correct answer. This differs from traditional question answering which often focuses solely on textual retrieval or simple fact extraction; SciTaRC specifically targets the ability to process and manipulate quantitative information presented in a structured tabular format. The complexity of these calculations varies, demanding models capable of handling multi-step reasoning and potentially requiring the application of Ī£ or averaging operations.

SciTaRC employs metrics of Reasoning Complexity and Input Complexity to provide a nuanced evaluation of question answering models on scientific tabular data. Reasoning Complexity assesses the number of steps and types of calculations required to arrive at the answer, while Input Complexity quantifies the amount of information – including table size and the number of referenced cells – needed to process the question. Analysis using these metrics indicates that state-of-the-art models achieve an overall accuracy of 76.8% on the benchmark, though performance varies significantly based on both complexity scores, suggesting areas for targeted model improvement and a need for more robust reasoning capabilities.

Model performance varies with question difficulty, as demonstrated by the agreement matrix where rows are sorted by ease of answering and columns by model accuracy, with blue indicating correct responses and a gray band highlighting questions no model solved.
Model performance varies with question difficulty, as demonstrated by the agreement matrix where rows are sorted by ease of answering and columns by model accuracy, with blue indicating correct responses and a gray band highlighting questions no model solved.

The Anatomy of Failure: Where the System Breaks Down

Model performance analysis consistently demonstrates errors in the localization phase of scientific question answering. Localization errors occur when a model fails to correctly pinpoint the relevant data within a provided table to answer a given question. These errors are not necessarily failures in calculation or comprehension; the model may understand the question and the necessary operation, but incorrectly identifies where to find the required values in the table structure. This misidentification can manifest as selecting the wrong row, column, or cell, leading to an incorrect answer despite potentially accurate reasoning. The frequency of localization errors suggests a need for improved techniques in table understanding and data retrieval, particularly in handling complex table layouts and ambiguous cell references.

Calculation errors in scientific question answering systems manifest as incorrect numeric results despite accurate data localization and question comprehension. These errors are not limited to simple arithmetic; they frequently involve multi-step calculations, unit conversions, and the application of [latex] \frac{numerator}{denominator} [/latex] formulas to retrieved values. Analysis indicates that current models struggle with maintaining precision throughout complex calculations and often fail to correctly propagate units, leading to dimensionally inconsistent or logically flawed answers. Addressing this requires advancements in numerical reasoning capabilities, potentially through the integration of symbolic computation techniques or dedicated numerical reasoning modules within the model architecture.

Comprehension errors in scientific question answering systems manifest as misinterpretations of either the user’s query or the semantic meaning of data presented in tables. These errors are not typically due to failures in information retrieval or numerical computation, but rather a lack of understanding regarding what the question is asking or how table attributes relate to the requested information. Specifically, models may incorrectly identify the relevant columns or rows, or fail to recognize the relationships between different attributes, leading to inaccurate or irrelevant answers. Analysis of these errors reveals difficulties in tasks requiring nuanced understanding of scientific terminology and the contextual meaning of table headers and data values.

Performance decreases consistently with increasing algorithmic complexity across metrics of calculation intensity [latex]I_{calc}[/latex], retrieval demand [latex]I_{retr}[/latex], plan horizon [latex]L_{plan}[/latex], and control flow [latex]C_{flow}[/latex], with trends at the extremes tempered by sample size (N).
Performance decreases consistently with increasing algorithmic complexity across metrics of calculation intensity [latex]I_{calc}[/latex], retrieval demand [latex]I_{retr}[/latex], plan horizon [latex]L_{plan}[/latex], and control flow [latex]C_{flow}[/latex], with trends at the extremes tempered by sample size (N).

Benchmarking the Illusion: Measuring Degrees of Failure

The SciTaRC benchmark was utilized to assess the performance of several large language models, including GPT-5, Grok-4.1, DeepSeek-V3.2, Llama-3.3, and Qwen3. SciTaRC specifically tests a model’s ability to answer complex, multi-hop questions requiring reasoning over scientific text, focusing on comprehension and inference within the domain of scientific knowledge. Evaluation on this benchmark provides a standardized method for comparing the reasoning capabilities of these models in a scientific context, allowing for quantitative assessment of their performance on challenging question-answering tasks.

The LLM as Judge methodology employs a large language model to assess the correctness of responses generated by other language models, moving beyond simple accuracy metrics to provide a more detailed evaluation. This approach involves prompting the judging LLM with both the original question and the candidate answer, then instructing it to determine the answer’s correctness, potentially identifying nuanced errors or incomplete reasoning. Unlike traditional evaluation methods relying on predefined ground truth datasets, LLM as Judge can assess complex, open-ended questions where multiple valid answers may exist, and offer justifications for its evaluations. This allows for a more granular understanding of model performance, highlighting not only whether an answer is correct, but how and why it is correct or incorrect.

Performance on the SciTaRC benchmark demonstrates quantifiable differences between large language models. Specifically, Kimi-K2-Thinking achieved a 24% improvement in accuracy compared to the baseline model, indicating a substantial gain in its ability to address the benchmark’s challenges. Furthermore, DeepSeek-R1-Distill outperformed the Llama-3 71B model by 10.5%, suggesting an advantage in processing and responding to the SciTaRC dataset. These results provide concrete data points for comparative analysis, highlighting specific areas where certain models excel over others in scientific and technical reasoning tasks.

Planning consistently improves performance for code models across task difficulty, while generalist models exhibit regression on easier tasks [latex]	ext{(red shading)}[/latex] but gain on previously unsolved ones [latex]	ext{(green shading)}[/latex].
Planning consistently improves performance for code models across task difficulty, while generalist models exhibit regression on easier tasks [latex] ext{(red shading)}[/latex] but gain on previously unsolved ones [latex] ext{(green shading)}[/latex].

The Inevitable Future: Growing, Not Building

Recent advancements in artificial intelligence highlight the efficacy of prompting models to articulate their reasoning processes, moving beyond simple input-output mappings. Techniques such as Chain-of-Thought and Program-of-Thought encourage large language models to decompose complex problems into a series of intermediate steps, mirroring human cognitive strategies. By explicitly generating these rationales, the models not only enhance the transparency of their decision-making but also significantly improve accuracy in tasks demanding multi-step inference. This deliberate step-by-step approach allows for error detection and correction during the reasoning process, fostering a more reliable and explainable form of artificial intelligence capable of tackling increasingly sophisticated scientific questions.

The consistent application of intermediate results remains a significant hurdle in achieving robust reasoning capabilities in large language models, particularly when tackling complex, multi-step problems. Current models frequently exhibit ā€˜memory errors’ – failures to accurately recall and utilize previously computed information, even within the same reasoning sequence. This isn’t simply a matter of forgetting; rather, the model’s internal representations can become corrupted or misapplied as the reasoning chain lengthens, leading to demonstrably incorrect conclusions despite seemingly logical initial steps. Addressing this requires innovations in architectural design and training methodologies, focusing on mechanisms that strengthen the persistence and reliable retrieval of intermediate states, and ensure that information remains consistently accessible throughout the entire problem-solving process. Without mitigating these memory failures, the potential of step-by-step reasoning techniques like Chain-of-Thought will remain limited in domains demanding precise and sustained computational accuracy.

The pursuit of genuinely robust scientific question answering increasingly centers on hybrid reasoning approaches, integrating the strengths of neural networks with the precision of symbolic methods. Recent studies demonstrate the potential of this synergy, though outcomes vary significantly between models; DeepSeek-V3.2, for example, achieved notable accuracy gains – 1.9% with autonomous planning and a more substantial 6.2% when guided by oracle planning. Conversely, Qwen3-30B exhibited a 5.4% regression when employing autonomous planning, highlighting the sensitivity of these systems to architectural design and training methodologies. These findings suggest that combining the pattern recognition capabilities of neural networks with the logical rigor of symbolic reasoning is a viable path forward, but also underscores the need for continued research to optimize these hybrid systems and address inconsistencies in performance across different model types.

Planning consistently improves performance for code models across task difficulty, while generalist models exhibit regression on easier tasks [latex]	ext{(red shading)}[/latex] but gain on previously unsolved ones [latex]	ext{(green shading)}[/latex].
Planning consistently improves performance for code models across task difficulty, while generalist models exhibit regression on easier tasks [latex] ext{(red shading)}[/latex] but gain on previously unsolved ones [latex] ext{(green shading)}[/latex].

The pursuit of automated reasoning over scientific tables, as evidenced by SciTaRC, reveals a familiar truth: systems are not built, they accrue. The benchmark highlights execution fidelity as the critical failing, not the initial planning stages. This echoes a sentiment expressed by Robert Tarjan: ā€œThe most effective programs are the ones that don’t run.ā€ He didn’t speak of halting problems, but of the inherent limitations of translating intent into reliable action. The benchmark isn’t a measure of intelligence, but a mapping of where the inevitable compromises – the frozen architecture – begin to crack under the weight of complexity. Models may generate pseudo-code, but the execution, the actual doing, remains the unpredictable element. The table isn’t conquered, merely charted – a temporary respite before the next failure manifests.

What Lies Ahead?

The SciTaRC benchmark doesn’t simply reveal limitations in current systems; it articulates a fundamental truth about intelligence applied to structured data. The observed bottleneck isn’t a failure of ā€˜thinking’ – of generating plans, however sophisticated – but a crisis of execution. Models can propose a path, yet consistently stumble on the steps themselves. Monitoring is the art of fearing consciously, and the benchmark highlights precisely what must be feared: the fragility of fidelity. Each successful computation isn’t a victory, but a temporary reprieve from inevitable error.

Future work will undoubtedly explore increasingly elaborate planning mechanisms. However, the true challenge isn’t to build more complex architectures, but to cultivate systems that anticipate their own failures. Resilience begins where certainty ends. Perhaps the focus should shift from generating pseudo-code to generating diagnostics – mechanisms for self-assessment and error correction built into the very fabric of computation.

This isn’t about creating ā€˜bug-free’ systems – that’s a childish fantasy. It’s about accepting that every system is a prophecy of future failure, and designing for graceful degradation. The value lies not in preventing errors, but in understanding them, and building systems that reveal, rather than conceal, their own limitations.


Original article: https://arxiv.org/pdf/2603.08910.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-03-12 01:59