Can AI Pass Physics?

Author: Denis Avetisyan

A new study rigorously tests the problem-solving abilities of artificial intelligence on challenging, algebra-based physics questions.

Model performance on advanced physics exams is not absolute, but fluctuates considerably year to year, as evidenced by diverging trajectories and inconsistent rankings across models-with certain exam iterations consistently challenging all models while others prove universally accessible-revealing that evaluation is contingent on specific test characteristics rather than inherent AI capability, and underscored by the variability in scoring consistency between independent raters.

This research presents a comparative evaluation of large language models’ performance on AP Physics free-response problems, highlighting strengths in algebra and weaknesses in visual and spatial reasoning.

Despite recent advances in artificial intelligence, reliably replicating human-level problem-solving in complex, qualitative scientific domains remains challenging. This is investigated in ‘How Well Do AI Systems Solve AP Physics? A Comparative Evaluation of Large Language Models on Algebra-Based Free Response Questions’, which comparatively assesses the performance of leading large language models-ChatGPT, Gemini, Claude, and DeepSeek-on algebra-based AP Physics free-response questions. While models demonstrated strong capabilities in structured algebraic manipulation, achieving mean scores of 82-92%, consistent performance varied considerably, and all exhibited recurring errors in areas demanding spatial reasoning, visual interpretation, and conceptual integration-such as vector analysis and circuit diagrams. These findings raise a critical question: can AI truly serve as an effective pedagogical tool in physics without addressing these fundamental limitations in qualitative and multimodal understanding?

The Limits of Correlation: Why LLMs Struggle with Physics

Despite remarkable progress in natural language processing, Large Language Models (LLMs) consistently demonstrate limitations when confronted with complex reasoning challenges, notably those demanding both quantitative calculation and spatial awareness. While proficient at identifying patterns and recalling information from vast datasets, these models struggle to apply foundational principles to novel situations requiring more than simple memorization. This difficulty isn’t merely a matter of lacking specific knowledge; LLMs often fail to correctly interpret physical relationships, manipulate numerical data to arrive at accurate solutions, or visualize scenarios in a way that aligns with real-world physics. The core issue lies in their architecture, which prioritizes statistical correlations over genuine understanding of causal mechanisms, hindering their ability to extrapolate beyond the training data and reliably solve problems requiring deeper cognitive processing.

Recent evaluations of Large Language Models (LLMs) utilizing the standardized format of Advanced Placement (AP) Physics exams demonstrate a surprising limitation in their capacity for genuine physics problem-solving. While achieving seemingly high mean scores – ranging from 82 to 92 percent on both AP Physics 1 and 2 – these results belie a struggle with the application of fundamental principles. The tests reveal that LLMs, despite their proficiency in processing language, often fail to consistently translate textual descriptions into accurate physical models and quantitative solutions. This performance suggests that current models excel at pattern recognition and information retrieval, but lack the deeper conceptual understanding required to reliably navigate the complexities of physics, indicating a need for advancements in their reasoning capabilities beyond simple statistical correlations.

AP Physics exams utilize a free-response format that necessitates more than simply arriving at the correct numerical answer; students must articulate their reasoning process, demonstrating a clear understanding of underlying physical principles and justifying each step taken to solve a problem. Current Large Language Models, while capable of performing calculations, frequently stumble at this crucial explanatory stage. They often provide answers without demonstrating how those answers were derived, lacking the capacity to synthesize a coherent and logically sound argument – a deficiency that exposes a fundamental limitation in their ability to truly ‘understand’ and apply physics concepts, rather than simply pattern-match within a dataset. This inability to provide comprehensive justifications underscores the need for advancements in LLM architectures to prioritize not just answer generation, but also transparent and logically consistent reasoning.

The observed limitations in Large Language Model performance on physics problem-solving necessitate a shift towards more discerning evaluation methodologies. Current metrics often prioritize final answers, overlooking the crucial reasoning process required to arrive at a solution-a capability demonstrably lacking in these models. Consequently, the field requires benchmarks that specifically assess the clarity, completeness, and logical consistency of explanations, rather than simply verifying correctness. This demand extends beyond metric development to encompass targeted improvements in LLM architecture and training data, focusing on instilling a deeper understanding of fundamental principles and fostering the ability to translate those principles into coherent, step-by-step reasoning – ultimately bridging the gap between pattern recognition and genuine scientific understanding.

Model rankings exhibit low stability across exam years for AP Physics 1, indicated by frequent rank reversals and a concordance of [latex]W=0.182[/latex], but demonstrate greater consistency in AP Physics 2 ([latex]W=0.532[/latex]), where Gemini and DeepSeek consistently outperform ChatGPT.

Deconstructing Physical Intuition: Core Reasoning Skills

Quantitative reasoning is fundamental to success on AP Physics assessments, requiring the consistent application of mathematical principles to derive numerical solutions. These problems necessitate the accurate identification of relevant variables, the selection of appropriate physical formulas – such as [latex]F = ma[/latex] for Newton’s Second Law or [latex]E = mc^2[/latex] for mass-energy equivalence – and the precise execution of algebraic manipulations. Students are expected to not only arrive at a correct numerical answer, but also to demonstrate proficiency in unit conversions, significant figures, and the proper handling of vectors and scalars within these calculations. The ability to translate word problems into mathematical expressions and to interpret the physical meaning of the resulting values is a key indicator of quantitative reasoning skill and a major component of exam scoring.

Vector reasoning is fundamental to solving numerous physics problems because physical quantities such as force, velocity, and displacement are often directional and therefore best represented by vectors. These quantities are not fully described by magnitude alone; their direction is equally important. Problems involving these quantities necessitate the application of vector addition, subtraction, and decomposition techniques to determine resultant vectors and their components. For example, calculating the net force on an object requires vector summation of all applied forces, considering both magnitude and direction. Similarly, determining an object’s resultant velocity after multiple velocity vectors are applied demands a similar approach. Understanding vector components – typically expressed in terms of [latex]x[/latex] and [latex]y[/latex] coordinates – is crucial for analyzing motion in two or three dimensions and resolving forces into manageable components for calculation.

Effective problem-solving in physics, beyond arriving at a numerical answer, necessitates qualitative reasoning – the ability to articulate the underlying principles and logical steps employed. Large Language Models (LLMs) are increasingly evaluated on their capacity to not only compute a result but to also explain the reasoning process, detailing why a specific formula was chosen, how variables relate to one another, and the physical significance of the solution. This includes justifying assumptions made during the problem-solving process and identifying potential limitations or edge cases. Demonstrating this capability involves constructing a coherent narrative that connects the initial problem statement to the final answer, effectively communicating the physics concepts involved in a way that a knowledgeable user can readily follow and validate.

Large Language Models (LLMs) consistently demonstrate deficiencies in spatial reasoning and the interpretation of diagrams common in physics problem sets. This weakness manifests as an inability to accurately identify relevant geometric relationships, extract quantitative data presented visually (such as angles, lengths, and component vectors), and translate diagrammatic information into symbolic representations suitable for calculation. Specifically, LLMs often fail to correctly decompose vectors into their [latex]x[/latex] and [latex]y[/latex] components based on a provided diagram, misinterpret the direction of forces or velocities, or overlook crucial geometric constraints implied by the visual representation of the problem. These limitations hinder the LLM’s ability to formulate correct problem-solving strategies and arrive at accurate numerical solutions.

Performance distributions reveal that while models like ChatGPT and Gemini exhibit inconsistent results across exam years in Physics 1, Gemini and DeepSeek consistently achieve high scores in Physics 2, with ChatGPT demonstrating the potential for perfect scores but with greater overall variability.

A Rigorous Assessment: Standardizing the Evaluation of LLMs

A standardized evaluation framework was implemented to rigorously assess the capabilities of Large Language Models (LLMs) in the domain of physics. This framework utilized complete, publicly available AP Physics 1 and AP Physics 2 exams as the basis for evaluation. The selection of these exams provided a consistent and well-defined set of problems covering a broad range of physics concepts at the college preparatory level. Employing full exams, rather than isolated questions, allowed for the assessment of the LLMs’ ability to integrate knowledge and solve complex, multi-step problems, mirroring real-world application of physics principles. The use of standardized exams ensured comparability of results across different LLMs and facilitated a quantitative analysis of their performance.

Rubric scoring was implemented to standardize the evaluation of Large Language Model (LLM) responses on physics problems. This methodology involved predefined scoring criteria detailing the elements required for a complete and correct answer, ranging from identifying relevant physical principles to accurate calculations and appropriate unit usage. Each response was assessed against these rubrics by trained evaluators, assigning points based on the presence and quality of each component. Utilizing a rubric-based system minimized subjective bias and ensured that all LLMs were assessed using the same criteria, enabling a fair and objective comparison of their problem-solving capabilities and facilitating quantifiable performance metrics.

Inter-rater reliability was established through the calculation of Intraclass Correlation Coefficients (ICC), a statistical measure of agreement among raters. Values consistently fell between 0.75 and 0.93, indicating a high degree of consistency in the application of the evaluation rubric. These ICC scores demonstrate that the rubric scoring process was not significantly affected by subjective interpretation, bolstering the validity and objectivity of the LLM performance assessment. A score of 0.75 or higher is generally accepted as demonstrating acceptable levels of rater agreement, and the observed range confirms a robust and reliable scoring methodology.

Evaluation of large language models including ChatGPT, Claude, Gemini, and DeepSeek, using the AP Physics 2 exam, revealed statistically significant performance differences (p-value = 0.0012). Analysis indicated that Gemini and DeepSeek exhibited greater consistency in their responses. Specifically, DeepSeek demonstrated a coefficient of variation of 4.7%, a metric used to quantify the dispersion of scores and therefore the consistency of performance across different problems within the exam.

Beyond Calculation: Enhancing LLM Reasoning Through Context

Chain-of-Thought prompting represents a significant advancement in eliciting more than just answers from large language models; it encourages a demonstration of reasoning. This technique involves crafting prompts that specifically request the model to explain its thought process, breaking down a problem into intermediate steps before arriving at a final solution. Studies have shown this simple adjustment dramatically improves accuracy, particularly in complex tasks like mathematical problem-solving and common-sense reasoning. By forcing the model to verbalize its internal logic, researchers gain insight into how it arrives at conclusions, and importantly, can identify and correct flawed reasoning patterns. The resulting increase in transparency and reliability positions Chain-of-Thought prompting as a cornerstone for building more trustworthy and capable artificial intelligence systems.

Large language models, despite their proficiency with textual data, frequently struggle with tasks demanding spatial reasoning or the interpretation of visual diagrams. While skillful at processing language, these models lack the inherent understanding of physical relationships and geometric principles that humans develop through embodied experience. Consequently, simply providing a text-based prompt, even a detailed one, often proves inadequate when the problem requires visualizing or mentally manipulating objects in space. This limitation is particularly evident in fields like physics, where diagrams and spatial arrangements are crucial for problem-solving; the models may correctly identify relevant formulas but fail to apply them correctly due to an inability to accurately ‘see’ the problem depicted visually. This suggests that a purely linguistic approach is insufficient, and that grounding language in visual information is essential to unlock more robust reasoning capabilities.

The capacity of large language models to solve physics problems is significantly boosted when text-based input is paired with visual information, a process known as multimodal grounding. While LLMs excel at processing language, they often struggle with tasks requiring spatial reasoning or the interpretation of diagrams – core components of many physics challenges. By integrating images, such as free-body diagrams or circuit schematics, alongside textual problem descriptions, researchers are enabling these models to ‘see’ the physical situation, effectively augmenting their understanding. This allows the LLM to not only process the quantitative relationships expressed in words but also to visually assess the forces, connections, and arrangements crucial for accurate problem-solving, paving the way for more robust and insightful AI assistance in scientific domains.

The continued advancement of large language models in physics necessitates a focused investigation into synergistic combinations of prompting techniques and multimodal input. While chain-of-thought prompting demonstrates success in articulating reasoning, its efficacy is limited when applied to visually-dependent problems. Future studies must systematically evaluate how different prompting strategies – varying levels of detail, question decomposition, and constraint specification – interact with various forms of multimodal grounding, such as diagrams, simulations, and real-world images. Determining the optimal balance between textual guidance and visual information promises to significantly enhance LLM performance, not only in solving complex physics problems but also in fostering a deeper, more intuitive understanding of physical concepts, ultimately benefiting both educational applications and cutting-edge research endeavors.

“`html

The evaluation of large language models against AP Physics free-response questions reveals a predictable fragility. These systems excel at manipulating symbols – demonstrating robust algebraic skills – yet consistently falter when asked to integrate visual information or apply fundamental physics principles with consistent accuracy. This mirrors a broader challenge in AI: the ability to do something is not the same as understanding why it works. As James Maxwell observed, “The true voyage of discovery…never reveals its destination.” The models, much like early explorers, can navigate the mathematical landscape, but lack a deeper comprehension of the underlying physical reality, and their performance serves not as a demonstration of intelligence, but as a map of limitations. The consistent errors highlight that correlation, in this case between symbolic manipulation and physical understanding, is suspicion, not proof.

What’s Next?

The exercise, predictably, reveals that current large language models excel at the appearance of understanding. They manipulate symbols with impressive fluency – algebra, it seems, is largely pattern matching. However, a correct answer, divorced from a coherent interpretation of the physical scenario, is a compromise between knowledge and convenience. The models demonstrate that ‘solving’ a physics problem is not necessarily the same as understanding physics. The persistent failure with visual-spatial reasoning isn’t merely a technical hurdle; it suggests a fundamental disconnect between the models’ symbolic processing and the embodied cognition arguably necessary for genuine scientific insight.

Future work will undoubtedly focus on multimodal learning, attempting to bridge this gap. But a more pressing question remains largely unaddressed: optimal performance for whom? Is the goal to produce models that mimic human error, or to surpass human limitations? The pursuit of ‘general’ artificial intelligence often obscures the fact that intelligence itself is deeply contextual. A model that excels at AP Physics, but fails to generalize to novel, even slightly altered, scenarios, is a beautifully polished solution to a poorly defined problem.

Ultimately, the limitations revealed here aren’t about the models themselves, but about the metrics used to evaluate them. A high score on a standardized test is a convenient proxy for understanding, but a poor substitute for it. The true challenge lies not in building models that appear intelligent, but in developing methods to rigorously assess what ‘intelligence’ – and, more specifically, ‘scientific reasoning’ – actually means.

Original article: https://arxiv.org/pdf/2603.07457.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Limits of Correlation: Why LLMs Struggle with Physics

Deconstructing Physical Intuition: Core Reasoning Skills

A Rigorous Assessment: Standardizing the Evaluation of LLMs

Beyond Calculation: Enhancing LLM Reasoning Through Context

What’s Next?

See also: