Can Machines Learn Grammar?

Author: Denis Avetisyan

New research shows that large language models are surprisingly adept at recognizing and applying complex grammatical rules without explicit programming.

The accuracy of large language models in evaluating parasitic gap constructions varies considerably across languages, suggesting that linguistic structure-rather than simply scale-remains a crucial factor in achieving robust natural language understanding.

This review examines how large language models learn syntactic structures-like subject-auxiliary inversion and parasitic gap licensing-through statistical learning from training data, challenging traditional views of innate grammatical competence.

Despite decades of inquiry, the foundations of syntactic competence remain debated-are grammatical constraints innate, or do they emerge from experience? This paper, ‘Grammaticality Judgments in Humans and Language Models: Revisiting Generative Grammar with LLMs’, investigates whether large language models (LLMs) can exhibit sensitivity to core syntactic structures-specifically subject-auxiliary inversion and parasitic gap licensing-solely through predictive training on surface text. Our results demonstrate that LLMs reliably distinguish grammatical from ungrammatical constructions, suggesting structural generalizations can arise from statistical learning without explicit encoding of rules. This raises the question of whether LLMs offer a novel pathway for understanding the origins of grammatical competence itself.

The Enduring Riddle of Linguistic Structure

The debate surrounding Universal Grammar, the theory positing an innate human capacity for language, continues to shape the study of language acquisition. This framework suggests that humans are born not with specific linguistic knowledge, but with a fundamental understanding of the rules governing language structure, allowing children to rapidly acquire the complexities of their native tongue. Researchers investigating this concept explore how children consistently demonstrate an ability to generalize linguistic rules, even when exposed to incomplete or imperfect language data – a phenomenon difficult to explain without some pre-wired linguistic ability. The enduring relevance of Universal Grammar lies in its attempt to explain not just how language is learned, but also why it is learned with such remarkable speed and consistency across diverse cultures, offering a foundational lens through which to examine the cognitive architecture underlying human communication.

The remarkable fluency exhibited by Large Language Models (LLMs) has ignited a vigorous debate concerning the nature of their linguistic competence. Though capable of generating human-like text, the question remains whether these models truly understand language, or simply excel at predicting the most probable sequence of words based on vast datasets. This distinction is critical, as genuine linguistic competence implies an ability to process novelty, grasp nuanced meaning, and adhere to grammatical rules beyond mere statistical correlation. Consequently, researchers are compelled to reassess the benchmarks used to evaluate LLMs, moving beyond superficial assessments of fluency to probe the depth of their understanding and the underlying mechanisms driving their performance. The current wave of LLMs therefore serves not only as powerful tools, but also as catalysts for a renewed investigation into the very foundations of language and cognition.

The ability to dissect and understand complex sentences hinges on recognizing their hierarchical structure – the way words group into phrases, which then combine to form larger meaningful units. Current research investigates whether Large Language Models (LLMs) truly grasp this organization, or if their apparent fluency stems from identifying statistical patterns within vast datasets. While LLMs can convincingly generate grammatically correct text, discerning whether they internally represent these nested relationships – akin to parsing a sentence into its component parts – remains a significant challenge. Some studies suggest LLMs struggle with tasks requiring genuine structural understanding, such as center-embedded clauses, indicating a reliance on surface-level correlations rather than deep linguistic competence. Determining if LLMs are simply masterful pattern matchers, or if they are approaching a more human-like capacity for hierarchical processing, is crucial for evaluating their potential and limitations in natural language understanding.

LLM accuracy in identifying parasitic gap constructions varies significantly depending on both the language and the specific model used.

Probing Syntactic Sensitivity: Methods of Dissection

Assessment of syntactic sensitivity in language models necessitates evaluation techniques extending beyond simple measures of fluency or surface-level correctness. Researchers utilize tests designed to probe understanding of core syntactic operations, such as Subject-Auxiliary Inversion – where the auxiliary verb inverts with the subject in questions – and the processing of complex constituent structures like nested clauses or center-embedded sentences. These constructions present challenges due to increased computational load and potential ambiguities, requiring models to demonstrate an ability to parse and interpret hierarchical relationships beyond mere sequential prediction of tokens. Performance on these tasks indicates whether a model has acquired a robust understanding of grammatical rules and the ability to generalize beyond the training data.

The ‘Proxy View’ frames Large Language Models (LLMs) not as simulations of human cognitive processes, but as instruments for investigating the potential learnability of linguistic structures from data. This approach acknowledges the fundamental differences between LLM architecture and human cognition, avoiding claims of direct cognitive modeling. Instead, LLMs are utilized to determine what syntactic patterns and constraints could be acquired given sufficient linguistic exposure, effectively serving as a proxy for learning. Performance on challenging linguistic constructions then informs hypotheses regarding the information available within the input data and the capacity of a learning system – regardless of its implementation – to extract relevant generalizations.

Grammaticality judgments, a core component of this methodology, involve presenting human subjects or language models with sentences and eliciting evaluations of their correctness; these judgments are typically binary – either grammatical or ungrammatical – although graded scales are also utilized. Rigorous analysis focuses on model performance specifically on complex syntactic constructions such as center-embedded clauses, relative clauses with multiple modifiers, and sentences requiring non-local dependencies. This analysis involves calculating precision and recall metrics for identifying grammatical and ungrammatical sentences, as well as error analysis to pinpoint specific syntactic phenomena that consistently challenge the model. Performance is often compared against human baseline judgments to establish a quantitative measure of syntactic sensitivity and to identify areas where the model deviates from human linguistic competence.

Tracing the Roots of Structure: Training and Representation

Autoregressive training methodologies, which task a model with predicting subsequent tokens within a sequence, are generally conducive to the development of structural sensitivity in language models. This approach encourages the creation of globally coherent representations as the model learns to maintain context and dependencies over extended sequences to accurately forecast the next token. By inherently requiring an understanding of grammatical structure to make accurate predictions, autoregressive training facilitates the acquisition of knowledge regarding hierarchical relationships and long-range dependencies within language. This contrasts with methods that focus on local context, potentially leading to fragmented representations and a diminished capacity to process complex linguistic structures.

Masked Language Modeling (MLM) training, wherein models are tasked with predicting randomly masked portions of input text, can result in a fragmented representational space compared to autoregressive approaches. This fragmentation occurs because the model processes local contexts in isolation to predict masked tokens, potentially diminishing its capacity to establish and maintain relationships between distant elements within a sequence. Consequently, the ability to learn and represent long-range dependencies and hierarchical syntactic structures may be impaired, as the model prioritizes local context reconstruction over the development of a globally coherent understanding of the input.

Investigations into large language models, specifically GPT-4 and LLaMA-3 trained with autoregressive objectives, indicate emerging syntactic sensitivity. Performance on challenging constructions such as English parasitic gaps and inversion consistently reaches near-perfect accuracy (100%), suggesting the capacity to learn structural generalizations from input data alone. While GPT-4 achieves 83% accuracy on English Across-the-Board Extraction, evidence of this sensitivity is not uniform across languages; performance on equivalent constructions in Norwegian demonstrates a decline in accuracy, particularly with Across-the-Board Extraction (29%), indicating that language-specific factors may influence the development of structural understanding.

Evaluations of large language models, specifically GPT-4, indicate a high capacity for structural generalization based solely on formal input. The model achieves 100% accuracy on English parasitic gaps and inversion tasks, demonstrating an ability to correctly process and understand complex grammatical constructions without explicit semantic cues. Performance extends to more complex structures, with GPT-4 attaining 83% accuracy on English Across-the-Board Extraction, a task requiring the identification of long-distance dependencies. These results suggest that these models can learn and apply structural rules from the patterns present in training data, even in the absence of meaning-based supervision.

Evaluations of GPT-4’s structural generalization capabilities reveal performance discrepancies across languages. While the model achieves high accuracy on Norwegian parasitic gap constructions, its performance declines with increased structural complexity; accuracy on Norwegian inversion drops to 78%, and performance on Norwegian Across-the-Board (ATB) Extraction is significantly lower at 29%. This indicates that the model’s ability to learn and apply structural rules is not uniformly robust across different grammatical phenomena within a single language, and that cross-linguistic transfer of structural understanding may be limited or require further refinement.

The Echo of Structure: Implications for Linguistic Theory

The capacity of large language models to discern constituent structure-how phrases and clauses combine to form sentences-represents a significant step toward achieving human-level linguistic competence. This ability isn’t merely about identifying parts of speech, but understanding the hierarchical relationships within a sentence, a skill traditionally assessed using tools like Phrase Structure Trees and Dependency Structures. Successful modeling of these structures demonstrates that LLMs are not simply memorizing patterns, but are developing an internal representation of grammatical organization. This suggests a potential to move beyond statistical prediction and toward a more generative, rule-governed understanding of language, opening doors to exploring the underlying cognitive processes that enable human language use and potentially creating models capable of true linguistic creativity.

Recent advancements in large language models, specifically GPT-4, showcase an unprecedented capacity for syntactic sensitivity, evidenced by achieving 100% accuracy on notoriously difficult English constructions like parasitic gaps and inversion. This feat isn’t merely a benchmark of performance; it positions LLMs as potentially invaluable tools for linguistic research. Previously, exploring the intricacies of these complex grammatical phenomena demanded extensive human annotation and analysis. Now, researchers can leverage LLMs to rapidly generate and evaluate hypotheses about linguistic structure, identify patterns in language use, and test the boundaries of grammatical theory. The ability to accurately process and predict these challenging constructions suggests that LLMs are capturing something fundamental about the rules governing human language, opening new avenues for computational modeling of linguistic competence and offering a powerful new lens through which to investigate the cognitive processes underlying language understanding.

The demonstrated capacity of large language models to process complex syntactic structures carries significant promise for the advancement of natural language processing systems. By accurately interpreting nuanced linguistic input – including constructions previously considered challenging for machines – these models move beyond simple pattern recognition toward a more genuine understanding of language. This improved comprehension translates directly into the potential for generating text that is not only grammatically correct but also more coherent, natural-sounding, and contextually appropriate. Consequently, future NLP applications – ranging from machine translation and text summarization to conversational AI – stand to benefit from increased robustness and a greater capacity to handle the inherent ambiguities and subtleties of human communication, ultimately fostering more effective and intuitive interactions between people and machines.

Despite recent advancements demonstrating impressive syntactic sensitivity, a comprehensive understanding of the boundaries of large language model’s structural comprehension remains a crucial area of inquiry. Current models, while capable of processing complex grammatical constructions, often lack the efficiency and adaptability characteristic of human language processing; they require substantial data and computational resources to achieve performance levels easily surpassed by native speakers. Future investigations should therefore focus on identifying the specific limitations of existing architectures, potentially exploring novel designs inspired by the hierarchical and recursive nature of human linguistic competence. This includes research into sparse activation, dynamic computation, and the integration of symbolic reasoning to create models that not only process language but also understand its underlying structure with greater nuance and efficiency.

The study’s findings resonate with a fundamental truth about complex systems: their behavior emerges from accumulated experience, not preordained design. As Robert Tarjan aptly stated, “Program structure is more important than program content.” This principle applies directly to large language models and their surprising ability to internalize grammatical structures. The models aren’t taught grammar in the traditional sense; they discern patterns within the vast dataset, effectively building their own ‘program structure’ for language. The observed sensitivity to phenomena like subject-auxiliary inversion and parasitic gap licensing isn’t evidence of innate linguistic ability, but rather a demonstration of how statistical learning creates a functional, if organically developed, competence. The system ages, adapts, and reveals its internal logic through performance – a timeline etched in parameters.

The Gradient of Competence

The demonstrated capacity of large language models to approximate human grammaticality judgments, without explicit encoding of generative rules, does not resolve the central questions-it merely reframes them. The models learn patterns, of course, but the persistence of error-the failures to fully mirror human assessments-is not noise. Every failure is a signal from time, indicating the limitations of statistical learning when confronted with the full complexity of language. The models achieve competence through scale, but scale does not guarantee understanding, only increasingly subtle mimicry.

Future work must move beyond simply measuring performance on established benchmarks. The true test lies in probing the boundaries of competence, constructing stimuli that reveal the underlying mechanisms-or lack thereof-driving these judgments. A focus on error analysis, on the specific types of grammatical structures that consistently challenge these models, will be more illuminating than chasing marginal gains in overall accuracy. Refactoring is a dialogue with the past; the next generation of models should not simply replicate human performance, but illuminate the processes that give rise to it.

Ultimately, the question is not whether these models can learn grammar, but whether their approach to grammatical competence offers a viable model of human linguistic acquisition. The gradient of competence is a steep one, and time will reveal whether these models merely climb it, or truly transcend it.

Original article: https://arxiv.org/pdf/2512.10453.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Enduring Riddle of Linguistic Structure

Probing Syntactic Sensitivity: Methods of Dissection

Tracing the Roots of Structure: Training and Representation

The Echo of Structure: Implications for Linguistic Theory

The Gradient of Competence

See also: