Can AI Judge Like a Justice?

Author: Denis Avetisyan

A new framework evaluates whether artificial intelligence can realistically simulate the challenging questioning of US Supreme Court oral arguments.

The system models the dynamics of oral arguments by simulating justice responses-predicting each [latex]n^{th}[/latex] turn based on case facts, the legal question, prior conversational context from the preceding [latex]n-1[/latex] turns, and the identity of the speaking justice-through both prompt-based methods utilizing varied base models and agentic approaches leveraging tools like case docket searches and historical voting data, all evaluated by a framework assessing both the realism and pedagogical value of the simulations.

Researchers present a method for assessing large language models’ ability to extract legal issues and generate justice-specific lines of inquiry, moving beyond simple accuracy metrics.

Effective legal training relies on realistic practice, yet simulating the nuanced questioning of appellate judges presents a significant challenge. This is the core issue addressed in ‘AI-Assisted Moot Courts: Simulating Justice-Specific Questioning in Oral Arguments’, which investigates whether large language models can accurately replicate the style and substance of U.S. Supreme Court oral arguments. The authors demonstrate a novel two-layer evaluation framework revealing that while AI-generated questions can mimic realism and identify key legal issues, models still struggle with diversity and exhibit a tendency towards overly agreeable questioning-shortcomings often obscured by standard evaluation metrics. Could these findings pave the way for more robust AI-driven tools that truly enhance legal preparation and reasoning skills?

The Erosion of Subjectivity: Toward Data-Driven Legal Analysis

Historically, evaluations of Supreme Court oral arguments have relied heavily on qualitative assessments – detailed reviews by legal scholars and journalists – a process inherently susceptible to individual bias and interpretation. While insightful, these subjective analyses struggle to accommodate the sheer volume of arguments presented each term, limiting their capacity for comprehensive, systematic understanding. Furthermore, replicating these findings proves difficult, as interpretations often lack the precision needed for robust validation or comparative study. This reliance on manual review not only restricts the scale of analysis but also hinders the identification of subtle, yet potentially influential, patterns within judicial questioning and advocacy strategies – a significant limitation in fully grasping the dynamics of legal reasoning at the highest court.

The complexities of Supreme Court oral arguments demand analytical techniques extending beyond traditional qualitative assessments. A data-driven approach, employing computational linguistics and machine learning, allows researchers to identify recurring patterns in both judicial questioning and advocacy strategies. This involves analyzing not just what questions are asked, but how they are framed, the linguistic features employed, and the subsequent responses elicited. By quantifying these elements, researchers can reveal subtle biases, predict judicial voting behavior, and understand the effectiveness of different legal arguments. Such rigorous analysis moves beyond subjective interpretation, offering a scalable and objective means of deciphering the dynamics of legal discourse and uncovering previously hidden trends in the nation’s highest court.

Effective analysis of legal discourse demands tools that surpass the limitations of basic text processing. Simple keyword searches often fail to capture the subtle interplay of arguments, while sentiment analysis frequently misinterprets the complex reasoning embedded within legal questioning and advocacy. To truly unlock understanding, computational linguistics must move towards techniques that model the structure of legal arguments, identify rhetorical strategies, and assess the logical relationships between claims and evidence. This requires developing algorithms capable of discerning not just what is said, but how it is said, and crucially, why

We evaluated the topical coverage of our simulated questions by comparing the legal issues raised in Supreme Court transcripts to those addressed by our AI-generated simulations.

Dissecting the Discourse: A Framework for Analysis

TranscriptAnalysis constitutes the foundational element of our research, involving a systematic and exhaustive review of oral argument transcripts sourced from legal proceedings. This process extends beyond simple textual review; it encompasses detailed coding and annotation of the transcripts to identify specific arguments, legal references, and rhetorical devices. The scope of TranscriptAnalysis includes transcripts from various court levels – including appellate and supreme courts – and covers a diverse range of legal domains. Data extracted through TranscriptAnalysis is then structured and digitized, creating a corpus suitable for computational analysis and enabling quantitative assessment of legal discourse. The granularity of this analysis allows for the identification of patterns and trends in legal argumentation that would be difficult to discern through traditional qualitative methods.

Automated IssueExtraction employs natural language processing techniques to identify the central legal questions presented in a given transcript. This process involves parsing the text, identifying key phrases and arguments related to legal principles, and categorizing these elements to define the issues under consideration. The extracted issues serve as the initial data points for subsequent analysis, enabling researchers to focus on the specific points of contention and build a structured understanding of the case. This automated approach facilitates efficient and scalable analysis of large volumes of legal transcripts, reducing the time and resources required for manual issue identification.

The system utilizes `GPT4oPrompting` to enhance the accuracy of identified legal issues and to create a comprehensive suite of evaluation test cases. This involves formulating specific prompts designed to challenge the automated `IssueExtraction` process, assessing its ability to correctly identify core legal questions under varying conditions and with complex factual scenarios. The generated test cases are structured to cover a range of argument types and legal domains, enabling a quantitative evaluation of the system’s performance metrics, including precision, recall, and F1-score. Results from these tests inform iterative refinements to both the prompting strategies and the underlying `IssueExtraction` algorithms.

Compared to transcript data, GPT-4o’s valence distribution more closely reflects inter-judge variation, particularly higher neutrality for the Chief Justice, while Gemini-2.5-Pro generates questions with competitive dynamics similar to the transcript data, but at the cost of producing more cooperative question types.

Unveiling the Dynamics: Categorizing and Evaluating Judicial Interaction

The judicial interaction analysis utilizes three distinct classification methods to categorize justice remarks. LegalBenchClassification employs a framework based on established legal principles and precedent to identify the legal basis of statements. MetacogClassification analyzes remarks through the lens of metacognitive processes, categorizing statements related to reasoning, knowledge, and uncertainty. Finally, StetsonClassification leverages the Stetson framework, focusing on argumentative moves and rhetorical strategies present in judicial discourse. These methods allow for a multi-faceted categorization of justice interactions, moving beyond simple topic identification to include analysis of the underlying cognitive and legal structures of the exchange.

The ValenceAssessment component of the system evaluates justice responses to determine the affective tone of the interaction, classifying exchanges as either collaborative or competitive. This assessment is performed by analyzing linguistic features within the justices’ statements, identifying cues indicative of agreement, support, or positive regard – suggesting a collaborative dynamic – versus disagreement, challenge, or negative sentiment, which indicate a competitive exchange. The resulting valence score provides a quantitative measure of the interaction’s tone, enabling analysis of communication patterns and the potential impact on legal reasoning and decision-making processes. This metric is distinct from categorization of the content of the remarks, focusing instead on how those remarks are delivered and received within the judicial dialogue.

The FallacyDetection component identifies instances of flawed reasoning within judicial statements by recognizing patterns corresponding to established logical fallacies, such as ad hominem attacks, straw man arguments, and false dichotomies. This functionality extends beyond simple keyword spotting; the system analyzes the semantic relationships between claims and evidence to determine if an argument is logically sound. Detection is achieved through a combination of rule-based systems and machine learning models trained on datasets of identified fallacies in legal texts. The presence of detected fallacies is quantified and reported, providing a metric for evaluating the quality of legal reasoning exhibited in the exchange and informing assessments of judicial argumentation.

Our evaluation framework assesses oral argument simulators through realism-via adversarial testing and human preference-and pedagogical usefulness, focusing on legal coverage, question diversity, fallacy detection, and an appropriately adversarial questioning style.

The Horizon of Comprehension: Measuring Comprehensive Issue Coverage

The evaluation of justice questioning relies on a metric called `IssueCoverageEvaluation`, designed to determine the extent to which crucial legal issues are actually addressed during discourse. This assessment doesn’t simply check for the mention of topics, but rather rigorously analyzes whether the questioning delves into all facets of each identified legal issue. The process reveals potential shortcomings in legal debate, pinpointing areas where critical aspects are overlooked or inadequately explored. By quantifying issue coverage, researchers gain a clearer understanding of the completeness of legal reasoning, ultimately helping to identify gaps where further investigation or clarification is needed to ensure a more thorough and just outcome.

Evaluations reveal a significant disparity in the ability of current models to comprehensively address key legal issues presented in justice questioning; while some demonstrate a capacity to cover over 60% of these issues, performance varies considerably. This suggests a nascent, but uneven, capability in artificial intelligence to identify and engage with the full scope of legal arguments. The observed range indicates that certain approaches are more effective at issue identification than others, prompting further investigation into the specific architectures and training data responsible for these differences. This level of coverage, though promising, still leaves a substantial portion of relevant legal considerations unaddressed, highlighting an ongoing need for refinement and innovation in this field.

Despite advances in artificial intelligence capable of dissecting legal arguments, a significant hurdle remains in achieving truly comprehensive issue coverage. Current models, even those demonstrating superior performance, only manage to address approximately 41% of the highly specific facets within complex legal issues. This suggests that while AI can identify broad legal themes, capturing the nuanced details and interconnectedness essential for thorough legal reasoning presents a considerable challenge. The limitation isn’t simply a matter of incomplete data; it reveals the inherent difficulty in representing the full spectrum of legal considerations, even for sophisticated algorithms designed to mimic human legal analysis. Consequently, while promising, current AI systems still require substantial refinement before they can consistently deliver the exhaustive coverage expected in robust legal discourse.

A new analytical approach, enabled by the AICollaborationFramework and sharpened through OralArgumentSimulation, is yielding unprecedented insights into the complexities of legal reasoning. This methodology moves beyond simple identification of legal issues to model the dynamic interplay of arguments, counterarguments, and nuanced interpretations that characterize judicial decision-making. By simulating oral arguments and leveraging AI collaboration, researchers can now dissect the cognitive processes underlying legal thought, revealing patterns and biases previously obscured. The framework doesn’t merely assess what issues are addressed, but how they are considered, offering a granular understanding of the reasoning pathways employed by legal professionals and paving the way for more transparent and equitable judicial outcomes.

The annotation interface displays contextual information-including facts, the legal question, and a scrollable conversation history-to allow annotators to evaluate and provide feedback on potential justice responses.

The pursuit of realistic simulation, as detailed in this work concerning AI-assisted moot courts, echoes a fundamental principle of system design: graceful decay. The framework proposed doesn’t aim for perfect replication of Supreme Court questioning – an impossible ideal – but rather a demonstrable progression toward more nuanced and relevant issue extraction. As G.H. Hardy observed, “The essence of mathematics lies in its elegance and simplicity.” This sentiment applies equally to the development of these AI systems; complex algorithms are only valuable if they yield clear, insightful simulations. The evaluation metrics themselves, beyond mere accuracy, acknowledge that a system’s true value isn’t its immediate perfection, but its capacity for continual refinement and adaptation over time, a measured progression against inevitable entropy.

What’s Next?

The pursuit of simulating legal reasoning, even with increasingly sophisticated large language models, reveals a fundamental truth: any improvement ages faster than expected. This framework, while offering granular evaluation of question generation, merely pushes the inevitable confrontation with the inherent messiness of adversarial argument. The metrics developed here, focused on issue extraction and question relevance, will soon prove insufficient as models become adept at appearing to reason, rather than demonstrating actual comprehension of underlying legal principles.

Future work must acknowledge the temporal nature of legal decay. A question well-answered today invites a more nuanced challenge tomorrow. The true metric isn’t static accuracy, but the rate at which a model’s reasoning degrades under sustained, adversarial pressure. The field should move beyond assessing what a model answers, to tracking how its justifications evolve – or unravel – over iterative questioning.

Rollback is a journey back along the arrow of time, and the effort to recreate the dynamism of oral argument will necessitate a deeper understanding of not just legal knowledge, but the subtle art of persuasive rhetoric, and the inevitability of unforeseen challenges. The goal isn’t perfect simulation, but the graceful acceptance of inherent imperfection.

Original article: https://arxiv.org/pdf/2603.04718.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Erosion of Subjectivity: Toward Data-Driven Legal Analysis

Dissecting the Discourse: A Framework for Analysis

Unveiling the Dynamics: Categorizing and Evaluating Judicial Interaction

The Horizon of Comprehension: Measuring Comprehensive Issue Coverage

What’s Next?

See also: