On Evaluating Legal-Reasoning Capabilities of Generative AI

On Evaluating Legal-Reasoning Capabilities of Generative AI HenryPrakken Department of Information and Computing Sciences Faculty of Science Utrecht University

Utrecht The Netherlands

On Evaluating Legal-Reasoning Capabilities of Generative AI 1613-0073 147AC1F863C4D2F2E1B627579F0BD4C4 GROBID - A machine learning software for extracting information from scholarly documents Legal argumentation Large language models Evaluation

This paper critically examines some recent studies of the legal-reasoning capabilities of generative AI. It also discusses which roles traditional symbolic approaches can have in the era of generative AI.

Introduction

The introduction of ChatGPT by OpenAI in November 2022 was a 'big bang' in AI. Never before was an AI tool available for so many and so easy to use for so many different tasks. The ease with which it generates flawless natural language for a wide variety of tasks such as summarising documents, writing essays about any given topic, writing poems, drafting travel plans, outlining presentations, and even solving computer-programming exercises is amazing. And all this essentially with the simple technique of predicting the most likely next word in a sequence of words. It is therefore easy to think that traditional symbolic AI research on reasoning and argumentation is now obsolete and that the right way to let the computer engage in reasoning and argumentation is by using generative AI founded on large language models. This paper addresses this issue for argumentation in the law, which is an important application domain of computational argument. Several experiments have already been conducted on how large language models (LLMs) perform on legal reasoning tasks. This paper reviews some these experiments and more generally discusses the potential of generative AI to engage in legal argumentation. We first briefly summarise AI & law research on legal argument in Section 2. Then we make some methodological observations in Section 3 and review recent experiments in applying LLMs to legal reasoning in Section 4. We then discuss what the field of computational argumentation can learn from these studies in Section 5, after which we conclude.

The 24th International Workshop on Computational Models of Natural Argument (CMNA'24)

Envelope h.prakken@uu.nl (H. Prakken) GLOBE https://webspace.science.uu.nl/~prakk101/ (H. Prakken) Orcid 0000-0002-3431-7757 (H. Prakken)

Brief overview of AI & law research on modelling legal argument

Argumentation is "…the giving of reasons to support or criticize a claim that is questionable, or open to doubt" [1, p. 285]. The field of AI & Law has developed formal and computational models of legal argumentation since [2]. For overviews see [3,4,5]. Both rule-and case-based approaches have been applied, initially as alternatives but more recently as complementing each other, since case-based reasoning often is about whether a rule's conditions are satisfied.

Rule-based approaches have to account for the defeasibility of legal rules. Since rule-makers cannot foresee everything, rule-appliers sometimes have to make exceptions in unforeseen circumstances. Defeasibility also arises because of presumptions and allocations of burdens of proof [6]. Several early rule-based accounts of legal reasoning used some form of logicprogramming [7,8]. Later, explicitly argument-based formalisms were applied or developed [9,10,11], as well as formalisms with anargumentative flavour such as Defeasible Logic [12] and abstract dialectical frameworks [11].

Case-based approaches were initially developed to account for the fact that in Anglo-American jurisdictions case law instead of legislation traditionally is the main source of law, where courts have to decide new cases by drawing analogies to decided cases. Case-based approaches have to account for the fact that cases often not just have similarities but also differences. Seminal work was on HYPO [13,14], in which cases were modelled with sets of features called dimensions, which have partially ordered values that make cases better or worse for a particular outcome. HYPO generated three-ply arguments between a plaintiff and a defendant in a civil dispute, drawing analogies between or distinguishing cases from their respective point of view. In later work this was refined in many ways, for instance, by distinguishing features of various levels of abstraction [15] or by comparing cases in terms of how case decisions promote or demote legal or social values [16,17]. Perhaps the most ambitious approaches are coherence-based accounts, which model the construction of legal theories of some kind that explain a set of cases and where the most coherent theory that does so should be adopted [2,18,19].

While early on rule-and case-based approaches were presented as alternatives, later the awareness arose that they complement each other, since case-based reasoning often is about whether a rule's conditions are satisfied. A challenge for rule-based accounts is that the conditions of legal rules are often vague and general, and no clear rules can be given for when they are satisfied. Here case-based approaches can complement rule-based approaches by providing forms of argumentation for interpreting legal concepts [20,21,4].

In sum, AI & law has developed rich models of various forms of legal argument, including rule-based, case-based and value-based accounts, which draw on various sources, including legislation, case law and social and moral value considerations. Moreover, all this work takes a knowledge-based approach: the required knowledge is encoded in a symbolic form that is understandable for the machine and the computer reasons with it in a formally defined way, ideally based on the laws of logic and rational reasoning. The advantages of this approach are transparency and explainability: humans can see which knowledge the machine uses and the machine can explain its outcomes by showing how it reasoned with this knowledge. A big disadvantage of this approach is that it is often hard to acquire and represent a sufficient amount of knowledge in a form that can be manipulated by the machine. This is the notorious knowledge acquisition bottleneck. Hence the attractiveness of large language models as a means to generate legal argument.

Methodological remarks

Evaluation of knowledge-based AI applications can be done at three levels: evaluating the inputted knowledge, evaluating the reasoning mechanism (for instance, on whether it implements some philosophically acceptable model of rational reasoning, or on soundness and completeness properties with respect to such a model) and evaluating the output. When evaluating applications of generative AI, evaluation at the first two levels becomes hard so often only the system's output is evaluated. Moreover, this output is natural language instead of some formal language, with all the ambiguity and vagueness that comes with it, so interpreting the output is not always an easy task. In consequence, evaluation studies of generative AI are inherently experimental, often statistical and can involve subjective elements.

Terminology: prompt engineering

A well-known drawback of LLMs is that there is no connection between the large statistical language model learned by the LLM and reality. All that an LLM 'knows' is how often words go together in similar contexts. This often causes a LLM to 'hallucinate' facts. While this may not be a problem for creative applications like writing prose or poetry, it is a serious problem when an LLM is asked to produce high-quality information or arguments in high-stake contexts; and legal contexts are often high-stake.

There is much research on addressing this problem. Many of them involve prompt engineering, that is, applying ingenious ways to write the prompts that are the user input of LLM applications. Zero-shot prompts do not contain any examples of desired output but directly ask a question or specify a task. Few-shot prompts do provide such examples. Chain-of-thought (CoT) prompting consists of ways to ask the model to 'think' step-by-step rather than solving a complex problem at once. Zero-Shot CoT does just that while few-shot CoT methods combine it with examples of desired output. Such prompts are often formulated as problems of pattern completion, which consist of showing the model a pattern of expected answers: this increases the probability that the model will indeed give an answer in terms of this pattern. Yet another way in which prompts can be engineered is to include one or more documents that have to be taken into account in the model's answers. When these documents are retrieved from other sources after entering a prompt, this is called retrieval-augmented generation. In legal applications it makes sense to include or retrieve legislation or case law.

Questions asked about the studies

In this paper we will review all serious recent studies of the legal-reasoning capabilities of generative AI that we know of, with among other things attention for their prompt-engineering methods. We will ask the following questions about the reviewed studies.

• Which reasoning capability is tested and according to which reasoning model?

• How direct was the testing? Were proxies for reasoning abilities used? • Which method of prompt engineering is used? • How systematic is the evaluation? Is it subjective or objective, qualitative or quantitative? • What is compared? LLMs or prompting methods against each other or also against human performance?

Recent experiments on legal reasoning by LLMs

All reviewed studies study tasks that involve legal reasoning, though in different ways. Some studies are about making exam questions, other studies involve specific reasoning tasks (mostly rule application) and some studies are about the generation of legal documents that typically contain argumentation. Many studies apply or refer to the IRAC method of legal reasoning, popular in Anglo-American legal education. IRAC stands for Issue-Rule-Application-Conclusion.

Here, Issue is the task of determining the legal issue of a case, Rule is the task of identifying the relevant legal rules (which can also be precedents), Application is the task of determining how the rules should be applied to the facts, and Conclusion is the task of drawing a legal conclusion from the rule application. While in reality issue spotting can be far from trivial, in all studies reported below the issue is in fact given. Then we see that the IRAC model in fact abstracts from all AI & law models of legal reasoning discussed above in Section 2.

Studies on document generation

There are a few studies on legal document generation.

Perlman [22], in an informal experiment with zero-shot prompts asks ChatGPT, among other things, to suggest arguments to make in a brief about a particular legal issue, to draft a legal complaint and to perform an initial legal analysis of a brief factual scenario. Perlman then gives his own informal qualitative opinion, observing among other things that ChatGPT's output is "surprisingly sophisticated" though "incomplete and problematic in numerous ways". The outputs "would not be sufficiently helpful in their current forms for most people".

Iu & Wong [23] conduct a similar informal experiment on the basis of a simplified description of the facts in a well-known American case, asking ChatGPT with zero-shot prompts to perform various writing tasks. Some of these tasks involve the production of legal arguments, such as drafting a pleading claim, drafting a skeleton argument with the support of case law and drafting a judgement considering both sides. The authors then subjectively evaluate the documents, observing among other things that ChatGPT "demonstrated its ability to understand simple facts and articulate the legal basis of a claim", "was able to …summarise the key facts of relevant case law to support the plaintiff's case", and "was able to apply the reasoning of case law to the simple facts of the case, thus demonstrating an ability to follow the IRAC approach in writing the skeleton argument". When applied to a second, more complicated case, ChatGPT "performed excellently" in drafting skeleton arguments and "was able to draft the judgment by considering the arguments of both sides with logical reasoning".

In sum, both these studies have no explicit testing of reasoning capability, so testing is indirect, the prompts were zero-shot, evaluation is unsystematic and subjective, and no comparisons are made. Because of the lack of explicit and objective evaluation standards, these studies cannot provide valid and reliable results, though they can have heuristic value.

Trozze et al. [24], in the domain of cryptocurrency security cases, tested ChatGPT on writing a complaint for a class action lawsuit. The complaint was compared to one written by a lawyer by letting a mock jury decide on the basis of both. The prompt only asked ChatGPT to write the various parts of a complaint and did not give reasoning instructions. ChatGPT was then evaluated in terms of how often the jury gave the same decision on the basis of both complaints. The jurors were in 88% of the cases in which the human lawyer drafted the complaint convinced that the allegations were proven; for AI-drafted complaints this figure was 80%. The authors conclude from this that "Overwhelmingly, ChatGPT drafted convincing complaints, which performed only slightly worse than the lawyer-drafted ones". More generally, the authors conclude that ChatGPT is better is in drafting legal documents than in statutory reasoning (citing others for the same conclusion).

In sum, the prompt did not give any reasoning instructions, and no explicit testing of reasoning capability. Testing is thus indirect, with as proxy how often ChatGPT agrees with the human lawyer. Systematic quantitative evaluation but no real comparison with human performance, since the human lawyer's performance is used as the evaluation standard.

Studies on exam performance

Several studies let the models take legal exams or answer exam questions. These studies mostly indirectly evaluate legal-reasoning capacities, since many exam questions do not directly test the student's reasoning or argumentation skills. These studies can only be regarded as evaluating such skills on the assumption that successfully making legal exams requires such skills. This may not always be the case since questions can also test for the possession of legal knowledge.

Yu et al. [25] let GPT-3 answer a type of question from the Japanese Bar exam, modelled as an entailment task from the COLIEE competition [26]. Given a legal rule and a legal question (hypothesis), GPT-3 has to answer whether the hypothesis is true or false, with a brief explanation. This looks likes rule application without chaining of rules. The authors test several prompting methods: zero shot (simply asking whether the hypothesis is true given the rule), few-shot (giving 1, 3 of 8 examples of desired output) and a two-stage form of CoT prompting, first asking 'let's think step-by-step' and using the output as the input for the prompt 'therefore, the hypothesis is (true or false)'. Answers are quantitatively evaluated in terms of the known correct answer. The accuracy is between 61 and 75%. Then the authors finetune GPT-3 with the COLLIEE data set. Accuracy is between 61 and 77%. Finally, the authors ask GPT-3 in the prompt to apply a particular reasoning method, which all are variants of the IRAC method. The prompts just mention the required approach but do not explain it. Accuracy is between 66 and 81%. The authors observe qualitatively that the models appear to apply the indicated reasoning method. The authors observe that the few-shot approaches with example-and reasoning prompts outperform previous winners of the COLIEE competition but they do not compare with human performance.

In sum, explicit testing of reasoning abilities so testing is direct, zero-shot and few-shot prompting, some prompts ask to apply a particular mentioned but undefined reasoning method, systematic quantitative evaluation, comparisons between prompting methods and with other NLP methods but not with human performance.

Choi et al. [27] tested ChatGPT on four law school exams, each consisting of an essay part and a multiple-choice part. They used zero-shot prompts consisting of the exam question and for the multiple-choice part they alternatively tested CoT prompting, asking to provide a chain of reasoning as well as giving a letter answer to the question. Three of the authors blindly graded exams made both by ChatGPT and by students. The authors found that ChatGPT passed all exams but that compared to the human students it "generally scored at or near the bottom of each class". Also, ChatGPT scored better on the multiple-choice questions than on the essay questions and CoT prompting performed worse than the zero-shot prompts although the difference was not statistically significant. As regards the essay questions, the authors qualitatively observed that ChatGPT was poor in arguing why a rule applied to given facts and that it did not systematically answer in terms of IRAC or some other reasoning model.

In sum, no explicit testing of reasoning capability/model, except for a basic form of CoT prompting for the multiple-choice questions. Testing is indirect, with the exam scores as proxy for legal reasoning abilities. Systematic quantitative evaluation in terms of exam scores, comparisons between prompting methods and with human performance.

Katz et al. [28] tested the performance of GPT-4 on a simulated version of the American bar exam. The exam consists of a part with essay questions and a part with multiple-choice questions. The answers on essay questions were evaluated by two academic legal experts on the basis of a collection of "representative" good answers available online. [28] make various claims about GPT-4's performance, the most important one being that it has passed the exam. Although [29] casts doubt on some of [28]'s claims, he agrees that their main claim is justified. This implies that GPT-4 performs comparably to human legal experts on the bar exam.

In sum, no explicit testing of reasoning capability. Testing is indirect, with exam score as a proxy for legal reasoning abilities. The zero-shot prompts correspond to the exam questions. Systematic quantitative evaluation and comparisons with human performance.

Nay [30] made their own selection of multiple-choice questions (four options, of which one correct) for American tax law, with randomly generated fact, names and numbers to avoid that the questions can be in a model's training set. They compare several prompting methods that inject legal information to the prompt to a zero-shot prompt that simply asks the question. One method injects potentially relevant statutes resulting from a similarity search into the prompt. Another method directly provides as context the relevant part of the law. A final method provides context in the form of a lecture note relevant given to the question type written by a law professor (one of the authors). Then various LLMs, including GPT-4, are compared on accuracy, where in some experiments the prompting method is combined with CoT prompting. Generally, GPT-4 performs the best, while CoT improves performance but not consistently.

In sum, implicit testing of reasoning capability, (deductive rule application without chaining). Testing is indirect. The prompts provide relevant legal information but give no information about or explicit examples of the expected reasoning. Systematic quantitative evaluation and comparisons between prompting methods but no comparisons with human performance.

Studies on specific reasoning tasks

Jiang & Yang [31] study how GPT-3 classifies brief factual scenarios as a criminal offence (choice of one from eight). They include a brief explanation of the legal syllogism (basically modus ponens with legal rules) in the prompt, without examples: 'In the legal syllogism, the major premise is the law article, the minor premise is the facts of the case and the conclusion is the outcome of the judgment'. Then the prompt gives a brief factual scenario and asks GPT-3 to 'use the legal syllogism to think and output the judgment'. The output gives a major and a minor premise and a conclusion. The evaluation standard is the given 'correct' classification. GPT-3 has a higher accuracy with this method (68.5%) than with simply giving the case and asking for the judgment (64.5%) and with zero-shot CoT prompting with 'let's think step-by-step' (58.8%).

In sum, explicit testing of reasoning capability, namely, the legal syllogism (deductive rule application without chaining). Testing is direct. The prompts contain an explanation of the reasoning method and ask to apply it. Systematic quantitative evaluation and comparisons between prompting methods but no comparisons with human performance.

Similar work is Deng et al. 2023 [32], who use four subsequent prompts corresponding to the stages of an IRAC-like process (article retrieval, recognising criminal elements in facts, applying articles, providing judgment) as part of the overall task to predict judgments and penalties. The four-stage process is compared on predictive performance with a 'plain-text' method and is found to generally but not always outperform the latter.

In sum, explicit testing of IRAC-like reasoning capability. Testing is indirect in terms of predictive performance. The breakdown into four prompts corresponds to IRAC-style reasoning. Systematic quantitative evaluation and comparisons between prompting methods but no comparisons with human performance.

A limitation of both [31] and [32] is that the test data apparently only contain convictions, so that the models cannot reason about whether a suspect is guilty.

Kang et al. [33] let ChatGPT evaluate scenarios of which the correct analysis is formulated with the IRAC-method in a semi-structured logical language, where the issues are given. This thus tests how well ChatGPT identifies the rules, the conclusion and the reasoning steps from the facts to the conclusion. It seems that the scenarios all are chains of if-then rules but this is not fully clear from the appendix. ChatGPT's outputs are evaluated by humans in terms of "the marking rubrics used by law schools". Then the quantitative measures precision, recall and F1 are calculated. The scores vary but are never very high. The authors first give zero-shot prompts without knowledge or examples and no request to use IRAC. When only the conclusion should be provided (yes/no), ChatGPT performs rather well but especially the reasoning is poor. Next they add, respectively, 20, 40 and 80% of the reasoning paths and observe improved scores. The same happens when examples are given in the prompt and when the problems are decomposed into subquestions (a kind of CoT prompting). This in fact codes the rules used in the reasoning paths in the subquestions.

In sum, explicit testing of reasoning capability, namely, IRAC. Testing is direct, since ChatGPT is evaluated on how well it can reproduce pre-encoded IRAC structures. Various zero-and few-shot prompting methods are used, giving less or more of the desired solution. Systematic quantitative evaluation, comparisons between prompting methods but not with human performance. An important thing to note is that quite some structure is added to the prompts.

Blair-Stanek et al. [34] test how well GPT-3 can perform "statutory reasoning", which they essentially see as deductive rule application including chaining. They use a data set containing non-ambiguous tax laws and test cases with unique correct answers. The questions GPT-3 has to answer are of the form 'Premise -Hypothesis' and GPT-3 has to answer whether the relation between them is 'entailment' or 'contradiction'. Several zero-and few-shot prompting methods are used, with and without including a relevant statute or examples, and some also including 'let's think step-by-step'. The prompts ask to do 'Entailment/Contradiction reasoning' but do not explain what it is. GPT-3 is numerically evaluated in terms of accuracy and scores between 38 and 74%, which the authors regard disappointing. Interestingly, the authors also tested GPT-3 on a set of simple 'synthetic' statutes with meaningless terms (rules with 2 or 3 conditions, chains with 2 or 3 rules), to test to what extent GPT-3 uses implicit knowledge. Here GPT-3 performed even worse. The issue of implicit knowledge is also discussed more generally by the authors, as well as the possibility that GPT-3 may have 'seen' the data set (which is public). The authors conclude that their experiments raise "doubts about GPT-3's ability to handle basic legal work". Here it should be noted that currently GPT-3 is not state-of-the art any more and that its successor GPT-4 generally performs much better on many tasks.

In sum, explicit testing of reasoning capability, namely, deductive rule application with chaining, with awareness that the model might apply implicit knowledge of the statutes. Testing is direct. The prompts ask to do 'Entailment/Contradiction reasoning' but do not explain what it is. Systematic quantitative evaluation and comparisons between prompting methods but no comparisons with human performance.

Guha et al. [35] present the LegalBench legal reasoning benchmark for six legal tasks corresponding to the stages of the IRAC model plus two related tasks. The datasets for the six tasks are restricted to clear cases with objectively correct answers. The authors then apply various LLMs to these tasks, where the prompts contain between zero and eight example answers and an instruction to the LLM to explain its reasoning. For all tasks, GPT-4 performed the best, with accuracies between 59.2 and 89.9%. The authors note that their experiments should be seen as providing lower bounds on performance since they see considerable scope for improvements.

In sum, explicit testing of reasoning capability, namely, IRAC with chaining. Testing is thus direct. The prompts can contain examples of the expected reasoning and give the instruction to explain the reasoning. Systematic quantitative evaluation and comparisons between LLMs in terms of accuracy but no comparisons between prompting methods or with human performance.

Trozze et al. [24] also tested ChatGPT with GPT3.5 on the task of identifying laws that are potentially being violated in a brief factual scenario. The evaluation was in terms of the laws that were actually mentioned in the case. The prompt asked ChatGPT to apply the IRAC method, but it only mentioned IRAC and did not explain it. Moreover, its application was not explicitly tested. Instead, the quantitative measures precision (0.658), recall (0.252) and F1 (0.324) were calculated. The authors concluded from these scores that ChatGPT's performance was overall poor.

In sum, the prompt asks to apply the IRAC method but no explicit testing of whether it was applied. Testing is indirect, with as proxy how often ChatGPT mentions a law also mentioned in the case. Systematic quantitative evaluation but no comparison with human performance.

Servantez et al. [36] propose an IRAC-inspired prompting method called 'Chain of logic'. Each prompt contains an example of a rule, a fact pattern and an issue, the rule's decomposition in elements (the conditions and the conclusion) and a formalisation of the rule in proposition logic. Then the example answers each rule element separately, gives the logical expression for the conditions yielded by the answer, and resolves it to give the final answer. Thus the model should in one shot learn to apply this IRAC-style process from the example. The authors apply this to several rule-based tasks from the LegalBench legal reasoning benchmark [35]. They apply five large language-models including GPT-4 and compare the accuracy of their prompting method to several zero-or few-shot prompting methods. Their method outperforms all other methods for all LLMs, although not by wide margins. With GPT-4 they obtain 92.3% accuracy, while the worst-performing method scores 86.3%. The authors conclude that, compared to the literature, their method is the only few-shot method that consistently outperforms zero-shot prompting. Limitations of this study are that the rules in LegalBench are simpler than in reality and that the method only seems to work for single-step rule-application.

In sum, explicit testing of reasoning capability, namely, IRAC-style deductive rule application without chaining. Testing is direct. The prompts give a detailed example of the expected reasoning. Systematic quantitative evaluation and comparisons between prompting methods in terms of accuracy but no comparisons with human performance.

Discussion

In this section we discuss what can be learned from the preceding overview. Here it should be taken into account that studies that are only published on ArXiv are presumably not peer reviewed.

The studies involving exams and document-generation tasks do not explicitly test some reasoning capability, which makes it hard to draw firm conclusions from them on such capabilities, since they do not distinguish between the possession of legal knowledge and the ability to apply it. The other studies do explicitly test on reasoning capabilities and test some form of deductive reasoning with legal rules, often structured in terms of the IRAC model. Some studies do not explain the reasoning method they ask for while other studies explain them with examples. A commonplace in both legal philosophy and AI & law is that deductive rule application is far too simplistic as a full model of legal reasoning. The exam and document-generation studies could implicitly test full-fledged argumentation capabilities, including the use of case-or value-based reasoning and the consideration of conflicting arguments. However, whether they do is hard to tell from the publications. This is a point on which computational models of legal argument could be useful, namely, as standards for the argumentative outputs of legal generative AI.

Most studies that make comparisons do so between several prompting methods or several LLMs. Two studies compare between AI and human performance, namely, [28] and [27], which both conclude that the model can pass American bar or law school exams and thus imply that the models can take these exams at the level of human law trainees or law-school students. However, passing such exams is only a rough proxy for having legal reasoning and argumentation abilities. Whether the various reported scores on rule application tasks are positive is hard to tell. In any case, knowledge-based legal AI would score perfectly on formalised versions of these tasks, while they naturally allow for two further forms of evaluation besides experimental evaluation of outputs: evaluation of the explicitly represented knowledge and of the explicitly programmed reasoning model. Therefore, given that so much legal knowledge is explicitly available, I believe that symbolic AI & law applications can still be practically useful, either stand-alone, or combined with generative AI as 'conversational interfaces' between the human users' human natural language and the system's formal language.

Some studies include reasoning instructions of varying levels of detail in the prompt and/or verify to what extent the model's output obeys these instructions. A general trend in the results is that such prompting methods improve performance but not consistently. Moreover, there are some methodological pitfalls here. The first is memorisation. Questions (for instance, bar exam questions) may be in the training data, so the model may have seen them before, or the model may in other ways have applied 'shortcuts' included in its statistical language model. Some of the discussed studies show awareness of these issues [34,24].

Next, even if a model structures output according to some reasoning method, it may be that the model has not followed the method. Striking examples are reported by [37], who found that GPT-3.5 when used with CoT prompting does not always behave according to the reason it says it applied. A simple example is with multiple-choice questions with two options A and B. When GPT-3.5 is only shown examples with A as the correct answer, it then tends to prefers answer A and gives a reason for A even if B is the correct answer. Thus the reason GPT-3.5 gives for its answer is not the reason it applied. More worrying examples involve racial and gender biases.

It might be argued that in legal applications this is not a serious problem since in the law all that matters is the justification as it is given, since that is by which the parties, appeal courts and the general public can assess the quality and acceptability of a decision. In philosophical terms, it is not the context of discovery but the context of justification that matters. However, against this it can be argued that when alternative decisions are legally acceptable, it is still undesirable that the choice for a particular decision and for which arguments and evidence to include in a decision is influenced by bias. This arguably holds the more for texts that do not contain decisions but standpoints of the parties, such as summons, complaints or briefs.

Regardless of this discussion, another way in which symbolic computational models of legal argument could be useful is in formulating reasoning instructions in the prompt. A natural idea is to formulate few-shot or CoT prompts in terms of some theory of rational reasoning or decision-making. It might be said that (legal) prompt engineering is applied (legal) philosophy.

Conclusion

Research on legal-reasoning capabilities of generative AI is rapidly emerging but still inconclusive as regards quality or practical usefulness. If reasoning models are made explicit, then they are (almost?) always some simple deductive form of rule application, which is generally regarded as too simplistic as a full-fledged model of legal argument. The possible roles of symbolic computational models of legal argument are threefold: as guidance for prompt engineering, as standards for evaluating outputs of legal-generative AI, and as symbolic alternatives to legal-generative AI, possibly combined with the latter as conversational interfaces. In any case, traditional symbolic AI research on legal reasoning and argumentation is not yet obsolete.

Fundamentals of Critical Argumentation DWalton 2006 Cambridge University Press Cambridge Reflections on TAXMAN: An experiment in artificial intelligence and legal reasoning LMccarty Harvard Law Review 90 1977 Law and logic: A review from an argumentation perspective HPrakken GSartor Artificial Intelligence 227 2015 HYPO's legacy: introduction to the virtual special issue TBench-Capon Artificial Intelligence and Law 25 2017 Logical models of legal argumentation HPrakken The Handbook of Rationality MKnauff WSpohn

Cambridge, MA

MIT Press 2021 A logical analysis of burdens of proof HPrakken GSartor Legal Evidence and Proof: Statistics, Stories, Logic HKaptein HPrakken BVerheij

Farnham

Ashgate Publishing 2009 The British Nationality Act as a logic program MSergot FSadri RKowalski FKriwaczek PHammond HCory Communications of the ACM 29 1986 Logic programming for large scale applications in law: a formalisation of supplementary benefit legislation TBench-Capon GRobinson TRouten MSergot Proceedings of the First International Conference on Artificial Intelligence and Law the First International Conference on Artificial Intelligence and Law

New York

ACM Press 1987 The Pleadings Game: an exercise in computational dialectics TGordon Artificial Intelligence and Law 2 1993 Argument-based extended logic programming with defeasible priorities HPrakken GSartor Journal of Applied Non-classical Logics 7 1997 A methodology for desining systems to reason with legal cases using abstract dialectical frameworks LAl-Abdulkarim KAtkinson TBench-Capon Artificial Intelligence and Law 24 2016 Implementing temporal defeasible logic for modeling legal reasoning GGovernatori ARotolo RRubino JSAI-isAI 2009 Workshops, LENLS, JURISIN Springer Lecture Notes in AI

KCSD, LLLL, Tokyo, Japan; Berlin

Springer Verlag November 19-20, 2009. 2010 6284 Revised Selected Papers A case-based system for trade secrets law ERissland KAshley Proceedings of the First International Conference on Artificial Intelligence and Law the First International Conference on Artificial Intelligence and Law

New York

ACM Press 1987 Modeling Legal Argument: Reasoning with Cases and Hypotheticals KAshley 1990 MIT Press Cambridge, MA Using background knowledge in case-based legal reasoning: a computational model and an intelligent learning environment VAleven Artificial Intelligence 150 2003 Representing teleological structure in case-based legal reasoning: the missing link DBerman CHafner Proceedings of the Fourth International Conference on Artificial Intelligence and Law the Fourth International Conference on Artificial Intelligence and Law

New York

ACM Press 1993 Predicting trade secret case outcomes using argument schemes and learned quantitative value effect tradeoffs MGrabmair Proceedings of the 16th International Conference on Artificial Intelligence and Law the 16th International Conference on Artificial Intelligence and Law

New York

ACM Press 2017 An implementation of Eisner v. Macomber LMccarty Proceedings of the Fifth International Conference on Artificial Intelligence and Law the Fifth International Conference on Artificial Intelligence and Law

New York

ACM Press 1995 A model of legal reasoning with cases incorporating theories and values TBench-Capon GSartor Artificial Intelligence 150 2003 CABARET: statutory interpretation in a hybrid architecture ERissland DSkalak International Journal of Man-Machine Studies 34 1991 Legal reasoning with argumentation schemes TGordon DWalton Proceedings of the Twelfth International Conference on Artificial Intelligence and Law the Twelfth International Conference on Artificial Intelligence and Law

New York

ACM Press 2009 APerlman The implications of ChatGPT for legal services and society 2022 ChatGPT by OpenAI: the end of litigation lawywers? KIu V.-YWong 2023 Large language models in cryptocurrency securities cases: can a GPT model meaningfully assist lawyers? ATrozze TDavies BKleinberg 10.1007/s10506-024-09399-6 Artificial Intelligence and Law 2024 Legal prompting: teaching a language model to think like a lawyer FYu LQuartey FSchilder ArXiv:2212.01326 2022 Overview and discussion of the competition on legal information extraction/entailment (COLIEE) JRabelo RGoebel M.-YKim YKano MYoshioka KSatoh The Review of Socionetwork Strategies 16 2021. 2022 ChatGPT goes to law school JChoi KHickman AMonahan DSchwarcz 10.2139/ssrn.4335905 2023 GPT-4 passes the bar exam DKatz MBommarito SGao PArredondo 2023 Re-evaluating GTP-4's bar exam performance EMartínez 10.1007/s10506-024-09396-9 Artificial Intelligence and Law 2024 JNay DKaramardian SLawsky WTao MBhat RJain ATLee JChoi JKasai ArXiv:2306.07075 Large language models as tax attorneys: a case study in legal capabilities emergence 2023 Legal syllogism prompting: teaching large language models for legal judgment prediction CJiang XYang Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law the Nineteenth International Conference on Artificial Intelligence and Law

New York

ACM Press 2023 Syllogistic reasoning for legal judgment analysis WDeng JPei KKong ZChen FWei YLi ZRen ZChen PRen Proceedings of the 2023 on Empirical Methods in Natural Language Processing the 2023 on Empirical Methods in Natural Language Processing 2023 Can ChatGPT perform reasoning using the IRAC method in analyzing legal scenarios like a lawyer? XKang LQu L.-KSoon ATrakic TZhuo PEmerton GGrant ArXiv:2310.14880 2023 Can GPT-3 perform statutory reasoning? ABlair-Stanek NHolzenberger BVan Durme Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law the Nineteenth International Conference on Artificial Intelligence and Law

New York

ACM Press 2023 LEGALBENCH: a collaboratively built benchmark for measuring legal reasoning in large language models NGuha ArXiv:2308.11462 2023 Chain of logic: rule-based reasoning with large language models SServantez JBarrow KHammond RJain ArXiv:2402.10400 2024 Language models don't always say what they think: unfaithful explanations in chain-of-thought prompting MTurpin JMichael EPerez SBowman Advances in Neural Information Processing Systems 36 2024