<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>On Evaluating Legal-Reasoning Capabilities of Generative AI</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Henry Prakken</string-name>
          <email>h.prakken@uu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information and Computing Sciences, Faculty of Science, Utrecht University</institution>
          ,
          <addr-line>Utrecht</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper critically examines some recent studies of the legal-reasoning capabilities of generative AI. It also discusses which roles traditional symbolic approaches can have in the era of generative AI. The introduction of ChatGPT by OpenAI in November 2022 was a 'big bang' in AI. Never before was an AI tool available for so many and so easy to use for so many diferent tasks. The ease with which it generates flawless natural language for a wide variety of tasks such as summarising documents, writing essays about any given topic, writing poems, drafting travel plans, outlining presentations, and even solving computer-programming exercises is amazing. And all this essentially with the simple technique of predicting the most likely next word in a sequence of words. It is therefore easy to think that traditional symbolic AI research on reasoning and argumentation is now obsolete and that the right way to let the computer engage in reasoning and argumentation is by using generative AI founded on large language models.</p>
      </abstract>
      <kwd-group>
        <kwd>Legal argumentation</kwd>
        <kwd>Large language models</kwd>
        <kwd>Evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>2.</p>
      <p>Then we make some methodological observations in Section 3 and review recent experiments in
applying LLMs to legal reasoning in Section 4. We then discuss what the field of computational
argumentation can learn from these studies in Section 5, after which we conclude.
https://webspace.science.uu.nl/~prakk101/ (H. Prakken)</p>
      <p>© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
    </sec>
    <sec id="sec-3">
      <title>2. Brief overview of AI &amp; law research on modelling legal argument</title>
      <p>
        Argumentation is “…the giving of reasons to support or criticize a claim that is questionable,
or open to doubt” [1, p. 285]. The field of AI &amp; Law has developed formal and computational
models of legal argumentation since [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. For overviews see [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ]. Both rule- and case-based
approaches have been applied, initially as alternatives but more recently as complementing
each other, since case-based reasoning often is about whether a rule’s conditions are satisfied.
Rule-based approaches have to account for the defeasibility of legal rules. Since rule-makers
cannot foresee everything, rule-appliers sometimes have to make exceptions in unforeseen
circumstances. Defeasibility also arises because of presumptions and allocations of burdens
of proof [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Several early rule-based accounts of legal reasoning used some form of
logicprogramming [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. Later, explicitly argument-based formalisms were applied or developed
[
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11</xref>
        ], as well as formalisms with anargumentative flavour such as Defeasible Logic [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
and abstract dialectical frameworks [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Case-based approaches were initially developed to account for the fact that in Anglo-American
jurisdictions case law instead of legislation traditionally is the main source of law, where courts
have to decide new cases by drawing analogies to decided cases. Case-based approaches have to
account for the fact that cases often not just have similarities but also diferences. Seminal work
was on HYPO [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ], in which cases were modelled with sets of features called dimensions,
which have partially ordered values that make cases better or worse for a particular outcome.
HYPO generated three-ply arguments between a plaintif and a defendant in a civil dispute,
drawing analogies between or distinguishing cases from their respective point of view. In later
work this was refined in many ways, for instance, by distinguishing features of various levels of
abstraction [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] or by comparing cases in terms of how case decisions promote or demote legal
or social values [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ]. Perhaps the most ambitious approaches are coherence-based accounts,
which model the construction of legal theories of some kind that explain a set of cases and
where the most coherent theory that does so should be adopted [
        <xref ref-type="bibr" rid="ref18 ref19 ref2">2, 18, 19</xref>
        ].
      </p>
      <p>
        While early on rule- and case-based approaches were presented as alternatives, later the
awareness arose that they complement each other, since case-based reasoning often is about
whether a rule’s conditions are satisfied. A challenge for rule-based accounts is that the
conditions of legal rules are often vague and general, and no clear rules can be given for when
they are satisfied. Here case-based approaches can complement rule-based approaches by
providing forms of argumentation for interpreting legal concepts [
        <xref ref-type="bibr" rid="ref20 ref21 ref4">20, 21, 4</xref>
        ].
      </p>
      <p>In sum, AI &amp; law has developed rich models of various forms of legal argument, including
rule-based, case-based and value-based accounts, which draw on various sources, including
legislation, case law and social and moral value considerations. Moreover, all this work takes
a knowledge-based approach: the required knowledge is encoded in a symbolic form that is
understandable for the machine and the computer reasons with it in a formally defined way,
ideally based on the laws of logic and rational reasoning. The advantages of this approach
are transparency and explainability: humans can see which knowledge the machine uses and
the machine can explain its outcomes by showing how it reasoned with this knowledge. A
big disadvantage of this approach is that it is often hard to acquire and represent a suficient
amount of knowledge in a form that can be manipulated by the machine. This is the notorious
knowledge acquisition bottleneck. Hence the attractiveness of large language models as a means
to generate legal argument.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Methodological remarks</title>
      <p>Evaluation of knowledge-based AI applications can be done at three levels: evaluating the
inputted knowledge, evaluating the reasoning mechanism (for instance, on whether it implements
some philosophically acceptable model of rational reasoning, or on soundness and
completeness properties with respect to such a model) and evaluating the output. When evaluating
applications of generative AI, evaluation at the first two levels becomes hard so often only the
system’s output is evaluated. Moreover, this output is natural language instead of some formal
language, with all the ambiguity and vagueness that comes with it, so interpreting the output is
not always an easy task. In consequence, evaluation studies of generative AI are inherently
experimental, often statistical and can involve subjective elements.</p>
      <sec id="sec-4-1">
        <title>3.1. Terminology: prompt engineering</title>
        <p>A well-known drawback of LLMs is that there is no connection between the large statistical
language model learned by the LLM and reality. All that an LLM ‘knows’ is how often words go
together in similar contexts. This often causes a LLM to ‘hallucinate’ facts. While this may not
be a problem for creative applications like writing prose or poetry, it is a serious problem when
an LLM is asked to produce high-quality information or arguments in high-stake contexts; and
legal contexts are often high-stake.</p>
        <p>There is much research on addressing this problem. Many of them involve prompt engineering,
that is, applying ingenious ways to write the prompts that are the user input of LLM applications.
Zero-shot prompts do not contain any examples of desired output but directly ask a question or
specify a task. Few-shot prompts do provide such examples. Chain-of-thought (CoT) prompting
consists of ways to ask the model to ‘think’ step-by-step rather than solving a complex problem
at once. Zero-Shot CoT does just that while few-shot CoT methods combine it with examples
of desired output. Such prompts are often formulated as problems of pattern completion, which
consist of showing the model a pattern of expected answers: this increases the probability
that the model will indeed give an answer in terms of this pattern. Yet another way in which
prompts can be engineered is to include one or more documents that have to be taken into
account in the model’s answers. When these documents are retrieved from other sources after
entering a prompt, this is called retrieval-augmented generation. In legal applications it makes
sense to include or retrieve legislation or case law.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Questions asked about the studies</title>
        <p>In this paper we will review all serious recent studies of the legal-reasoning capabilities of
generative AI that we know of, with among other things attention for their prompt-engineering
methods. We will ask the following questions about the reviewed studies.</p>
        <p>• Which reasoning capability is tested and according to which reasoning model?
• How direct was the testing? Were proxies for reasoning abilities used?
• Which method of prompt engineering is used?
• How systematic is the evaluation? Is it subjective or objective, qualitative or quantitative?
• What is compared? LLMs or prompting methods against each other or also against human
performance?</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Recent experiments on legal reasoning by LLMs</title>
      <p>All reviewed studies study tasks that involve legal reasoning, though in diferent ways. Some
studies are about making exam questions, other studies involve specific reasoning tasks (mostly
rule application) and some studies are about the generation of legal documents that typically
contain argumentation. Many studies apply or refer to the IRAC method of legal reasoning,
popular in Anglo-American legal education. IRAC stands for Issue-Rule-Application-Conclusion.
Here, Issue is the task of determining the legal issue of a case, Rule is the task of identifying the
relevant legal rules (which can also be precedents), Application is the task of determining how
the rules should be applied to the facts, and Conclusion is the task of drawing a legal conclusion
from the rule application. While in reality issue spotting can be far from trivial, in all studies
reported below the issue is in fact given. Then we see that the IRAC model in fact abstracts
from all AI &amp; law models of legal reasoning discussed above in Section 2.</p>
      <sec id="sec-5-1">
        <title>4.1. Studies on document generation</title>
        <p>There are a few studies on legal document generation.</p>
        <p>
          Perlman [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], in an informal experiment with zero-shot prompts asks ChatGPT, among
other things, to suggest arguments to make in a brief about a particular legal issue, to draft
a legal complaint and to perform an initial legal analysis of a brief factual scenario. Perlman
then gives his own informal qualitative opinion, observing among other things that ChatGPT’s
output is “surprisingly sophisticated” though “incomplete and problematic in numerous ways”.
The outputs “would not be suficiently helpful in their current forms for most people”.
        </p>
        <p>
          Iu &amp; Wong [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] conduct a similar informal experiment on the basis of a simplified description
of the facts in a well-known American case, asking ChatGPT with zero-shot prompts to perform
various writing tasks. Some of these tasks involve the production of legal arguments, such as
drafting a pleading claim, drafting a skeleton argument with the support of case law and drafting
a judgement considering both sides. The authors then subjectively evaluate the documents,
observing among other things that ChatGPT “demonstrated its ability to understand simple
facts and articulate the legal basis of a claim”, “was able to …summarise the key facts of relevant
case law to support the plaintif’s case”, and “was able to apply the reasoning of case law to
the simple facts of the case, thus demonstrating an ability to follow the IRAC approach in
writing the skeleton argument”. When applied to a second, more complicated case, ChatGPT
“performed excellently” in drafting skeleton arguments and “was able to draft the judgment by
considering the arguments of both sides with logical reasoning”.
        </p>
        <p>In sum, both these studies have no explicit testing of reasoning capability, so testing is indirect,
the prompts were zero-shot, evaluation is unsystematic and subjective, and no comparisons are
made. Because of the lack of explicit and objective evaluation standards, these studies cannot
provide valid and reliable results, though they can have heuristic value.</p>
        <p>
          Trozze et al. [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], in the domain of cryptocurrency security cases, tested ChatGPT on
writing a complaint for a class action lawsuit. The complaint was compared to one written by a
lawyer by letting a mock jury decide on the basis of both. The prompt only asked ChatGPT
to write the various parts of a complaint and did not give reasoning instructions. ChatGPT
was then evaluated in terms of how often the jury gave the same decision on the basis of
both complaints. The jurors were in 88% of the cases in which the human lawyer drafted the
complaint convinced that the allegations were proven; for AI-drafted complaints this figure
was 80%. The authors conclude from this that “Overwhelmingly, ChatGPT drafted convincing
complaints, which performed only slightly worse than the lawyer-drafted ones”. More generally,
the authors conclude that ChatGPT is better is in drafting legal documents than in statutory
reasoning (citing others for the same conclusion).
        </p>
        <p>In sum, the prompt did not give any reasoning instructions, and no explicit testing of reasoning
capability. Testing is thus indirect, with as proxy how often ChatGPT agrees with the human
lawyer. Systematic quantitative evaluation but no real comparison with human performance,
since the human lawyer’s performance is used as the evaluation standard.</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Studies on exam performance</title>
        <p>Several studies let the models take legal exams or answer exam questions. These studies mostly
indirectly evaluate legal-reasoning capacities, since many exam questions do not directly test the
student’s reasoning or argumentation skills. These studies can only be regarded as evaluating
such skills on the assumption that successfully making legal exams requires such skills. This
may not always be the case since questions can also test for the possession of legal knowledge.</p>
        <p>
          Yu et al. [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] let GPT-3 answer a type of question from the Japanese Bar exam, modelled as
an entailment task from the COLIEE competition [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. Given a legal rule and a legal question
(hypothesis), GPT-3 has to answer whether the hypothesis is true or false, with a brief
explanation. This looks likes rule application without chaining of rules. The authors test several
prompting methods: zero shot (simply asking whether the hypothesis is true given the rule),
few-shot (giving 1, 3 of 8 examples of desired output) and a two-stage form of CoT prompting,
ifrst asking ‘let’s think step-by-step’ and using the output as the input for the prompt ‘therefore,
the hypothesis is (true or false)’. Answers are quantitatively evaluated in terms of the known
correct answer. The accuracy is between 61 and 75%. Then the authors finetune GPT-3 with
the COLLIEE data set. Accuracy is between 61 and 77%. Finally, the authors ask GPT-3 in the
prompt to apply a particular reasoning method, which all are variants of the IRAC method. The
prompts just mention the required approach but do not explain it. Accuracy is between 66 and
81%. The authors observe qualitatively that the models appear to apply the indicated
reasoning method. The authors observe that the few-shot approaches with example-and reasoning
prompts outperform previous winners of the COLIEE competition but they do not compare
with human performance.
        </p>
        <p>In sum, explicit testing of reasoning abilities so testing is direct, zero-shot and few-shot
prompting, some prompts ask to apply a particular mentioned but undefined reasoning method,
systematic quantitative evaluation, comparisons between prompting methods and with other
NLP methods but not with human performance.</p>
        <p>
          Choi et al. [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] tested ChatGPT on four law school exams, each consisting of an essay
part and a multiple-choice part. They used zero-shot prompts consisting of the exam question
and for the multiple-choice part they alternatively tested CoT prompting, asking to provide
a chain of reasoning as well as giving a letter answer to the question. Three of the authors
blindly graded exams made both by ChatGPT and by students. The authors found that ChatGPT
passed all exams but that compared to the human students it “generally scored at or near the
bottom of each class”. Also, ChatGPT scored better on the multiple-choice questions than on
the essay questions and CoT prompting performed worse than the zero-shot prompts although
the diference was not statistically significant. As regards the essay questions, the authors
qualitatively observed that ChatGPT was poor in arguing why a rule applied to given facts and
that it did not systematically answer in terms of IRAC or some other reasoning model.
        </p>
        <p>In sum, no explicit testing of reasoning capability/model, except for a basic form of CoT
prompting for the multiple-choice questions. Testing is indirect, with the exam scores as
proxy for legal reasoning abilities. Systematic quantitative evaluation in terms of exam scores,
comparisons between prompting methods and with human performance.</p>
        <p>
          Katz et al. [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] tested the performance of GPT-4 on a simulated version of the American
bar exam. The exam consists of a part with essay questions and a part with multiple-choice
questions. The answers on essay questions were evaluated by two academic legal experts on
the basis of a collection of “representative” good answers available online. [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] make various
claims about GPT-4’s performance, the most important one being that it has passed the exam.
Although [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] casts doubt on some of [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]’s claims, he agrees that their main claim is justified.
This implies that GPT-4 performs comparably to human legal experts on the bar exam.
        </p>
        <p>In sum, no explicit testing of reasoning capability. Testing is indirect, with exam score as a
proxy for legal reasoning abilities. The zero-shot prompts correspond to the exam questions.
Systematic quantitative evaluation and comparisons with human performance.</p>
        <p>
          Nay [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] made their own selection of multiple-choice questions (four options, of which one
correct) for American tax law, with randomly generated fact, names and numbers to avoid that
the questions can be in a model’s training set. They compare several prompting methods that
inject legal information to the prompt to a zero-shot prompt that simply asks the question.
One method injects potentially relevant statutes resulting from a similarity search into the
prompt. Another method directly provides as context the relevant part of the law. A final
method provides context in the form of a lecture note relevant given to the question type written
by a law professor (one of the authors). Then various LLMs, including GPT-4, are compared on
accuracy, where in some experiments the prompting method is combined with CoT prompting.
Generally, GPT-4 performs the best, while CoT improves performance but not consistently.
        </p>
        <p>In sum, implicit testing of reasoning capability, (deductive rule application without chaining).
Testing is indirect. The prompts provide relevant legal information but give no information
about or explicit examples of the expected reasoning. Systematic quantitative evaluation and
comparisons between prompting methods but no comparisons with human performance.</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Studies on specific reasoning tasks</title>
        <p>
          Jiang &amp; Yang [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] study how GPT-3 classifies brief factual scenarios as a criminal ofence
(choice of one from eight). They include a brief explanation of the legal syllogism (basically
modus ponens with legal rules) in the prompt, without examples: ‘In the legal syllogism, the
major premise is the law article, the minor premise is the facts of the case and the conclusion is
the outcome of the judgment’. Then the prompt gives a brief factual scenario and asks GPT-3 to
‘use the legal syllogism to think and output the judgment’. The output gives a major and a minor
premise and a conclusion. The evaluation standard is the given ‘correct’ classification. GPT-3
has a higher accuracy with this method (68.5%) than with simply giving the case and asking for
the judgment (64.5%) and with zero-shot CoT prompting with ‘let’s think step-by-step’ (58.8%).
        </p>
        <p>In sum, explicit testing of reasoning capability, namely, the legal syllogism (deductive rule
application without chaining). Testing is direct. The prompts contain an explanation of the
reasoning method and ask to apply it. Systematic quantitative evaluation and comparisons
between prompting methods but no comparisons with human performance.</p>
        <p>
          Similar work is Deng et al. 2023 [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ], who use four subsequent prompts corresponding
to the stages of an IRAC-like process (article retrieval, recognising criminal elements in facts,
applying articles, providing judgment) as part of the overall task to predict judgments and
penalties. The four-stage process is compared on predictive performance with a ‘plain-text’
method and is found to generally but not always outperform the latter.
        </p>
        <p>In sum, explicit testing of IRAC-like reasoning capability. Testing is indirect in terms of
predictive performance. The breakdown into four prompts corresponds to IRAC-style
reasoning. Systematic quantitative evaluation and comparisons between prompting methods but no
comparisons with human performance.</p>
        <p>
          A limitation of both [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] and [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ] is that the test data apparently only contain convictions, so
that the models cannot reason about whether a suspect is guilty.
        </p>
        <p>
          Kang et al. [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ] let ChatGPT evaluate scenarios of which the correct analysis is formulated
with the IRAC-method in a semi-structured logical language, where the issues are given. This
thus tests how well ChatGPT identifies the rules, the conclusion and the reasoning steps from
the facts to the conclusion. It seems that the scenarios all are chains of if-then rules but this
is not fully clear from the appendix. ChatGPT’s outputs are evaluated by humans in terms of
“the marking rubrics used by law schools”. Then the quantitative measures precision, recall
and F1 are calculated. The scores vary but are never very high. The authors first give zero-shot
prompts without knowledge or examples and no request to use IRAC. When only the conclusion
should be provided (yes/no), ChatGPT performs rather well but especially the reasoning is
poor. Next they add, respectively, 20, 40 and 80% of the reasoning paths and observe improved
scores. The same happens when examples are given in the prompt and when the problems are
decomposed into subquestions (a kind of CoT prompting). This in fact codes the rules used in
the reasoning paths in the subquestions.
        </p>
        <p>In sum, explicit testing of reasoning capability, namely, IRAC. Testing is direct, since ChatGPT
is evaluated on how well it can reproduce pre-encoded IRAC structures. Various zero- and
few-shot prompting methods are used, giving less or more of the desired solution.
Systematic quantitative evaluation, comparisons between prompting methods but not with human
performance. An important thing to note is that quite some structure is added to the prompts.</p>
        <p>
          Blair-Stanek et al. [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ] test how well GPT-3 can perform “statutory reasoning”, which they
essentially see as deductive rule application including chaining. They use a data set containing
non-ambiguous tax laws and test cases with unique correct answers. The questions GPT-3 has
to answer are of the form ‘Premise - Hypothesis’ and GPT-3 has to answer whether the relation
between them is ‘entailment’ or ‘contradiction’. Several zero- and few-shot prompting methods
are used, with and without including a relevant statute or examples, and some also including
‘let’s think step-by-step’. The prompts ask to do ‘Entailment/Contradiction reasoning’ but do
not explain what it is. GPT-3 is numerically evaluated in terms of accuracy and scores between
38 and 74%, which the authors regard disappointing. Interestingly, the authors also tested GPT-3
on a set of simple ‘synthetic’ statutes with meaningless terms (rules with 2 or 3 conditions,
chains with 2 or 3 rules), to test to what extent GPT-3 uses implicit knowledge. Here GPT-3
performed even worse. The issue of implicit knowledge is also discussed more generally by
the authors, as well as the possibility that GPT-3 may have ‘seen’ the data set (which is public).
The authors conclude that their experiments raise “doubts about GPT-3’s ability to handle basic
legal work”. Here it should be noted that currently GPT-3 is not state-of-the art any more and
that its successor GPT-4 generally performs much better on many tasks.
        </p>
        <p>In sum, explicit testing of reasoning capability, namely, deductive rule application with
chaining, with awareness that the model might apply implicit knowledge of the statutes. Testing
is direct. The prompts ask to do ‘Entailment/Contradiction reasoning’ but do not explain what
it is. Systematic quantitative evaluation and comparisons between prompting methods but no
comparisons with human performance.</p>
        <p>Guha et al. [35] present the LegalBench legal reasoning benchmark for six legal tasks
corresponding to the stages of the IRAC model plus two related tasks. The datasets for the six
tasks are restricted to clear cases with objectively correct answers. The authors then apply
various LLMs to these tasks, where the prompts contain between zero and eight example
answers and an instruction to the LLM to explain its reasoning. For all tasks, GPT-4 performed
the best, with accuracies between 59.2 and 89.9%. The authors note that their experiments
should be seen as providing lower bounds on performance since they see considerable scope
for improvements.</p>
        <p>In sum, explicit testing of reasoning capability, namely, IRAC with chaining. Testing is thus
direct. The prompts can contain examples of the expected reasoning and give the instruction to
explain the reasoning. Systematic quantitative evaluation and comparisons between LLMs in
terms of accuracy but no comparisons between prompting methods or with human performance.</p>
        <p>
          Trozze et al. [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] also tested ChatGPT with GPT3.5 on the task of identifying laws that
are potentially being violated in a brief factual scenario. The evaluation was in terms of the
laws that were actually mentioned in the case. The prompt asked ChatGPT to apply the IRAC
method, but it only mentioned IRAC and did not explain it. Moreover, its application was not
explicitly tested. Instead, the quantitative measures precision (0.658), recall (0.252) and F1 (0.324)
were calculated. The authors concluded from these scores that ChatGPT’s performance was
overall poor.
        </p>
        <p>In sum, the prompt asks to apply the IRAC method but no explicit testing of whether it was
applied. Testing is indirect, with as proxy how often ChatGPT mentions a law also mentioned
in the case. Systematic quantitative evaluation but no comparison with human performance.</p>
        <p>Servantez et al. [36] propose an IRAC-inspired prompting method called ‘Chain of logic’.
Each prompt contains an example of a rule, a fact pattern and an issue, the rule’s decomposition
in elements (the conditions and the conclusion) and a formalisation of the rule in proposition
logic. Then the example answers each rule element separately, gives the logical expression for
the conditions yielded by the answer, and resolves it to give the final answer. Thus the model
should in one shot learn to apply this IRAC-style process from the example. The authors apply
this to several rule-based tasks from the LegalBench legal reasoning benchmark [35]. They
apply five large language-models including GPT-4 and compare the accuracy of their prompting
method to several zero- or few-shot prompting methods. Their method outperforms all other
methods for all LLMs, although not by wide margins. With GPT-4 they obtain 92.3% accuracy,
while the worst-performing method scores 86.3%. The authors conclude that, compared to the
literature, their method is the only few-shot method that consistently outperforms zero-shot
prompting. Limitations of this study are that the rules in LegalBench are simpler than in reality
and that the method only seems to work for single-step rule-application.</p>
        <p>In sum, explicit testing of reasoning capability, namely, IRAC-style deductive rule application
without chaining. Testing is direct. The prompts give a detailed example of the expected
reasoning. Systematic quantitative evaluation and comparisons between prompting methods in
terms of accuracy but no comparisons with human performance.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Discussion</title>
      <p>In this section we discuss what can be learned from the preceding overview. Here it should
be taken into account that studies that are only published on ArXiv are presumably not peer
reviewed.</p>
      <p>The studies involving exams and document-generation tasks do not explicitly test some
reasoning capability, which makes it hard to draw firm conclusions from them on such capabilities,
since they do not distinguish between the possession of legal knowledge and the ability to apply
it. The other studies do explicitly test on reasoning capabilities and test some form of deductive
reasoning with legal rules, often structured in terms of the IRAC model. Some studies do not
explain the reasoning method they ask for while other studies explain them with examples. A
commonplace in both legal philosophy and AI &amp; law is that deductive rule application is far too
simplistic as a full model of legal reasoning. The exam and document-generation studies could
implicitly test full-fledged argumentation capabilities, including the use of case- or value-based
reasoning and the consideration of conflicting arguments. However, whether they do is hard to
tell from the publications. This is a point on which computational models of legal argument
could be useful, namely, as standards for the argumentative outputs of legal generative AI.</p>
      <p>
        Most studies that make comparisons do so between several prompting methods or several
LLMs. Two studies compare between AI and human performance, namely, [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] and [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], which
both conclude that the model can pass American bar or law school exams and thus imply that the
models can take these exams at the level of human law trainees or law-school students. However,
passing such exams is only a rough proxy for having legal reasoning and argumentation abilities.
Whether the various reported scores on rule application tasks are positive is hard to tell. In
any case, knowledge-based legal AI would score perfectly on formalised versions of these
tasks, while they naturally allow for two further forms of evaluation besides experimental
evaluation of outputs: evaluation of the explicitly represented knowledge and of the explicitly
programmed reasoning model. Therefore, given that so much legal knowledge is explicitly
available, I believe that symbolic AI &amp; law applications can still be practically useful, either
stand-alone, or combined with generative AI as ‘conversational interfaces’ between the human
users’ human natural language and the system’s formal language.
      </p>
      <p>
        Some studies include reasoning instructions of varying levels of detail in the prompt and/or
verify to what extent the model’s output obeys these instructions. A general trend in the results
is that such prompting methods improve performance but not consistently. Moreover, there are
some methodological pitfalls here. The first is memorisation. Questions (for instance, bar exam
questions) may be in the training data, so the model may have seen them before, or the model
may in other ways have applied ‘shortcuts’ included in its statistical language model. Some of
the discussed studies show awareness of these issues [
        <xref ref-type="bibr" rid="ref24 ref34">34, 24</xref>
        ].
      </p>
      <p>Next, even if a model structures output according to some reasoning method, it may be that
the model has not followed the method. Striking examples are reported by [37], who found that
GPT-3.5 when used with CoT prompting does not always behave according to the reason it says
it applied. A simple example is with multiple-choice questions with two options A and B. When
GPT-3.5 is only shown examples with A as the correct answer, it then tends to prefers answer A
and gives a reason for A even if B is the correct answer. Thus the reason GPT-3.5 gives for its
answer is not the reason it applied. More worrying examples involve racial and gender biases.</p>
      <p>It might be argued that in legal applications this is not a serious problem since in the law all
that matters is the justification as it is given, since that is by which the parties, appeal courts
and the general public can assess the quality and acceptability of a decision. In philosophical
terms, it is not the context of discovery but the context of justification that matters. However,
against this it can be argued that when alternative decisions are legally acceptable, it is still
undesirable that the choice for a particular decision and for which arguments and evidence to
include in a decision is influenced by bias. This arguably holds the more for texts that do not
contain decisions but standpoints of the parties, such as summons, complaints or briefs.</p>
      <p>Regardless of this discussion, another way in which symbolic computational models of legal
argument could be useful is in formulating reasoning instructions in the prompt. A natural
idea is to formulate few-shot or CoT prompts in terms of some theory of rational reasoning or
decision-making. It might be said that (legal) prompt engineering is applied (legal) philosophy.</p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion</title>
      <p>Research on legal-reasoning capabilities of generative AI is rapidly emerging but still
inconclusive as regards quality or practical usefulness. If reasoning models are made explicit, then
they are (almost?) always some simple deductive form of rule application, which is generally
regarded as too simplistic as a full-fledged model of legal argument. The possible roles of
symbolic computational models of legal argument are threefold: as guidance for prompt
engineering, as standards for evaluating outputs of legal-generative AI, and as symbolic alternatives
to legal-generative AI, possibly combined with the latter as conversational interfaces. In any
case, traditional symbolic AI research on legal reasoning and argumentation is not yet obsolete.
[35] N. Guha et al., LEGALBENCH: a collaboratively built benchmark for measuring legal
reasoning in large language models, 2023. ArXiv:2308.11462.
[36] S. Servantez, J. Barrow, K. Hammond, R. Jain, Chain of logic: rule-based reasoning with
large language models, 2024. ArXiv:2402.10400.
[37] M. Turpin, J. Michael, E. Perez, S. Bowman, Language models don’t always say what they
think: unfaithful explanations in chain-of-thought prompting, in: Advances in Neural
Information Processing Systems, volume 36, 2024.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Walton</surname>
          </string-name>
          , Fundamentals of Critical Argumentation, Cambridge University Press, Cambridge,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>McCarty</surname>
          </string-name>
          ,
          <string-name>
            <surname>Reflections on</surname>
            <given-names>TAXMAN</given-names>
          </string-name>
          :
          <article-title>An experiment in artificial intelligence and legal reasoning</article-title>
          ,
          <source>Harvard Law Review</source>
          <volume>90</volume>
          (
          <year>1977</year>
          )
          <fpage>89</fpage>
          -
          <lpage>116</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Prakken</surname>
          </string-name>
          , G. Sartor,
          <article-title>Law and logic: A review from an argumentation perspective</article-title>
          ,
          <source>Artificial Intelligence</source>
          <volume>227</volume>
          (
          <year>2015</year>
          )
          <fpage>214</fpage>
          -
          <lpage>225</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bench-Capon</surname>
          </string-name>
          ,
          <article-title>HYPO's legacy: introduction to the virtual special issue</article-title>
          ,
          <source>Artificial Intelligence and Law</source>
          <volume>25</volume>
          (
          <year>2017</year>
          )
          <fpage>205</fpage>
          -
          <lpage>250</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Prakken</surname>
          </string-name>
          ,
          <article-title>Logical models of legal argumentation</article-title>
          , in: M.
          <string-name>
            <surname>Knauf</surname>
          </string-name>
          , W. Spohn (Eds.),
          <source>The Handbook of Rationality</source>
          , MIT Press, Cambridge, MA,
          <year>2021</year>
          , pp.
          <fpage>669</fpage>
          -
          <lpage>677</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Prakken</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Sartor, A logical analysis of burdens of proof</article-title>
          , in: H.
          <string-name>
            <surname>Kaptein</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Prakken</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          Verheij (Eds.), Legal Evidence and Proof: Statistics, Stories, Logic, Ashgate Publishing, Farnham,
          <year>2009</year>
          , pp.
          <fpage>223</fpage>
          -
          <lpage>253</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sergot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sadri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kowalski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kriwaczek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hammond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cory</surname>
          </string-name>
          ,
          <article-title>The British Nationality Act as a logic program</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>29</volume>
          (
          <year>1986</year>
          )
          <fpage>370</fpage>
          -
          <lpage>386</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bench-Capon</surname>
          </string-name>
          , G. Robinson,
          <string-name>
            <given-names>T.</given-names>
            <surname>Routen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sergot</surname>
          </string-name>
          ,
          <article-title>Logic programming for large scale applications in law: a formalisation of supplementary benefit legislation</article-title>
          ,
          <source>in: Proceedings of the First International Conference on Artificial Intelligence and Law</source>
          , ACM Press, New York,
          <year>1987</year>
          , pp.
          <fpage>190</fpage>
          -
          <lpage>198</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gordon</surname>
          </string-name>
          ,
          <article-title>The Pleadings Game: an exercise in computational dialectics</article-title>
          ,
          <source>Artificial Intelligence and Law</source>
          <volume>2</volume>
          (
          <year>1993</year>
          )
          <fpage>239</fpage>
          -
          <lpage>292</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Prakken</surname>
          </string-name>
          , G. Sartor,
          <article-title>Argument-based extended logic programming with defeasible priorities</article-title>
          ,
          <source>Journal of Applied Non-classical Logics 7</source>
          (
          <year>1997</year>
          )
          <fpage>25</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Al-Abdulkarim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Atkinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bench-Capon</surname>
          </string-name>
          ,
          <article-title>A methodology for desining systems to reason with legal cases using abstract dialectical frameworks</article-title>
          ,
          <source>Artificial Intelligence and Law</source>
          <volume>24</volume>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Governatori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rotolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rubino</surname>
          </string-name>
          ,
          <article-title>Implementing temporal defeasible logic for modeling legal reasoning, in: JSAI-isAI 2009 Workshops</article-title>
          ,
          <string-name>
            <surname>LENLS</surname>
          </string-name>
          , JURISIN,
          <string-name>
            <surname>KCSD</surname>
          </string-name>
          , LLLL, Tokyo, Japan,
          <source>November 19-20</source>
          ,
          <year>2009</year>
          , Revised Selected Papers,
          <source>number 6284 in Springer Lecture Notes in AI</source>
          , Springer Verlag, Berlin,
          <year>2010</year>
          , pp.
          <fpage>45</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E.</given-names>
            <surname>Rissland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ashley</surname>
          </string-name>
          ,
          <article-title>A case-based system for trade secrets law</article-title>
          ,
          <source>in: Proceedings of the First International Conference on Artificial Intelligence and Law</source>
          , ACM Press, New York,
          <year>1987</year>
          , pp.
          <fpage>60</fpage>
          -
          <lpage>66</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ashley</surname>
          </string-name>
          , Modeling Legal Argument:
          <article-title>Reasoning with Cases and Hypotheticals</article-title>
          , MIT Press, Cambridge, MA,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>V.</given-names>
            <surname>Aleven</surname>
          </string-name>
          ,
          <article-title>Using background knowledge in case-based legal reasoning: a computational model and an intelligent learning environment</article-title>
          ,
          <source>Artificial Intelligence</source>
          <volume>150</volume>
          (
          <year>2003</year>
          )
          <fpage>183</fpage>
          -
          <lpage>237</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D.</given-names>
            <surname>Berman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hafner</surname>
          </string-name>
          ,
          <article-title>Representing teleological structure in case-based legal reasoning: the missing link</article-title>
          ,
          <source>in: Proceedings of the Fourth International Conference on Artificial Intelligence and Law</source>
          , ACM Press, New York,
          <year>1993</year>
          , pp.
          <fpage>50</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grabmair</surname>
          </string-name>
          ,
          <article-title>Predicting trade secret case outcomes using argument schemes and learned quantitative value efect tradeofs</article-title>
          ,
          <source>in: Proceedings of the 16th International Conference on Artificial Intelligence and Law</source>
          , ACM Press, New York,
          <year>2017</year>
          , pp.
          <fpage>89</fpage>
          -
          <lpage>98</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>L.</given-names>
            <surname>McCarty</surname>
          </string-name>
          ,
          <article-title>An implementation of Eisner v</article-title>
          . Macomber, in
          <source>: Proceedings of the Fifth International Conference on Artificial Intelligence and Law</source>
          , ACM Press, New York,
          <year>1995</year>
          , pp.
          <fpage>276</fpage>
          -
          <lpage>286</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bench-Capon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sartor</surname>
          </string-name>
          ,
          <article-title>A model of legal reasoning with cases incorporating theories and values</article-title>
          ,
          <source>Artificial Intelligence</source>
          <volume>150</volume>
          (
          <year>2003</year>
          )
          <fpage>97</fpage>
          -
          <lpage>143</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>E.</given-names>
            <surname>Rissland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Skalak</surname>
          </string-name>
          ,
          <article-title>CABARET: statutory interpretation in a hybrid architecture</article-title>
          ,
          <source>International Journal of Man-Machine Studies</source>
          <volume>34</volume>
          (
          <year>1991</year>
          )
          <fpage>839</fpage>
          -
          <lpage>887</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gordon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Walton</surname>
          </string-name>
          ,
          <article-title>Legal reasoning with argumentation schemes</article-title>
          ,
          <source>in: Proceedings of the Twelfth International Conference on Artificial Intelligence and Law</source>
          , ACM Press, New York,
          <year>2009</year>
          , pp.
          <fpage>137</fpage>
          -
          <lpage>146</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Perlman</surname>
          </string-name>
          ,
          <article-title>The implications of ChatGPT for legal services</article-title>
          and society,
          <year>2022</year>
          . Http://ssrn.com/abstract=4294197.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>K.</given-names>
            <surname>Iu</surname>
          </string-name>
          , V.-Y. Wong,
          <article-title>ChatGPT by OpenAI: the end of litigation lawywers</article-title>
          ?,
          <year>2023</year>
          . Https://ssrn.com/abstract=4339839.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Trozze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Davies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kleinberg</surname>
          </string-name>
          ,
          <article-title>Large language models in cryptocurrency securities cases: can a GPT model meaningfully assist lawyers?</article-title>
          ,
          <source>Artificial Intelligence and Law</source>
          (
          <year>2024</year>
          ). Https://doi.org/10.1007/s10506-024-09399-6.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>F.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Quartey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schilder</surname>
          </string-name>
          ,
          <article-title>Legal prompting: teaching a language model to think like a lawyer, 2022</article-title>
          . ArXiv:
          <volume>2212</volume>
          .
          <fpage>01326</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rabelo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Goebel</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yoshioka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Satoh</surname>
          </string-name>
          ,
          <article-title>Overview and discussion of the competition on legal information extraction/entailment (COLIEE) 2021, The Review of Socionetwork Strategies 16 (</article-title>
          <year>2022</year>
          )
          <fpage>111</fpage>
          -
          <lpage>133</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hickman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Monahan</surname>
          </string-name>
          , D. Schwarcz, ChatGPT goes to law school,
          <year>2023</year>
          . Https://- doi.org/10.2139/ssrn.4335905.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>D.</given-names>
            <surname>Katz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bommarito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gao</surname>
          </string-name>
          , P. Arredondo,
          <article-title>GPT-4 passes the bar exam</article-title>
          .,
          <year>2023</year>
          . Https://ssrn.com/abstract=4389233.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>E.</given-names>
            <surname>Martínez</surname>
          </string-name>
          ,
          <article-title>Re-evaluating GTP-4's bar exam performance</article-title>
          ,
          <source>Artificial Intelligence and Law</source>
          (
          <year>2024</year>
          ). Https://doi.org/10.1007/s10506-024-09396-9.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>J.</given-names>
            <surname>Nay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Karamardian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lawsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bhat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kasai</surname>
          </string-name>
          ,
          <article-title>Large language models as tax attorneys: a case study in legal capabilities emergence</article-title>
          ,
          <year>2023</year>
          . ArXiv:
          <volume>2306</volume>
          .
          <fpage>07075</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>C.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Legal syllogism prompting: teaching large language models for legal judgment prediction</article-title>
          ,
          <source>in: Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law</source>
          , ACM Press, New York,
          <year>2023</year>
          , pp.
          <fpage>417</fpage>
          -
          <lpage>421</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>W.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <article-title>Syllogistic reasoning for legal judgment analysis</article-title>
          ,
          <source>in: Proceedings of the 2023 on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>13997</fpage>
          -
          <lpage>14009</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>X.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Qu</surname>
          </string-name>
          , L.
          <string-name>
            <surname>-K. Soon</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Trakic</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Zhuo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Emerton</surname>
          </string-name>
          , G. Grant,
          <article-title>Can ChatGPT perform reasoning using the IRAC method in analyzing legal scenarios like a lawyer?</article-title>
          ,
          <year>2023</year>
          . ArXiv:
          <volume>2310</volume>
          .
          <fpage>14880</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>A.</given-names>
            <surname>Blair-Stanek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Holzenberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. van Durme</given-names>
            ,
            <surname>Can</surname>
          </string-name>
          <string-name>
            <surname>GPT</surname>
          </string-name>
          -
          <article-title>3 perform statutory reasoning?</article-title>
          ,
          <source>in: Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law</source>
          , ACM Press, New York,
          <year>2023</year>
          , pp.
          <fpage>22</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>