<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Measuring Faithfulness and Abstention: An Automated Pipeline for Evaluating LLM-Generated 3-ply Case-Based Legal Arguments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Li Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Morgan Gray</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jaromir Savelka</string-name>
          <email>jsavelka@andrew.cmu.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kevin D. Ashley</string-name>
          <email>ashley@pitt.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Intelligent Systems Program, University of Pittsburgh</institution>
          ,
          <addr-line>Pittsburgh, Pennsylvania</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computer Science, Carnegie Mellon University</institution>
          ,
          <addr-line>Pittsburgh, Pennsylvania</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Large Language Models (LLMs) demonstrate potential in complex legal tasks like argument generation, yet their reliability remains a concern. Building upon pilot work assessing LLM generation of 3-ply legal arguments using human evaluation, this paper introduces an automated pipeline to evaluate LLM performance on this task, specifically focusing on faithfulness (absence of hallucination), factor utilization, and appropriate abstention. We define hallucination as the generation of factors not present in the input case materials and abstention as the model's ability to refrain from generating arguments when instructed and no factual basis exists. Our automated method employs an external LLM to extract factors from generated arguments and compares them against the ground-truth factors provided in the input case triples (current case and two precedent cases). We evaluated eight distinct LLMs on three tests of increasing dificulty: 1) generating a standard 3-ply argument, 2) generating an argument with swapped precedent roles, and 3) recognizing the impossibility of argument generation due to lack of shared factors and abstaining. Our findings indicate that while current LLMs achieve high accuracy (over 90%) in avoiding hallucination on viable argument generation tests (Tests 1 &amp; 2), they often fail to utilize the full set of relevant factors present in the cases. Critically, on the abstention test (Test 3), most models failed to follow instructions to stop, instead generating spurious arguments despite the lack of common factors. This automated pipeline provides a scalable method for assessing these crucial LLM behaviors, highlighting the need for improvements in factor utilization and robust abstention capabilities before reliable deployment in legal settings. Project page: Link.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;LLM Evaluation</kwd>
        <kwd>Legal Argument Generation</kwd>
        <kwd>Hallucination Measurement</kwd>
        <kwd>Abstention</kwd>
        <kwd>Trustworthy AI</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Large language models (LLMs) have demonstrated remarkable capabilities across various domains,
including legal analysis and argumentation [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. Their potential to streamline legal research, draft
documents, and even generate arguments ofers significant eficiency gains. However, their tendency
to hallucinate facts or generate plausible but unsupported statements poses significant risks in legal
applications, where accuracy and reliability are of utomost importance [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. Misguided decisions,
ethical concerns, and even professional sanctions can result from relying on inaccurate AI-generated
legal content [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ].
      </p>
      <p>
        A critical challenge lies in ensuring the factual accuracy and appropriate reasoning behavior of LLMs
when tasked with generating case-based legal arguments. Pilot work involving human evaluation of
LLM-generated 3-ply arguments (plaintif’s argument citing precedent 1, defendant’s counterargument
distinguishing precedent 1 and citing precedent 2, plaintif’s rebuttal distinguishing precedent 2)
indicated that while LLMs can produce structurally coherent arguments, their factual grounding and
adherence to constraints can be problematic [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Specifically, LLMs may hallucinate, i.e., introduce
factual elements (represented as ‘factors’ in case-based reasoning) not present in the source materials.
Furthermore, they may fail to follow instructions appropriately, particularly negative constraints
such as abstaining from generating an argument when the provided cases lack a suficient factual
basis for comparison. Existing evaluation methods often focus on general capabilities [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
        ] but
lack fine-grained metrics to assess these specific failure modes in the context of factor-based legal
argumentation.
      </p>
      <p>To address this gap, we introduce an automated pipeline for evaluating LLM performance in generating
3-ply, factor-based legal arguments. This pipeline specifically targets the assessment of hallucination,
factor utilization (the extent to which relevant, available factors are used), and appropriate abstention.
The core of our approach involves using an external LLM to analyze the arguments generated by the
models under test, extracting the factors cited within them. These extracted factors are then compared
against the ground-truth factors present in the input case materials to compute quantitative metrics for
faithfulness and completeness.</p>
      <p>The development of this automated evaluation pipeline enables a targeted assessment of LLM
behavior in generating factor-based arguments. To guide this assessment, we pose the following
research questions (RQs):
• RQ1: To what extent do LLMs exhibit measurable errors, specifically hallucination (citing
nonexistent factors) and incomplete factor utilization (omitting relevant available factors), when
tasked with generating 3-ply case-based arguments from factor-represented inputs?
• RQ2: How efectively do LLMs adhere to instructions to abstain from argument generation when
presented with input cases lacking common factors, and what is their propensity to generate
spurious arguments under such conditions?
• RQ3: Can the proposed automated evaluation metrics efectively quantify distinct error types
(hallucination, incomplete utilization, spurious generation) and successfully reveal performance
variations across diferent LLMs and varying levels of task complexity?
The main contributions of this paper are: an automated evaluation pipeline specifically designed for
assessing LLM-generated, factor-based legal arguments; novel metrics targeting hallucination, factor
utilization, and abstention behavior in this context; an empirical evaluation of eight distinct LLMs
(including open-source and proprietary models of varying sizes) on three argumentation tasks with
increasing dificulty; and insights into the specific weaknesses of current LLMs regarding factual
grounding and instruction following in legal argument generation.</p>
      <p>The remainder of this paper is structured as follows: Section 2 discusses related work. Section 3
details the methodology of our automated evaluation pipeline. Section 4 describes the experimental
setup, including the dataset, tasks, and models. Section 5 presents the results of our evaluation. Section
6 provides a qualitative error analysis. Section 7 concludes the paper. Finally, Section 8 acknowledges
the limitations and suggests future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. LLMs in the Legal Domain</title>
        <p>
          Recent advances in LLMs, from open-source models such as Llama models [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] to proprietary systems
such as GPT models [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], have demonstrated remarkable capabilities in natural language understanding
and generation. This has spurred significant interest in their application within the legal domain,
ranging from legal research assistance and contract analysis to case outcome prediction and automated
legal document drafting [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. While promising, the reliable deployment of these models requires careful
consideration of their limitations, particularly concerning factual accuracy.
2.2. Computational Argumentation and Case-Based Reasoning
Computational models of legal argument, particularly those employing case-based reasoning (CBR),
provide a foundation for analyzing and generating legal arguments. Early work pioneered the use
of ’factors’—stereotypical fact patterns relevant to legal claims—in US trade secrets law, developing
systems like HYPO that analyze and compare cases based on shared and distinguishing factors [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
Subsequent systems like CATO introduced factor hierarchies [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], while others integrated rule-based
and case-based approaches [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] or focused on predicting outcomes [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and incorporating legal values
[
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Formal models of precedential constraint based on factors have also been developed [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. Factors
provide a structured representation suitable for evaluating the factual basis of arguments, as employed
in this study.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.3. Argument Generation with LLMs</title>
        <p>
          Beyond general legal tasks, researchers are exploring the specific capability of LLMs to generate
arguments. Some work has shown LLMs can assist humans in identifying legal factors [
          <xref ref-type="bibr" rid="ref19">19, 20</xref>
          ] or
generate factor-based arguments in a structured manner [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. However, the practical utility of such
generated arguments hinges on their factual accuracy and logical coherence. This paper focuses on
evaluating these aspects rather than proposing new generation methods.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.4. Hallucination in LLMs</title>
        <p>
          A significant challenge for LLMs is hallucination—the generation of content that is factually incorrect
or unsupported by the provided input or established knowledge [
          <xref ref-type="bibr" rid="ref5">5, 21</xref>
          ]. Hallucinations can manifest
as contradictions with the input prompt, conflicts with provided context, or deviations from
realworld facts [22]. In high-stakes domains like law, where precision and truthfulness are critical [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ],
hallucination represents a major barrier to adoption. Mitigation strategies often involve techniques
like chain-of-thought prompting [23] or retrieval-augmented generation (RAG) to ground responses in
external sources [24]. Our work focuses on reliably detecting hallucination in the specific context of
factor-based legal arguments.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.5. Evaluation Metrics for Generated Text</title>
        <p>
          Standard metrics for evaluating generated text, such as ROUGE [25], BLEU [26], and BERTScore [27],
primarily measure surface-level similarity or semantic overlap with reference texts. While useful for
tasks like summarization or translation, they are often insuficient for assessing factual accuracy, logical
consistency, or adherence to constraints in complex generation tasks like legal argumentation [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Some
legal benchmarks exist [
          <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
          ], but metrics specifically tailored to evaluate faithfulness and abstention in
factor-based reasoning remain underdeveloped. Our work aims to fill this gap by proposing automated
metrics focused on these critical aspects.
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>2.6. Instruction Following in LLMs</title>
        <p>The ability of LLMs to accurately follow complex instructions is crucial for their reliable use. Research
has shown that while LLMs are increasingly capable of adhering to instructions, they can struggle with
nuanced or complex constraints, particularly negative constraints (e.g., “do not generate X if condition
Y is met”) or situations requiring implicit recognition of task impossibility [28, 29]. Failure to follow
instructions, such as the requirement to abstain from generating an argument when no factual basis
exists, is one of the key failure modes investigated in this study.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>Our work employs a pipeline designed to automatically generate, assess, and score LLM performance
on a structured legal argument generation task. This pipeline consists of several stages: scenario
generation, argument generation by the models under test, automated factor extraction from the
generated arguments, and quantitative scoring based on comparison with ground-truth inputs.</p>
      <sec id="sec-3-1">
        <title>3.1. Task Definition: 3-Ply Argument Generation</title>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Factor-Based Case Representation</title>
        <p>
          Cases are represented using a standardized set of 26 legal factors pertinent to U.S. trade secret
misappropriation law, derived from foundational work in legal AI [
          <xref ref-type="bibr" rid="ref13">13, 30</xref>
          ]. These factors encapsulate key factual
aspects, such as circumstances surrounding disclosure, security measures implemented, characteristics
of the information, and relevant employee conduct. Each factor is designated as typically favouring
either the plaintif (P) or the defendant (D). For instance, a case might be represented textually as:
[Case Name] [Outcome] [Factors: F1 Disclosure-in-negotiations (D), F4
Agreed-not-todisclose (P), F6 Security-measures (P)]
This structured factor representation facilitates objective comparison between cases based on shared and
distinguishing factors, providing the essential ground truth for our subsequent automated evaluation
metrics.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Argument Generation and Evaluation Pipeline</title>
        <p>The evaluation process follows a defined pipeline. First, legal scenarios (case triples) are generated
according to specific criteria (detailed in Section 4.1). Second, each model under test receives these
scenarios within a structured prompt (Section 4.2) and generates the 3-ply argument. Third, an automated
process extracts the factors cited within the generated argument text (Section 3.4.1). Finally, these
extracted factors are compared against the ground-truth factors from the input scenario to calculate
performance metrics (Sections 3.4.2-3.4.4). The overall process is depicted in Figure 2.</p>
        <p>Experimental Configuration</p>
        <p>Parameters
1. Automated Scenario</p>
        <p>Synthesis
Standardized Task Prompt
2. LLM Argument Generation</p>
        <p>(Model Under Test)
Extraction Task Prompt</p>
        <p>3. Automated Factor
Extraction (via Evaluator</p>
        <p>LLM)
4. Quantitative Performance</p>
        <p>Scoring</p>
        <p>Factor-Encoded Case Triples</p>
        <p>(Ground Truth: FGT)
Generated Argument Output
Extracted Factor Sets (FExt)</p>
        <p>from Output
Computed Evaluation Metrics
(e.g., AccH, RecU,</p>
        <p>RatioAbstain)
output, along with metadata, is logged for subsequent analysis.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Automated Metrics Definition</title>
        <p>To evaluate the generated 3-ply arguments quantitatively, we developed automated metrics focused on
faithfulness (absence of hallucination), factor utilization (completeness), and adherence to abstention
instructions. These metrics rely on comparing the factors cited within the generated argument against
the ground-truth factors present in the input cases.</p>
        <p>3.4.1 Factor Extraction: An external, high-capability LLM (GPT-4.1) serves as an automated
evaluator. For each generated 3-ply argument produced by a model under test, this evaluator LLM is
prompted to analyze the argument text. Its task is to identify and extract the specific sets of factors that
the model under test asserted existing in each case in the triples (Current Case - CC, TSC1, TSC2). Let
, be the set of actual ground-truth factors present in the input for case .
, be the set of factors extracted by the evaluator for case  ∈ {,  1,  2}. Similarly, let
3.4.2 Hallucination Metric: Hallucination is operationally defined as the assertion by the model
under test that a specific factor exists in a specific case when, according to the ground-truth input for
that case, it is not present. We quantify this by summing the hallucinations across all three cases and
normalizing by the total number of ground-truth factors in the input triple. The Hallucination Accuracy
( ) is calculated as:
 =</p>
        <p>1
︂(</p>
        <p>)︂
− 
Here,  is the total count of hallucinated factors across the three cases:
 =</p>
        <p>∑︁
∈{, 1, 2}</p>
        <p>|{ ∈ , |  ∈/ ,}|
And  represents the total count of factors across the three ground-truth input cases (sum of factors
per case, not unique factors):
 =</p>
        <p>∑︁
A higher  indicates greater faithfulness, meaning the argument relies less on unsupported factual
assertions specific to each case.</p>
        <p>3.4.3 Factor Utilization Metric: Factor utilization assesses how comprehensively the model under
test mentions the available ground-truth factors for the specific cases they belong to. We compute Factor
Utilization Recall ( ) by summing the correctly identified factors for each case and normalizing by
the total number of ground-truth factors.</p>
        <p>=
︂(  )︂

where  is the total count of utilized ground-truth factors correctly mentioned for their respective
cases across the triple:
 =</p>
        <p>∑︁
 is defined as above. A higher  indicates that the generated argument incorporates more of
the factual elements provided in the input, correctly associating them with their respective cases.</p>
        <p>3.4.4 Abstention Metric: Test 3 specifically tests the model’s ability to abstain when argument
generation is impossible due to a lack of shared factors. The primary measure for this test is the
Abstention Ratio (). Let  be the number of successfully executed abstentions and  
be the total number of test triples requiring abstention. The ratio is calculated as:
 =
︂(  ︂)
 
A higher  indicates better adherence to instructions to abstain. For this test, we also report
Hallucination Accuracy ( ) to characterize the nature of the arguments generated when models
failed to abstain (Section 5.3).</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Rationale for External LLM-based Evaluation</title>
        <p>Employing a highly capable LLM (GPT-4.1) for the factor extraction step (Section 3.4.1) provides
substantial advantages in scalability and consistency compared to manual annotation across potentially
hundreds or thousands of generated arguments. Although the evaluator LLM is not infallible, our spot
checks indicated high accuracy in identifying factor mentions within the generated text structures.
This automated approach enables large-scale, reproducible evaluation across numerous models and
experimental conditions. Potential limitations associated with this method are acknowledged in Section
8.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Design</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset Generation and Structure</title>
        <p>The dataset used in this study was synthetically generated using a custom tool designed to create
controlled case triples for evaluating specific argumentation phenomena within the U.S. trade secret
domain. Each generated triple includes a factor-represented current case, a potential plaintif precedent
(TSC1), and a potential defendant counter-precedent (TSC2).</p>
        <p>The generation process allows for specifying several parameters, including the number of cases, the
complexity level which controlls the number of factors per case, typically ranging from complexity-1 to
complexity+1, and, crucially, the scenario ‘mode’. We generated data across three distinct modes, each
designed to test diferent facets of LLM reasoning and instruction following:
• Arguable Sets: These triples contain suficient overlapping factors between the current case
and the respective precedents (TSC1 for plaintif, TSC2 for defendant), with aligned outcomes,
facilitating standard 3-ply argument generation. These are used for Test 1.
• Reordered Sets: In these triples, common factors exist, but the typical roles based on outcomes
are reversed (TSC1 favors Defendant, TSC2 favors Plaintif). These are used for Test 2, primarily
testing robustness to the reordered precedent cases compared to Test 1.
• Non-arguable Sets: These triples are constructed such that there are no common factors between
the current case and either TSC1 or TSC2. These are specifically designed for Test 3 to evaluate the
model’s ability to recognize the impossibility of argument generation and abstain as instructed.
For this study, we generated sets of 30 triples with complexity of 12, for each of the ‘Arguable’,
‘Reordered’, and ‘Non-arguable’ modes, resulting in a total dataset of 90 triples used across the three
experimental tests. Example case structures for each mode are illustrated in Table 1. This structured
dataset allows for targeted testing of baseline argument generation (Arguable), adherence to specific
instructions under potentially confusing conditions (Reordered/Swapped Roles), and the crucial ability
to recognize factual impossibility and follow abstention instructions (Non-arguable).</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Core Prompt Structure</title>
        <p>For each test, the LLMs under evaluation were provided with a structured prompt designed to give
suficient context and clear instructions. The prompt included a description of the 3-ply argument test
(as the scheme shown in Figure 1), relevant background on trade secret misappropriation law, and the
factor representations for the specific input cases (current case, TSC1, TSC2). Crucially, the prompt
also contained explicit instructions regarding the desired output format and an abstention condition: it
stated that if no common factors could be found to support an analogy for a given ply, the model should
output a specific phrase (e.g., “Cannot generate argument due to lack of common factors”) and stop
processing that ply, rather than fabricating an argument. This abstention instruction was particularly
relevant for evaluating performance on Test 3. An example of the full prompt structure is provided in
Appendix A and B.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Curriculum for Testing</title>
        <p>We evaluated the selected LLMs on three distinct tests, leveraging the diferent modes of our generated
dataset:
• Test 1 on Arguable Sets: Standard Argument Generation. Using the ‘Arguable’ case triples,
models generated the standard 3-ply argument (Plaintif cites TSC1, Defendant cites TSC2,
Plaintif rebuts TSC2). This test assesses baseline performance regarding hallucination and factor
utilization when arguments are factually supported.
• Test 2 on Reordered Sets: Swapped Precedent Roles. Also using the ‘Arguable’ triples, this
test required models to perform the 3-ply argument but with the order of TSC1 and TSC2 swapped
(TSC1 favoring the Defendant, TSC2 favoring the Plaintif). Critically, models were not given the
precedent name to cite; instead, they had to select the appropriate precedent (TSC1 or TSC2) by
analyzing which case’s outcome supported their argumentative goal.
• Test 3 on Non-arguable Sets: Abstention Test. Utilizing the ‘Non-arguable’ case triples,
models were prompted to generate the standard 3-ply argument. The expected correct behavior,
however, was for the model to recognize the lack of common factors required for analogical
reasoning and follow the explicit instruction to abstain from generating spurious arguments for
triple. This test directly probes the ability to identify task impossibility and adhere to negative
constraints.</p>
        <p>These tests present progressively challenging scenarios designed to probe diferent aspects of LLM
reliability in legal argument generation.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Models Evaluated</title>
        <p>We selected eight distinct LLMs, representing a range of sizes, architectures, and access types
(opensource and commercial). The models evaluated were:
• GPT-4o (OpenAI)
• GPT-4o-mini (OpenAI)
• Llama-3-70B-8192 (Meta)
• Llama-3-8B-8192 (Meta)
• Llama-4-Maverick-17B-128e-instruct (Meta)
• Llama-4-Scout-17B-16e-instruct (Meta)
• DeepSeek-R1-Distill-Llama-70B (DeepSeek)
• Qwen-QWQ-32B (Alibaba)
This selection aimed to cover a spectrum of capabilities, including models ranging from comparatively
small (Llama-3-8B) to large (GPT-4o, Llama-3-70B), proprietary and open-source options,
Mixture-ofExperts architectures (Llama-4 variants), and models optimized for reasoning tasks (Qwen-QWQ-32B,
DeepSeek-R1-Distill-Llama-70B). These models were accessed via their respective APIs at the time of
experimentation (May 2025).</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Implementation Details</title>
        <p>To ensure comparability across models and tests, consistent generation parameters were employed
for all LLM invocations within the argument generation pipeline. We used a temperature setting
of 0 since the task is focused on expected correct output. A max_tokens limit of 500 was set, which
proved suficient for the typical length of a 3-ply argument structure (for reasoning models, the
limit was set as 5,000). Other standard parameters included top_p=1, frequency_penalty=0, and
presence_penalty=0. The factor extraction step performed by the external evaluator LLM (GPT-4.1)
also utilized fixed deterministic settings to ensure consistency in the evaluation process itself.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>We analyze the performance of the eight evaluated LLMs across the three distinct tests using the
automated metrics for Hallucination Accuracy ( , Section 3.4.2), Factor Utilization Recall ( ,
Section 3.4.3), and Abstention Ratio (, Section 3.4.4), as defined previously. The results are
presented separately for each test type: Test 1 (Arguable), Test 2 (Reordered), and Test 3 (Non-arguable).</p>
      <sec id="sec-5-1">
        <title>5.1. Hallucination Accuracy Results</title>
        <p>As shown in Table 2, the Hallucination Accuracy ( ) is generally very high for Tests 1 and 2 across
most models. These tests involve generating arguments based on provided ‘Arguable’ scenarios, either
in a standard format (Test 1) or with swapped precedent roles (Test 2). GPT-4o achieves near-perfect
accuracy (&gt;99% for Test 1, &gt;97% for Test 2), indicating exceptional faithfulness in citing only those factors
genuinely shared between the relevant cases in these scenarios. Other models, including GPT-4o-mini,
Llama-3-70B, and the Llama-4 variants, also demonstrate high accuracy, typically above 96% on these
tests. This suggests that when instructed to generate arguments based on factor representations where
supporting shared factors exist, current leading LLMs are capable of doing so with minimal hallucination
according to our metric. Smaller or specialized models like DeepSeek, Qwen, and Llama-3-8B show
slightly lower but still respectable accuracy, generally above 90%.</p>
        <p>For Test 3 (Non-arguable), where no shared factors exist and models were instructed to abstain, the
 remains surprisingly high for several models, particularly GPT-4o (99.16%) and Llama-4-Maverick
(94.35%). This high accuracy, however, must be interpreted cautiously alongside the primary goal of
abstention and the recall results (Section 5.3).</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Factor Utilization Recall Results</title>
        <p>Factor Utilization Recall ( ), presented in Table 3, measures the completeness of the arguments by
quantifying the proportion of available supporting factors that were correctly identified and used by
the LLM for Tests 1 and 2. The results reveal a more varied picture than accuracy. For Tests 1 and 2,
GPT-4o again leads, utilizing over 85% of available factors in the standard test (Test 1) and over 76% in
the swapped-role test (Test 2). Llama-3-70B also performs strongly, achieving recall scores above 74%
for both tests. Other models exhibit a wider range of performance. For instance, GPT-4o-mini shows
significantly lower recall (around 42-50%), suggesting it generates arguments that are factually accurate
(high  ) but lack comprehensiveness. The Llama-4 variants, DeepSeek, Qwen, and Llama-3-8B fall
between these extremes, with recall generally ranging from 50% to 70% on the arguable tests. This
indicates that while models can avoid making up facts, they often fail to incorporate the full set of
relevant facts available in the input materials into their generated arguments.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Abstention Test Performance</title>
        <p>Test 3 was designed specifically to test the models’ ability to follow the instruction to abstain when
faced with ‘Non-arguable’ scenarios lacking shared factors. The primary metric for this test is the
Abstention Ratio (), presented in Table 4. We also consider Hallucination Accuracy ( )
from Table 2 for Test 3 to understand the nature of arguments generated when models failed to abstain.</p>
        <p>As shown in Table 4, the ability to correctly abstain varies significantly across models. GPT-4o
achieved the highest abstention ratio (86.67%), successfully following the instruction in the majority
of non-arguable cases. Qwen-QWQ-32B (56.67%) and Llama-4-Maverick (50.00%) also demonstrated
some capability to abstain. However, several models, including Llama-3-8B, Llama-4-Scout, and
GPT4o-mini, had an abstention ratio of 0.00%, indicating they failed to abstain in any of the test instances.
Llama-3-70B also performed poorly with a very low abstention ratio (3.33%).</p>
        <p>For models that failed to abstain and instead generated spurious arguments, their Hallucination
Accuracy ( ) on Test 3 (Table 2) is informative. For example, GPT-4o, even when it rarely failed
to abstain, maintained very high  (99.16%), meaning its spurious arguments were largely free of
hallucinated factors not in the input cases. Llama-4-Maverick also showed high  (94.35%) in such
instances. This suggests that their failure was primarily in not following the abstention instruction,
rather than fabricating factors. Other models that failed to abstain also generally maintained relatively
high  (mostly above 84%), indicating that the spurious arguments, while incorrect, were mostly
based on factors present in the input cases rather than completely fabricated information.</p>
        <p>Overall, this critical test of instruction following reveals a significant weakness in most LLMs. The
inability to reliably recognize task impossibility and adhere to negative constraints is a major concern
for their deployment in sensitive applications. Even models that performed well on argument generation
(Tests 1 &amp; 2) struggled significantly with abstention.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Comparative Analysis</title>
        <p>Across Tests 1 and 2 (argument generation), GPT-4o demonstrates the strongest performance, achieving
the highest Hallucination Accuracy and leading significantly in Factor Utilization Recall, suggesting its
arguments are both faithful and comprehensive. Llama-3-70B generally ranks second, showing strong
accuracy and good recall. Llama-4-Maverick also performs well in terms of accuracy on these tests,
though its recall is moderate.</p>
        <p>The performance gap between the top models (GPT-4o, Llama-3-70B) and others is more pronounced
in Factor Utilization Recall than in Hallucination Accuracy for Tests 1 and 2. Models like GPT-4o-mini,
Llama-3-8B, and Qwen often achieve reasonable accuracy but struggle with recall, producing less
complete arguments. The Llama-4 variants and DeepSeek fall in the mid-range for both metrics on
these arguable tests.</p>
        <p>Test 3 (Abstention) reveals a critical dimension of model capability. GPT-4o stands out with the highest
Abstention Ratio (Table 4), indicating a superior ability to follow instructions to abstain.
Qwen-QWQ32B and Llama-4-Maverick show moderate success in abstention, while several models, including some
high performers on Tests 1 &amp; 2 like Llama-3-70B, almost completely failed to abstain. This highlights that
strong performance on generative tasks does not necessarily translate to robust instruction following
for negative constraints. The failure to abstain, even when explicitly instructed, is a significant concern.
The failure modes on Test 3 varied, but very few models performed the task as intended by consistently
and correctly abstaining.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Error Analysis</title>
      <p>To gain deeper insights into the quantitative results and understand the nature of the errors identified
by our automated metrics, we conducted an error analysis. We selected LLM outputs for manual review
primarily based on instances where the automated metrics indicated significant deviations from desired
performance. This included cases with lower Hallucination Accuracy ( &lt; 95%), notably low
Factor Utilization Recall ( ), outputs from models exhibiting generally weaker performance on a
specific test, and, critically, all instances where models failed to produce the correct abstention output
on Test 3. The analysis involved a manual review of the selected LLM-generated argument texts. Each
generated argument was compared against the ground-truth factors provided in the corresponding
input case triple (Current Case, TSC1, TSC2) and the specific instructions given in the core prompt
(including the 3-ply structure requirements and the abstention rule). During the review, observed errors
were categorized into distinct types related to hallucination, incomplete factor utilization, and failures
in following instructions, particularly regarding the abstention task.</p>
      <p>Hallucination Errors (Primarily Tests 1 &amp; 2): Although quantitative results showed high  ,
qualitative analysis identified infrequent instances, primarily in lower-performing models.
• Factor Misattribution: Citing a factor as present in one case (e.g., the Current Case) when it
actually belonged to a diferent case in the input triple (e.g., TSC1).</p>
      <sec id="sec-6-1">
        <title>Factor Misattribution: GPT-4o</title>
        <p>“Plaintif’s Argument: ... Factors F3 ... F21 Knew-info-confidential (P), F23 Waiver-of-confidentiality
(D), and F25 Info-reverse-engineered (D) were present in both the input case and TSC1 ...”</p>
        <p>For example, in one instance of a GPT-4o output (Box 1), factor F21 was incorrectly attributed to
TSC1 by the model, as it was not present in the ground-truth factors for TSC1.</p>
      </sec>
      <sec id="sec-6-2">
        <title>Incomplete Factor Utilization Errors (Tests 1 &amp; 2): This was a more common issue across models,</title>
        <p>reflected in the   scores.</p>
        <p>• Omission of Shared Factors: Failing to identify or mention relevant factors that were shared
between the Current Case and the precedent being cited (TSC1 for Plaintif’s first ply, TSC2 for
Defendant’s ply).
• Omission of Distinguishing Factors: Failing to identify or mention factors that diferentiate the
Current Case from the precedent being discussed, particularly when the task required distinguishing
(Defendant’s ply distinguishing TSC1, Plaintif’s rebuttal distinguishing TSC2).</p>
      </sec>
      <sec id="sec-6-3">
        <title>Omission of Shared Factors &amp; Omission of Distinguishing Factors: Llama-3-8B-8192</title>
        <p>“Plaintif’s Argument: ... Factors F4 Agreed-not-to-disclose (P) and F6 Security-measures (P) were
present in both the input case and TSC1 ...”
“Plaintif’s Rebuttal: ... In TSC2, the additional factors ... F25 Info-reverse-engineered (D), F27
Disclosurein-public-forum (D) were present and are not present in input case.”</p>
        <p>For instance, with Llama-3-8B-8192 (Box 2), the model failed to mention that F7 Brought-tools (P)
was a shared pro-plaintif factor in both the current case and TSC1 during the Plaintif’s Argument.
Additionally, in the Plaintif’s Rebuttal, the model did not point out that F12
Outsider-disclosuresrestricted (P) was a distinguishing factor present in the current case but not in TSC2.</p>
        <p>Instruction Following / Abstention Errors (Test 3): This was the most significant failure mode
observed across nearly all models.</p>
        <p>• Failure to Abstain: The most common error was generating a spurious 3-ply argument structure
despite the lack of common factors and the explicit instruction to output a specific abstention
phrase. We observe that even in instances where the model does not successfully abstain, the
resulting legal arguments do not exhibit factual misattribution or hallucination.
• Incorrect Abstention Phrase: In rare cases where a model attempted to signal impossibility, it
failed to use the exact required phrase (“No common factor between the current case and the
TSC1/TSC2”), using a variation instead.
• Spurious Reasoning within Failed Abstention: When models incorrectly generated arguments, the
reasoning was inherently flawed. This often involved discussing factors from the cases as if
they provided a basis for comparison, even though no common factors existed, or fabricating
connections.</p>
      </sec>
      <sec id="sec-6-4">
        <title>Failure to Abstain: GPT-4o</title>
        <p>“Plaintif’s Argument: ... F22 Invasive-techniques (P), F26 Deception (P) were present in the input
case and support the Plaintif’s position ...”</p>
        <p>This output (Box 3) illustrates GPT-4o failing to follow the instruction to abstain when presented
with a non-arguable scenario in Test 3. Instead of outputting the specified abstention phrase, the model
proceeded to generate an argument. Notably, while the generation of any argument was incorrect in
this context, the factors cited in this whole output were present in the input case material, indicating a
failure in instruction adherence rather than a hallucination errors.</p>
      </sec>
      <sec id="sec-6-5">
        <title>Spurious Reasoning within Failed Abstention: Deepseek-R1-Distill-Llama-70B</title>
        <p>“&lt;think&gt; ... The first step is to check if there are common factors between the input current case and
either TSC1 or TSC2. If there are none, I should stop ... &lt;\think&gt;”
“Plaintif’s Argument: Factors F6 Security-measures (P) and F21 Knew-info-confidential (P) were present
in both the input case and TSC1 ...”</p>
        <p>As shown in the example (Box 4), even when a model’s reasoning trace explicitly mentioned the
condition for stopping, it might still proceed to generate a spurious argument, failing to follow the
critical abstention instruction. Additionally, the model claimed that there were common factors between
the input case and TSC1.</p>
        <p>This qualitative analysis complements the quantitative metrics, highlighting that even with high
accuracy, completeness remains a challenge, and adherence to negative constraints like abstention is a
critical weakness for current LLMs in this legal argument generation context.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>This paper introduced and applied an automated pipeline to evaluate the performance of eight LLMs
on generating 3-ply, factor-based legal arguments, focusing specifically on faithfulness, completeness,
and the ability to follow abstention instructions. Our evaluation, guided by three research questions,
yielded the following conclusions:</p>
      <p>Regarding RQ1 (hallucination and incomplete factor utilization), our results show that while most
evaluated LLMs exhibit high Hallucination Accuracy ( &gt; 90 − 95% in Tests 1 &amp; 2), indicating
they generally avoid citing non-existent factors when generating arguments in viable scenarios, they
struggle with completeness. Factor Utilization Recall ( ) varied significantly (from ≈ 40% to ≈ 85%
in Tests 1 &amp; 2), demonstrating that LLMs often omit relevant, available factors from the input cases,
leading to potentially superficial arguments.</p>
      <p>Concerning RQ2 (adherence to abstention instructions), the evaluation revealed a critical weakness
across almost all models. When presented with non-arguable scenarios (Test 3) and explicitly instructed
to abstain, most models failed to follow this directive, as measured by our Abstention Ratio (Table 4).
Instead of abstaining, they generated spurious arguments. Only a few models showed a significant
ability to abstain, with GPT-4o performing best, yet still not perfectly. This highlights a fundamental
inability in most current LLMs to reliably recognize task impossibility and follow negative constraints.</p>
      <p>Addressing RQ3 (efectiveness of automated metrics), the proposed metrics (  ,  , and
) successfully quantified distinct error types.  efectively measured faithfulness
(absence of hallucination).  captured incomplete factor utilization in argument generation tasks (Tests
1 &amp; 2).  directly measured the critical capability of adherence to abstention instructions in
Test 3. Together, these metrics clearly revealed performance variations across diferent LLMs and task
complexities, demonstrating the pipeline’s utility in diagnosing specific weaknesses.</p>
      <p>In summary, while LLMs show promise in generating factually grounded components of legal
arguments based on structured inputs, significant improvements are needed in ensuring comprehensive
reasoning (completeness) and, most crucially, in robust instruction following, particularly regarding
negative constraints and the ability to abstain appropriately. These deficiencies must be addressed
before LLMs can be reliably deployed for substantive legal argumentation tasks.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Limitations and Future Work</title>
      <p>This study is subject to several limitations. The evaluation utilizes synthetic, factor-represented cases,
simplifying the nuances of real-world legal texts and reasoning. The accuracy of our automated metrics
inherently depends on the performance of the external LLM used for factor extraction, introducing a
potential layer of error. The specific operationalization of our metrics, particularly for the abstention
test, could be further refined. Furthermore, the findings are based on a specific dataset size, prompt
structure, and set of LLMs, potentially limiting the generalizability of the precise quantitative results,
although the qualitative trends are likely indicative.</p>
      <p>Future research should aim to address these limitations. Evaluating performance on larger, more
diverse datasets, including those derived from real-world legal documents (which would necessitate
robust factor extraction from text as a preliminary step [31]), is a crucial next step. Further validation and
refinement of the automated metrics, potentially including comparisons with human expert judgments
on argument quality beyond factor usage, would strengthen the evaluation pipeline. Investigating the
underlying reasons for the observed deficiencies in recall and abstention through model interpretability
or targeted probing could inform the development of more reliable models. Finally, exploring novel
prompting strategies, fine-tuning approaches, or architectural modifications specifically designed to
enhance completeness and robust instruction adherence in legal argument generation remains a vital
avenue for future work.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
[20] J. Savelka, K. D. Ashley, The unreasonable efectiveness of large language models in zero-shot
semantic annotation of legal texts, Frontiers in Artificial Intelligence 6 (2023) 1279794.
[21] Y. A. Yadkori, I. Kuzborskij, A. György, C. Szepesvári, To believe or not to believe your llm, arXiv
preprint arXiv:2406.02543 (2024).
[22] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y. Zhang, Y. Chen, et al., Siren’s song
in the ai ocean: a survey on hallucination in large language models, arXiv preprint arXiv:2309.01219
(2023).
[23] S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li, A. Celikyilmaz, J. Weston, Chain-of-verification
reduces hallucination in large language models, arXiv preprint arXiv:2309.11495 (2023).
[24] T. Vu, M. Iyyer, X. Wang, N. Constant, J. Wei, J. Wei, C. Tar, Y.-H. Sung, D. Zhou, Q. Le, et al.,
Freshllms: Refreshing large language models with search engine augmentation, arXiv preprint
arXiv:2310.03214 (2023).
[25] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization
branches out, 2004, pp. 74–81.
[26] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine
translation, in: Proceedings of the 40th annual meeting of the Association for Computational
Linguistics, 2002, pp. 311–318.
[27] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, Bertscore: Evaluating text generation with
bert, arXiv preprint arXiv:1904.09675 (2019).
[28] B. Wen, J. Yao, S. Feng, C. Xu, Y. Tsvetkov, B. Howe, L. L. Wang, Know your limits: A survey of
abstention in large language models, arXiv preprint arXiv:2407.18418 (2024).
[29] S. Feng, W. Shi, Y. Wang, W. Ding, V. Balachandran, Y. Tsvetkov, Don’t hallucinate, abstain:
Identifying llm knowledge gaps via multi-llm collaboration, arXiv preprint arXiv:2402.00367
(2024).
[30] K. D. Ashley, Artificial intelligence and legal analytics: new tools for law practice in the digital
age, Cambridge University Press, 2017.
[31] M. Gray, J. Savelka, W. Oliver, K. Ashley, Automatic identification and empirical analysis of
legally relevant factors, in: Proceedings of the Nineteenth International Conference on Artificial
Intelligence and Law, 2023, pp. 101–110.</p>
      <p>A. Prompt Structure for 3-Ply Argument Generation
The following shows the structure of the prompt provided to the LLMs for the 3-ply argument generation
task.</p>
      <sec id="sec-9-1">
        <title>TASK</title>
        <p>In this task, we will formulate legal arguments based on trade secret misappropriation claims using a
structured approach. Follow the steps outlined below for consistency and clarity.</p>
      </sec>
      <sec id="sec-9-2">
        <title>Legal Problem Context</title>
        <p>In this problem, we aim to develop arguments using factors critical to trade secret misappropriation
claims. Typically, the Plaintif alleges that the Defendant has misappropriated their trade secret. For
instance, Kentucky Fried Chicken (KFC) could claim misappropriation if an employee disclosed their
secret recipe, which is a blend of herbs and spices, by publishing it in a cookbook.</p>
        <p>Factors may support either the Plaintif (P) or the Defendant (D). The Plaintif might emphasize
measures they took to protect the recipe, while the Defendant could argue that the recipe was already
disclosed to outsiders. Based on the factors provided, construct a three-part argument as detailed below.</p>
        <p>Instructions
1. IMPORTANT: If there is no common factor between the current case and the TSC1/TSC2, you
need to say "No common factor between the input current case and the TSC1/TSC2" and stop
generating any argument.
2. Construct a 3-Ply Argument:
a) Plaintif’s Argument: Present an argument in favor of the Plaintif’s position by i. citing
a relevant Trade Secret Case (TSC1/TSC2) with a similar favorable outcome; ii. Highlighting
shared factors between the input current case and the TSC1/TSC2.
b) Defendant’s Counterargument: Refute the Plaintif’s position by i. Distinguishing the
cited TSC1/TSC2 based on difering factors; ii. Citing a counterexample (a TSC1/TSC2 with
a Defendant-favorable outcome) and drawing an analogy to the input current case.
c) Plaintif’s Rebuttal: Address and distinguish the counterexample, reinforcing the
Plaintif’s original argument.
3. Use Provided Factors: Base your arguments on the factors outlined, ensuring logical consistency.</p>
      </sec>
      <sec id="sec-9-3">
        <title>Example Input and Output Example Current Case</title>
        <p>• F4 Agreed-not-to-disclose (P)
• F6 Security-measures (P)
Example TSC1 outcome Plaintif
• F4 Agreed-not-to-disclose (P)
• F21 Knew-info-confidential (P)</p>
      </sec>
      <sec id="sec-9-4">
        <title>Example Output (json format)</title>
        <p>Plaintif’s Argument: Factors F4 Agreed-not-to-disclose (P) and F6 Security-measures (P) were
present in both the current case and TSC1, where the court found in favor of the Plaintif.
In Addition, Factors F12 Outsider-disclosures-restricted (P), F14 Restricted-materials-used (P),
F21 Knew-info-confidential (P) are present in the current case and favor the Plaintif.</p>
      </sec>
      <sec id="sec-9-5">
        <title>Current Case, TSC1, and TSC2 ...</title>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>B. Prompt Structure for Factor Extraction</title>
      <p>The following shows the structure of the prompt provided to the LLM (GPT-4.1) for the factor extraction
task.</p>
      <sec id="sec-10-1">
        <title>Example Prompt</title>
      </sec>
      <sec id="sec-10-2">
        <title>TASK</title>
        <p>You are tasked with extracting factors from the 3-ply argument.</p>
      </sec>
      <sec id="sec-10-3">
        <title>Example Input (json format)</title>
        <p>Plaintif’s Argument: Factors F4 Agreed-not-to-disclose (P) and F6 Security-measures (P) were
present in both the current case and TSC1, where the court found in favor of the Plaintif.</p>
      </sec>
      <sec id="sec-10-4">
        <title>Example Output (json format) Current Case</title>
        <p>TSC1
TSC2</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ariai</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Demartini, Natural language processing for the legal domain: A survey of tasks, datasets, models, and challenges</article-title>
          ,
          <source>arXiv preprint arXiv:2410.21306</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Katz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hartung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gerlach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jana</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. J. Bommarito</surname>
            <given-names>II</given-names>
          </string-name>
          ,
          <article-title>Natural language processing in the legal domain</article-title>
          ,
          <source>arXiv preprint arXiv:2302.12039</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , K. D. Ashley,
          <article-title>Generating case-based legal arguments with llms</article-title>
          ,
          <source>in: Proceedings of the 4th ACM Computers and Law Symposium</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bommasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Hudson</surname>
          </string-name>
          , E. Adeli,
          <string-name>
            <given-names>R.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Arora</surname>
          </string-name>
          , S. von Arx,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bohg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosselut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Brunskill</surname>
          </string-name>
          , et al.,
          <article-title>On the opportunities and risks of foundation models</article-title>
          ,
          <source>arXiv preprint arXiv:2108.07258</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Frieske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ishii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>Survey of hallucination in natural language generation</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>D. U. S. de la Osa</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Remolina</surname>
          </string-name>
          ,
          <article-title>Artificial intelligence at the bench: Legal and ethical challenges of informing-or misinforming-judicial decision-making through generative ai</article-title>
          ,
          <source>Data &amp; Policy</source>
          <volume>6</volume>
          (
          <year>2024</year>
          )
          <article-title>e59</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Avery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Abril</surname>
          </string-name>
          ,
          <source>A. del Riego</source>
          , Chatgpt, esq.:
          <article-title>Recasting unauthorized practice of law in the era of generative ai</article-title>
          ,
          <source>Yale JL &amp; Tech. 26</source>
          (
          <year>2023</year>
          )
          <fpage>64</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Guha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nyarko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ré</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chohlas-Wood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Waldon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rockmore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zambrano</surname>
          </string-name>
          , et al.,
          <article-title>Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Y. Liu,
          <article-title>Lexeval: A comprehensive chinese legal benchmark for evaluating large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2409.20288</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>S</surname>
          </string-name>
          . Zhang,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ge</surname>
          </string-name>
          , Lawbench:
          <article-title>Benchmarking legal knowledge of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2309.16289</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
          </string-name>
          , et al.,
          <article-title>Llama: Open and eficient foundation language models</article-title>
          ,
          <source>arXiv preprint arXiv:2302.13971</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Achiam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Adler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Akkaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. L.</given-names>
            <surname>Aleman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Altenschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Anadkat</surname>
          </string-name>
          , et al.,
          <source>Gpt-4 technical report, arXiv preprint arXiv:2303.08774</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>K. D. Ashley</surname>
          </string-name>
          , Modeling Legal Argument:
          <article-title>Reasoning with Cases and Hypotheticals</article-title>
          , The MIT Press, Cambridge, MA,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>V.</given-names>
            <surname>Aleven</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. D.</given-names>
            <surname>Ashley</surname>
          </string-name>
          ,
          <article-title>Teaching case-based argumentation through a model and examples: Empirical evaluation of an intelligent learning environment</article-title>
          ,
          <source>in: Artificial intelligence in education</source>
          , volume
          <volume>39</volume>
          ,
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          ,
          <year>1997</year>
          , pp.
          <fpage>87</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>E. L.</given-names>
            <surname>Rissland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. B.</given-names>
            <surname>Skalak</surname>
          </string-name>
          ,
          <article-title>Cabaret:rule interpretation in a hybrid architecture</article-title>
          ,
          <source>International Journal of Man-machine Studies</source>
          <volume>34</volume>
          (
          <year>1991</year>
          )
          <fpage>839</fpage>
          -
          <lpage>887</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Brüninghaus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. D.</given-names>
            <surname>Ashley</surname>
          </string-name>
          ,
          <article-title>Predicting outcomes of case based legal arguments</article-title>
          ,
          <source>in: Proceedings of the 9th International conference on Artificial Intelligence and Law</source>
          , ACM,
          <year>2003</year>
          , pp.
          <fpage>233</fpage>
          -
          <lpage>242</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grabmair</surname>
          </string-name>
          ,
          <article-title>Predicting trade secret case outcomes using argument schemes and learned quantitative value efect tradeofs</article-title>
          ,
          <source>in: Proceedings of the 16th ICAIL</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>89</fpage>
          -
          <lpage>98</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Horty</surname>
          </string-name>
          ,
          <article-title>Modifying precedential constraint</article-title>
          ,
          <source>Journal of Artificial Intelligence Law</source>
          <volume>30</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Savelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. M.</given-names>
            <surname>Oliver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. D.</given-names>
            <surname>Ashley</surname>
          </string-name>
          ,
          <article-title>Empirical legal analysis simplified: reducing complexity through automatic identification and evaluation of legally relevant factors</article-title>
          ,
          <source>Philosophical Transactions of the Royal Society A</source>
          <volume>382</volume>
          (
          <year>2024</year>
          )
          <fpage>20230155</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>In</given-names>
            <surname>Addition</surname>
          </string-name>
          ,
          <article-title>Factors F12 Outsider-disclosures-restricted (P), F14 Restricted-materials-used (P), F21 Knew-info-confidential (P) are present in the current case and favor the Plaintif</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>