<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>June</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Legal Argument Mining: Recent Trends and Open Challenges</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ru¯ta Liepi n</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesca Galloni</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesca Lagioia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Lippi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mariaceleste Musicco</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Burcu Sayin</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Passerini</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Sartor</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ALMA-AI, CIRSFID, Department of Law, University of Bologna</institution>
          ,
          <addr-line>Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Information Engineering (DINFO), University of Florence</institution>
          ,
          <addr-line>Florence</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Information Engineering and Computer Science, University of Trento</institution>
          ,
          <addr-line>Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Law Department, European University Institute</institution>
          ,
          <addr-line>Florence</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>20</volume>
      <issue>2025</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper presents a brief survey of recent trends in legal argument mining, focusing on the early use of large language models in this subfield. As legal texts, especially judicial decisions, increase in volume and complexity, the need for efective tools to extract and analyse legal arguments becomes more pressing. The paper outlines key datasets and tasks in legal argument mining, and identifies challenges and open issues to guide future research. The growing complexity and volume of legal texts - particularly judicial decisions - has spurred interest in computational tools capable of capturing and organising legal reasoning [1]. Argument mining (AM) has emerged as a crucial task for the legal domain, to extract and analyse argumentative structures from textual documents [2][3]. Indeed, AM can be highly beneficial to both legal research and legal decision making, improve judicial transparency and contribute to the advancement of AI-assisted reasoning. Judicial decisions are inherently argumentative, often justifying legal conclusions through layered reasoning, with references to precedents and legislation. Structured access to this information can empower legal professionals and scholars to better understand, critique, and apply legal rulings. The information acquisition bottleneck has significantly afected progress in legal argument mining (LAM). Traditional approaches to information extraction have faced persistent challenges in addressing LAM due to aspects such as the complexity of legal language, its context dependency, the need for domain expertise, and variation across jurisdictions [4][5]. The emerging capabilities of large language models (LLMs) [6] hold promise for assisting with LAM. Their capacity for zero-shot and few-shot learning and for contextual understanding can be deployed in identification and or classification of legal content. Their proficiency in reasoning tasks makes them especially promising for AM in law, where subtle distinctions and interpretive nuances have to be taken into account to detect and distinguish patterns of reasoning. This study aims to map the state-of-the-art research on the use of LLMs for argument mining in judicial decisions and to identify research gaps and open research questions. By focusing on LAM and LLMs, our work aims to serve as a reference for researchers and legal professionals seeking to understand and advance the use of AI in the analysis of judicial reasoning. The paper is structured as</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Legal argument mining</kwd>
        <kwd>large language models</kwd>
        <kwd>argumentation datasets</kwd>
        <kwd>survey</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>follows. Section 2 describes the criteria for selecting the relevant studies. Sections 3 and 4 respectively
presents an overview of the employed datasets and the argument mining tasks. Section 5 discusses
challenges, open questions and outlines future research directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Mapping the latest advancements: selection criteria</title>
      <p>We conducted a targeted literature review to map recent advancements in the use of classical transformer
models (e.g., LegalBERT) and LLMs for LAM. The search covers leading databases and venues in AI,
computational linguistics, and legal informatics, including the ACM Digital Library, ACL Anthology,
JURIX, ICAIL, the Argument Mining Workshop, and the Artificial Intelligence and Law Journal. We used
focused keywords – “argument mining”, “legal argument mining”, “large language models”, and “legal
argumentation” – to identify studies at the intersection of advanced NLP and legal reasoning. Papers
are selected for their methodological, experimental, and dataset-focused relevance to judicial contexts.
Three inclusion criteria are applied. Topical Relevance, requires significance to the methodological,
experimental, and dataset-oriented aspects. Venue Quality restricts the selection to reputable,
peerreviewed conferences, journals, or workshops within AI and law or closely related fields. Temporal
Relevance considers works published from 2020 onward, a period marked by the rapid evolution and
widespread adoption of most advanced transformer models and LLMs. Exclusion criteria filter out studies
with limited pertinence to LAM, those lacking data analysis, and grey literature (e.g., non-peer-reviewed
or self-published works). The final selection includes 21 papers (7 of which use LLMs, and 14 use classical
transformer models). We collected citation metadata and examined datasets, distinguishing between
new corpora and reused datasets, as well as jurisdictions, domains, and languages. We categorised
the AM tasks, and documented the employed models. Contributions are assessed in terms of classical
metrics and expert-based legal evaluations. Finally, we highlight recurring and new challenges, which
include data limitations, annotation subjectivity, and cross-jurisdictional generalisability.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Datasets</title>
      <p>Despite growing interest in computational legal reasoning, open-access and high-quality datasets for
LAM remain limited, especially when compared to general-domain or even other specialised NLP areas.
Nonetheless, several recent contributions have introduced or repurposed datasets for tasks involving
legal argumentation, enabling experimentation with classical transformer-based models and LLMs.
This section reviews the most relevant corpora in the selected literature, outlining jurisdictions, legal
domain focus, language and scope. A comparative table summarising dataset properties and usage is
given in Appendix A.</p>
      <p>
        European courts. The supranational nature of European courts, particularly the European Court
of Human Rights (ECHR) and the Court of Justice of the European Union (CJEU), has motivated the
creation of annotated datasets for LAM. The ECHR in particular has been a major focus. One of the
earliest contributions comes from Poudyal et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], who created a corpus of 42 ECHR judgments in
English, annotated at the clause level as premise, conclusion, or non-argumentative, along with mapped
relations between premises and conclusions. Clauses may serve multiple argumentative roles, reflecting
the layered nature of legal reasoning. Expanding this foundation, the LAM:ECHR corpus [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] ofers a
ifne-grained annotations over 373 ECHR judgments in English, focusing on Articles 3, 7, and 8 of the EU
Charter of Fundamental Rights. Argument spans are labelled using a Toulmin-inspired model, capturing
both argument types (e.g., diferent methods of interpretation, tests of the principle of proportionality,
precedents, and others) and actors (e.g., the ECHR itself, the State, applicants).
      </p>
      <p>
        Further extending ECHR coverage, Chlapanis et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] exploit the legal reasoning abilities of LLMs.
Given a set of arguments, the goal is to predict the next correct statement. To this end, they produced
LAR-ECHR, by combining three pre-existing corpora [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ][
        <xref ref-type="bibr" rid="ref8">8</xref>
        ][
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The final corpus contains 403 samples
from 191 cases in English, each sample including a target argument (i.e., the correct next legal statement)
and several distractors (i.e., incorrect next statements). All targets are extracted from legal reasoning
authored by judges, and pertain to the court’s application of law to facts and follow parties’ submissions.
Distractors are selected from the same corpus to match target arguments in style and vocabulary,
avoiding paraphrases. The LaCour! corpus [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] introduces oral legal discourse in English and French.
It includes 154 transcribed ECHR courtroom dialogues held between 2012 and 2021. Transcripts are
annotated with sentence-level labels, identifying questions, opinions (e.g., dissenting, concurring),
speaker roles, language and timestamps. Each hearing is linked to the corresponding final judgment,
ofering a multimodal perspective on case deliberation.
      </p>
      <p>
        Finally, Demosthenes[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] includes 40 English judgments by the Court of Justice of the European Union
(CJEU) on fiscal State Aid, ranging from 2000 to 2018. It focuses on the “Findings of the Court” section.
The annotation follows a three-level hierarchical scheme: (1) argumentative components (premises,
conclusions), (2) premise types (legal or factual), and (3) argumentation schemes (e.g., Rule, Precedent,
Authority). An extension to the dataset[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] introduced inferential relations between components, such
as support, rebuttal, and undercut, allowing for rich modelling of argumentative structures.
National courts - common law. Academic researchers have created several datasets that highlight
the evidential reasoning in common law jurisdictions. In the U.S., the BVA dataset [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] contains 30
Veterans’ Appeals decisions on PTSD claims in English, from 2013 to 2016. The corpus is annotated with
three rhetorical roles, relevant to evidentiary reasoning: evidence, reasoning, and findings - mapping
directly to how tribunals evaluate and adjudicate claims.
      </p>
      <p>In Canada, a legal argumentation corpus was created from over 28,733 case-summary pairs sourced
from CanLII [16]. Initial annotation involved 574 randomly selected summaries [17], later expanded to
1,049. Each case is annotated with the components of an argument triple: issue, conclusion, and reason.
The corpus contains statements extracted from full-text decisions and their corresponding summaries.
The total number of statements from full texts is significantly higher compared to those extracted from
summaries</p>
      <p>One of the largest-scale resources to date is the Indian Supreme Court corpus by Ali et al. [18]
covering 30,034 English decisions from 1952 to 2012. Through rule-based methods, relational sentence
pairs are automatically identified and labelled as Support or Attack, enabling large-scale analysis of
argumentative dynamics. The authors later expand the dataset [19], specifically focusing on industrial
disputes. Each argument is represented by its start and end sentence, where each sentence in between
is considered as a part of the argument. For each argument, annotators were also required to identify
the sentence containing the major claim.</p>
      <p>Building on previous works [20], Bambroo et al.[21] created two additional English corpora focused
on rhetorical role classification: the DIN dataset, based on 150 Indian Supreme Court decisions (including
100 newly annotated cases), and the DUK dataset, covering 50 judgments by the UK Supreme Court.
Each sentence is labelled using a seven-role schema – Facts, Lower Court Ruling, Argument, Ratio,
Statute, Precedent, Present Court Ruling – reflecting the internal structure of legal decisions.
National courts - civil law. In civil law jurisdictions, legal judgments often follow a codified structure,
and argumentation is more tightly bound to statutory interpretation. New datasets from Germany, Italy,
and Spain mark a turning point in making these systems accessible for empirical analysis.</p>
      <p>The German dataset [22] focuses on proportionality arguments in constitutional law. It includes 300
randomly selected decisions by the German Federal Constitutional Court (GFCC), issued between 1951
and 2021. Annotations, limited to the “merit” section, are based on the GFCC’s four-step proportionality
test – legitimate aim, suitability, necessity, balancing – and allow multiple labels per sentence.</p>
      <p>
        The Italian dataset [23] consists of 225 VAT rulings in Italian by Regional Tax Commissions from 2010
to 2022, sourced from the Giustizia Tributaria database. Annotations follow a three-level hierarchy –
i.e., argument components, premise type, and argument scheme – defined by [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] for the Demosthenes
corpus.
      </p>
      <p>The Spanish dataset [24], focuses on family law issues (e.g., child custody, alimony, house allocation).
It includes 3,047 decisions from provincial and higher courts, issued between 2015 and 2020, and sourced
from the Spanish Centre for Judicial Documentation (CENDOJ). Annotations cover the following
elements: request types (e.g., custody, alimony), judicial principles, factual or legal justifications and
decisions. To optimise training, annotated segments were selected based on two criteria: their clarity in
representing the target category and their self-contained interpretability.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Argument Mining Tasks</title>
      <p>
        LAM encompasses a constellation of computational tasks aimed at dissecting and reconstructing the
argumentative elements of judicial decisions. At its core, argument mining seeks to render the complex
reasoning processes embedded in legal texts into structured, machine-interpretable representations.
This is typically approached through a multi-stage pipeline, aimed at extracting natural language
arguments and their relations from textual documents [25][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Each stage addresses a sub-task of
the problem. At first, usually argumentative sentences (i.e. those containing an argument or part
thereof) are detected. Then the boundaries of the various argument components are identified and their
characteristics are specified (e.g, distinguishing between premises and conclusions, argument schemes,
actors) [26][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ][27]. Finally, relationships are predicted between these components and/or between the
arguments they are part of [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We structure our review, distinguishing these tasks into four general
categories: argument component detection and classification, structure and relation modelling, legal
reasoning, and multi-task and hybrid setups. Additional details on the tasks and employed models can
be found in Appendix B.
      </p>
      <p>Detection and classification: At the foundational level of AM pipelines lies the task of identifying
argumentative content – distinguishing those statements that contain claims, premises or other
reasoning elements from non-argumentative texts. Once identified, these components are further categorised
according to their argumentative function (e.g., premise, conclusion) and attributes (e.g., factual vs.
legal, actor attribution and rhetorical role). To extract non-overlapping, contiguous sentence spans
representing complete legal arguments, Ali et al. [19] identify a set of argument markers, including
claim sentences and their supporting premises. They model the task as a text segmentation problem
over entire court judgments, integrating local classifications using Integer Linear Programming to
produce an optimal document-level segmentation of legal arguments.</p>
      <p>
        Al Zubaer and colleagues [28] focus on the binary classification of argumentative components as
premise(s) or a conclusion(s). While, Grundler et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] exemplify the AM layered pipeline through a
4 step framework: (i) detection, i.e., classifying sentences as argumentative or not; (ii) classification
as premise(s) or conclusion(s); (iii) type classification of premises as legal, factual or both; and (iv)
scheme classification. The last two are multi-label tasks assigning types and argument schemes to
legal premises. Muñoz-Soro and others [24] identify relevant argumentative components in family law
through binary classification (argumentative/non-argumentative) – with a particular focus on child
custody cases – and then introduce a tailored multi-label classification of such sentences, categorised as
types of plaintif’s requestions, legal justifications (the main arguments used by the court in custody
proceedings), and the court’s decisions.
      </p>
      <p>
        Rhetorical roles are a relevant point of focus too. Walker et al [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], for instance, distinguish among (i)
evidence (e.g., describing medical records and lay testimony); (ii) evidence reasoning (explaining how
the tribunal interprets such evidences); and (iii) findings (stating formal factual conclusions reached by
the decision-maker). Several studies have focused on the Issue–Reason–Conclusion (IRC) framework,
to model argumentative structure in legal texts. Xu et al [17] rely on the IRC taxonomy to classify
statements according to their argumentative roles. They [16] refine the IRC approach, by incorporating
a token-level and subsequent sentence-level classification of (i) court-issues; (ii) conclusions on such
issues (court’s decision); and (iii) the court’s reasons for so concluding, coupling the binary classification
task with abstractive summarisation (all unannotated sentence are treated as non-IRC sentences).
Similarly, with the ultimate goal of improving the summarisation task, Elaraby and Litman [29] adopt a
multi-class approach for a sentence-level classification according to one of three legal argument roles
under the IRC framework. An additional class is assigned to non-argumentative sentences.
      </p>
      <p>
        Bambroo et al [21] provide a multi-class classification of statements into seven predefined rhetorical
roles (Facts, Ratio, Precedent, etc.). Lüders and Stohlmann [22] focus on determining whether or not
a Court’s statement invokes the proportionality argument. The task is framed as a binary
sentencelevel classification. Finally, Habernal et al [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] classify spans of legal text with argument types (e.g.,
Textual interpretation, Application to the concrete case) and identify the actors responsible for each
argumentative content (e.g., ECHR, Applicant, State).
      </p>
      <p>Structures and relations. Beyond identifying discrete components, AM aims to reconstruct the
logical architecture that connects such elements. This entails modelling inferential and rhetorical
relationships – such as support, rebuttal, or contradiction – between sentences or propositions.</p>
      <p>
        Ali et al. [18] develop methods for support and attack relation classification, combining linguistic
cues (e.g., discourse connectors), semantic similarity, and weak supervision strategies They also explore
automated dataset construction via weakly supervised methods. Santin et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] expand this space by
identifying and classifying five relation types – Support from Premise(s), Support from Failure, Rebuttal,
Undercut, and Rephrase – ofering a fine-grained typology of argumentative dynamics within judicial
texts. This layer of analysis aims to capture the holistic reasoning process of judicial decisions, allowing
thee reconstruction of structured argumentative graphs.
      </p>
      <p>
        Legal reasoning: A more recent evolution in LAM centers on emulating legal reasoning. This shift
reflects a broader ambition to go beyond surface-level pattern recognition toward interpreting and
replicating the decision-making logic of courts. Chlapanis et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] introduce the Legal Argument
Reasoning (LAR) task. Given the case facts, LLMs predict the next logical statement in a legal argument
chain, among multiple choice options.
      </p>
      <p>
        Similarly, Held and Habernal [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] examine whether judges’ questions during oral hearings can
serve as predictors of subsequent dissenting or concurring opinions. Thus, they frame reasoning as
a process of inferring latent intent. Authors examine various tasks including a binary classification
task of what kind of opinions are expressed after the questioning part, i.e., “dissenting", “concurring",
“partly", “opinion"). Furthermore, reasoning tasks are explored in combination with summarisation and
Q&amp;A.
      </p>
      <p>Reasoning is also integrated with other NLP tasks. To evaluate the quality of summaries, Xu et al
[30] generate structured question-answer pairs, grounded in the IRC schema. Similarly, Smywinski
[31] assesses the capacity of LLMs to reason through question-answer pairs generated from legal texts.
Their emphasis is on understanding, interpreting, and reasoning over legal arguments, rather than
merely detecting or classifying argument components or their relationships. Lastly, Lu [32] explores
prompt engineering as a means of simulating professional legal reasoning, testing whether structured
prompts like IRAC or TREACC can elicit logical coherence and improve LLMs’ ability to assess legal
arguments in a zero-shot setting.</p>
      <p>
        Multi-task and hybrid setups: Some studies approach argument mining through a multi-stage
pipeline, combining several sub-tasks to capture argumentative structures more comprehensively.
Zhang and others [33] begin with argument clause recognition, classifying sentences from judicial
opinions as either argumentative or non-argumentative. The second stage, argument relation mining,
identifies inferential links between components within the same argument. Finally, argument component
classification assigns each argumentative clause a role — either premise or conclusion—using two
binary classifiers. Similarly, Poudyal et al [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] define a three-part pipeline tailored to legal texts: (A)
Argument Clause Recognition, which determines whether a clause forms part of an argument; (B)
Argument Relation Mining, which assesses whether two clauses are part of the same argument; and (C)
Premise/Conclusion Recognition, which classifies the argumentative role of each clause. This integrated
approach enables a more structured analysis of legal reasoning across multiple layers.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>
        Legal argument mining poses several unique challenges for LLMs, rooted in both the nature of legal
texts and on the current limitations of computational approaches. A fundamental obstacle is the scarcity
of annotated legal data, as highlighted by Zhang et al. [33]. Data scarcity has notoriously been a crucial
issue for the argument mining community [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In the law this issue is aggravated by the complexity
and linguistic specificities of legal texts, whose annotation requires costly legal expertise. This aspect
makes the task of legal argument mining particularly suitable for experimentation in a zero-shot or
few-shot learning setting.
      </p>
      <p>
        In fact, although domain-specific pre-training of models ofers promising solutions to this “data
poverty”, more research is needed to adapt NLP tools to high-complexity tasks. The issue of label
ambiguity, where a clause can function as both a premise and a conclusion, complicates binary classification
approaches, as observed by Al-Abdulkarim et al. [28] conducted on ECHR dataset. They also note the
brittleness of in-context learning, model performance being sensitive to prompt design. Santin et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
emphasise the dificulty of reconstructing complex argumentative structures due to low annotator
agreement and the challenges in long-distance link prediction – especially when argument pairs span
large portions of text, increasing class imbalance. Moreover, Habernal et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] argue that there still
exists a gap between how arguments are represented in computational argumentation and how legal
experts interpret them, according to legal reasoning. In this direction, Bambroo et al. [21] draw attention
to the long, unstructured nature of legal documents and the crucial need for explainability in legal
AI, where trust hinges on the reference of authoritative legal sources, and the heavy dependence of
performance on embedding quality.
      </p>
      <p>Taken together, these challenges remark the need for more robust, interpretable, and domain-aware
models tailored to the legal domain. So far, the performance of LLMs for legal argument mining
tasks is promising, but there is a large margin for improvement. For example, Al-Abdulkarim et
al. [28] found that law-specific models like legal-BERT outperformed general-purpose models such
as GPT-3.5 and GPT-4, particularly in identifying conclusions, and local embedding models proved
to be competitive alternatives. In other tasks, limitations in LLMs reasoning have emerged. Studies
reveal that even advanced prompting techniques, like Chain-of-Thought (CoT), fail to ensure accurate
application of legal standards, with LLMs often reverting to outdated doctrines in U.S. federal jurisdiction
law. Moreover, models appear to be strongly influenced by the context provided in the introductory
sections of datasets [32]. Evaluation methods themselves also warrant scrutiny. For instance,
retrievalbased metrics may not adequately capture the qualitative dimensions of legal argument relevance,
and the generalizability of findings is limited due to the use of jurisdiction-specific datasets and legal
traditions [31].</p>
      <p>
        Finally, eforts to use LLMs for structured reasoning tasks, such as predicting the next step in legal
arguments based on ECHR cases, highlight additional limitations [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Results are shown to be afected
by factors such as dataset construction methods, summarisation of case facts, and the artificial nature
of the prediction task itself.
      </p>
      <p>In future work, we plan to extend this survey study by examining in greater detail the methods and
models used for legal argument mining tasks, and by broadening the selection of papers to also include
those addressing legal argument generation tasks.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was partially supported by the following projects: CompuLaw - Computable Law - funded
by the ERC under the Horizon 2020 (Grant Agreement N. 833647); PRIN2022 PRIMA - PRivacy
Infringements Machine-Advice (Ref. Prot. n.: 20224TPEYC - CUP J53D23005130001); PRIN2022 EQUAL –
EQUitableALgorithms (Ref. Prot n. 2022KFLF3E_001 - CUP J53D23005560001); “FAIR - Future Artificial
Intelligence Research” – Spoke 8 “Pervasive AI”, under the European Commission’s NextGeneration
EU programme, PNRR – M4C2 – Investimento 1.3, Partenariato Esteso (PE00000013); TANGO - Grant
Agreement no. 101120763. Funded by the European Union. Views and opinions expressed are however
those of the author(s) only and do not necessarily reflect those of the European Union or the European
Health and Digital Executive Agency (HaDEA). Neither the European Union nor the granting authority
can be held responsible for them.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Grammarly in order to check grammar and
spelling. After using this tool/service, the author(s) reviewed and edited the content as needed and
take(s) full responsibility for the publication’s content.
[16] H. Xu, K. Ashley, Multi-granularity argument mining in legal texts, in: Legal Knowledge and</p>
      <p>Information Systems, IOS Press, 2022, pp. 261–266.
[17] H. Xu, J. Savelka, K. D. Ashley, Toward summarizing case decisions via extracting argument issues,
reasons, and conclusions, in: Proceedings of the eighteenth international conference on artificial
intelligence and law, 2021, pp. 250–254.
[18] B. Ali, S. Pawar, G. Palshikar, R. Singh, Constructing a dataset of support and attack relations
in legal arguments in court judgements using linguistic rules, in: Proceedings of the Thirteenth
Language Resources and Evaluation Conference, 2022, pp. 491–500.
[19] B. Ali, S. Pawar, G. Palshikar, A. S. Banerjee, D. Singh, Legal argument extraction from court
judgements using integer linear programming, in: Proceedings of the 10th Workshop on Argument
Mining, 2023, pp. 52–63.
[20] P. Bhattacharya, S. Paul, K. Ghosh, S. Ghosh, A. Wyner, Identification of rhetorical roles of
sentences in Indian legal judgments, in: Legal knowledge and information systems, IOS Press,
2019, pp. 3–12.
[21] P. Bambroo, S. Adhikary, P. Bhattacharya, A. Chakraborty, S. Ghosh, K. Ghosh, MARRO:
multiheaded attention for rhetorical role labeling in legal documents, Artificial Intelligence and Law
(2025) 1–30.
[22] K. Lüders, B. Stohlmann, Classifying proportionality-identification of a legal argument, Artificial</p>
      <p>Intelligence and Law (2024) 1–28.
[23] G. Grundler, A. Galassi, P. Santin, A. Fidelangeli, F. Galli, E. Palmieri, F. Lagioia, G. Sartor, P. Torroni,
et al., AMELIA-Argument Mining Evaluation on Legal documents in ItAlian: A CALAMITA
challenge, in: Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it
2024), Pisa, Italy, 2024.
[24] J. F. Muñoz-Soro, R. del Hoyo Alonso, R. Montañes, F. Lacueva, A neural network to identify
requests, decisions, and arguments in court rulings on custody, Artificial Intelligence and Law 33
(2025) 101–135.
[25] E. Cabrio, S. Villata, Five years of argument mining: a data-driven analysis, in: IJCAI, ijcai.org,
2018, pp. 5427–5433.
[26] R. Bar-Haim, I. Bhattacharya, F. Dinuzzo, A. Saha, N. Slonim, Stance classification of
contextdependent claims, in: EACL (1), Association for Computational Linguistics, 2017, pp. 251–261.
[27] V. Niculae, J. Park, C. Cardie, Argument mining with structured SVMs and RNNs, in: Proceedings
of the 55th Annual Meeting of the Association for Computational Linguistics, Association for
Computational Linguistics, Vancouver, Canada, 2017, pp. 985–995.
[28] A. Al Zubaer, M. Granitzer, J. Mitrović, Performance analysis of large language models in the
domain of legal argument mining, Frontiers in artificial intelligence 6 (2023) 1278796.
[29] M. Elaraby, D. Litman, ArgLegalSumm: Improving abstractive summarization of legal documents
with argument mining, in: Proceedings of the 29th International Conference on Computational
Linguistics, 2022.
[30] H. Xu, K. Ashley, A question-answering approach to evaluating legal summaries, in: Legal</p>
      <p>Knowledge and Information Systems, IOS Press, 2023, pp. 293–298.
[31] A. Smywiński-Pohl, T. Libal, Enhancing legal argument retrieval with optimized language model
techniques, in: JSAI International Symposium on Artificial Intelligence, Springer, 2024, pp. 93–108.
[32] Y.-A. Lu, H.-Y. Kao, 0x. yuan at semeval-2024 task 5: Enhancing legal argument reasoning with
structured prompts, in: Proceedings of the 18th International Workshop on Semantic Evaluation
(SemEval-2024), 2024, pp. 385–390.
[33] G. Zhang, D. Lillis, P. Nulty, Can domain pre-training help interdisciplinary researchers from data
annotation poverty? A case study of legal argument mining with bert-based transformers, in:
Proceedings of the Workshop on Natural Language Processing for Digital Humanities, 2021, pp.
121–130.
[34] O. Shulayeva, A. Siddharthan, A. Wyner, Recognizing cited facts and principles in legal judgements,</p>
      <p>Artificial Intelligence and Law 25 (2017) 107–126.</p>
      <sec id="sec-7-1">
        <title>Appendix</title>
        <p>A. Overview of legal argument mining datasets</p>
        <p>7
g 0
b m n ,7 s
0 i
2
3 e
,7 th i</p>
      </sec>
      <sec id="sec-7-2">
        <title>Overview of argument mining tasks and models</title>
        <p>Authors</p>
        <p>Dataset
Task description
Models
Multi-class sentence classification task following the
IRC framework, i.e. issue, reason, conclusion, non-IRC
Demosthenes
LAR-ECHR</p>
        <p>LaCour!
(reuse</p>
        <p>SemEval-2024
Task 5, domain of</p>
        <p>U.S. civil
procedure)
(reuse) Canada
dataset</p>
        <p>N/S
Poudyal et al.</p>
        <p>
          [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
        </p>
        <p>ECHR
Zhang et al.</p>
        <p>[33]
(reuse) ECHR
(1) Multi-class sequence labeling task. Classifying spans
(token-level sequences) according to two: argument
type and argument actor, two single-label classification
tasks. Each token within a paragraph aph is labelled as
to indicate whether (i) it begins (ii) is inside (iii) is
outside a given argument span of a certain type. (2)</p>
        <p>Argument actor classification task, multi-class,
single-label classification. Tokens are labeled for their
rhetorical or legal role and the actor</p>
        <p>Structures and Relations</p>
        <p>Multi-class classification task, predict the correct
relation label for any given sentence pair. i.e. support,
attack, no relation
Binary classification task and as an inferential relation
prediction, i.e. support, rebuttal, undercut, rephrase.
Link prediction between argumentative components
within judicial decisions</p>
        <p>Legal Reasoning</p>
        <p>Multiple-choice next-statement selection task,
forced-choice classification task. Legal Argument
Reasoning (LAR): given a sentence, predict the next
logical argument statement among multiple possible
choices
Multi-class and binary classification task to investigate
whether the questions posed by judges during ECHR
oral hearings correlate with the type of opinion that
those judges later issue in the final judgment, i.e.</p>
        <p>dissenting, concurring, or none
Binary classification task to evaluate whether structured
legal reasoning prompts (e.g. IRAC, TREACC) can guide</p>
        <p>LLMs to determine whether argumentative legal
answers are correct or incorrect, based on the context of
U.S. civil procedure cases, including the legal question
and explanation, in a zero-shot learning setting
Evaluate the presence of argumentative structure
following the IRC framework (Issue-
ReasonConclusion), within legal summaries, through a
question-answering framework
Improving the retrieval of legally relevant arguments to
extract and rank legal arguments from case law in
response to a legal query
(1) Argument clause recognition, binary classifi- cation
to determine whether a clause is argument- ative or
non-argumentative. (2) Premise and conclusion
recognition, binary classification, given a set of
previously identified argumentative clauses, assign each
one either a premise or conclusion label. (3) Argument
relation mining, binary class- ification, given a pair of
argumentative clauses, classify whether they are part of
the same argument structure or not
(1) Argument clause recognition, binary sentence
classification task, i.e. argument clause/ non-argument
clause; (2) Argument component classification, two
separate binary classification tasks (i) whether an
argument clause is a premise (ii) whether an argument
clause is a conclusion. A clause can be both a premise
and a conclusion in diferent arguments. (3) Argument
relation mining, binary classification, each pair of
clauses is classified as either related (i.e. belonging to
the same argument structure) or not related
RoBERTa-Large, Legal-BERT, SVM</p>
        <p>BERT</p>
        <p>ResAttArg, DistilRoBERTa
BERT, Legal-BERT, Legal-RoBERTa,</p>
        <p>RoBERTa-Large, Llama-3 8b</p>
        <p>Mixtral-8x7B
GPT-4, Longformer Encoder-Decoder (LED),</p>
        <p>BART</p>
        <p>RoBERTa
RoBERTa, one-layer BiLSTM, Legal-BERT,</p>
        <p>C-Legal-BERT</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K. D.</given-names>
            <surname>Ashley</surname>
          </string-name>
          ,
          <article-title>Artificial intelligence and legal analytics: new tools for law practice in the digital age</article-title>
          , Cambridge University Press,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lawrence</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <article-title>Argument mining: A survey</article-title>
          ,
          <source>Comput. Linguistics</source>
          <volume>45</volume>
          (
          <year>2019</year>
          )
          <fpage>765</fpage>
          -
          <lpage>818</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lippi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Torroni</surname>
          </string-name>
          ,
          <article-title>Argumentation mining: State of the art and emerging trends</article-title>
          ,
          <source>ACM Trans. Internet Techn</source>
          .
          <volume>16</volume>
          (
          <year>2016</year>
          )
          <volume>10</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          :
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K. D.</given-names>
            <surname>Ashley</surname>
          </string-name>
          ,
          <article-title>Automatically extracting meaning from legal texts: opportunities and challenges</article-title>
          ,
          <source>Ga. St. UL Rev</source>
          .
          <volume>35</volume>
          (
          <year>2018</year>
          )
          <fpage>1117</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Mochales</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-F. Moens</surname>
          </string-name>
          , Argumentation mining,
          <source>Artificial intelligence and law 19</source>
          (
          <year>2011</year>
          )
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bommasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Borgeaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yogatama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          , et al.,
          <article-title>Emergent abilities of large language models</article-title>
          ,
          <source>Transactions on Machine Learning Research</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Poudyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Šavelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ieven</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Moens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Goncalves</surname>
          </string-name>
          , P. Quaresma, ECHR:
          <article-title>Legal corpus for argument mining</article-title>
          ,
          <source>in: Proceedings of the 7th Workshop on Argument Mining</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>67</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I.</given-names>
            <surname>Habernal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Faber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Recchia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bretthauer</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Spiecker genannt Döhmann, C. Burchard, Mining legal arguments in court decisions</article-title>
          ,
          <source>Artificial Intelligence and Law</source>
          <volume>32</volume>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>O. S.</given-names>
            <surname>Chlapanis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Galanis</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          , LAR-ECHR:
          <article-title>A new legal argument reasoning task and dataset for cases of the european court of human rights</article-title>
          ,
          <source>in: Proceedings of the Natural Legal Language Processing Workshop</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>267</fpage>
          -
          <lpage>279</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>I.</given-names>
            <surname>Chalkidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fergadiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tsarapatsanis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Aletras</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Malakasiotis</surname>
          </string-name>
          ,
          <article-title>Paragraph-level rationale extraction through regularization: A case study on European court of human rights cases</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>226</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Santosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. G.</given-names>
            <surname>Haddad</surname>
          </string-name>
          , M. Grabmair,
          <string-name>
            <surname>ECtHR-PCR</surname>
          </string-name>
          :
          <article-title>A dataset for precedent understanding and prior case retrieval in the european court of human rights (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Held</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Habernal</surname>
          </string-name>
          , Lacour!:
          <article-title>enabling research on argumentation in hearings of the european court of human rights</article-title>
          ,
          <source>Artificial Intelligence and Law</source>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Grundler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Santin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Galassi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Galli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Godano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lagioia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Palmieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ruggeri</surname>
          </string-name>
          , G. Sartor,
          <string-name>
            <given-names>P.</given-names>
            <surname>Torroni</surname>
          </string-name>
          ,
          <article-title>Detecting arguments in CJEU decisions on fiscal state aid</article-title>
          ,
          <source>in: Proceedings of the 9th Workshop on Argument Mining</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>143</fpage>
          -
          <lpage>157</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P.</given-names>
            <surname>Santin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Grundler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Galassi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Galli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lagioia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Palmieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ruggeri</surname>
          </string-name>
          , G. Sartor,
          <string-name>
            <given-names>P.</given-names>
            <surname>Torroni</surname>
          </string-name>
          ,
          <article-title>Argumentation structure prediction in CJEU decisions on fiscal state aid</article-title>
          ,
          <source>in: Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>247</fpage>
          -
          <lpage>256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>V.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Foerster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Ponce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rosen</surname>
          </string-name>
          ,
          <article-title>Evidence types, credibility factors, and patterns or soft rules for weighing conflicting evidence: Argument mining in the context of legal rules governing evidence assessment</article-title>
          ,
          <source>in: Proceedings of the 5th Workshop on Argument Mining</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>68</fpage>
          -
          <lpage>78</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Habernal</surname>
          </string-name>
          et al. [8]
          <string-name>
            <surname>Ali</surname>
          </string-name>
          et al. [
          <volume>18</volume>
          ]
          <string-name>
            <surname>Santin</surname>
          </string-name>
          et al. [
          <volume>14</volume>
          ]
          <string-name>
            <surname>Chlapanis</surname>
          </string-name>
          et al. [9]
          <string-name>
            <surname>Held</surname>
          </string-name>
          &amp; Habernal [12]
          <string-name>
            <surname>Lu</surname>
          </string-name>
          et al. [32]
          <string-name>
            <surname>Xu</surname>
          </string-name>
          et al. [
          <volume>30</volume>
          ]
          <string-name>
            <surname>Smywinski</surname>
          </string-name>
          et al. [
          <volume>31</volume>
          ]
          <string-name>
            <given-names>Indian</given-names>
            <surname>Supreme</surname>
          </string-name>
          <article-title>Court corpus GPT-4o (L), GPT-4o-mini (S), Mistral-8x22B (L), Mistral-8x7B (M), Mistral-7B (S)</article-title>
          ,
          <source>Llama-3</source>
          .
          <fpage>1</fpage>
          -
          <lpage>70B</lpage>
          (L),
          <source>Llama-3</source>
          .
          <fpage>1</fpage>
          -
          <string-name>
            <surname>8B (S). L</surname>
          </string-name>
          ,
          <article-title>M, S denote the largest, medium, smallest models per family, respectively DeBERTa v</article-title>
          .
          <volume>3</volume>
          ,
          <string-name>
            <surname>Legal</surname>
            <given-names>-BERT</given-names>
          </string-name>
          , RoBERTa
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>