<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>BioLaw Journal</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1007/978-3-030-89811-3\_3</article-id>
      <title-group>
        <article-title>Topic Similarity of Heterogeneous Legal Sources Supporting the Legislative Process</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michele Corazza</string-name>
          <email>michele.corazza@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leonardo Zilli</string-name>
          <email>leonardo.zilli@studio.unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Monica Palmirani</string-name>
          <email>monica.palmirani@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Bologna</institution>
          ,
          <addr-line>ALMA-AI, via Galliera 3, Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Unsupervised learning</institution>
          ,
          <addr-line>Sentence Transformers, Hybrid AI, Legal NLP</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>13048</volume>
      <fpage>79</fpage>
      <lpage>84</lpage>
      <abstract>
        <p>The legislative process starts with a deep analysis of the existing regulations at European and national levels to avoid conflicts and fostering the into force norms. Also the Constitutional Court decisions play a fundamental role in this analysis for checking the compliance with the constitutional framework and for including the inputs coming from this relevant court in the law-making process. Finally, it is also significant to compare the forthcoming proposal with the already presented bills regarding the same topic. This comparison is crucial to avoid overlapping and to coordinate the democratic dialogue with the diferent parties. In this light, this paper presents an unsupervised approach for calculating similarity between heterogeneous documents annotated in Akoma Ntoso XML, with the aim to support the information retrieval of similar documents using thematic taxonomy used in legal domain. The prototype has been developed for answering to a call for manifestation of interests launched by the Chamber of Deputy of Italy in order to adopt hybrid AI in the legislation process. It uses a completely unsupervised approach based on Sentence Transformers, meaning that neither annotated data or any ifne-tuning process is required.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The legislative process inside parliaments and oficial
assemblies includes an initial phase of preliminary
discovery of the existing regulations and rules in the
same domain of the proposal, in order to synchronize
lficting norms.</p>
      <p>
        Secondly, a legal preliminary study
must be conducted for applying legislative drafting
techniques that have the aim of creating transparent
and evidence-based legislation (e.g., Better Regulation
planning-and-proposing-law/better-regulation_en).
On the other hand, the fragmentation of the legal
system imposes the task of an accurate preliminary legal
analysis and research at diferent levels of legislation to
the legislative department: at the European level in order
to discover the norms in Regulations and Directives;
at the national level to avoid overlapping with other
existing acts; at the ministerial level to synchronize
the technical and operative rules. Notably, it is crucial
to check the decisions of the Constitutional Court to
avoid to produce norms that are unconstitutional. On
the other hand, the legal sources, considering their
nEvelop-O
(M. Palmirani)
Ntoso XML [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for creating a common framework for
their representation that is capable of capturing the legal
knowledge and metadata (e.g., jurisdiction, hierarchy,
temporal model).
      </p>
      <p>Additionally, we provide an unsupervised approach
for classifying legal documents according to their topic,
which is used to retrieve the relevant legal documents
concerning some main legal topics (e.g., the subject of the</p>
      <sec id="sec-1-1">
        <title>Chamber of Deputies Committees defined by law</title>
        <p>1, or
EUROVOC top-level thematic classes) from a user input.
This work was conducted on the use-case of the Chamber
of Deputy of Italy’s needs and documents, answering the
CEUR</p>
        <p>ceur-ws.org</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>call for interests launched in February 2024 concerning
the use of AI in Parliament 2.</p>
      <p>The legislative language is a peculiar language that in- The documents used for the project have been collected
cludes qualified part of the text like the preamble, norma- from diferent sources, resulting in four distinct datasets:
tive part, definitions, normative references, exceptions,
transitional norms, etc. For this reason, the task is not
trivial and should take in consideration these
peculiarities.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Datasets and resources</title>
      <p>• Corte Costituzionale: Contains the orders and
judgments of the Italian constitutional court,
spanning from 1956 to 2018 (10725 documents),
which have been downloaded and converted to</p>
      <p>Akoma Ntoso using an ad-hoc tool 3;
• Progetti di Legge (PDL): A collection of Italian
legislative bills from the legislatures XVIII and
XIX (March 2018 to May 2024 - 3615 documents),
extracted from the oficial website of the
Chamber of Deputies of the Italian Parliament4 in the
HTML format and converted to Akoma Ntoso
using a batch python parser5.
• EUR-Lex: A collection of Regulations and
Directives from the European Union, spanning from
2010 to 2021, extracted from the EUR-Lex
website6 and converted from Formex to the Akoma</p>
      <p>Ntoso format using our conversion tool 7.
• Normattiva: A collection of Italian legislative acts
extracted from the Normattiva portal8, which
contains all legislative documents from the Italian
parliament in Akoma Ntoso format. The
documents from 2010 to May 2024 were selected,
including Primary and Secondary Law.</p>
      <p>
        The creation of models and methods for the legal domain
is a challenging endeavour, as this field is characterized
by some peculiar aspects that might lead general-purpose
approaches to be inaccurate. Nevertheless, a multitude
of diferent models and strategies have been proposed
in this field, including models that have been trained
specifically on this domain like LEGAL-BERT[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which
was fine-tuned from BERT[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] on legislative documents
from the UK, US and EU, court documents from the
European Court of Justice. Another model, called custom
LEGAL-BERT[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] was instead trained on a corpus
comprised entirely of Case Law from the Harvard Law
Library. Another prominent example of ad-hoc models for
the legal domain is called Pile-of-Law (PoL), from the
name of the dataset that was used to fine-tune it, which
comprises data from 35 diferent sources in English [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Interestingly, in terms of natural language processing
applications for the legal domain, most approaches appear When not already in the Akoma Ntoso XML format,
to be targeted at the judiciary rather than the legislative as is the case for the PDL and Eur-Lex dataset, the
docubranch. Additionally, some approaches include common- ments have been converted to this format. Through this
law corpora (UK/US) that for our purpose (EU) could conversion, it is possible for us to extract portions of the
create relevant distortions in the dataset. In particular, document according to its hierarchical structure (articles,
a common task is the prediction of a judgment for a commas, lists, etc). This structural information is very
given case. This task has been attempted using multi- important for the legal domain, as it allows to chunk
ple methods, including using a consistency graph and a documents while considering their structure (e.g., legal
transformer model to determine which articles have been definitions, article, list of points). Furthermore,
normaviolated in a given case [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The research is not limited tive references are also annotated as such, and a unique
to the English language, as there are contributions for URI is used to indicate them. The Akoma Ntoso
stanChinese court judgments [7] and rulings from the Indian dard also follows the FRBR conceptual model, which is
Supreme Court [8]. used to distinguish between works (i.e a specific law),
      </p>
      <p>
        Another crucial aspect of research in the wider field expressions (the various consolidated versions of each
of legal informatics is the creation of formats, ontologies law that have been amended over time) and
manifestaand tools that support the machine-readable represen- tions (the physical embodiment of an expression or work).
tation of legal documents, from both the legislative and Through the annotation of the hierarchical structure of
judiciary branches. Among these, one of the founding el- documents, the references and the URI naming
convenements of our approach is the usage of the Akoma Ntoso tion based on FRBR it is possible to resolve normative
XML standard [
        <xref ref-type="bibr" rid="ref1">1, 9</xref>
        ], which has been adopted by many references, even when they refer to a part of a document,
international institutions [10, 11, 12, 13, 14] to represent like a single article or paragraph. Furthermore, the FRBR
legal documents. This standard allows the annotation of
legal definitions, references, the hierarchical structure of 43hhttttppss::////gwiwtlawb..ccaomme/CraI.RitS/FID/cortecostituzionale-py
legal documents, as well as the temporal aspects of legal 5https://gitlab.com/CIRSFID/html2aknPDL
documents. 6https://eur-lex.europa.eu
7http://u2.cirsfid.unibo.it/formexplus2akn/frontend/
2https://comunicazione.camera.it/archivio-prima-pagina/19-37666 8https://www.normattiva.it/
model allows us to retrieve the consolidated version of a
document which is temporally relevant for a given
reference. Akoma Ntoso also includes legal metadata (e.g.,
jurisdiction, temporal information, modifications,
definitions, law-making process, life-cycle of the document,
classification) which improves the expressiveness of legal
knowledge in the XML representation.
      </p>
      <p>Each dataset follows semantically descriptive naming
conventions for the documents, which facilitate
subsequent data handling and processing steps in the pipeline
of the project. Table 1 summarizes the number of
documents contained in each dataset.</p>
      <p>Dataset
Corte Costituzionale
PDL
EUR-Lex
Normattiva</p>
      <p>N. of Documents
10725
3615
14305
3195</p>
      <p>In order to deal with the highly heterogeneous nature
of the datasets, labels describing a number of various
topics have been used for categorizing the documents. The
documents concerning Italy have been classified
according to the labels of the Committees of the Chamber of
Deputies. These Committees are represented as a string
describing them, which contains their titles (shown in
Table 2), as well as their description as presented in the
Circolare del Presidente della Camera (16 ottobre 1996,
n. 3), the oficial document that regulates the matters of
competence for each of them. Only regarding the dataset
of the Constitutional Court, the “Giustizia” (Justice) and
“Afari costituzionali, della Presidenza del consiglio e interni
della Camera dei deputati” (Constitutional Afairs,
Presidency of the Council and Internal Afairs of the Chamber
of Deputies) commissions were excluded as they apply
to the vast majority of Constitutional Court documents.</p>
      <p>Concerning the EUR-Lex dataset, the classification
leveraged the European multilingual thesaurus, EuroVoc,
using the top-level terms (shown in Table 3) and their
immediate subcategories separated by semicolons. As
for the Constitutional Court, the term “Unione Europea”
(European Union) has been excluded as it is too general
and relevant to all documents in the dataset.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Document Classification</title>
      <p>In order to classify documents according to their
content, we used an approach based on the
SentenceTransformers library [15], and selected the
multilingual model
“paraphrase-multilingual-mpnet-basev2”[16]. This model is made multilingual from the
monolingual Sentence Transformer model
“paraphrasempnet-base-v2”, in turn based on MPNet [17], which was
Afari esteri e comunitari
Difesa
Bilancio, tesoro e programmazione
Finanze
Cultura, scienza ed istruzione
Ambiente, territorio e lavori pubblici
Trasporti, poste e telecomunicazioni
Attività produttive, commercio e turismo
Lavoro pubblico e privato
Afari sociali
Agricoltura</p>
      <p>Politiche dell’Unione Europea
trained using a contrastive loss and an approach
similar to siamese networks to allow the direct application
of a metric (cosine similarity) to its output vectors in
order to measure the semantic proximity of sentences.
The monolingual model is then used as a teacher in a
teacher-student configuration to train the multilingual
one so that both the original and translated versions of
sentences have the same vector representation in the
new model. The chosen model, in particular, was trained
on parallel data and supports 50+ languages, including
Italian and English. Crucially, the usage of a sentence
transformer allows us to operate in a completely
unsupervised way, without the need to use annotated data or
to fine-tune the model for the classification task, since we
can directly apply cosine similarity to measure semantic
relatedness.</p>
      <p>In order to produce a classification of the documents,
we selected two components of the normative documents
(Eur-Lex, Normattiva, PDL), namely their titles and
articles. For the Corte Costituzionale dataset, we selected
tion xx/yyyy/EU) we obtain the specific
referenced portion of the document as an XML
element;
• For generic references to an entire document (eg</p>
      <sec id="sec-4-1">
        <title>Regulation xx/yyyy/EU) we use the title and first article of the document to represent it.</title>
        <p>Formally, then, an article  having children and references
is represented by an embedding obtained from the model
 using the following recursive procedure:
 () =</p>
        <p>1
2 + |()|
Where:
( (()) +
∑  (  ()) +</p>
        <p>∑ (  ()))

1
 ()

parenthesis, which contains brief descriptions of refer- function  that works as follows:
tively;
respectively.
• () is the textual content of the article which is</p>
        <p>not included in any of its non inline children;
• (),   () represent the set of all non inline
chil</p>
        <p>dren of  and the i-th child element of  ,
respec•  (),   () represent all the references in the text</p>
        <p>of the article, and the j-th reference in the text,
In order to represent references, then, we can define a
(1)
(2)</p>
        <p>These components were extracted by applying the ap- the function  () as defined previously computes an
avpute embeddings representing each title of the document. the normative references contained in the text.
the introduction as a substitute for the title, while
instead of the articles we used the decision portion of the
documents, in addition to all textual content between
enced documents. The text between parenthesis is fed to
the model and the results are averaged to produce a
single vector. In the following sections, we use “titles” and
“articles” for brevity, but these correspond to introduction
and decision + parenthesis for the Corte Costituzionale
dataset.
propriate Xpath query to the Akoma Ntoso XML tree
representing each document. The first step is to
comThen, we proceed to compute the article vectors. While
in the case of titles we can just apply the sentence
transformer directly to the text, the length of articles might
prevent the model from producing accurate result, or
even exceed the maximum allowed tokens for a given
model. For this reason, our approach leverages the
structure of articles, represented using Akoma Ntoso, to
produce one embedding for each article. In particular, we
until we reach the XML elements that are leaves of the
tree. We exclude the elements that appear inline in the
text (eg dates, references, etc) in order to maintain the
textual content of each leaf node (eg paragraph, item of a
list, etc) intact. A visualization of the procedure is shown
in Figure 1. In addition to its own textual content, each
leaf node is associated with a list of the references in its
text, which are resolved as follows:
proceed traversing the XML tree in a recursive manner, topics and each article. Finally, the maximum similarity
• For punctual references (eg Article 3 of Regula- providing information about the more relative topic for
() = { 1
 ()
2  ( ()) +  (</p>
        <p>if  is a punctual reference
1()) otherwise
Where  () represent the title of the referenced document,
while  1() is the first article of the document. Overall,
erage vector representation for each article, which
aggregates the embeddings of all its children but also considers</p>
        <p>Once we obtained the vector representation of each
article of each document and its titles embeddings, we
can compare them with the vector representations of our
topics, the EuroVoc terms for the European legislation
and the Chamber commissions for the Italian documents.</p>
        <p>Then, the similarity between each document and the
subjects is derived from the sum of the cosine similarity
between its title and the average similarity between the
value obtained by this procedure is used to classify each
document using one of the topics.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Searching by topic</title>
      <p>In order to provide a topic-based search that can be used
in the Italian legislative process, the final step is to
provide an interface to query each of the four datasets, by</p>
    </sec>
    <sec id="sec-6">
      <title>6. Evaluation and Results</title>
      <sec id="sec-6-1">
        <title>9http://u2.cirsfid.unibo.it/portale-camera</title>
        <p>In order to evaluate the performance of our subject-based
classification, we asked three experts of the legal domain When comparing the result, it is interesting to note
to annotate 100 random documents for each dataset be- that among the Italian datasets, which use the same
catetween them, and proceeded to measure the accuracy gories, the Normattiva and Corte Costituzionale accuracy
of our classification when compared to the annotated seems higher, while the PDL dataset shows a lower
perground truth (Table 4). The fact that experts were in- formance. This suggests that the finalized version of
volved in the annotation of the results is crucial for the documents issued by the parliament and the
Constitulegal domain, since this allows the legal interpretation of tional court might be simpler to classify in an
unsuperthe results, which can only be accomplished through an vised way, while the more draft-like qualities of the PDL
evaluation by legal experts [18]. dataset hinder the classification eforts.</p>
        <p>While this is just a preliminary assessment of the
classification performance of our unsupervised model, it is 7. Conclusions and Future Work
possible to derive that the label applied to the documents
is correct in at least 39% of the cases, meaning that the
approach is indeed able to link a document with its more
relevant anchor with a good level of approximation.</p>
        <p>In this article, we present an unsupervised approach that
aims to support the Italian legislative process, by
providing useful insights into documents from the relevant
European and Italian institutions (European Union,
Constitutional Court, Italian Parliament). The system doesn’t
10https://comunicazione.camera.it/archivio-prima-pagina/19-41329</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>only provide with a ranking of relevant documents, but
it also returns the two most relevant EuroVoc terms (for
EU documents) and Chamber commissions (for Italian This project is funded by the European Union -
NextGendocuments). This allows the user a more thorough ex- erationEU under the National Recovery and Resilience
ploration of the relevant subjects, while also supplying Plan (PNRR) - Mission 4 Education and research -
Comsuggestions in terms of specific documents. ponent 2 From research to business - Investment 1.1
No</p>
      <p>Our approach is completely unsupervised and it does tice PRIN 2022 (DD N. 104 del 02/02/2022), title Smart
not rely on any form of annotation, meaning that scaling Legal Order in DigiTal Society - SLOTS, proposal code
up the approach to more documents, or even using more “Smart Legal Order in DigiTal Society (SLOTS)”, Proposal
performant models do not require any fine-tuning, with code 2022LRL2C2, CUP J53D23005610006. We also thank
the procedure consisting in obtaining the article and title Salvatore Sapienza, Chantal Bomprezzi, Pier Francesco
vectors for all documents. Furthermore, the adopted Bresciani for validating the results.
approach leverages the hierarchical nature of legislative
documents, as represented in Akoma Ntoso XML in order References
to produce embeddings that are based on the structure of
the document. Moreover, using a structured format as our
input allows us to resolve normative references, without
which some of the of a document will be impossible to
understand for an automatic system.</p>
      <p>The evaluation performed on the classification system
showed a promising level of performance for an
unsupervised model, which doesn’t rely on any information
about the specific task. Additionally, the multilingual
model used in our method allows users to work both on
English and Italian, both in terms of queries and in terms
of results, with satisfying results. Nevertheless, it would
be possible to improve the quality of the results by testing
other models, which might yield better performance.</p>
      <p>The validation of the search by topic task has been
assessed by two senior legal researcher in the team,
however it is recommendable to organize a session with
relevant end-users with some concrete scenarios for
returning relevant documents and categories given a user query.</p>
      <p>For this task, it would be necessary to involve the relevant
stakeholders, meaning experts involved in the drafting of
legislative documents in Italy. Nevertheless, the project
has been evaluated by scientific experts 10 appointed by
the Italian Chamber of Deputies in the context of its
manifestation of interest and it was included as part of the
work by of one of the two winning consortiums.</p>
      <p>The experimental results obtained in this paper
constitute a study of the application of pre-existing Sentence
Transformer models in an unsupervised way to the
classiifcation and search of Italian legal documents. While we
achieved satisfactory results, our approach could still be
improved by improving upon the base methodology and
conducting a more thorough exploration of other
multilingual models. Furthermore, a formal evaluation by the
stakeholders would also improve our understanding
further specific parameters that arise during the legislative
process.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Palmirani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sperberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Vergottini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vitali</surname>
          </string-name>
          ,
          <source>Akoma Ntoso Version 1</source>
          .
          <issue>0 Part 1</issue>
          :
          <string-name>
            <given-names>XML</given-names>
            <surname>Vocabulary</surname>
          </string-name>
          ,
          <source>Technical Report, OASIS Standard</source>
          ,
          <year>2018</year>
          . URL: http://docs.oasis-open.org/legaldocml/akn-core/
          <year>v1</year>
          . 0/akn-core-v1.
          <fpage>0</fpage>
          -part1-vocabulary.html.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I.</given-names>
            <surname>Chalkidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fergadiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Malakasiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Aletras</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          , LEGAL-BERT:
          <article-title>The muppets straight out of law school</article-title>
          , in: T. Cohn,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2020</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>2898</fpage>
          -
          <lpage>2904</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .findings-emnlp.
          <volume>261</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          . findings- emnlp.261.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Guha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Anderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Henderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <article-title>When does pretraining help? assessing self-supervised learning for law and the casehold dataset of 53,000+ legal holdings</article-title>
          ,
          <source>in: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law</source>
          , ICAIL '21,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2021</year>
          , p.
          <fpage>159</fpage>
          -
          <lpage>168</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Henderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krass</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Guha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <article-title>Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>29217</fpage>
          -
          <lpage>29234</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <article-title>Legal judgment prediction via relational learning</article-title>
          ,
          <source>in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '21,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2021</year>
          , p.
          <fpage>983</fpage>
          -
          <lpage>992</lpage>
          . URL: https://doi.org/10.1145/ 3404835.3462931. doi:
          <volume>10</volume>
          .1145/3404835.3462931.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>