<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Generator for Legal Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Benjamin Cérat</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olivier Salaün</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Noreddine Ben Jillali</string-name>
          <email>jillalin@lexum.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marc-André Morissette</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Isabela Pocovnicu</string-name>
          <email>pocovnicui@lexum.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emma Elliott</string-name>
          <email>elliotte@lexum.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>François Harvey</string-name>
          <email>harveyf@lexum.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lexum Inc.</institution>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>RALI, DIRO, Université de Montréal</institution>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>of the output.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Recent years have seen the advent of transformer-based
language models [
        <xref ref-type="bibr" rid="ref20 ref28">1, 2</xref>
        ] that have been applied to a wide
variety of tasks across the field of natural language
processing (NLP). The legal domain has not been exempt
from this trend: automatic legal document classification,
information retrieval and even summarization tasks have
all become attainable goals. This paper outlines the
stepby-step eforts required to accomplish one of these tasks,
namely keyword generation, in a commercial setting.
      </p>
      <sec id="sec-1-1">
        <title>Lexum is a small software company owned by the</title>
        <p>viding open access to online legal information through
the canlii.org website and other related products.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Our objective was to generate, for each decision, short</title>
        <p>instructive keywords complying with the style found in</p>
      </sec>
      <sec id="sec-1-3">
        <title>Canadian legal reports [4] (some examples are shown in</title>
        <p>Proceedings of the Sixth Workshop on Automated Semantic Analysis of
stored in Huggingface [5], namely BigBirdPegasus [6]
(BBP) trained on the BIGPATENT dataset [7]1, which
we fine-tuned to generate keywords harvested from
decisions found in the Ontario Reports, the Law Society
of Saskatchewan Libraries databases and the Supreme</p>
      </sec>
      <sec id="sec-1-4">
        <title>Court of Canada Reports. The ability to handle fairly long documents was paramount, as Canadian decisions can vary from several sentences to novel length, averaging around 6 thousand words.</title>
        <p>While the preliminary results from BBP were
promising, we outlined several issues regarding generated
keywords. First, pre-trained language models capable of
handling long documents were not available for the French
language (one of the oficial languages of Canada), and
the rare French legal models were either not accessible [8]
or not suited for Canadian common law [9]. Second,
courts. Inference on decisions from courts unseen in
the training set was strongly biased toward topics found
in reports (books that collect and publish notable case
law), thus confusing jurisdictions and generally being of
much lower quality. Finally, the keyword format is quite
inconsistent across collections (e.g. length and styles
diferences among examples in Table</p>
      </sec>
      <sec id="sec-1-5">
        <title>1) and the model generates keywords in a style that matches only one of those formats. Since the output is meant to be displayed inline on search results, a more uniform result is desir</title>
        <p>Canadian Legal Information Institute and focused on pro- the model fails to generalize to decisions from unseen</p>
        <p>bigbird-pegasus-large-bigpatent</p>
        <sec id="sec-1-5-1">
          <title>Keywords</title>
          <p>Constitutional law – Charter of Rights – Life, liberty and security of the
person – Fundamental justice – Abortion – Criminal Code prohibiting abortion
except where life or health of woman endangered – Whether or not
abortion provisions infringe right to life, liberty and security of the person – If
so, whether or not such infringement in accord with fundamental justice –
Whether or not impugned legislation reasonable and demonstrably justified
in a free and democratic society – Canadian Charter of Rights and Freedoms,
ss. 1, 7 – Criminal Code, R.S.C. 1970, c. C-34, s. 251.</p>
          <p>Criminal Law - Sentence - Robbery and Extortion
Citizenship and Immigration — Status in Canada — Convention Refugees
and Persons in Need of Protection — Immigration practice — Refugee Appeal
Division jurisdiction — Judicial review of Immigration and Refugee Board,
Refugee Appeal Division (RAD) decision dismissing applicants’ appeal from
Refugee Protection Division (RPD) decision refusing to recognize applicants’
claim to being refugees or persons in need of protection within meaning of</p>
          <p>Immigration and Refugee Protection Act, ss. 96, 97
able. output to segments from the input text. The traditional</p>
          <p>The LexKey project was meant to address these issues TF-IDF approach compares n-gram frequencies inside
with the following contributions: first, we further pre- the document to those of the corpus at large. This selects
trained a multilingual model [10, 11] using a denoising groups of words that are rarely used but common in the
task [12] on CanLII’s Canadian legal decisions, doctrines document text. In our experience, our existing
TF-IDFand legislations. We then modified the model’s atten- based system applied to legal decisions yields keywords
tion to allow longer inputs and then fine-tuned it on a that are generally related to the facts of a case, but not
supervised keyword generation task over a large, care- to the legal principles discussed.
fully curated and normalized set of decisions containing However, human-written summaries and keywords
keywords. found in law reports (books that collect and publish
no</p>
          <p>
            The model has been deployed live on CanLII in Febru- table case law) and legal databases, put an emphasis on
ary 2023 after a lengthy quality analysis consisting of broader legal doctrines, the pertinent case law and the
both automatic and manual evaluations performed by ex- statutes considered. The LexKey project aims at
highperts. The underlying pre-trained language model is also lighting these concepts with new keywords.
shaping up to be a crucial part of other related projects. There have been some eforts to leverage
transformerbased models such as BERT to extract the most relevant
sentences as a summary [17], but they struggle to
per2. Related Works form as well as abstractive models of similar complexity.
2.1. Extractive Methods
Automatically generating structured keywords from
textual documents is a task that is closely related to summa- Since the advent of BERT [2], abstractive models using
rizing documents, a common task in natural language pro- the transformer architecture [
            <xref ref-type="bibr" rid="ref20 ref28">1</xref>
            ] dominate the landscape
cessing. Extractive statistical systems have long been the of summarization tasks. Overall, most of the best
perstandard approach to perform this task. In fact, LexKey forming models such as T5 [18], BART [12], MBART [10]
is aiming at replacing an in-house modified TF-IDF ex- and Pegasus [19] use an encoder-decoder architecture.
traction system [13] that works by ranking sentences or Several elements drew us towards encoder-decoder
n-grams in the text and picking a few salient expressions models: the flexibility provided by separately designed
to represent the document. This extractive approach is encoder and decoder layers, the flexibility in the
implesimilar to those used by [14, 15, 16]. mentation of the pre-training objective (e.g. masked
lan
          </p>
          <p>
            Doing so prevents the model from inventing false- guage modelling [2], denoising [12]) and of the attention
hoods about the document’s content by constraining its architecture [
            <xref ref-type="bibr" rid="ref20 ref28">1, 6, 20</xref>
            ], along with the ease of managing
multilingual models by simply using source and target
2.2. Abstractive Methods
          </p>
        </sec>
        <sec id="sec-1-5-2">
          <title>Source</title>
          <p>Ontario Reports</p>
          <p>Law Society of Saskatchewan
Supreme Court of Canada Reports
Canadian Federal Courts Reports</p>
          <p>Other Canadian Sources
language prompts [10]. similar to the style we were aiming at in order to ease</p>
          <p>More recently, at the time the LexKey project was al- format normalization.
ready nearing its completion, large decoder-based models Gathering decisions from Law Reports introduced a
such as GPT [21] have also been shown to perform well fairly strong bias toward decisions that were of interest to
in text generation tasks based on prompting schemes. legal practitioners, which feature appeal cases, decisions
The large-scale application of the most recently released that are considered authorities on broad constitutional
models (i.e. ChatGPT [22], GPT-4 [23]) to our corpus is questions, or decisions that clarify a legal controversy.
left for future work. These difer from day-to-day decisions without
preceden</p>
          <p>All in all, most of the architectures we surveyed were tial value.
performing very similarly on summarizing tasks, so the
ease of adapting the model to our needs (training and Table 2
inference cost over millions of documents in particular) Decisions with Keywords per Source
became the primary diferentiator.
2.3. Handling Long Documents</p>
        </sec>
      </sec>
      <sec id="sec-1-6">
        <title>Early transformer models like BERT [2] were limited to</title>
        <p>fairly short sequences (512 tokens) due to their quadratic
memory usage as a function of the input length. Research
into more eficient encoder attention mechanisms or im- Total 157 697 100
plementation is very active: Flash Attention [24], Big
Bird [6], Longformer-Encoder-Decoder [25], LSG [20] Together with an experienced legal archivist, we set
and LongT5 [26] are all fairly recent models implement- out to identify the decisions on CanLII [3] that featured
ing such techniques. Most of them allow extending input keywords, often from Law Reports that had licensed its
length up to around 16k tokens by making the memory content to CanLII. We also contacted organizations that
usage scale more linearly in relation to the input length. ofered such information to members as an added value</p>
        <p>Hierarchical Attention architectures [27] have also to see if they would be interested in collaborating. In
been tried but seem to be limited to around 4k tokens the end, we gathered data from several Canadian sources
inputs and require much more extensive architectural (see Table 2).
changes relative to the basic transformer implementation.</p>
        <p>In the context of Multi-LexSum, a summarization task 3.2. Preprocessing
of civil rights lawsuits, [28] also emphasized the
longrange input issues faced by transformer models when
dealing with a multi-document case.</p>
        <p>In order to pre-train our language model, we used the
raw text (Table 3) extracted from the HTML of the 3.1M
decisions, 100k commentaries, and 85k statutes and
regulations in French and English languages available on
3. Datasets CanLII. We also gathered a large collection of English
language decisions with appropriate keywords, but we
3.1. Sources could not get a suitable amount of French language
deciOne of the most important objectives of the LexKey sions with keywords in time for the first release. The text
project was to collate a representative set of annotated of every decision with keywords that we did not already
decisions that was large enough to train a highly capable have was also added to the pre-training dataset.
keyword generation model whose output would meet The extraction process separates the keywords from
the quality expectations of our users. Annotating by the rest of the text, removes summaries and cleans up the
hand tens of thousands of decisions was unfeasible, so resulting text by normalizing separators and whitespaces,
we decided to gather case law databases that already had keeping the document structure intact.
keywords and categories added by experts. For the pre-training denoising task, the documents</p>
        <p>We had a few criteria for the selection of appropriate were further split into individual training samples in
training material. The most obvious one was that the chunks of roughly 1024 tokens. These chunks were
legal framework that created them should be fairly sim- made by cutting the text along sentence boundaries using
ilar to Canadian Law. This meant limiting ourselves to NLTK [29]. For the fine-tuning task (keyword
generacountries that are under common law (mostly the Com- tion), the preprocessed text is left in a single chunk. In
monwealth member countries and to some degree the this step, we also normalize the keywords that were
eiUnited States). The decisions also had to be either in
French or English and have a keyword format somewhat 2Number of words per decision on the basis of NLTK.
Count</p>
        <p>
          [
          <xref ref-type="bibr" rid="ref20 ref28">1</xref>
          ] &lt;/n&gt;On May 15, 2017 the Respondent, the Vancouver&lt;/n&gt;Park Board, passed a bylaw amen dment applicable to parks
within its&lt;/n&gt;jurisdiction, prohibiting the movement of whales to parks, the keeping of&lt;/n&gt;whales at parks (excluding
whales which were already in a park on May 15, 2017)&lt;/n&gt;and the production or presentation in a park, of a show,
performance or other&lt;/n&gt;form of entertainment involving whales. The only park within the jurisdiction&lt;/n&gt;of the
Vancouver Park Board where whales are kept is Stanley Park, [...]
3.3. Normalization
        </p>
      </sec>
      <sec id="sec-1-7">
        <title>The keywords we had available for our training data came</title>
        <p>Avg. Length2 Chunks Count % from very diferent sources that rely on very diferent
Train 6529.3 4.48M 2.92M 90% formats. To align them more closely to the format we
Valid 5634.8 249K 162K 5% wanted to display to our users, we performed extensive
Test 5655.4 249K 162K 5% normalization.</p>
        <p>An issue that was immediately apparent is that some
sources of keywords tend to be very terse, only two or
ther removed from the text or supplied by an external three words, while others like the Supreme Court of
partner to use as output targets. Canada tend to be very long and mostly composed of</p>
        <p>The documents are then randomly shufled and split descriptive sentences (Table 1). To be able to control the
into training (90%), validation (5%) and test (5%) sets. We keyword format generated by our model and avoid
havalso created a separate fine-tuning dataset containing the ing it copy whichever source format the document
resemsubset of decisions that had keywords. It is also split into bles, we introduced a keyword normalization step to our
training (90%), validation (5%) and test (5%) sets. preprocessing. The complete keyword sequences were</p>
        <p>When pre-training multilingual models based on separated into a list of short keyword groups, descriptive
MBART checkpoints, we used decisions in both lan- sentences, and discussed jurisprudence and legislation.
guages, whereas pre-training and fine-tuning for non- The sequence is split along the dashes into individual
MBART-based models are done only on English docu- parts. We then use a regular expression to extract the
disments (we did not have enough French documents with cussed jurisprudence and legislation. Next, starting from
keywords at that time). The test sets are English-only for the beginning of the sequence, we select parts of 4 words
all models. or less that do not contain a helping verb or pronoun as</p>
        <p>To avoid any information leakage, we use a bucket keywords. The rest is marked as a descriptive sentence.
hashing strategy on an immutable unique index to ensure One major problem that was identified when trying to
documents always end up in the same set across dataset generate descriptive sentences and that led us to abandon
versions. We first calculate the md5 hash of an id that that idea was that the models were prone to hallucinating
is shared by every part or version of a document then facts or subtly inverting logical propositions found in the
convert it to an integer. We then put this id in the correct text, thus creating believable-sounding keywords that
bucket by taking the modulo of the number of buckets misrepresented the decision. Removing these descriptive
desired (20 in our case). sentences did have some negative impact as they allow</p>
        <p>This ensures that every part of a multipart document more nuanced keywords that can refer to specific facts
and any translation will always end up in the training or arguments. It also meant that we had to remove some
set for example, or that a decision with keywords in the decisions with a set of keywords that had only a single
ifne-tuning validation set will likewise be in the valida- short keyword (e.g. a broad domain) followed by
description set of the pre-training dataset even if we regenerate tive sentences. Once normalized, these keywords did not
several iterations of the datasets with new material or fit our preferred format.
preprocessing. In the format we settled on, the descriptive sentences
in the keywords are discarded and the short keywords
Table 5 are limited to 6 per keyword set (a document can have
Documents in Fine-tuning Dataset several groups of keywords if it discusses various issues).</p>
        <p>Avg. Length2 Count % The short keywords are then consolidated using a
handmade mapping file of equivalent subjects to avoid having
TVraaliidn 44127342..19 173K5K 950%% diferent naming conventions.</p>
        <p>Test 4172.6 7K 5% After extracting the keywords, further preprocessing
is done to ensure good-quality training samples. We
re4. Models
move headers and get rid of very short decisions since
those are generally assigned the same keywords (e.g. the
same related lower instance decision). The various
extracted keywords are also categorized into short, medium
and long formats should we later decide that we do not
want to use the medium-length keywords (the format
chosen for the LexKey project) everywhere.
3.4. Truecasing
One major issue we faced while normalizing the
keywords to create the fine-tuning dataset was that the
capitalization was inconsistent: all caps for some words, title
case for every word in some examples, and finally
sentence case. We decided to settle on the Supreme Court
of Canada’s (SCC) convention of sentence case, both
because it was preferred by our legal experts and because
the SCC is a large, consistent and well-curated source of
examples.</p>
        <p>After trying to use standard tools such as NLTK’s
Part of Speech tagger which yielded poor results, we
decided to fine-tune a separate version of our pre-trained
lexBART model to output the proper casing on the
keywords from the Supreme Court of Canada.</p>
        <p>To avoid potential miscopying issues, we trained the
model to output a token representing the proper case
(lowercase, capitalized or all caps) for each word in our
training and validation data. This approach yielded very
good casing in the appropriate format in most cases.</p>
        <p>Initially, we only applied truecasing to keywords from
collections that were known to be badly formatted, but
this preprocessing step was eventually applied to every
keyword as we kept tracking down casing issues to
oddities in other collections that should have been properly
cased.</p>
      </sec>
      <sec id="sec-1-8">
        <title>Given that our proof of concept used BigBirdPegasus</title>
        <p>(BBP), our first idea was to try to modify this model
into a multilingual variant. BBP is warm-started from
English-only RoBERTa parameters [30], then pre-trained
on an unsupervised Gap Sentence Generation task [19].</p>
        <p>Therefore, replacing the initial parameters and the
tokenizer with one of the many multilingual RoBERTa-based
models [31] looked feasible at first. However, after some
discussion with the authors on training cost estimates, it
3.5. Hand-Curated Test Set became apparent that it was not feasible on our hardware
(a small server with two A6000 GPUs) and would require
To validate that the models generate keywords of good renting TPUs during most of the development process.
quality on data from out-of-sample sources, we also had To stay within our hardware capability, we decided
editors create a hand-labelled set of 500 documents dis- to start with a pre-trained multilingual encoder-decoder
tinct from the fine-tuning test set. They are sampled from model MBART50 [10, 11], further pre-train it on our data
courts and tribunals not found in any of our keyword (denoising task) and then modify the encoder attention
sources. As such, none of these decisions were included to handle longer input sequences. To speed up the
dein the fine-tuning dataset of the keyword generation task velopment process, we did most of the experiments on
and we ensured that they covered topics not common in a smaller English-only BART-base model first (referred
legal reports. to as lexBART later). This allowed us to quickly identify</p>
        <p>Editors selected them in proportions reflecting the true the efect of various preprocessing steps and find suitable
distribution of our complete corpus (see Table 6). To keep hyperparameters that could then be reused on the larger
low-level tribunal decisions from dominating this test MBART50 model (lexMBART).
set, we deliberately sampled 60% of the documents from
higher-level courts. The keywords assigned to these
documents were also normalized, following the same
steps shown earlier.</p>
      </sec>
      <sec id="sec-1-9">
        <title>3Using gradient accumulation</title>
        <p>4.1. Unsupervised Pre-training
Starting from a pre-trained model checkpoint, we further
pre-trained it using a denoising objective [12] on our
dataset of around 3.2M legal documents (Table 4). This
objective consisted of mask filling and fixing sentence
permutation noise on chunks of the input documents.
15% of tokens in each sample were masked, the sentences
were randomly shufled, and the model was tasked with
correctly generating the initial sample.</p>
        <p>This pre-training was done using the standard
crossentropy loss [32] between the generated reconstituted
text and the actual text before noise was applied.
After more than 93k steps, the loss on the evaluation set
decreases from 1.02 to 0.43. In downstream tasks, the
pretraining step turned out to yield a marginal gain of up
to 0.8 points for ROUGE scores, which is consistent with
[33, 8]’s findings that domain adaptation is beneficial.
4.2. Handling Long Documents
To enable this model to handle long inputs, we converted
its full attention layers to LSG attention [20]. The
conversion from full to sparse attention makes the model
less memory-greedy. LSG uses, as the name suggests, a
mix of local, sparse and global attention similar to
Longformer [25] along with a pooled representation of the rest
of the input sequence. This allows the memory usage of
attention computations to scale linearly with the input
sequence length, and not quadratically like traditional
transformer models.</p>
        <p>Our pre-training objective requires that the output
length is at least as long as the input length, so it is
ill-suited to LSG-based models with diferent input and
output lengths. Since the denoising task can be done with
shorter document chunks and since the size mismatch
does not cause problems during the fine-tuning as the
keywords are always much shorter than the input text,
we only modified the attention after the pre-training was
done.
4.3. Supervised Fine-tuning for Keyword</p>
        <p>Generation</p>
      </sec>
      <sec id="sec-1-10">
        <title>After pre-training, the model architecture is converted to LSG attention to allow a larger sequence input length of 4096 or 8192 tokens. It is then fine-tuned to generate the</title>
        <p>These preliminary results are however contradicted
by the Hand-Curated Test Set (Table 10) where the much
smaller lexBART model produced the best scores,
followed by the lexMBART LSG-4k model. This suggests
that models’ ability to generalize to documents from
unseen courts is not guaranteed to improve as the sequence
input length increases. It is also possible that longer
input may dilute the relevant information in cases where
the model is unsure.</p>
      </sec>
      <sec id="sec-1-11">
        <title>4Time taken by the fine-tuning step.</title>
        <p>5Prototype trained on a single A6000 GPU. Would likely take around
half the training time on 2 units.
6For monolingual models, French documents are removed from
training and validation sets. Test set is English-only for all models.
correct keywords on our labelled dataset using a
crossentropy loss. Documents exceeding the maximum input
length are simply truncated by the tokenizer.</p>
        <p>While the LSG attention can be modified to accept
input lengths longer than 8k tokens, doing so proved to
stretch fine-tuning time to an impractical degree. On
our in-house hardware, fine-tuning an MBART50-large
model extended to 4k input tokens took roughly 6 days
and was proportionally longer with longer inputs (taking
12 days for 8k). In addition, 20% of our dataset is longer
than 4k, and only 6% is longer than 8k.</p>
        <p>After some experimentation, we settled on beam
search as the generation strategy. We found that using 10
beams gives the best balance between generation speed
and quality. We also added a rule-based post-generation
processing step to remove any leftover repetitions, as
beam search is prone to this issue.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>5. Results</title>
      <p>5.1. Quantitative Analysis</p>
      <sec id="sec-2-1">
        <title>The order within keyword sequences matters, as terms</title>
        <p>range from the most general to the most specific legal
principles. Therefore, we settled on using ROUGE [34]
for assessing models’ performance. As it can be seen from
Table 9, all the models trained on the fine-tuning dataset
outperformed both our initial BigBirdPegasus prototype
and the legacy TF-IDF system on ROUGE scores by large
margins of at least 10 points. The models with larger
input lengths have slightly better scores with the largest
model, lexMBART LSG-8k, performing best.
Table 12 issues on a case-by-case basis by either excluding
probImpact of Preprocessing Steps on lexBART. The top scores are lematic decisions (and using the legacy TF-IDF-based
in bold font and the second best are underlined. keywords instead) or fixing the recurring mistakes in
Steps ROUGE1 ROUGE2 ROUGEL post-processing. The keywords displayed by the CanLII
Baseline7 49.0 30.7 41.2 search engine can also be manually edited by an editor if
+Pre-training 49.8 30.9 41.3 required.
-Summaries 49.4 34.5 46.9 Table 14 shows examples of generated keywords,
+Normalisation 50.2 35.6 47.8 mostly from the hand-curated test set. In the table, Gold
+Truecasing 50.5 35.7 48.1 Standard is the keyword assigned by an expert. It is
+Additional Sources 55.7 43.6 54.4 omitted when the decision is not included in any of our
annotated datasets. These examples mostly stem from
quality analysis done in a test environment just prior to
the same on those of more than 4096 but less than 8192. going live. Among the other rows, TF-IDF is our legacy
For longer decisions, it performed significantly better (by model, MBART50-LSG-4k is the in-production model and
around 1.2 ROUGE) than the 4k model. the Evaluator Comments are annotations added to the</p>
        <p>From Table 12, we can see the efect of each change to model output during qualitative evaluation. Some may
the model or dataset on the lexBART model that scored have been translated from French.
the best on the Hand-Curated Test Set, starting from the The first example shows a marked improvement over
BART-base model. This is the same model that was used the legacy model. The new keywords are on-topic, and
to figure out hyperparameters for the bilingual MBART cover the same subjects as the gold standard without
models. Every step of the normalization process helped missing important aspects. Meanwhile, the TF-IDF picks
the model improve. Even removing the decision sum- words like “recommended”, “death” and “friend” that
maries from the input text only hurt the ROUGE1 score have no bearing on the case. Likewise, the second
exslightly. We were surprised by this low impact since we ample shows another criminal case from a provincial
can expect summaries to contain the important informa- court where the model’s keywords are far better than the
tion that would be found in keywords. However, all the legacy one.
steps related to data quality improved the ROUGE score In the third, we can see one of the problematic cases
far less than simply adding more sources of annotated where the model over-generalizes from its training data.
decisions. Some tribunals that are only found in our data when</p>
        <p>While those steps did not individually result in their decision is appealed all have the “Judiciary review”
markedly improved metrics, their combination had a no- keywords even if the decision is not being reviewed. In
ticeable impact on the measured quality of the generation. this case, while the model correctly identifies the topic,
In particular, while pre-training only improved ROUGE it incorrectly generates the “Judicial review” keywords.
by 0.1 to 0.8 points, after this step, the model-generated The fourth, a case about compensation for a work
keywords looked much more on-topic for decisions dis- injury had the model pick up on the description of the
cussing subjects not found within the fine-tuning dataset accident and identify it as a murder case. The model
(see Table 13 for an example). hallucinated murders in this fashion often enough that
we had to keep the legacy keywords on these tribunals.
5.2. Manual Qualitative Analysis The next two are examples of very short decisions. We
found the model prone to invent nonsense when
processWhile ROUGE scores are useful when comparing mod- ing decisions with limited context, hence our decision to
els with each other, they cannot determine whether the not generate new keywords on decisions with less than
model’s outputs meet our users’ quality standards. To a few paragraphs. The first of the two is only four short
do so, we had the final model generate keywords for the paragraphs, but the model does a good job of identifying
500 documents hand-curated test set and had experts the important idea. The second is a single sentence and
compare them to the manually generated keywords. To causes the model to generate a keyword that is nearly as
help them, we automatically verified if the discussed leg- long and unhelpful.
islation could be found in the text input. Two other fairly common problem cases found
dur</p>
        <p>We must first emphasize that the model performed ing the qualitative analysis were missing topics and
erropoorly on some subsets of the decisions. For example, neous facts. In the missing topic example, we can see that
they generate mostly random keywords when the deci- the keywords are on-topic but that an important notion,
sion is very short, but other recurring issues have been “Evidence” is not mentioned. This was considered
acceptidentified (see examples in Table 14). We have fixed these able by reviewers. On the other hand, sometimes the
model would create keywords that misrepresented the
7Only basic preprocessing and starting from BART-base decision (in the last case, lexMBART wrongfully refers
to “Aboriginal persons”). This kind of error can be hard started from MBART checkpoint, further pre-trained on
to find without carefully reading the whole decision and a large corpus of legal documents, and fine-tuned to
prois quite problematic to users. Thankfully, it appears to duce structured keywords similar to those produced by
be rare enough to be acceptable in keywords labelled as legal publishers. We believe that both this model and,
automatically generated. especially, this large multi-source corpus will allow us to
continue to leverage the current NLP advances into more
useful automation that previously had to be performed
6. Discussion by hand by experts.</p>
        <p>Despite the limitations we outlined, our custom-made
Overall in these experiments, lexBART scored well de- keyword generator was found to perform well,
generspite its much lower parameter count. If bilingual French ating useful keywords for a large subset of our
docuand English support was not a requirement down the ments. Most of the undesirable behaviours could be
line, its good performance would be a strong argument curbed through some post-processing and a carefully
for picking the smaller model. chosen keyword format.</p>
        <p>Both LSG models outperformed lexMBART but, all in The LexKey project started in May 2021 and in
Februall, the qualitative analysis showed only a limited difer- ary 2023, the fine-tuned model was deployed live on
ence in output quality between the lexMBART-LSG-4k CanLII on a large part of the English corpus. Both the
lanand 8k models. Since the latter takes twice as long to run guage model and the dataset will also likely be reused in
in both training and inference (and thus costs twice as other upcoming projects like document classification.
Almuch), we eventually decided to deploy the lexMBART- though we cannot release the dataset because of editors’
LSG-4k model to production. This necessitated adding policy, we will make our pre-training and fine-tuning
an editorial override and keyword blacklisting feature scripts available8 along with the pre-trained model itself.
to the publication pipeline and deploying the model us- By doing so, we also intend to showcase what is
techniing torch-serve to our AWS cloud environment, where cally feasible for a small legal tech company of our size
it replaced the legacy TF-IDF system on 1.3M English when it comes to keyword generation.
language decisions from the selected courts and tribunals. In the next development phase, we plan to source more</p>
        <p>Processing those decisions took 3 days on a French language decisions with keywords to provide
G5.12xlarge machine from AWS (using 4 workers, 16 the same feature on CanLII for both oficial languages
threads and a batch size of 2 per GPU). The current day- (French documents were too scarce to be included in
to-day intake of around 600 decisions per day is handled the fine-tuning dataset). We have also acquired 144K
by a G4dn.xlarge running a single worker with a batch decisions from the Harvard Caselaw Access Project and
size of one. 64K decisions from the Australian Federal Courts and
will use them to validate whether adding data from other
7. Conclusion common law countries can help improve our model. We
will also be experimenting with other legal document
types like briefs or doctrines to see if our model can
provide useful keywords. Finally, we intend to test the
most recent large language models such as GPT-4 [23]
for keyword generation.</p>
      </sec>
      <sec id="sec-2-2">
        <title>In this paper, we present the work done to leverage the</title>
        <p>recent advances in language modelling and generation
to produce useful keywords to augment search results
on the CanLII website. These eforts yielded an
encoderdecoder language model named lexMBART LSG-4k
(nicknamed LexKey for the sake of simplicity) that was warm- 8https://github.com/Lexum/lexkey-public</p>
        <sec id="sec-2-2-1">
          <title>Keyword</title>
          <p>Criminal law — Ofences — Murder — Second degree murder — Sentencing
Criminal law — Prisons — Prisoners — Releases — Parole
period of parole ineligibility — ofender — recommended — friends — death
Criminal law — Murder — Second degree murder — Sentencing — Parole ineligibility
Good
Criminal law — Sexual ofences, public morals and disorderly conduct — Sexual
exploitation — Evidence
Criminal law — Ofences against person and reputation — Sexual assault — General
ofence — Evidence
identification — omitted for publication — witness — sexually assaulted — don t know
Criminal law — Sexual ofences — Sexual assault — Sexual touching — Evidence
Good
Labour and employment law — Labour law — Collective agreement — Management
rights — Surveillance of employees
Privacy and freedom of information — Provincial privacy legislation — Collection of
personal information — Purpose and use
screen captures — recording of incoming calls — requirement to record time codes —
reasonableness — analysis
Labour law — Arbitration — Judicial review
Wrong - Labour law — Arbitration is good but Judicial review is wrong.
knee — worker — pre-existing degenerative changes — pre-existing condition —
aggravation
Criminal law — Murder — Second degree murder — Evidence — Identification
Wrong - about workplace injury, not murder
complied — fine — imprisoned — merits — varying
Practice and procedure — Fine — Compliance with judgment
Good
The Reasons for Judgment rendered in file T-1291-97 apply to the appellant in this file.
file — rendered — apply
Practice — Judgments and orders — Reasons for judgment — Application to vary
Labour and employment law — Labour law — Unfair labour practices — Employer
practices — Interference with union activities
Labour and employment law — Labour law — Unfair labour practices — Remedies —
Miscellaneous
vote — scheduled — unfair labour practice — shift — hours
Labour relations — Certification — Wishes and preferences — Employee vote
Wrong - ”Wishes and preferences” is wrong. Also incomplete.</p>
          <p>Criminal law — Ofences against person and reputation — Sexual assault — General
ofence — Evidence
Evidence — Hearsay — Traditional exceptions to rule against admission — Spontaneous
statements
audio recordings — testimony — evidence — sexual assault — witness
Criminal law — Sexual ofences — Sexual assault — Ofences against persons —
Unlawful confinement
Not Bad - No mention of Evidence
Criminal law — Ofences — Robbery — Sentencing — Adult ofenders
sentence — pre-sentence report — robberies — community — ofences
Criminal law — Property ofences — Robbery — Sentencing — Aboriginal persons</p>
          <p>Wrong - par. 40 “This ofender is not aboriginal.”
plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- (1952) 107–114. URL: http://www.jstor.org/stable/
try, A. Askell, et al., Language models are few-shot 2984087.
learners, Advances in neural information process- [33] L. Zheng, N. Guha, B. R. Anderson, P. Henderson,
ing systems 33 (2020) 1877–1901. D. E. Ho, When does pretraining help? assessing
[22] OpenAI, Introducing ChatGPT, https://openai.com/ self-supervised learning for law and the casehold
blog/chatgpt, 2022. Accessed: 2023-02-05. dataset of 53,000+ legal holdings, in: Proceedings
[23] OpenAI, Gpt-4 technical report, 2023. of the Eighteenth International Conference on
Ara r X i v : 2 3 0 3 . 0 8 7 7 4 . tificial Intelligence and Law, 2021, pp. 159–168.
[24] T. Dao, D. Fu, S. Ermon, A. Rudra, C. Ré, Flashat- [34] C.-Y. Lin, ROUGE: A package for automatic
evaltention: Fast and memory-eficient exact attention uation of summaries, in: Text Summarization
with io-awareness, Advances in Neural Information Branches Out, Association for Computational
LinProcessing Systems 35 (2022) 16344–16359. guistics, Barcelona, Spain, 2004, pp. 74–81. URL:
[25] I. Beltagy, M. E. Peters, A. Cohan, Longformer: https://aclanthology.org/W04-1013.</p>
          <p>The long-document transformer, 2020. URL: https://
arxiv.org/abs/2004.05150. doi:1 0 . 4 8 5 5 0 / A R X I V . 2 0 0 4 .</p>
          <p>0 5 1 5 0 .
[26] M. Guo, J. Ainslie, D. Uthus, S. Ontanon, J. Ni,</p>
          <p>Y.-H. Sung, Y. Yang, LongT5: Eficient
text-totext transformer for long sequences, in: Findings
of the Association for Computational Linguistics:
NAACL 2022, Association for Computational
Linguistics, Seattle, United States, 2022, pp. 724–736.</p>
          <p>URL: https://aclanthology.org/2022.findings-naacl.</p>
          <p>55. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 2 . f i n d i n g s - n a a c l . 5 5 .
[27] I. Chalkidis, X. Dai, M. Fergadiotis, P. Malakasiotis,</p>
          <p>D. Elliott, An exploration of hierarchical attention
transformers for eficient long document
classification, 2022. URL: https://arxiv.org/abs/2210.05529.</p>
          <p>doi:1 0 . 4 8 5 5 0 / A R X I V . 2 2 1 0 . 0 5 5 2 9 .
[28] Z. Shen, K. Lo, L. Yu, N. Dahlberg, M. Schlanger,</p>
          <p>D. Downey, Multi-lexsum: Real-world summaries
of civil rights lawsuits at multiple granularities,
in: Thirty-sixth Conference on Neural
Information Processing Systems Datasets and Benchmarks
Track, 2022. URL: https://openreview.net/forum?
id=z1d8fUiS8Cr.
[29] S. Bird, E. Loper, NLTK: The natural language
toolkit, in: Proceedings of the ACL Interactive
Poster and Demonstration Sessions, Association
for Computational Linguistics, Barcelona, Spain,
2004, pp. 214–217. URL: https://aclanthology.org/</p>
          <p>P04-3031.
[30] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,</p>
          <p>O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
Roberta: A robustly optimized bert pretraining
approach, 2019. URL: https://arxiv.org/abs/1907.11692.</p>
          <p>doi:1 0 . 4 8 5 5 0 / A R X I V . 1 9 0 7 . 1 1 6 9 2 .
[31] A. Conneau, K. Khandelwal, N. Goyal, V.
Chaudhary, G. Wenzek, F. Guzmán, É. Grave, M. Ott,
L. Zettlemoyer, V. Stoyanov, Unsupervised
crosslingual representation learning at scale, in:
Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics, 2020, pp. 8440–8451.
[32] I. J. Good, Rational decisions, Journal of the Royal</p>
          <p>Statistical Society. Series B (Methodological) 14</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>translation</surname>
            , Transactions of the Association for [1]
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Parmar</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <source>Computational Linguistics</source>
          <volume>8</volume>
          (
          <year>2020</year>
          )
          <fpage>726</fpage>
          -
          <lpage>742</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          , At- [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-J.</given-names>
            <surname>Chen</surname>
          </string-name>
          , N. Goyal,
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>mation processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
          <article-title>lation with extensible multilingual pretraining</article-title>
          and [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert: ifnetuning (
          <year>2020</year>
          ).
          <article-title>a r X i v : 2 0 0 8 . 0 0 4 0 1</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>Pre-training of deep bidirectional transformers for</article-title>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Mo-
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <year>2019</year>
          <article-title>Conference of the North American Chapter Denoising sequence-to-sequence pre-training for</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Human Language</surname>
            <given-names>Technologies</given-names>
          </string-name>
          , Volume
          <volume>1</volume>
          (Long prehension,
          <source>in: Proceedings of the 58th Annual</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>and Short Papers)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
          <article-title>Meeting of the Association for Computational Lin[3] Canadian Legal Information Institute</article-title>
          , https://www. guistics, Association for Computational Linguis-
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          canlii.org/en/,
          <year>2001</year>
          . Accessed:
          <fpage>2023</fpage>
          -02-05. tics, Online,
          <year>2020</year>
          , pp.
          <fpage>7871</fpage>
          -
          <lpage>7880</lpage>
          . URL: https:// [4]
          <string-name>
            <surname>S. C.</surname>
          </string-name>
          of Canada |
          <article-title>Cour suprême du Canada, Style aclanthology</article-title>
          .
          <source>org/2020.acl-main.703. doi:1 0 . 1 8</source>
          <volume>6 5 3 /</volume>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Manual</surname>
          </string-name>
          | Guide de rédaction,
          <article-title>Supreme Court of v 1 / 2 0 2 0 . a c l - m a i n . 7 0 3</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>Canada | Cour suprême du Canada</source>
          ,
          <year>1987</year>
          . [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ramos</surname>
          </string-name>
          ,
          <article-title>Using tf-idf to determine word relevance</article-title>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          , C. De- in
          <source>document queries</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>langue</surname>
            , A. Moi,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Cistac</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Rault</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Louf</surname>
            , M. Fun- [14]
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Farzindar</surname>
          </string-name>
          , G. Lapalme, Letsum, an automatic
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Plu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Le</given-names>
            <surname>Scao</surname>
          </string-name>
          ,
          <source>S. Gug- information systems: JURIX</source>
          <year>2004</year>
          ,
          <article-title>the seventeenth</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>ger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Drame</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Lhoest</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rush</surname>
          </string-name>
          , Transform- annual conference, volume
          <volume>120</volume>
          , IOS Press,
          <year>2004</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <article-title>ers: State-of-the-art natural language process</article-title>
          - p.
          <fpage>11</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          ing, in: Proceedings of the 2020 Conference on [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Hachey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Grover</surname>
          </string-name>
          , Extractive summarisation of
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <article-title>Empirical Methods in Natural Language Process- legal texts</article-title>
          ,
          <source>Artificial Intelligence and Law</source>
          <volume>14</volume>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <article-title>ing: System Demonstrations</article-title>
          , Association for Com-
          <volume>305</volume>
          -345.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>putational Linguistics</surname>
          </string-name>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          . [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Polsley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jhunjhunwala</surname>
          </string-name>
          , R. Huang, CaseSum-
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          URL: https://aclanthology.org/
          <year>2020</year>
          .
          <article-title>emnlp-demos.6. marizer: A system for automated summarization</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>doi:1 0 . 1 8</source>
          <volume>6 5 3</volume>
          / v 1 /
          <article-title>2 0 2 0</article-title>
          . e m n l p
          <article-title>- d e m o s . 6</article-title>
          . of legal texts,
          <source>in: Proceedings of COLING</source>
          <year>2016</year>
          , the [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaheer</surname>
          </string-name>
          , G. Guruganesh,
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ainslie</surname>
          </string-name>
          , 26th International Conference on Computational
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          , et al.,
          <article-title>Big bird: Transformers for longer 2016 Organizing Committee</article-title>
          , Osaka, Japan,
          <year>2016</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          sequences., in: NeurIPS,
          <year>2020</year>
          .
          <fpage>258</fpage>
          -
          <lpage>262</lpage>
          . URL: https://aclanthology.org/C16-2054. [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>BIGPATENT</surname>
          </string-name>
          : A large- [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Leveraging bert for extractive text sum-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <article-title>scale dataset for abstractive and coherent sum-</article-title>
          marization on lectures,
          <year>2019</year>
          . URL: https://arxiv.org/
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          marization,
          <source>in: Proceedings of the 57th Annual abs/1906.04165. doi:1 0 . 4 8</source>
          <volume>5 5</volume>
          <fpage>0</fpage>
          <string-name>
            <surname>/ A R X I</surname>
          </string-name>
          <article-title>V . 1 9 0 6 . 0 4 1 6 5</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <article-title>Meeting of the Association for Computational Lin-</article-title>
          [18]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>tics</surname>
          </string-name>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>2204</fpage>
          -
          <lpage>2213</lpage>
          . URL:
          <article-title>limits of transfer learning with a unified text-to-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          https://aclanthology.org/P19-1212.
          <source>doi:1 0 . 1 8</source>
          <volume>6 5 3</volume>
          / text transformer,
          <source>The Journal of Machine Learning</source>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <article-title>v 1 / P 1 9 - 1 2 1 2</article-title>
          . Research 21 (
          <year>2020</year>
          )
          <fpage>5485</fpage>
          -
          <lpage>5551</lpage>
          . [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Garneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gaumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lamontagne</surname>
          </string-name>
          , P.-L. [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saleh</surname>
          </string-name>
          , P. Liu, Pegasus:
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <source>ceedings of the Eighteenth International Confer- ference on Machine Learning, PMLR</source>
          ,
          <year>2020</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <source>ence on Artificial Intelligence and Law</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>11328</fpage>
          -
          <lpage>11339</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          256-
          <fpage>257</fpage>
          . [20]
          <string-name>
            <given-names>C.</given-names>
            <surname>Condevaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Harispe</surname>
          </string-name>
          , Lsg attention: Extrapola[9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Douka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Abdine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vazirgiannis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>El</surname>
          </string-name>
          <article-title>Ham- tion of pretrained transformers to long sequences,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <article-title>model adaptation for french legal text</article-title>
          , in: Pro- Mining: 27th Pacific-Asia Conference on Knowl-
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <article-title>ceedings of the Natural Legal Language Processing edge Discovery and Data Mining</article-title>
          ,
          <string-name>
            <surname>PAKDD</surname>
          </string-name>
          <year>2023</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <source>Workshop</source>
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>95</fpage>
          -
          <lpage>101</lpage>
          . Osaka, Japan, May
          <volume>25</volume>
          -28,
          <year>2023</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>I</given-names>
          </string-name>
          , [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Edunov</surname>
          </string-name>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>443</fpage>
          -
          <lpage>454</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , Multi- [21]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. D.</surname>
          </string-name>
          Ka-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>