1. Introduction

Generator for Legal Documents

Benjamin Cérat

Olivier Salaün

Noreddine Ben Jillali

jillalin@lexum.com 0

Marc-André Morissette

Isabela Pocovnicu

pocovnicui@lexum.com 0

Emma Elliott

elliotte@lexum.com 0

François Harvey

harveyf@lexum.com 0 0 Lexum Inc. , Canada 1 RALI, DIRO, Université de Montréal , Canada

of the output.

1. Introduction

Recent years have seen the advent of transformer-based language models [ 1, 2 ] that have been applied to a wide variety of tasks across the field of natural language processing (NLP). The legal domain has not been exempt from this trend: automatic legal document classification, information retrieval and even summarization tasks have all become attainable goals. This paper outlines the stepby-step eforts required to accomplish one of these tasks, namely keyword generation, in a commercial setting.

Lexum is a small software company owned by the

viding open access to online legal information through the canlii.org website and other related products.

Our objective was to generate, for each decision, short

instructive keywords complying with the style found in

Canadian legal reports [4] (some examples are shown in

Proceedings of the Sixth Workshop on Automated Semantic Analysis of stored in Huggingface [5], namely BigBirdPegasus [6] (BBP) trained on the BIGPATENT dataset [7]1, which we fine-tuned to generate keywords harvested from decisions found in the Ontario Reports, the Law Society of Saskatchewan Libraries databases and the Supreme

Court of Canada Reports. The ability to handle fairly long documents was paramount, as Canadian decisions can vary from several sentences to novel length, averaging around 6 thousand words.

While the preliminary results from BBP were promising, we outlined several issues regarding generated keywords. First, pre-trained language models capable of handling long documents were not available for the French language (one of the oficial languages of Canada), and the rare French legal models were either not accessible [8] or not suited for Canadian common law [9]. Second, courts. Inference on decisions from courts unseen in the training set was strongly biased toward topics found in reports (books that collect and publish notable case law), thus confusing jurisdictions and generally being of much lower quality. Finally, the keyword format is quite inconsistent across collections (e.g. length and styles diferences among examples in Table

1) and the model generates keywords in a style that matches only one of those formats. Since the output is meant to be displayed inline on search results, a more uniform result is desir

Canadian Legal Information Institute and focused on pro- the model fails to generalize to decisions from unseen

bigbird-pegasus-large-bigpatent

Keywords

Constitutional law – Charter of Rights – Life, liberty and security of the person – Fundamental justice – Abortion – Criminal Code prohibiting abortion except where life or health of woman endangered – Whether or not abortion provisions infringe right to life, liberty and security of the person – If so, whether or not such infringement in accord with fundamental justice – Whether or not impugned legislation reasonable and demonstrably justified in a free and democratic society – Canadian Charter of Rights and Freedoms, ss. 1, 7 – Criminal Code, R.S.C. 1970, c. C-34, s. 251.

Criminal Law - Sentence - Robbery and Extortion Citizenship and Immigration — Status in Canada — Convention Refugees and Persons in Need of Protection — Immigration practice — Refugee Appeal Division jurisdiction — Judicial review of Immigration and Refugee Board, Refugee Appeal Division (RAD) decision dismissing applicants’ appeal from Refugee Protection Division (RPD) decision refusing to recognize applicants’ claim to being refugees or persons in need of protection within meaning of

Immigration and Refugee Protection Act, ss. 96, 97 able. output to segments from the input text. The traditional

The LexKey project was meant to address these issues TF-IDF approach compares n-gram frequencies inside with the following contributions: first, we further pre- the document to those of the corpus at large. This selects trained a multilingual model [10, 11] using a denoising groups of words that are rarely used but common in the task [12] on CanLII’s Canadian legal decisions, doctrines document text. In our experience, our existing TF-IDFand legislations. We then modified the model’s atten- based system applied to legal decisions yields keywords tion to allow longer inputs and then fine-tuned it on a that are generally related to the facts of a case, but not supervised keyword generation task over a large, care- to the legal principles discussed. fully curated and normalized set of decisions containing However, human-written summaries and keywords keywords. found in law reports (books that collect and publish no

The model has been deployed live on CanLII in Febru- table case law) and legal databases, put an emphasis on ary 2023 after a lengthy quality analysis consisting of broader legal doctrines, the pertinent case law and the both automatic and manual evaluations performed by ex- statutes considered. The LexKey project aims at highperts. The underlying pre-trained language model is also lighting these concepts with new keywords. shaping up to be a crucial part of other related projects. There have been some eforts to leverage transformerbased models such as BERT to extract the most relevant sentences as a summary [17], but they struggle to per2. Related Works form as well as abstractive models of similar complexity. 2.1. Extractive Methods Automatically generating structured keywords from textual documents is a task that is closely related to summa- Since the advent of BERT [2], abstractive models using rizing documents, a common task in natural language pro- the transformer architecture [ 1 ] dominate the landscape cessing. Extractive statistical systems have long been the of summarization tasks. Overall, most of the best perstandard approach to perform this task. In fact, LexKey forming models such as T5 [18], BART [12], MBART [10] is aiming at replacing an in-house modified TF-IDF ex- and Pegasus [19] use an encoder-decoder architecture. traction system [13] that works by ranking sentences or Several elements drew us towards encoder-decoder n-grams in the text and picking a few salient expressions models: the flexibility provided by separately designed to represent the document. This extractive approach is encoder and decoder layers, the flexibility in the implesimilar to those used by [14, 15, 16]. mentation of the pre-training objective (e.g. masked lan

Doing so prevents the model from inventing false- guage modelling [2], denoising [12]) and of the attention hoods about the document’s content by constraining its architecture [ 1, 6, 20 ], along with the ease of managing multilingual models by simply using source and target 2.2. Abstractive Methods

Source

Ontario Reports

Law Society of Saskatchewan Supreme Court of Canada Reports Canadian Federal Courts Reports

Other Canadian Sources language prompts [10]. similar to the style we were aiming at in order to ease

More recently, at the time the LexKey project was al- format normalization. ready nearing its completion, large decoder-based models Gathering decisions from Law Reports introduced a such as GPT [21] have also been shown to perform well fairly strong bias toward decisions that were of interest to in text generation tasks based on prompting schemes. legal practitioners, which feature appeal cases, decisions The large-scale application of the most recently released that are considered authorities on broad constitutional models (i.e. ChatGPT [22], GPT-4 [23]) to our corpus is questions, or decisions that clarify a legal controversy. left for future work. These difer from day-to-day decisions without preceden

All in all, most of the architectures we surveyed were tial value. performing very similarly on summarizing tasks, so the ease of adapting the model to our needs (training and Table 2 inference cost over millions of documents in particular) Decisions with Keywords per Source became the primary diferentiator. 2.3. Handling Long Documents

Early transformer models like BERT [2] were limited to

fairly short sequences (512 tokens) due to their quadratic memory usage as a function of the input length. Research into more eficient encoder attention mechanisms or im- Total 157 697 100 plementation is very active: Flash Attention [24], Big Bird [6], Longformer-Encoder-Decoder [25], LSG [20] Together with an experienced legal archivist, we set and LongT5 [26] are all fairly recent models implement- out to identify the decisions on CanLII [3] that featured ing such techniques. Most of them allow extending input keywords, often from Law Reports that had licensed its length up to around 16k tokens by making the memory content to CanLII. We also contacted organizations that usage scale more linearly in relation to the input length. ofered such information to members as an added value

Hierarchical Attention architectures [27] have also to see if they would be interested in collaborating. In been tried but seem to be limited to around 4k tokens the end, we gathered data from several Canadian sources inputs and require much more extensive architectural (see Table 2). changes relative to the basic transformer implementation.

In the context of Multi-LexSum, a summarization task 3.2. Preprocessing of civil rights lawsuits, [28] also emphasized the longrange input issues faced by transformer models when dealing with a multi-document case.

In order to pre-train our language model, we used the raw text (Table 3) extracted from the HTML of the 3.1M decisions, 100k commentaries, and 85k statutes and regulations in French and English languages available on 3. Datasets CanLII. We also gathered a large collection of English language decisions with appropriate keywords, but we 3.1. Sources could not get a suitable amount of French language deciOne of the most important objectives of the LexKey sions with keywords in time for the first release. The text project was to collate a representative set of annotated of every decision with keywords that we did not already decisions that was large enough to train a highly capable have was also added to the pre-training dataset. keyword generation model whose output would meet The extraction process separates the keywords from the quality expectations of our users. Annotating by the rest of the text, removes summaries and cleans up the hand tens of thousands of decisions was unfeasible, so resulting text by normalizing separators and whitespaces, we decided to gather case law databases that already had keeping the document structure intact. keywords and categories added by experts. For the pre-training denoising task, the documents

We had a few criteria for the selection of appropriate were further split into individual training samples in training material. The most obvious one was that the chunks of roughly 1024 tokens. These chunks were legal framework that created them should be fairly sim- made by cutting the text along sentence boundaries using ilar to Canadian Law. This meant limiting ourselves to NLTK [29]. For the fine-tuning task (keyword generacountries that are under common law (mostly the Com- tion), the preprocessed text is left in a single chunk. In monwealth member countries and to some degree the this step, we also normalize the keywords that were eiUnited States). The decisions also had to be either in French or English and have a keyword format somewhat 2Number of words per decision on the basis of NLTK. Count

[ 1 ] </n>On May 15, 2017 the Respondent, the Vancouver</n>Park Board, passed a bylaw amen dment applicable to parks within its</n>jurisdiction, prohibiting the movement of whales to parks, the keeping of</n>whales at parks (excluding whales which were already in a park on May 15, 2017)</n>and the production or presentation in a park, of a show, performance or other</n>form of entertainment involving whales. The only park within the jurisdiction</n>of the Vancouver Park Board where whales are kept is Stanley Park, [...] 3.3. Normalization

The keywords we had available for our training data came

Avg. Length2 Chunks Count % from very diferent sources that rely on very diferent Train 6529.3 4.48M 2.92M 90% formats. To align them more closely to the format we Valid 5634.8 249K 162K 5% wanted to display to our users, we performed extensive Test 5655.4 249K 162K 5% normalization.

An issue that was immediately apparent is that some sources of keywords tend to be very terse, only two or ther removed from the text or supplied by an external three words, while others like the Supreme Court of partner to use as output targets. Canada tend to be very long and mostly composed of

The documents are then randomly shufled and split descriptive sentences (Table 1). To be able to control the into training (90%), validation (5%) and test (5%) sets. We keyword format generated by our model and avoid havalso created a separate fine-tuning dataset containing the ing it copy whichever source format the document resemsubset of decisions that had keywords. It is also split into bles, we introduced a keyword normalization step to our training (90%), validation (5%) and test (5%) sets. preprocessing. The complete keyword sequences were

When pre-training multilingual models based on separated into a list of short keyword groups, descriptive MBART checkpoints, we used decisions in both lan- sentences, and discussed jurisprudence and legislation. guages, whereas pre-training and fine-tuning for non- The sequence is split along the dashes into individual MBART-based models are done only on English docu- parts. We then use a regular expression to extract the disments (we did not have enough French documents with cussed jurisprudence and legislation. Next, starting from keywords at that time). The test sets are English-only for the beginning of the sequence, we select parts of 4 words all models. or less that do not contain a helping verb or pronoun as

To avoid any information leakage, we use a bucket keywords. The rest is marked as a descriptive sentence. hashing strategy on an immutable unique index to ensure One major problem that was identified when trying to documents always end up in the same set across dataset generate descriptive sentences and that led us to abandon versions. We first calculate the md5 hash of an id that that idea was that the models were prone to hallucinating is shared by every part or version of a document then facts or subtly inverting logical propositions found in the convert it to an integer. We then put this id in the correct text, thus creating believable-sounding keywords that bucket by taking the modulo of the number of buckets misrepresented the decision. Removing these descriptive desired (20 in our case). sentences did have some negative impact as they allow

This ensures that every part of a multipart document more nuanced keywords that can refer to specific facts and any translation will always end up in the training or arguments. It also meant that we had to remove some set for example, or that a decision with keywords in the decisions with a set of keywords that had only a single ifne-tuning validation set will likewise be in the valida- short keyword (e.g. a broad domain) followed by description set of the pre-training dataset even if we regenerate tive sentences. Once normalized, these keywords did not several iterations of the datasets with new material or fit our preferred format. preprocessing. In the format we settled on, the descriptive sentences in the keywords are discarded and the short keywords Table 5 are limited to 6 per keyword set (a document can have Documents in Fine-tuning Dataset several groups of keywords if it discusses various issues).

Avg. Length2 Count % The short keywords are then consolidated using a handmade mapping file of equivalent subjects to avoid having TVraaliidn 44127342..19 173K5K 950%% diferent naming conventions.

Test 4172.6 7K 5% After extracting the keywords, further preprocessing is done to ensure good-quality training samples. We re4. Models move headers and get rid of very short decisions since those are generally assigned the same keywords (e.g. the same related lower instance decision). The various extracted keywords are also categorized into short, medium and long formats should we later decide that we do not want to use the medium-length keywords (the format chosen for the LexKey project) everywhere. 3.4. Truecasing One major issue we faced while normalizing the keywords to create the fine-tuning dataset was that the capitalization was inconsistent: all caps for some words, title case for every word in some examples, and finally sentence case. We decided to settle on the Supreme Court of Canada’s (SCC) convention of sentence case, both because it was preferred by our legal experts and because the SCC is a large, consistent and well-curated source of examples.

After trying to use standard tools such as NLTK’s Part of Speech tagger which yielded poor results, we decided to fine-tune a separate version of our pre-trained lexBART model to output the proper casing on the keywords from the Supreme Court of Canada.

To avoid potential miscopying issues, we trained the model to output a token representing the proper case (lowercase, capitalized or all caps) for each word in our training and validation data. This approach yielded very good casing in the appropriate format in most cases.

Initially, we only applied truecasing to keywords from collections that were known to be badly formatted, but this preprocessing step was eventually applied to every keyword as we kept tracking down casing issues to oddities in other collections that should have been properly cased.

Given that our proof of concept used BigBirdPegasus

(BBP), our first idea was to try to modify this model into a multilingual variant. BBP is warm-started from English-only RoBERTa parameters [30], then pre-trained on an unsupervised Gap Sentence Generation task [19].

Therefore, replacing the initial parameters and the tokenizer with one of the many multilingual RoBERTa-based models [31] looked feasible at first. However, after some discussion with the authors on training cost estimates, it 3.5. Hand-Curated Test Set became apparent that it was not feasible on our hardware (a small server with two A6000 GPUs) and would require To validate that the models generate keywords of good renting TPUs during most of the development process. quality on data from out-of-sample sources, we also had To stay within our hardware capability, we decided editors create a hand-labelled set of 500 documents dis- to start with a pre-trained multilingual encoder-decoder tinct from the fine-tuning test set. They are sampled from model MBART50 [10, 11], further pre-train it on our data courts and tribunals not found in any of our keyword (denoising task) and then modify the encoder attention sources. As such, none of these decisions were included to handle longer input sequences. To speed up the dein the fine-tuning dataset of the keyword generation task velopment process, we did most of the experiments on and we ensured that they covered topics not common in a smaller English-only BART-base model first (referred legal reports. to as lexBART later). This allowed us to quickly identify

Editors selected them in proportions reflecting the true the efect of various preprocessing steps and find suitable distribution of our complete corpus (see Table 6). To keep hyperparameters that could then be reused on the larger low-level tribunal decisions from dominating this test MBART50 model (lexMBART). set, we deliberately sampled 60% of the documents from higher-level courts. The keywords assigned to these documents were also normalized, following the same steps shown earlier.

3Using gradient accumulation

4.1. Unsupervised Pre-training Starting from a pre-trained model checkpoint, we further pre-trained it using a denoising objective [12] on our dataset of around 3.2M legal documents (Table 4). This objective consisted of mask filling and fixing sentence permutation noise on chunks of the input documents. 15% of tokens in each sample were masked, the sentences were randomly shufled, and the model was tasked with correctly generating the initial sample.

This pre-training was done using the standard crossentropy loss [32] between the generated reconstituted text and the actual text before noise was applied. After more than 93k steps, the loss on the evaluation set decreases from 1.02 to 0.43. In downstream tasks, the pretraining step turned out to yield a marginal gain of up to 0.8 points for ROUGE scores, which is consistent with [33, 8]’s findings that domain adaptation is beneficial. 4.2. Handling Long Documents To enable this model to handle long inputs, we converted its full attention layers to LSG attention [20]. The conversion from full to sparse attention makes the model less memory-greedy. LSG uses, as the name suggests, a mix of local, sparse and global attention similar to Longformer [25] along with a pooled representation of the rest of the input sequence. This allows the memory usage of attention computations to scale linearly with the input sequence length, and not quadratically like traditional transformer models.

Our pre-training objective requires that the output length is at least as long as the input length, so it is ill-suited to LSG-based models with diferent input and output lengths. Since the denoising task can be done with shorter document chunks and since the size mismatch does not cause problems during the fine-tuning as the keywords are always much shorter than the input text, we only modified the attention after the pre-training was done. 4.3. Supervised Fine-tuning for Keyword

Generation

After pre-training, the model architecture is converted to LSG attention to allow a larger sequence input length of 4096 or 8192 tokens. It is then fine-tuned to generate the

These preliminary results are however contradicted by the Hand-Curated Test Set (Table 10) where the much smaller lexBART model produced the best scores, followed by the lexMBART LSG-4k model. This suggests that models’ ability to generalize to documents from unseen courts is not guaranteed to improve as the sequence input length increases. It is also possible that longer input may dilute the relevant information in cases where the model is unsure.

4Time taken by the fine-tuning step.

5Prototype trained on a single A6000 GPU. Would likely take around half the training time on 2 units. 6For monolingual models, French documents are removed from training and validation sets. Test set is English-only for all models. correct keywords on our labelled dataset using a crossentropy loss. Documents exceeding the maximum input length are simply truncated by the tokenizer.

While the LSG attention can be modified to accept input lengths longer than 8k tokens, doing so proved to stretch fine-tuning time to an impractical degree. On our in-house hardware, fine-tuning an MBART50-large model extended to 4k input tokens took roughly 6 days and was proportionally longer with longer inputs (taking 12 days for 8k). In addition, 20% of our dataset is longer than 4k, and only 6% is longer than 8k.

After some experimentation, we settled on beam search as the generation strategy. We found that using 10 beams gives the best balance between generation speed and quality. We also added a rule-based post-generation processing step to remove any leftover repetitions, as beam search is prone to this issue.

5. Results

5.1. Quantitative Analysis

The order within keyword sequences matters, as terms

range from the most general to the most specific legal principles. Therefore, we settled on using ROUGE [34] for assessing models’ performance. As it can be seen from Table 9, all the models trained on the fine-tuning dataset outperformed both our initial BigBirdPegasus prototype and the legacy TF-IDF system on ROUGE scores by large margins of at least 10 points. The models with larger input lengths have slightly better scores with the largest model, lexMBART LSG-8k, performing best. Table 12 issues on a case-by-case basis by either excluding probImpact of Preprocessing Steps on lexBART. The top scores are lematic decisions (and using the legacy TF-IDF-based in bold font and the second best are underlined. keywords instead) or fixing the recurring mistakes in Steps ROUGE1 ROUGE2 ROUGEL post-processing. The keywords displayed by the CanLII Baseline7 49.0 30.7 41.2 search engine can also be manually edited by an editor if +Pre-training 49.8 30.9 41.3 required. -Summaries 49.4 34.5 46.9 Table 14 shows examples of generated keywords, +Normalisation 50.2 35.6 47.8 mostly from the hand-curated test set. In the table, Gold +Truecasing 50.5 35.7 48.1 Standard is the keyword assigned by an expert. It is +Additional Sources 55.7 43.6 54.4 omitted when the decision is not included in any of our annotated datasets. These examples mostly stem from quality analysis done in a test environment just prior to the same on those of more than 4096 but less than 8192. going live. Among the other rows, TF-IDF is our legacy For longer decisions, it performed significantly better (by model, MBART50-LSG-4k is the in-production model and around 1.2 ROUGE) than the 4k model. the Evaluator Comments are annotations added to the

From Table 12, we can see the efect of each change to model output during qualitative evaluation. Some may the model or dataset on the lexBART model that scored have been translated from French. the best on the Hand-Curated Test Set, starting from the The first example shows a marked improvement over BART-base model. This is the same model that was used the legacy model. The new keywords are on-topic, and to figure out hyperparameters for the bilingual MBART cover the same subjects as the gold standard without models. Every step of the normalization process helped missing important aspects. Meanwhile, the TF-IDF picks the model improve. Even removing the decision sum- words like “recommended”, “death” and “friend” that maries from the input text only hurt the ROUGE1 score have no bearing on the case. Likewise, the second exslightly. We were surprised by this low impact since we ample shows another criminal case from a provincial can expect summaries to contain the important informa- court where the model’s keywords are far better than the tion that would be found in keywords. However, all the legacy one. steps related to data quality improved the ROUGE score In the third, we can see one of the problematic cases far less than simply adding more sources of annotated where the model over-generalizes from its training data. decisions. Some tribunals that are only found in our data when

While those steps did not individually result in their decision is appealed all have the “Judiciary review” markedly improved metrics, their combination had a no- keywords even if the decision is not being reviewed. In ticeable impact on the measured quality of the generation. this case, while the model correctly identifies the topic, In particular, while pre-training only improved ROUGE it incorrectly generates the “Judicial review” keywords. by 0.1 to 0.8 points, after this step, the model-generated The fourth, a case about compensation for a work keywords looked much more on-topic for decisions dis- injury had the model pick up on the description of the cussing subjects not found within the fine-tuning dataset accident and identify it as a murder case. The model (see Table 13 for an example). hallucinated murders in this fashion often enough that we had to keep the legacy keywords on these tribunals. 5.2. Manual Qualitative Analysis The next two are examples of very short decisions. We found the model prone to invent nonsense when processWhile ROUGE scores are useful when comparing mod- ing decisions with limited context, hence our decision to els with each other, they cannot determine whether the not generate new keywords on decisions with less than model’s outputs meet our users’ quality standards. To a few paragraphs. The first of the two is only four short do so, we had the final model generate keywords for the paragraphs, but the model does a good job of identifying 500 documents hand-curated test set and had experts the important idea. The second is a single sentence and compare them to the manually generated keywords. To causes the model to generate a keyword that is nearly as help them, we automatically verified if the discussed leg- long and unhelpful. islation could be found in the text input. Two other fairly common problem cases found dur

We must first emphasize that the model performed ing the qualitative analysis were missing topics and erropoorly on some subsets of the decisions. For example, neous facts. In the missing topic example, we can see that they generate mostly random keywords when the deci- the keywords are on-topic but that an important notion, sion is very short, but other recurring issues have been “Evidence” is not mentioned. This was considered acceptidentified (see examples in Table 14). We have fixed these able by reviewers. On the other hand, sometimes the model would create keywords that misrepresented the 7Only basic preprocessing and starting from BART-base decision (in the last case, lexMBART wrongfully refers to “Aboriginal persons”). This kind of error can be hard started from MBART checkpoint, further pre-trained on to find without carefully reading the whole decision and a large corpus of legal documents, and fine-tuned to prois quite problematic to users. Thankfully, it appears to duce structured keywords similar to those produced by be rare enough to be acceptable in keywords labelled as legal publishers. We believe that both this model and, automatically generated. especially, this large multi-source corpus will allow us to continue to leverage the current NLP advances into more useful automation that previously had to be performed 6. Discussion by hand by experts.

Despite the limitations we outlined, our custom-made Overall in these experiments, lexBART scored well de- keyword generator was found to perform well, generspite its much lower parameter count. If bilingual French ating useful keywords for a large subset of our docuand English support was not a requirement down the ments. Most of the undesirable behaviours could be line, its good performance would be a strong argument curbed through some post-processing and a carefully for picking the smaller model. chosen keyword format.

Both LSG models outperformed lexMBART but, all in The LexKey project started in May 2021 and in Februall, the qualitative analysis showed only a limited difer- ary 2023, the fine-tuned model was deployed live on ence in output quality between the lexMBART-LSG-4k CanLII on a large part of the English corpus. Both the lanand 8k models. Since the latter takes twice as long to run guage model and the dataset will also likely be reused in in both training and inference (and thus costs twice as other upcoming projects like document classification. Almuch), we eventually decided to deploy the lexMBART- though we cannot release the dataset because of editors’ LSG-4k model to production. This necessitated adding policy, we will make our pre-training and fine-tuning an editorial override and keyword blacklisting feature scripts available8 along with the pre-trained model itself. to the publication pipeline and deploying the model us- By doing so, we also intend to showcase what is techniing torch-serve to our AWS cloud environment, where cally feasible for a small legal tech company of our size it replaced the legacy TF-IDF system on 1.3M English when it comes to keyword generation. language decisions from the selected courts and tribunals. In the next development phase, we plan to source more

Processing those decisions took 3 days on a French language decisions with keywords to provide G5.12xlarge machine from AWS (using 4 workers, 16 the same feature on CanLII for both oficial languages threads and a batch size of 2 per GPU). The current day- (French documents were too scarce to be included in to-day intake of around 600 decisions per day is handled the fine-tuning dataset). We have also acquired 144K by a G4dn.xlarge running a single worker with a batch decisions from the Harvard Caselaw Access Project and size of one. 64K decisions from the Australian Federal Courts and will use them to validate whether adding data from other 7. Conclusion common law countries can help improve our model. We will also be experimenting with other legal document types like briefs or doctrines to see if our model can provide useful keywords. Finally, we intend to test the most recent large language models such as GPT-4 [23] for keyword generation.

In this paper, we present the work done to leverage the

recent advances in language modelling and generation to produce useful keywords to augment search results on the CanLII website. These eforts yielded an encoderdecoder language model named lexMBART LSG-4k (nicknamed LexKey for the sake of simplicity) that was warm- 8https://github.com/Lexum/lexkey-public

Keyword

Criminal law — Ofences — Murder — Second degree murder — Sentencing Criminal law — Prisons — Prisoners — Releases — Parole period of parole ineligibility — ofender — recommended — friends — death Criminal law — Murder — Second degree murder — Sentencing — Parole ineligibility Good Criminal law — Sexual ofences, public morals and disorderly conduct — Sexual exploitation — Evidence Criminal law — Ofences against person and reputation — Sexual assault — General ofence — Evidence identification — omitted for publication — witness — sexually assaulted — don t know Criminal law — Sexual ofences — Sexual assault — Sexual touching — Evidence Good Labour and employment law — Labour law — Collective agreement — Management rights — Surveillance of employees Privacy and freedom of information — Provincial privacy legislation — Collection of personal information — Purpose and use screen captures — recording of incoming calls — requirement to record time codes — reasonableness — analysis Labour law — Arbitration — Judicial review Wrong - Labour law — Arbitration is good but Judicial review is wrong. knee — worker — pre-existing degenerative changes — pre-existing condition — aggravation Criminal law — Murder — Second degree murder — Evidence — Identification Wrong - about workplace injury, not murder complied — fine — imprisoned — merits — varying Practice and procedure — Fine — Compliance with judgment Good The Reasons for Judgment rendered in file T-1291-97 apply to the appellant in this file. file — rendered — apply Practice — Judgments and orders — Reasons for judgment — Application to vary Labour and employment law — Labour law — Unfair labour practices — Employer practices — Interference with union activities Labour and employment law — Labour law — Unfair labour practices — Remedies — Miscellaneous vote — scheduled — unfair labour practice — shift — hours Labour relations — Certification — Wishes and preferences — Employee vote Wrong - ”Wishes and preferences” is wrong. Also incomplete.

Criminal law — Ofences against person and reputation — Sexual assault — General ofence — Evidence Evidence — Hearsay — Traditional exceptions to rule against admission — Spontaneous statements audio recordings — testimony — evidence — sexual assault — witness Criminal law — Sexual ofences — Sexual assault — Ofences against persons — Unlawful confinement Not Bad - No mention of Evidence Criminal law — Ofences — Robbery — Sentencing — Adult ofenders sentence — pre-sentence report — robberies — community — ofences Criminal law — Property ofences — Robbery — Sentencing — Aboriginal persons

Wrong - par. 40 “This ofender is not aboriginal.” plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- (1952) 107–114. URL: http://www.jstor.org/stable/ try, A. Askell, et al., Language models are few-shot 2984087. learners, Advances in neural information process- [33] L. Zheng, N. Guha, B. R. Anderson, P. Henderson, ing systems 33 (2020) 1877–1901. D. E. Ho, When does pretraining help? assessing [22] OpenAI, Introducing ChatGPT, https://openai.com/ self-supervised learning for law and the casehold blog/chatgpt, 2022. Accessed: 2023-02-05. dataset of 53,000+ legal holdings, in: Proceedings [23] OpenAI, Gpt-4 technical report, 2023. of the Eighteenth International Conference on Ara r X i v : 2 3 0 3 . 0 8 7 7 4 . tificial Intelligence and Law, 2021, pp. 159–168. [24] T. Dao, D. Fu, S. Ermon, A. Rudra, C. Ré, Flashat- [34] C.-Y. Lin, ROUGE: A package for automatic evaltention: Fast and memory-eficient exact attention uation of summaries, in: Text Summarization with io-awareness, Advances in Neural Information Branches Out, Association for Computational LinProcessing Systems 35 (2022) 16344–16359. guistics, Barcelona, Spain, 2004, pp. 74–81. URL: [25] I. Beltagy, M. E. Peters, A. Cohan, Longformer: https://aclanthology.org/W04-1013.

The long-document transformer, 2020. URL: https:// arxiv.org/abs/2004.05150. doi:1 0 . 4 8 5 5 0 / A R X I V . 2 0 0 4 .

0 5 1 5 0 . [26] M. Guo, J. Ainslie, D. Uthus, S. Ontanon, J. Ni,

Y.-H. Sung, Y. Yang, LongT5: Eficient text-totext transformer for long sequences, in: Findings of the Association for Computational Linguistics: NAACL 2022, Association for Computational Linguistics, Seattle, United States, 2022, pp. 724–736.

URL: https://aclanthology.org/2022.findings-naacl.

55. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 2 . f i n d i n g s - n a a c l . 5 5 . [27] I. Chalkidis, X. Dai, M. Fergadiotis, P. Malakasiotis,

D. Elliott, An exploration of hierarchical attention transformers for eficient long document classification, 2022. URL: https://arxiv.org/abs/2210.05529.

doi:1 0 . 4 8 5 5 0 / A R X I V . 2 2 1 0 . 0 5 5 2 9 . [28] Z. Shen, K. Lo, L. Yu, N. Dahlberg, M. Schlanger,

D. Downey, Multi-lexsum: Real-world summaries of civil rights lawsuits at multiple granularities, in: Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL: https://openreview.net/forum? id=z1d8fUiS8Cr. [29] S. Bird, E. Loper, NLTK: The natural language toolkit, in: Proceedings of the ACL Interactive Poster and Demonstration Sessions, Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 214–217. URL: https://aclanthology.org/

P04-3031. [30] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,

O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, 2019. URL: https://arxiv.org/abs/1907.11692.

doi:1 0 . 4 8 5 5 0 / A R X I V . 1 9 0 7 . 1 1 6 9 2 . [31] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, É. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised crosslingual representation learning at scale, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8440–8451. [32] I. J. Good, Rational decisions, Journal of the Royal

Statistical Society. Series B (Methodological) 14

translation , Transactions of the Association for [1] A.

Vaswani , N.

Shazeer , N.

Parmar , J.

Uszkoreit , Computational Linguistics 8 ( 2020 ) 726 - 742 .

Jones ,

A. N.

Gomez , Ł. Kaiser, I. Polosukhin , At- [11]

Tang ,

Tran ,

Li ,

P.-J.

Chen , N. Goyal,

mation processing systems 30 ( 2017 ). lation with extensible multilingual pretraining and [2]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: ifnetuning ( 2020 ). a r X i v : 2 0 0 8 . 0 0 4 0 1 .

Pre-training of deep bidirectional transformers for [12]

Lewis ,

Liu ,

Goyal ,

Ghazvininejad , A . Mo-

2019

Conference of the North American Chapter Denoising sequence-to-sequence pre-training for

Human Language

Technologies

, Volume 1 (Long prehension, in: Proceedings of the 58th Annual

and Short Papers) , 2019 , pp. 4171 - 4186 . Meeting of the Association for Computational Lin[3] Canadian Legal Information Institute , https://www. guistics, Association for Computational Linguis-

canlii.org/en/, 2001 . Accessed: 2023 -02-05. tics, Online, 2020 , pp. 7871 - 7880 . URL: https:// [4] S. C. of Canada | Cour suprême du Canada, Style aclanthology . org/2020.acl-main.703. doi:1 0 . 1 8 6 5 3 /

Manual | Guide de rédaction, Supreme Court of v 1 / 2 0 2 0 . a c l - m a i n . 7 0 3 .

Canada | Cour suprême du Canada , 1987 . [13]

Ramos , Using tf-idf to determine word relevance [5]

Wolf ,

Debut ,

Sanh ,

Chaumond , C. De- in document queries , 1999 .

langue , A. Moi, P.

Cistac , T.

Rault , R.

Louf , M. Fun- [14] A.

Farzindar , G. Lapalme, Letsum, an automatic

Jernite ,

Plu ,

Xu ,

T. Le

Scao , S. Gug- information systems: JURIX 2004 , the seventeenth

ger , M.

Drame , Q.

Lhoest , A.

Rush , Transform- annual conference, volume 120 , IOS Press, 2004 ,

ers: State-of-the-art natural language process - p. 11 .

ing, in: Proceedings of the 2020 Conference on [15]

Hachey ,

Grover , Extractive summarisation of

Empirical Methods in Natural Language Process- legal texts , Artificial Intelligence and Law 14 ( 2006 )

ing: System Demonstrations , Association for Com- 305 -345.

putational Linguistics , Online, 2020 , pp. 38 - 45 . [16]

Polsley ,

Jhunjhunwala , R. Huang, CaseSum-

URL: https://aclanthology.org/ 2020 . emnlp-demos.6. marizer: A system for automated summarization

doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . e m n l p - d e m o s . 6 . of legal texts, in: Proceedings of COLING 2016 , the [6]

Zaheer , G. Guruganesh,

K. A.

Dubey ,

Ainslie , 26th International Conference on Computational

Yang , et al., Big bird: Transformers for longer 2016 Organizing Committee , Osaka, Japan, 2016 , pp.

sequences., in: NeurIPS, 2020 . 258 - 262 . URL: https://aclanthology.org/C16-2054. [7]

Sharma ,

Li ,

Wang , BIGPATENT : A large- [17]

Miller , Leveraging bert for extractive text sum-

scale dataset for abstractive and coherent sum- marization on lectures, 2019 . URL: https://arxiv.org/

marization, in: Proceedings of the 57th Annual abs/1906.04165. doi:1 0 . 4 8 5 5 0

/ A R X I

V . 1 9 0 6 . 0 4 1 6 5 .

Meeting of the Association for Computational Lin- [18]

Rafel ,

Shazeer ,

Roberts ,

Lee ,

Narang ,

tics , Florence, Italy, 2019 , pp. 2204 - 2213 . URL: limits of transfer learning with a unified text-to-

https://aclanthology.org/P19-1212. doi:1 0 . 1 8 6 5 3 / text transformer, The Journal of Machine Learning

v 1 / P 1 9 - 1 2 1 2 . Research 21 ( 2020 ) 5485 - 5551 . [8]

Garneau ,

Gaumond ,

Lamontagne , P.-L. [19]

Zhang ,

Zhao ,

Saleh , P. Liu, Pegasus:

ceedings of the Eighteenth International Confer- ference on Machine Learning, PMLR , 2020 , pp.

ence on Artificial Intelligence and Law , 2021 , pp. 11328 - 11339 .

256- 257 . [20]

Condevaux ,

Harispe , Lsg attention: Extrapola[9]

Douka ,

Abdine ,

Vazirgiannis ,

Ham- tion of pretrained transformers to long sequences,

model adaptation for french legal text , in: Pro- Mining: 27th Pacific-Asia Conference on Knowl-

ceedings of the Natural Legal Language Processing edge Discovery and Data Mining , PAKDD 2023 ,

Workshop 2021 , 2021 , pp. 95 - 101 . Osaka, Japan, May 25 -28, 2023 , Proceedings, Part

, [10]

Liu ,

Gu ,

Goyal ,

Li , S. Edunov , Springer, 2023 , pp. 443 - 454 .

Ghazvininejad ,

Lewis ,

Zettlemoyer , Multi- [21]

Brown ,

Mann ,

Ryder ,

Subbiah , J. D. Ka-