The Gap between Deep Learning and Law: Predicting
                             Employment Notice
                                  Jason T. Lam                                                                       David Liang
                           jason.lam@queensu.com                                                         david.liang@queensu.com
                              Queen’s University                                                     Faculty of Law, Queen’s University
                               Kingston, Ontario                                                             Kingston, Ontario

                                Samuel Dahan                                                                  Farhana Zulkernine
                      samuel.dahan@queensu.ca                                                           farhana@cs.queensu.com
                  Faculty of Law, Queen’s University                                            School of Computing, Queen’s University
                           Kingston, Ontario                                                                Kingston, Ontario
                 Cornell Law School, Cornell University
                           Ithaca, New York
ABSTRACT                                                                               1     INTRODUCTION
This study aims to determine whether Natural Language Processing                       The system of law that governs work in Canada (outside Quebec)
with deep learning models can shed new light on the Canadian                           consists of three overlapping regimes: the common law regime, the
calculation system for employment notice. In particular, we investi-                   regulatory regime and the collective bargaining regime (also called
gate whether deep learning can enhance the predictability of notice                    labour law or the law of unionized workers). This paper focuses on
period, that is, whether it is possible to predict notice period with                  the common law regime and in particular the employment law prin-
high accuracy. A major challenge with the classification of reason-                    ciples that apply to notice of termination, one of the most litigated
able notice is the inconsistency of the case law. As argued by the                     issues in Canada. One peculiarity of the Canadian system is that
Ontario Court of Appeal, the process of determining reasonable                         while each province has specific regulatory standards, common
notice is "more art than science". In a previous study, we assessed                    law of employment, in principle, applies in the western provinces
the predictability of reasonable notice periods by applying statis-                    of Canada [10]. Common law is usually defined as a system of
tical machine learning to a hand-annotated dataset of 850 cases.                       judge-made rules that uses a precedent-based approach to case law.
Building on this past study, this paper utilizes state-of-the-art deep                 Earlier decisions pertaining to similar facts or legal issues guide
learning models on a free-text summary of cases. We further ex-                        later decisions in an attempt to create legal predictability. In an ef-
periment with a variety of domain adaptations of state-of-the-art                      fort to ensure legal certainty and predictability, judges must follow
pretrained BERT-esque models. Our results appear to show that                          the reasoning in earlier cases that address the same legal issues and
the domain adaptations of BERT-esque models negatively affected                        similar facts. That said, common law rules and their interpretations
performance. Our best performing model was an out-of-the-box                           evolve with societal values, and therefore, the interpretation and
RoBERTa base model which achieved a 69% accuracy using a +/-2                          application of common law principles can sometime be inconsis-
prediction window.                                                                     tent and unpredictable, including when it comes to termination
                                                                                       notice [16].
CCS CONCEPTS                                                                              Upon termination, if the employment relationship is governed
• Applied computing → Law; • Computing methodologies →                                 by an indefinite contract and if there is no termination provision
Artificial intelligence; Natural language processing.                                  limiting the employee’s rights, the employer has the obligation to
                                                                                       provide either notice or pay in lieu of notice1 . Should the employer
KEYWORDS                                                                               fail to comply with this obligation, courts attempt to determine
                                                                                       what compensation the employee would have received during that
Deep Learning - Employment Law - Reasonable Notice - Employ-                           period if adequate notice had been provided, as well as damages
ment Termination - Legal Analytics - Predictive Analytics - Consis-                    for that loss, less any mitigation income. Courts typically begin
tency                                                                                  their analysis of what constitutes "reasonable notice" by looking
ACM Reference Format:                                                                  at the so-called "Bardal factors", described in the landmark case
Jason T. Lam, David Liang, Samuel Dahan, and Farhana Zulkernine. 2020.                 Bardal v. Globe & Mail Ltd: 1) age of the employee, 2) length of
The Gap between Deep Learning and Law: Predicting Employment No-                       service, 3) character of employment and 4) availability of similar
tice. In Proceedings of the 2020 Natural Legal Language Processing (NLLP)              employment2 [1].
Workshop, 24 August 2020, San Diego, US. ACM, New York, NY, USA, 5 pages.
                                                                                       1 There is no obligation to provide reasonable notice when there is a lawful termination
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons   provision (see Machtinger v. HOJ Industries, SCC) or where there is a fixed-term
License Attribution 4.0 International (CC BY 4.0).                                     contract [7] [10]
NLLP @ KDD 2020, August 24th, San Diego, US                                            2 Most employment contracts are of indefinite length, and the law implies a term that
© 2020 Copyright held by the owner/author(s).                                          employers must provide employees with reasonable notice that the relationship is
                                                                                       ending. See, e.g., Machtinger v. HOJ Industries Ltd, [1992] 1 SCR 986, 7 OR (3d) 480
NLLP @ KDD 2020, August 24th, San Diego, US                                                 Jason T. Lam, David Liang, Samuel Dahan, and Farhana Zulkernine


   While the Bardal test has been designed as an objective calcula-         factual case summaries generated from China Judgments Online.
tion system, there is no clear indication of how much weight should         They reported better results than Luo et al. [20] macro-F1 scores of
be given to each factor, nor of how the factors should be utilized [1].     64.0/67.1/73.1 on their small/medium/large datasets, respectively.
Accordingly, the case law on employment notice has been noted               We note in Hu et al. [14] they segment the maximum document
to be inherently inconsistent and subjective, and there does not            length to 500 and China Judgements Online appear to be a database
seem to be a "right" figure for reasonable notice. As Justice Dun-          of fact descriptions not full cases. Chalkidis et al. [6], used a dataset
phy wrote of calculation of reasonable notice periods, "[it] is more        of European Court of Human Rights(ECHR) cases totalling 11,500.
art than science but must be one that is fair in all of the circum-         They reported the results of a variety of deep learning architectures
stances" [2]. There have also been additional layers of complication        on three tasks: binary classification, multi-class classification and
as judges have considered factors beyond those explicitly men-              case importance prediction. They further introduced a Hierarchical-
tioned in Bardal, such as inducement, in which an employer’s act            BERT, which first produced fact embeddings which are used with a
in bad faith results in aggravated damages (Wallace Damages) [3].           self-attention mechanism to formulate the document embedding.
   In this paper we investigate whether deep learning models can            Their Hierarchical-BERT was their performing model in both their
enhance the predictability of termination notice. Using advances in         binary and multi-label classification tasks with an F1 of 82 and 60.8.
pretrained models such as BERT [9], we investigate the effectiveness        In their case importance task, their majority-class classifier achieved
of domain adaptations as well as benchmark our results against              the lowest mean-squared error of 0.369, with Hierarchical-BERT
deep learning models that have shown success across multiple                achieving the best Spearman’s 𝜌 of .527. Although we implement
domains and in law. We build on previous work in Dahan et al. [8],          similar models to Chaldkidis et al. [6], we note that our task of
in which we analyzed the prediction of reasonable notice using              predicting an outcome differs from the classification of an entire
statistical machine learning on a hand-annotated tabular dataset            document.
and demonstrated a lack of consistency amongst the judgements.                 We note that as of this writing, there does not appear to be
   In the balance of this paper we present related deep learning            existing literature on using deep learning for the prediction of
research for legal text analytics and a description of the problem,         reasonable notice from free text.
followed by a discussion of the data, models and methods utilized
in this research. We end with our results and discussions followed          3    PROBLEM DESCRIPTION
by a conclusion and recommendations for future work.                        The aim of this project is to predict how judges determine ‘reason-
                                                                            able notice’ in employment termination cases. As argued earlier, if
2    RELATED WORK                                                           the employment relationship is governed by an indefinite contract
In literature, the application of NLP to legal analytics is still new and   and the employer wishes to terminate the employment relationship
there exist very few implementations in predicting court decisions          for any reason, the employer has the obligation to provide notice
or classification of legal data [8]. Soh et al. [13] explored multi-        — usually denoted in months — or pay in lieu of notice, calculated
ple statistical machine learning approaches, out-of-the-box deep            according to the Bardal factors4 .
learning models such as BERT and a shallow convolutional neural                While the primary goal of this research is to assess the predic-
network to classify 6,277 Singapore Supreme Court Judgements into           tive power of deep learning models when it comes to notice of
their 31 different legal areas. Their results showed that the statisti-     termination, it is worth noting that this research is drawn from
cal model performed best with a micro-F1 of 63.2 and a macro-F1             a larger project aiming at developing an open-source system for
of 73.3 [13]. Our problem statement differs from Soh et al. [13] as         small-claims disputes, including employment disputes: MyOpen-
they classified the area of law utilizing the entirety of a case while      Court. This system aims to promote access to small-claims justice
we wish to predict the outcome. While in machine learning, 3-4              and provide legal help to self-represented litigants by democratizing
instances of a sample is considered to be too few, in law, 3-4 samples      legal analytics technology. Like many AI legal tools, MyOpenCourt
of precedence are considered to be plenty. Few-shot predictors at-          provides legal information that requires users to fill out a multiple-
tempt to generalize classes with few training samples. Luo et al. [20]      choice questionnaire and outputs a single numerical prediction
proposed a model for predicting criminal charges leveraging re-             along with a list of relevant precedents.
lated law articles. They used a hierarchical attention mechanism to            Instead, in this research we propose a system that outputs a
create a document representation and a different stack of attention         prediction of reasonable notice based on a free-text summary of
components was trained to select the best supporting legislative            the case law. We used deep learning in conjunction with Natural
statutes for a given case. Their model was trained on 50,000 case           Language Processing (NLP) to calculate the period of reasonable
documents extracted from China Judgments Online3 , and only pre-            notice from manually typed summaries of adjudicated cases. The
dicted criminal charges that had at least 80 cases. For the sake of         summaries are unstructured text data written in plain English (i.e.
simplicity, the authors only considered cases in which there was            not "legalese"), collected from WestLaw’s Quantum service5 . These
a single defendant. Luo et al. [20] reported an F1 micro/macro of           summaries contain sufficient information for a trained legal pro-
90.21/80.48 and compared the performance of their model to other            fessional to approximately determine the notice award without
baselines they had built. Hu et al. [14] performed three experiments        looking at the outcome of the case. We decided to use summaries
that trained multiple baselines including the model presented by            4 Payment in lieu of notice’ is immediate compensation at an amount equal to that an
Luo et al. [20], on three datasets comprising 61,589/153,521/306,900        employee would have earned as salary or wages by working through the whole notice
                                                                            period
3 http://wenshu.court.gov.cn/                                               5 https://www.westlawnextcanada.com/quantums/
The Gap between Deep Learning and Law: Predicting Employment Notice                                           NLLP @ KDD 2020, August 24th, San Diego, US


instead of entire cases because of the the "fussy" nature of common       5.2     Few-shot Learning
law text [11]. The lack of styling in Canadian cases made it difficult    While in machine learning 3-4 instances of a sample is consid-
to automatically parse the extraction of the legal facts. As our goal     ered to be too few, in law, 3-4 samples are considered to be plenty.
is to predict the outcome of a case, we require the inputs to not         Few-shot models often utilize a method which only requires a few
contain any mention of the legal analysis or the outcome. The full        instances of a training sample to be able to generalize. Given our
legal cases are long, often more than ten pages and averaging 5,000       sparse dataset and the success Hu et al. [14] had demonstrated in
words, and introduce ambiguity through an abundance of informa-           criminal law, we implemented a modified version for the prediction
tion that must be carefully filtered to extract useful information.       of reasonable notice. We followed the implementation details pre-
Furthermore, legal cases are decided by a multitude of judges, which      sented in Hu et al. [14] except we adopted the sentence embeddings
leads to many idiosyncrasies in writing styles and case structuring.      presented by Lin et al.[18] and generated r number of attention
                                                                          vectors for each attribute, as the original model yielded poor results.
                                                                          The architecture presented by Hu et al. [14] created a document
4     DATA
                                                                          representation by combining an attribute-aware embedding with
The WestLaw Quantum Service5 provides a brief synopsis of the             one that is attribute-free. The attribute-aware embedding resulted
case, often including the judgement on the reasonable notice period.      from an average pooling of the sentence embeddings from four
Since the goal of this research is to predict reasonable notice period,   stacked self-attention mechanisms. For a single mechanism we pre-
any mention of the judgement regarding the notice period was              dict an attribute (e.g. a person’s age or duration) by self-attending
manually removed, leaving only factual descriptions of the plaintiff      to multiple parts of the text simultaneously. Four labels were used
and the characteristics of the case. We prepended the input with          to train the attribute-aware mechanisms to predict the length of
the year of the judgement, occupation category, age, salary, job          employment, age of employee, character of employment, and avail-
title and duration of employment of the plaintiff extracted from          ability of similar employment. These labels were hand-annotated
the summaries. If the information could not be found, nothing was         by a team of Queen’s Law students. The attribute-free embedding
prepended to the summary. The summaries appeared to be written            comprised a max-pooling of the hidden states generated by the
in plain English by human writers. Each case outcome is considered        encoder.
to be its own class, with outcomes of 25 months or greater being             We utilized pretrained 300-dimension GloVe vectors that were
grouped together. Our classification task had a total of 25 classes.      fine-tuned, a hidden dimension of 300, and dropout of 0.5. An
                                                                          attention r of 30 was used. We used an Adam optimizer with a
                                                                          learning rate of 0.001 and reduced the learning rate when a metric
5     MODELS AND METHODS
                                                                          has stopped improving by a factor of 0.95. Alpha, the scaling on
Our dataset totaled 1,695 cases for training and 409 for testing. We      the attribute-aware loss, had a value of 0.3 and each epoch took
did not utilize a development set and evaluated the final models on       approximately 8 minutes to execute.
the testing set. All experiments were completed on an IBM Power8
server with 512 GB of memory, 64 cores (hyper-threaded to 128),           5.3     BERT-esque and Domain adaptation
four Nvidia K80 GPUs, Red Hat Enterprise Linux Server release 7.6
                                                                          The Bidirectional Encoder Representations from Transformers (BERT)
and ppc64le architecture.
                                                                          from Devlin et al. [9] has recently laid the foundation of pretraining
                                                                          models by leveraging language modelling, transfer learning, and
5.1    Hierarchical Attention Network                                     fine-tuning on downstream tasks. As one of the main strengths of
                                                                          BERT is its generalized understanding of the English language, and
Extracting deep semantic and contextual understanding from text
                                                                          pretrained BERT models have been publicly released6 , we leveraged
data is essential for every NLP task. Bahdanau et al. [4] first pro-
                                                                          RoBERTa for predicting reasonable notice. We further experimented
posed attention mechanisms for machine translation by learning
                                                                          with Robustly Optimized BERT Pretraining Approach (RoBERTa),
how to align the original text with the translated words. Rather
                                                                          which held the top spot in the GLUE benchmarks [19] during the
than the traditional approach of attempting to distill an entire doc-
                                                                          course of our research. Architecturally, RoBERTa did not differ
ument into a vector, Yang et al. [23] introduced the Hierarchical
                                                                          from the original BERT model; instead, Liu et al. [19] utilized an
Attention Network (HAN) to encode smaller chunks of text that are
                                                                          additional 160GBs of data and further fine-tuned hyperparameters.
then used to inform future encodings. The HAN first learned the
                                                                          Through experimentation of learning rates and batch sizes, the
importance of each word which informed its sentence embedding,
                                                                          authors determined that the next sentence prediction only offered
and a separate attention mechanism learned the importance of each
                                                                          marginal to no performance improvements. Using only hyperpa-
sentence to inform the final document representation. In our HAN
                                                                          rameter fine-tuning and additional data, RoBERTa achieved an
we used SpaCy sentence bound detection and tokenization [12].
                                                                          almost 7% improvement over the original BERT model on the GLUE
   We utilized pretrained 200-dimension GloVe vectors with a LSTM
                                                                          benchmarks.
containing hidden dimensions of 75 and an attention dimension
of 50. We optimized with a stochastic gradient descent optimizer          5.3.1 Domain Adaptations. Following common practice for im-
with a learning rate of 0.06 , a batch size of 32, a momentum of 0.9      proving pretrained model performance, we experimented by domain-
and dropout of 0.5. The learning rate was reduced by a factor of          adapting our BERT-esque models on full reasonable notice case
0.95 when our performance stopped improving. Each epoch took
approximately six minutes to execute.                                     6 https://github.com/google-research/bert
NLLP @ KDD 2020, August 24th, San Diego, US                                           Jason T. Lam, David Liang, Samuel Dahan, and Farhana Zulkernine


texts as well as the Harvard case law dataset [21]. In our BERT im-     to learn the importance of each sentence (fact) and weigh it ac-
plementations, we utilized the 𝐵𝐸𝑅𝑇𝑏𝑎𝑠𝑒 model from HuggingFace7 .       cordingly prior to creating a document representation. This may
We further experimented with domain adapting a 𝑅𝑜𝐵𝐸𝑅𝑇 𝑎𝑏𝑎𝑠𝑒             be replicating the thought process of the judiciary. Furthermore,
model using Facebook AI’s implementation8 . In addition, we further     we believe our mixed results from our few-shot model may be at-
pretrained both models using only the masked language modelling         tributable to the size of our dataset. In Hu et al. [14], their smallest
(MLM) criterion, consistent with the results from Liu et al. [19].      training dataset contained over 61,000 training samples, compared
Five epochs were used for all MLM pretraining. For our end results,     to our 1,695. The issue of data in our few-shot model appear to
pretrained language models were further trained for text classi-        be supported with the performance of our BERT-esque models in
fication with ten epochs and fine-tuned on 409 remaining cases.         which the majority of models performed better. BERT-esque models
Classification training was performed using a batch size of 16, using   are pretrained to have a generalized understanding of language out
the default losses and optimizers. The classification head of BERT      of the box, making it easier to fine-tune on a specific classification
took 20 minutes to train while MLM took 2.5 hours. For BERT,            task. Furthermore, the task of predicting reasonable notice requires
we fine-tuned on the full case text that correspond to each of the      additional knowledge beyond an understanding of the natural lan-
respective 1,695 reasonable notice cases in our training set.           guage. In fact, it requires knowledge on judicial bias and dispute
   In our RoBERTa implementation, we fine-tuned on approxi-             settlement - something that seasoned employment lawyers have
mately four million cases from Harvard’s case law project. We           built throughout their interactions with judges and colleagues and
determined cases before 1960 to be linguistically different from        cannot be learned from case law. This may partially explain our
present-day legal documents and thus these cases were removed.          mixed results insofar as the majority of employment disputes are
We note that the Harvard case law project only includes cases from      resolved through negotiation. Thus, considering that the case law
the United States. To create an accurate comparison with our BERT       constitutes only a small piece of the data, it may be argued that our
implementation, we domain-adapted 𝑅𝑜𝐵𝐸𝑅𝑇 𝑎𝑏𝑎𝑠𝑒 to the same set          model could have performed better had it be trained on settlement
of full case texts. The classification head of RoBERTa took 30 min-     agreements.
utes to fine-tune, while MLM took 3 hours. Our batch size was 256          In our experiments we utilized a classification approach over a
and peak learning rate was 0.0001 in accordance with Liu et al. [19].   regression as we believe this best replicates the decision-making
                                                                        process of a judge. In deciding the amount of reasonable notice a
6    RESULTS AND DISCUSSION                                             plaintiff should receive one could presume a judge would locate
                                                                        similar past cases and adjust their ruling based on differences of
                                                                        fact. We utilize classification to anchor the number of months of
           Approach                   Acc. (+/-2)
                                                                        reasonable notice and broaden the output window in our metric
           HAN                        67%
                                                                        to account for the differences of fact. In addition, our findings in
           Few-shot w/ Self-attention 51%
                                                                        Dahan et al. [8], in which we used regression along with a hand-
           BERT+base                  61%
                                                                        labeled tabular dataset, indicated regression to be a poor predictor
           BERT+full cases            49%
                                                                        of reasonable notice.
           RoBERTa+full cases         63%
                                                                           A key finding of this research is that our domain adaptations
           RoBERTa+Harvard            65%
                                                                        did not yield significant improvements when compared to the out-
           RoBERTa+base               69%
                                                                        of-the-box pretrained models, as reported in Rietzler et al. [21]
Table 1: Summary of results for predicting the number                   and commonly noted as a promising avenue for other domains
of months awarded for reasonable notice using case sum-                 (e.g. SciBERT [5] and BioBERT [17]). Instead, domain adaptations
maries.                                                                 appeared to negatively affect the performance of both our 𝑅𝑜𝐵𝐸𝑅𝑇 𝑎
                                                                        and 𝐵𝐸𝑅𝑇 models. Despite the language of the Canadian reasonable
                                                                        notice cases being more similar to our case summaries than the
                                                                        Harvard Case Law dataset, our results from 𝑅𝑜𝐵𝐸𝑅𝑇 𝑎𝑏𝑎𝑠𝑒 , domain-
The output of our system was classified as correct if it was within
                                                                        adapted on the Harvard dataset, performed slightly better than the
+/-2 months of the ground truth label to account for situational
                                                                        ones trained on full Canadian cases. This finding was in line with
variability (i.e.Eq 1). We refer to this as the output window.
                                                                        one of the conclusions set out by Liu et al. [19], which emphasized
          𝑔𝑟𝑜𝑢𝑛𝑑𝑡𝑟𝑢𝑡ℎ − 2 ≤ 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 ≤ 𝑔𝑟𝑜𝑢𝑛𝑑𝑡𝑟𝑢𝑡ℎ + 2         (1)    the importance of volume for datasets. Furthermore, as Liu et al. [19]
                                                                        performed extensive hyperparameter experimentation we opted to
   Our 𝑅𝑜𝐵𝐸𝑅𝑇 𝑎𝑏𝑎𝑠𝑒 model had the highest accuracy of 69%. A
                                                                        utilize their recommended learning rates and batch sizes. It appears
summary of the results can be seen in Table 1.
                                                                        current deep learning solutions may not be able to accurately grasp
   Interestingly, our best-performing model was 𝑅𝑜𝐵𝐸𝑅𝑇 𝑎𝑏𝑎𝑠𝑒 out-
                                                                        the many unique characters of each legal dispute. Unfortunately,
of-the-box and not a domain-adapted version. Our HAN out-performed
                                                                        our performance suggests that in spite of recent advances, deep
the majority of our pretrained models and was only marginally
                                                                        learning may not be able to accurately predict reasonable notice
worse than 𝑅𝑜𝐵𝐸𝑅𝑇 𝑎𝑏𝑎𝑠𝑒 . The performance of HAN may be attrib-
                                                                        from free text. That said, our mixed results may also be explained
utable to architectural structuring, as each sentence in our case
                                                                        by the fact that judges are inherently unpredictable when it comes
summaries roughly contains a statement of fact, allowing our HAN
                                                                        to the application of the Bardal test, and thus predicting notice is
7 huggingface.co                                                        an almost impossible task for deep learning models, but also for
8 https://github.com/pytorch/fairseq                                    experienced lawyers.
The Gap between Deep Learning and Law: Predicting Employment Notice                                                            NLLP @ KDD 2020, August 24th, San Diego, US


7    CONCLUSION AND FUTURE WORK                                                                Law 27, 2 (2019), 171–198.
                                                                                           [7] Bruce Curran and Sara Slinn. 2016. Just Notice Reform: Enhanced Statutory
In this paper we applied multiple deep-learning solutions to human-                            Termination Provisions for the 99%. Osgoode Legal Studies Research Paper 61
written case summaries to classify the reasonable notice period                                (2016), 2017.
                                                                                           [8] Samuel Dahan, Jonathan Touboul, Jason Lam, and Dan Sfedj. 2020. Predicting
a plaintiff should be awarded. Our best performing model was                                   Employment Notice with Machine Learning: Promises and Limitation.
𝑅𝑜𝐵𝐸𝑅𝑇 𝑎𝑏𝑎𝑠𝑒 which achieved a 69% accuracy with a +/-2 month                               [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
window. Domain adaptations negatively affected our performance                                 Pre-training of deep bidirectional transformers for language understanding. arXiv
                                                                                               preprint arXiv:1810.04805 (2018).
in the case of RoBERTa, and marginally improved performance in                            [10] David J Doorey. 2020. The Law of Work: Common Law and the Regulation of
the case of BERT. Given the significant successes of deep learning in                          Work. Regulation 285 (2020).
various domains, the relatively poor efficiency found in predicting                       [11] James Holland and Julian Webb. 2013. Learning legal rules: a students’ guide to
                                                                                               legal method and reasoning. Oxford university press.
notice periods may come as a surprise. In fact, results from this                         [12] Matthew Honnibal and Mark Johnson. 2015. An Improved Non-monotonic Tran-
paper appear to be consistent with our previous findings in Dahan                              sition System for Dependency Parsing. In Proceedings of the 2015 Conference on
                                                                                               Empirical Methods in Natural Language Processing. Association for Computational
et al. [8], suggesting the inconsistencies in the case law make it                             Linguistics, Lisbon, Portugal, 1373–1378. https://aclweb.org/anthology/D/D15/
difficult to predict reasonable notice accurately. Thus, it can be                             D15-1162
argued that our results have very little to do with the quality of                        [13] Jerrold Soh Tsin Howe, Lim How Khang, and Ian Ernst Chai. 2019. Legal Area
                                                                                               Classification: A Comparative Study of Text Classifiers on Singapore Supreme
the model, and that it may not be possible to reduce our prediction                            Court Judgments. arXiv preprint arXiv:1904.06470 (2019).
error, mainly because of the inherently inconsistent nature of the                        [14] Zikun Hu, Xiang Li, Cunchao Tu, Zhiyuan Liu, and Maosong Sun. 2018. Few-shot
dataset.                                                                                       charge prediction with discriminative legal attributes. In Proceedings of the 27th
                                                                                               International Conference on Computational Linguistics. 487–498.
   At the time of our research, RoBERTa was the top-performing                            [15] Zhewei Huang, Wen Heng, and Shuchang Zhou. 2019. Learning to paint with
model on the GLUE benchmarks [22], where it ranks 10th as of this                              model-based deep reinforcement learning. In Proceedings of the IEEE International
                                                                                               Conference on Computer Vision. 8709–8718.
writing. Utilizing the top performing model from the benchmarks                           [16] Grant Lamond. 2006. Precedent and analogy in legal reasoning. (2006).
may yield better results. While our domain adaptation used 1,695                          [17] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim,
full cases, and the superior performance of domain adapting on                                 Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language
                                                                                               representation model for biomedical text mining. Bioinformatics 36, 4 (2020),
an adjacent domain of American law, collecting a larger dataset of                             1234–1240.
Canadian cases may be an interesting avenue to explore for further                        [18] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang,
experimentation. While deep learning and BERT-esque models                                     Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence
                                                                                               embedding. arXiv preprint arXiv:1703.03130 (2017).
have proven successful in numerous domains, it appears a gap still                        [19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
exists for deep learning applications in the legal field. That said,                           Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A
                                                                                               robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692
other areas — such as art, which has been considered too artistic                              (2019).
to be impacted by AI — have shown some recent successes. In                               [20] Bingfeng Luo, Yansong Feng, Jianbo Xu, Xiang Zhang, and Dongyan Zhao. 2017.
particular, deep learning models have successfully been trained to                             Learning to Predict Charges for Criminal Cases with Legal Basis. In Proceedings
                                                                                               of the 2017 Conference on Empirical Methods in Natural Language Processing.
paint like humans [15], which leads us to be optimistic about future                           2727–2736.
applications in the legal field. We believe that further advances in                      [21] Alexander Rietzler, Sebastian Stabinger, Paul Opitz, and Stefan Engl. 2019. Adapt
the field of NLP and deep learning will be able to perform well on                             or get left behind: Domain adaptation through bert language model finetuning
                                                                                               for aspect-target sentiment classification. arXiv preprint arXiv:1908.11860 (2019).
this task in the future.                                                                  [22] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R
                                                                                               Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural
                                                                                               language understanding. arXiv preprint arXiv:1804.07461 (2018).
ACKNOWLEDGMENTS                                                                           [23] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard
                                                                                               Hovy. 2016. Hierarchical attention networks for document classification. In
We would like to thank the team at the Conflict Analytics Lab, the                             Proceedings of the 2016 Conference of the North American Chapter of the Association
Scotiabank Centre for Customer Analytics and the Center for Law                                for Computational Linguistics: Human Language Technologies. 1480–1489.
in the Contemporary Workplace at Queen’s University (Jonathan
Touboul, Maxime Cohen, Kevin Banks, Simon Townsend, William
Quiglietta, Zach Berg, Mackenzie Anderson, Brandon Loehle, Max
Saunders, Sean Gulrajani, Shane Liquornik, Yuri Levin, Mikhail
Nediak, Stephen Thomas) as well as our colleagues Joshua Karton
and Bill Flanagan for supporting the project as well as contributing
to the development of the idea.

REFERENCES
 [1] [n.d.]. Bardal v. Globe & Mail Ltd. (1960), 24 D.L.R. (2d).
 [2] [n.d.]. Fraser v. Canerector Inc., 2015 CarswellOnt 4796, 2015 ONSC 2138 at para
     32.
 [3] [n.d.]. Wallace v. United Grain Growers Ltd., 1997 CanLII 332 (SCC), (1997) 3
     S.C.R. 701.
 [4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma-
     chine translation by jointly learning to align and translate. arXiv preprint
     arXiv:1409.0473 (2014).
 [5] Iz Beltagy, Arman Cohan, and Kyle Lo. 2019. Scibert: Pretrained contextualized
     embeddings for scientific text. arXiv preprint arXiv:1903.10676 (2019).
 [6] Ilias Chalkidis and Dimitrios Kampas. 2019. Deep learning in law: early adaptation
     and legal word embeddings trained on large corpora. Artificial Intelligence and