The Gap between Deep Learning and Law: Predicting Employment Notice Jason T. Lam David Liang jason.lam@queensu.com david.liang@queensu.com Queen’s University Faculty of Law, Queen’s University Kingston, Ontario Kingston, Ontario Samuel Dahan Farhana Zulkernine samuel.dahan@queensu.ca farhana@cs.queensu.com Faculty of Law, Queen’s University School of Computing, Queen’s University Kingston, Ontario Kingston, Ontario Cornell Law School, Cornell University Ithaca, New York ABSTRACT 1 INTRODUCTION This study aims to determine whether Natural Language Processing The system of law that governs work in Canada (outside Quebec) with deep learning models can shed new light on the Canadian consists of three overlapping regimes: the common law regime, the calculation system for employment notice. In particular, we investi- regulatory regime and the collective bargaining regime (also called gate whether deep learning can enhance the predictability of notice labour law or the law of unionized workers). This paper focuses on period, that is, whether it is possible to predict notice period with the common law regime and in particular the employment law prin- high accuracy. A major challenge with the classification of reason- ciples that apply to notice of termination, one of the most litigated able notice is the inconsistency of the case law. As argued by the issues in Canada. One peculiarity of the Canadian system is that Ontario Court of Appeal, the process of determining reasonable while each province has specific regulatory standards, common notice is "more art than science". In a previous study, we assessed law of employment, in principle, applies in the western provinces the predictability of reasonable notice periods by applying statis- of Canada [10]. Common law is usually defined as a system of tical machine learning to a hand-annotated dataset of 850 cases. judge-made rules that uses a precedent-based approach to case law. Building on this past study, this paper utilizes state-of-the-art deep Earlier decisions pertaining to similar facts or legal issues guide learning models on a free-text summary of cases. We further ex- later decisions in an attempt to create legal predictability. In an ef- periment with a variety of domain adaptations of state-of-the-art fort to ensure legal certainty and predictability, judges must follow pretrained BERT-esque models. Our results appear to show that the reasoning in earlier cases that address the same legal issues and the domain adaptations of BERT-esque models negatively affected similar facts. That said, common law rules and their interpretations performance. Our best performing model was an out-of-the-box evolve with societal values, and therefore, the interpretation and RoBERTa base model which achieved a 69% accuracy using a +/-2 application of common law principles can sometime be inconsis- prediction window. tent and unpredictable, including when it comes to termination notice [16]. CCS CONCEPTS Upon termination, if the employment relationship is governed • Applied computing → Law; • Computing methodologies → by an indefinite contract and if there is no termination provision Artificial intelligence; Natural language processing. limiting the employee’s rights, the employer has the obligation to provide either notice or pay in lieu of notice1 . Should the employer KEYWORDS fail to comply with this obligation, courts attempt to determine what compensation the employee would have received during that Deep Learning - Employment Law - Reasonable Notice - Employ- period if adequate notice had been provided, as well as damages ment Termination - Legal Analytics - Predictive Analytics - Consis- for that loss, less any mitigation income. Courts typically begin tency their analysis of what constitutes "reasonable notice" by looking ACM Reference Format: at the so-called "Bardal factors", described in the landmark case Jason T. Lam, David Liang, Samuel Dahan, and Farhana Zulkernine. 2020. Bardal v. Globe & Mail Ltd: 1) age of the employee, 2) length of The Gap between Deep Learning and Law: Predicting Employment No- service, 3) character of employment and 4) availability of similar tice. In Proceedings of the 2020 Natural Legal Language Processing (NLLP) employment2 [1]. Workshop, 24 August 2020, San Diego, US. ACM, New York, NY, USA, 5 pages. 1 There is no obligation to provide reasonable notice when there is a lawful termination Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons provision (see Machtinger v. HOJ Industries, SCC) or where there is a fixed-term License Attribution 4.0 International (CC BY 4.0). contract [7] [10] NLLP @ KDD 2020, August 24th, San Diego, US 2 Most employment contracts are of indefinite length, and the law implies a term that © 2020 Copyright held by the owner/author(s). employers must provide employees with reasonable notice that the relationship is ending. See, e.g., Machtinger v. HOJ Industries Ltd, [1992] 1 SCR 986, 7 OR (3d) 480 NLLP @ KDD 2020, August 24th, San Diego, US Jason T. Lam, David Liang, Samuel Dahan, and Farhana Zulkernine While the Bardal test has been designed as an objective calcula- factual case summaries generated from China Judgments Online. tion system, there is no clear indication of how much weight should They reported better results than Luo et al. [20] macro-F1 scores of be given to each factor, nor of how the factors should be utilized [1]. 64.0/67.1/73.1 on their small/medium/large datasets, respectively. Accordingly, the case law on employment notice has been noted We note in Hu et al. [14] they segment the maximum document to be inherently inconsistent and subjective, and there does not length to 500 and China Judgements Online appear to be a database seem to be a "right" figure for reasonable notice. As Justice Dun- of fact descriptions not full cases. Chalkidis et al. [6], used a dataset phy wrote of calculation of reasonable notice periods, "[it] is more of European Court of Human Rights(ECHR) cases totalling 11,500. art than science but must be one that is fair in all of the circum- They reported the results of a variety of deep learning architectures stances" [2]. There have also been additional layers of complication on three tasks: binary classification, multi-class classification and as judges have considered factors beyond those explicitly men- case importance prediction. They further introduced a Hierarchical- tioned in Bardal, such as inducement, in which an employer’s act BERT, which first produced fact embeddings which are used with a in bad faith results in aggravated damages (Wallace Damages) [3]. self-attention mechanism to formulate the document embedding. In this paper we investigate whether deep learning models can Their Hierarchical-BERT was their performing model in both their enhance the predictability of termination notice. Using advances in binary and multi-label classification tasks with an F1 of 82 and 60.8. pretrained models such as BERT [9], we investigate the effectiveness In their case importance task, their majority-class classifier achieved of domain adaptations as well as benchmark our results against the lowest mean-squared error of 0.369, with Hierarchical-BERT deep learning models that have shown success across multiple achieving the best Spearman’s 𝜌 of .527. Although we implement domains and in law. We build on previous work in Dahan et al. [8], similar models to Chaldkidis et al. [6], we note that our task of in which we analyzed the prediction of reasonable notice using predicting an outcome differs from the classification of an entire statistical machine learning on a hand-annotated tabular dataset document. and demonstrated a lack of consistency amongst the judgements. We note that as of this writing, there does not appear to be In the balance of this paper we present related deep learning existing literature on using deep learning for the prediction of research for legal text analytics and a description of the problem, reasonable notice from free text. followed by a discussion of the data, models and methods utilized in this research. We end with our results and discussions followed 3 PROBLEM DESCRIPTION by a conclusion and recommendations for future work. The aim of this project is to predict how judges determine ‘reason- able notice’ in employment termination cases. As argued earlier, if 2 RELATED WORK the employment relationship is governed by an indefinite contract In literature, the application of NLP to legal analytics is still new and and the employer wishes to terminate the employment relationship there exist very few implementations in predicting court decisions for any reason, the employer has the obligation to provide notice or classification of legal data [8]. Soh et al. [13] explored multi- — usually denoted in months — or pay in lieu of notice, calculated ple statistical machine learning approaches, out-of-the-box deep according to the Bardal factors4 . learning models such as BERT and a shallow convolutional neural While the primary goal of this research is to assess the predic- network to classify 6,277 Singapore Supreme Court Judgements into tive power of deep learning models when it comes to notice of their 31 different legal areas. Their results showed that the statisti- termination, it is worth noting that this research is drawn from cal model performed best with a micro-F1 of 63.2 and a macro-F1 a larger project aiming at developing an open-source system for of 73.3 [13]. Our problem statement differs from Soh et al. [13] as small-claims disputes, including employment disputes: MyOpen- they classified the area of law utilizing the entirety of a case while Court. This system aims to promote access to small-claims justice we wish to predict the outcome. While in machine learning, 3-4 and provide legal help to self-represented litigants by democratizing instances of a sample is considered to be too few, in law, 3-4 samples legal analytics technology. Like many AI legal tools, MyOpenCourt of precedence are considered to be plenty. Few-shot predictors at- provides legal information that requires users to fill out a multiple- tempt to generalize classes with few training samples. Luo et al. [20] choice questionnaire and outputs a single numerical prediction proposed a model for predicting criminal charges leveraging re- along with a list of relevant precedents. lated law articles. They used a hierarchical attention mechanism to Instead, in this research we propose a system that outputs a create a document representation and a different stack of attention prediction of reasonable notice based on a free-text summary of components was trained to select the best supporting legislative the case law. We used deep learning in conjunction with Natural statutes for a given case. Their model was trained on 50,000 case Language Processing (NLP) to calculate the period of reasonable documents extracted from China Judgments Online3 , and only pre- notice from manually typed summaries of adjudicated cases. The dicted criminal charges that had at least 80 cases. For the sake of summaries are unstructured text data written in plain English (i.e. simplicity, the authors only considered cases in which there was not "legalese"), collected from WestLaw’s Quantum service5 . These a single defendant. Luo et al. [20] reported an F1 micro/macro of summaries contain sufficient information for a trained legal pro- 90.21/80.48 and compared the performance of their model to other fessional to approximately determine the notice award without baselines they had built. Hu et al. [14] performed three experiments looking at the outcome of the case. We decided to use summaries that trained multiple baselines including the model presented by 4 Payment in lieu of notice’ is immediate compensation at an amount equal to that an Luo et al. [20], on three datasets comprising 61,589/153,521/306,900 employee would have earned as salary or wages by working through the whole notice period 3 http://wenshu.court.gov.cn/ 5 https://www.westlawnextcanada.com/quantums/ The Gap between Deep Learning and Law: Predicting Employment Notice NLLP @ KDD 2020, August 24th, San Diego, US instead of entire cases because of the the "fussy" nature of common 5.2 Few-shot Learning law text [11]. The lack of styling in Canadian cases made it difficult While in machine learning 3-4 instances of a sample is consid- to automatically parse the extraction of the legal facts. As our goal ered to be too few, in law, 3-4 samples are considered to be plenty. is to predict the outcome of a case, we require the inputs to not Few-shot models often utilize a method which only requires a few contain any mention of the legal analysis or the outcome. The full instances of a training sample to be able to generalize. Given our legal cases are long, often more than ten pages and averaging 5,000 sparse dataset and the success Hu et al. [14] had demonstrated in words, and introduce ambiguity through an abundance of informa- criminal law, we implemented a modified version for the prediction tion that must be carefully filtered to extract useful information. of reasonable notice. We followed the implementation details pre- Furthermore, legal cases are decided by a multitude of judges, which sented in Hu et al. [14] except we adopted the sentence embeddings leads to many idiosyncrasies in writing styles and case structuring. presented by Lin et al.[18] and generated r number of attention vectors for each attribute, as the original model yielded poor results. The architecture presented by Hu et al. [14] created a document 4 DATA representation by combining an attribute-aware embedding with The WestLaw Quantum Service5 provides a brief synopsis of the one that is attribute-free. The attribute-aware embedding resulted case, often including the judgement on the reasonable notice period. from an average pooling of the sentence embeddings from four Since the goal of this research is to predict reasonable notice period, stacked self-attention mechanisms. For a single mechanism we pre- any mention of the judgement regarding the notice period was dict an attribute (e.g. a person’s age or duration) by self-attending manually removed, leaving only factual descriptions of the plaintiff to multiple parts of the text simultaneously. Four labels were used and the characteristics of the case. We prepended the input with to train the attribute-aware mechanisms to predict the length of the year of the judgement, occupation category, age, salary, job employment, age of employee, character of employment, and avail- title and duration of employment of the plaintiff extracted from ability of similar employment. These labels were hand-annotated the summaries. If the information could not be found, nothing was by a team of Queen’s Law students. The attribute-free embedding prepended to the summary. The summaries appeared to be written comprised a max-pooling of the hidden states generated by the in plain English by human writers. Each case outcome is considered encoder. to be its own class, with outcomes of 25 months or greater being We utilized pretrained 300-dimension GloVe vectors that were grouped together. Our classification task had a total of 25 classes. fine-tuned, a hidden dimension of 300, and dropout of 0.5. An attention r of 30 was used. We used an Adam optimizer with a learning rate of 0.001 and reduced the learning rate when a metric 5 MODELS AND METHODS has stopped improving by a factor of 0.95. Alpha, the scaling on Our dataset totaled 1,695 cases for training and 409 for testing. We the attribute-aware loss, had a value of 0.3 and each epoch took did not utilize a development set and evaluated the final models on approximately 8 minutes to execute. the testing set. All experiments were completed on an IBM Power8 server with 512 GB of memory, 64 cores (hyper-threaded to 128), 5.3 BERT-esque and Domain adaptation four Nvidia K80 GPUs, Red Hat Enterprise Linux Server release 7.6 The Bidirectional Encoder Representations from Transformers (BERT) and ppc64le architecture. from Devlin et al. [9] has recently laid the foundation of pretraining models by leveraging language modelling, transfer learning, and 5.1 Hierarchical Attention Network fine-tuning on downstream tasks. As one of the main strengths of BERT is its generalized understanding of the English language, and Extracting deep semantic and contextual understanding from text pretrained BERT models have been publicly released6 , we leveraged data is essential for every NLP task. Bahdanau et al. [4] first pro- RoBERTa for predicting reasonable notice. We further experimented posed attention mechanisms for machine translation by learning with Robustly Optimized BERT Pretraining Approach (RoBERTa), how to align the original text with the translated words. Rather which held the top spot in the GLUE benchmarks [19] during the than the traditional approach of attempting to distill an entire doc- course of our research. Architecturally, RoBERTa did not differ ument into a vector, Yang et al. [23] introduced the Hierarchical from the original BERT model; instead, Liu et al. [19] utilized an Attention Network (HAN) to encode smaller chunks of text that are additional 160GBs of data and further fine-tuned hyperparameters. then used to inform future encodings. The HAN first learned the Through experimentation of learning rates and batch sizes, the importance of each word which informed its sentence embedding, authors determined that the next sentence prediction only offered and a separate attention mechanism learned the importance of each marginal to no performance improvements. Using only hyperpa- sentence to inform the final document representation. In our HAN rameter fine-tuning and additional data, RoBERTa achieved an we used SpaCy sentence bound detection and tokenization [12]. almost 7% improvement over the original BERT model on the GLUE We utilized pretrained 200-dimension GloVe vectors with a LSTM benchmarks. containing hidden dimensions of 75 and an attention dimension of 50. We optimized with a stochastic gradient descent optimizer 5.3.1 Domain Adaptations. Following common practice for im- with a learning rate of 0.06 , a batch size of 32, a momentum of 0.9 proving pretrained model performance, we experimented by domain- and dropout of 0.5. The learning rate was reduced by a factor of adapting our BERT-esque models on full reasonable notice case 0.95 when our performance stopped improving. Each epoch took approximately six minutes to execute. 6 https://github.com/google-research/bert NLLP @ KDD 2020, August 24th, San Diego, US Jason T. Lam, David Liang, Samuel Dahan, and Farhana Zulkernine texts as well as the Harvard case law dataset [21]. In our BERT im- to learn the importance of each sentence (fact) and weigh it ac- plementations, we utilized the 𝐵𝐸𝑅𝑇𝑏𝑎𝑠𝑒 model from HuggingFace7 . cordingly prior to creating a document representation. This may We further experimented with domain adapting a 𝑅𝑜𝐵𝐸𝑅𝑇 𝑎𝑏𝑎𝑠𝑒 be replicating the thought process of the judiciary. Furthermore, model using Facebook AI’s implementation8 . In addition, we further we believe our mixed results from our few-shot model may be at- pretrained both models using only the masked language modelling tributable to the size of our dataset. In Hu et al. [14], their smallest (MLM) criterion, consistent with the results from Liu et al. [19]. training dataset contained over 61,000 training samples, compared Five epochs were used for all MLM pretraining. For our end results, to our 1,695. The issue of data in our few-shot model appear to pretrained language models were further trained for text classi- be supported with the performance of our BERT-esque models in fication with ten epochs and fine-tuned on 409 remaining cases. which the majority of models performed better. BERT-esque models Classification training was performed using a batch size of 16, using are pretrained to have a generalized understanding of language out the default losses and optimizers. The classification head of BERT of the box, making it easier to fine-tune on a specific classification took 20 minutes to train while MLM took 2.5 hours. For BERT, task. Furthermore, the task of predicting reasonable notice requires we fine-tuned on the full case text that correspond to each of the additional knowledge beyond an understanding of the natural lan- respective 1,695 reasonable notice cases in our training set. guage. In fact, it requires knowledge on judicial bias and dispute In our RoBERTa implementation, we fine-tuned on approxi- settlement - something that seasoned employment lawyers have mately four million cases from Harvard’s case law project. We built throughout their interactions with judges and colleagues and determined cases before 1960 to be linguistically different from cannot be learned from case law. This may partially explain our present-day legal documents and thus these cases were removed. mixed results insofar as the majority of employment disputes are We note that the Harvard case law project only includes cases from resolved through negotiation. Thus, considering that the case law the United States. To create an accurate comparison with our BERT constitutes only a small piece of the data, it may be argued that our implementation, we domain-adapted 𝑅𝑜𝐵𝐸𝑅𝑇 𝑎𝑏𝑎𝑠𝑒 to the same set model could have performed better had it be trained on settlement of full case texts. The classification head of RoBERTa took 30 min- agreements. utes to fine-tune, while MLM took 3 hours. Our batch size was 256 In our experiments we utilized a classification approach over a and peak learning rate was 0.0001 in accordance with Liu et al. [19]. regression as we believe this best replicates the decision-making process of a judge. In deciding the amount of reasonable notice a 6 RESULTS AND DISCUSSION plaintiff should receive one could presume a judge would locate similar past cases and adjust their ruling based on differences of fact. We utilize classification to anchor the number of months of Approach Acc. (+/-2) reasonable notice and broaden the output window in our metric HAN 67% to account for the differences of fact. In addition, our findings in Few-shot w/ Self-attention 51% Dahan et al. [8], in which we used regression along with a hand- BERT+base 61% labeled tabular dataset, indicated regression to be a poor predictor BERT+full cases 49% of reasonable notice. RoBERTa+full cases 63% A key finding of this research is that our domain adaptations RoBERTa+Harvard 65% did not yield significant improvements when compared to the out- RoBERTa+base 69% of-the-box pretrained models, as reported in Rietzler et al. [21] Table 1: Summary of results for predicting the number and commonly noted as a promising avenue for other domains of months awarded for reasonable notice using case sum- (e.g. SciBERT [5] and BioBERT [17]). Instead, domain adaptations maries. appeared to negatively affect the performance of both our 𝑅𝑜𝐵𝐸𝑅𝑇 𝑎 and 𝐵𝐸𝑅𝑇 models. Despite the language of the Canadian reasonable notice cases being more similar to our case summaries than the Harvard Case Law dataset, our results from 𝑅𝑜𝐵𝐸𝑅𝑇 𝑎𝑏𝑎𝑠𝑒 , domain- The output of our system was classified as correct if it was within adapted on the Harvard dataset, performed slightly better than the +/-2 months of the ground truth label to account for situational ones trained on full Canadian cases. This finding was in line with variability (i.e.Eq 1). We refer to this as the output window. one of the conclusions set out by Liu et al. [19], which emphasized 𝑔𝑟𝑜𝑢𝑛𝑑𝑡𝑟𝑢𝑡ℎ − 2 ≤ 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 ≤ 𝑔𝑟𝑜𝑢𝑛𝑑𝑡𝑟𝑢𝑡ℎ + 2 (1) the importance of volume for datasets. Furthermore, as Liu et al. [19] performed extensive hyperparameter experimentation we opted to Our 𝑅𝑜𝐵𝐸𝑅𝑇 𝑎𝑏𝑎𝑠𝑒 model had the highest accuracy of 69%. A utilize their recommended learning rates and batch sizes. It appears summary of the results can be seen in Table 1. current deep learning solutions may not be able to accurately grasp Interestingly, our best-performing model was 𝑅𝑜𝐵𝐸𝑅𝑇 𝑎𝑏𝑎𝑠𝑒 out- the many unique characters of each legal dispute. Unfortunately, of-the-box and not a domain-adapted version. Our HAN out-performed our performance suggests that in spite of recent advances, deep the majority of our pretrained models and was only marginally learning may not be able to accurately predict reasonable notice worse than 𝑅𝑜𝐵𝐸𝑅𝑇 𝑎𝑏𝑎𝑠𝑒 . The performance of HAN may be attrib- from free text. That said, our mixed results may also be explained utable to architectural structuring, as each sentence in our case by the fact that judges are inherently unpredictable when it comes summaries roughly contains a statement of fact, allowing our HAN to the application of the Bardal test, and thus predicting notice is 7 huggingface.co an almost impossible task for deep learning models, but also for 8 https://github.com/pytorch/fairseq experienced lawyers. The Gap between Deep Learning and Law: Predicting Employment Notice NLLP @ KDD 2020, August 24th, San Diego, US 7 CONCLUSION AND FUTURE WORK Law 27, 2 (2019), 171–198. [7] Bruce Curran and Sara Slinn. 2016. Just Notice Reform: Enhanced Statutory In this paper we applied multiple deep-learning solutions to human- Termination Provisions for the 99%. Osgoode Legal Studies Research Paper 61 written case summaries to classify the reasonable notice period (2016), 2017. [8] Samuel Dahan, Jonathan Touboul, Jason Lam, and Dan Sfedj. 2020. Predicting a plaintiff should be awarded. Our best performing model was Employment Notice with Machine Learning: Promises and Limitation. 𝑅𝑜𝐵𝐸𝑅𝑇 𝑎𝑏𝑎𝑠𝑒 which achieved a 69% accuracy with a +/-2 month [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: window. Domain adaptations negatively affected our performance Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). in the case of RoBERTa, and marginally improved performance in [10] David J Doorey. 2020. The Law of Work: Common Law and the Regulation of the case of BERT. Given the significant successes of deep learning in Work. Regulation 285 (2020). various domains, the relatively poor efficiency found in predicting [11] James Holland and Julian Webb. 2013. Learning legal rules: a students’ guide to legal method and reasoning. Oxford university press. notice periods may come as a surprise. In fact, results from this [12] Matthew Honnibal and Mark Johnson. 2015. An Improved Non-monotonic Tran- paper appear to be consistent with our previous findings in Dahan sition System for Dependency Parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational et al. [8], suggesting the inconsistencies in the case law make it Linguistics, Lisbon, Portugal, 1373–1378. https://aclweb.org/anthology/D/D15/ difficult to predict reasonable notice accurately. Thus, it can be D15-1162 argued that our results have very little to do with the quality of [13] Jerrold Soh Tsin Howe, Lim How Khang, and Ian Ernst Chai. 2019. Legal Area Classification: A Comparative Study of Text Classifiers on Singapore Supreme the model, and that it may not be possible to reduce our prediction Court Judgments. arXiv preprint arXiv:1904.06470 (2019). error, mainly because of the inherently inconsistent nature of the [14] Zikun Hu, Xiang Li, Cunchao Tu, Zhiyuan Liu, and Maosong Sun. 2018. Few-shot dataset. charge prediction with discriminative legal attributes. In Proceedings of the 27th International Conference on Computational Linguistics. 487–498. At the time of our research, RoBERTa was the top-performing [15] Zhewei Huang, Wen Heng, and Shuchang Zhou. 2019. Learning to paint with model on the GLUE benchmarks [22], where it ranks 10th as of this model-based deep reinforcement learning. In Proceedings of the IEEE International Conference on Computer Vision. 8709–8718. writing. Utilizing the top performing model from the benchmarks [16] Grant Lamond. 2006. Precedent and analogy in legal reasoning. (2006). may yield better results. While our domain adaptation used 1,695 [17] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, full cases, and the superior performance of domain adapting on Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2020), an adjacent domain of American law, collecting a larger dataset of 1234–1240. Canadian cases may be an interesting avenue to explore for further [18] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, experimentation. While deep learning and BERT-esque models Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130 (2017). have proven successful in numerous domains, it appears a gap still [19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer exists for deep learning applications in the legal field. That said, Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 other areas — such as art, which has been considered too artistic (2019). to be impacted by AI — have shown some recent successes. In [20] Bingfeng Luo, Yansong Feng, Jianbo Xu, Xiang Zhang, and Dongyan Zhao. 2017. particular, deep learning models have successfully been trained to Learning to Predict Charges for Criminal Cases with Legal Basis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. paint like humans [15], which leads us to be optimistic about future 2727–2736. applications in the legal field. We believe that further advances in [21] Alexander Rietzler, Sebastian Stabinger, Paul Opitz, and Stefan Engl. 2019. Adapt the field of NLP and deep learning will be able to perform well on or get left behind: Domain adaptation through bert language model finetuning for aspect-target sentiment classification. arXiv preprint arXiv:1908.11860 (2019). this task in the future. [22] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018). ACKNOWLEDGMENTS [23] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In We would like to thank the team at the Conflict Analytics Lab, the Proceedings of the 2016 Conference of the North American Chapter of the Association Scotiabank Centre for Customer Analytics and the Center for Law for Computational Linguistics: Human Language Technologies. 1480–1489. in the Contemporary Workplace at Queen’s University (Jonathan Touboul, Maxime Cohen, Kevin Banks, Simon Townsend, William Quiglietta, Zach Berg, Mackenzie Anderson, Brandon Loehle, Max Saunders, Sean Gulrajani, Shane Liquornik, Yuri Levin, Mikhail Nediak, Stephen Thomas) as well as our colleagues Joshua Karton and Bill Flanagan for supporting the project as well as contributing to the development of the idea. REFERENCES [1] [n.d.]. Bardal v. Globe & Mail Ltd. (1960), 24 D.L.R. (2d). [2] [n.d.]. Fraser v. Canerector Inc., 2015 CarswellOnt 4796, 2015 ONSC 2138 at para 32. [3] [n.d.]. Wallace v. United Grain Growers Ltd., 1997 CanLII 332 (SCC), (1997) 3 S.C.R. 701. [4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma- chine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014). [5] Iz Beltagy, Arman Cohan, and Kyle Lo. 2019. Scibert: Pretrained contextualized embeddings for scientific text. arXiv preprint arXiv:1903.10676 (2019). [6] Ilias Chalkidis and Dimitrios Kampas. 2019. Deep learning in law: early adaptation and legal word embeddings trained on large corpora. Artificial Intelligence and