Pre-trained Contextual Embeddings for Litigation Code Classification Max Bartolo Kamil Tylinski Alastair Moore UCL Mishcon de Reya LLP Mishcon de Reya LLP m.bartolo@cs.ucl.ac.uk kamil.tylinski@mishcon.com alastair.moore@mishcon.com Narrative text J-code Abstract Working on response from and statutory review. JE20 Preparing documents for meeting with JH30 Models for a variety of natural language pro- cessing tasks, such as question answering or Attendance on client, email exchange JJ70 text classification, are potentially important components for a wide range of legal machine Table 1: Example narrative text for the classification learning systems. These tasks may include ex- task. Given a sentence of text describing the actions amining whole legal corpora, but may also in- completed by the lawyer, assign a label based on a clude a broad range of tasks that can support discrete J-codes label set. J-codes are time-recording automation in the digital workplace. Impor- codes introduced to comply with requirements under tantly, recent advances in pre-trained contex- the UK Civil Procedure Rules. The process of redac- tual embeddings have substantially improved tion, highlighted, is discussed in Section 3.3. the performance of text classification across a wide range of tasks. In this paper, we in- vestigate the application of these recent ap- Developing systems that can support the au- proaches on a legal time-recording task. We tomation of a variety of tasks across the digital demonstrate improved performance on a 40- class J-code classification task over a variety workplace involves working with heterogeneous of baseline techniques. The best performing data, with different quantities of labelled data (for single model achieves performance gains of the purposes of supervised learning) of variable 2.23 micro-averaged accuracy points and 9.39 quality. For this reason, practitioners are increas- macro-averaged accuracy points over the next ingly turning to more indirect ways of injecting best classifier on the test set. This result sug- weak supervision signals into their models (Ratner gests these techniques will find broad utility in et al., 2017). Recent work on multitask learning the development of legal language models for (Ratner et al., 2019) has developed an approach a range of automation tasks. to deep learning architectures that learn massive 1 Introduction multitask models with different heads adapted for different tasks. Legal data comes in a variety of different forms, A traditional approach to text classification from contracts and legal documents containing tasks is to create a linear classifier (Logistic re- technical language, to the variety of correspon- gression or Support Vector Machine) on sentences dence between client and solicitor (from email to presented as bag of words. The main disadvan- transcripts), to billing and enterprise performance tage of this method is its inability to share pa- management (EPM) systems used to support the rameters among classes and features (Joulin et al., business of law. 2017). Alternatively, the problem be approached In: Proceedings of the First International Workshop on AI by means of neural networks (Zhang et al., 2015), and Intelligent Assistance for Legal Professionals in the where transformer architectures has proven to be Digital Workplace (LegalAIIA 2019), held in conjunction with ICAIL 2019. June 17, 2019. Montréal, QC, Canada. more appropriate for a wide variety of tasks, not only text classification (Vaswani et al., 2017; Dai Copyright c 2019 for this paper by its authors. Use et al., 2019). permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Published at http://ceur-ws.org. Importantly, incorporating pre-trained contex- LegalAIIA Workshop, ICAIL '19, June 17, 2019, Montreal, Quebec, Canada Bartolo, Tylinski and Moore tual embeddings (Peters et al., 2018; Radford Classification tasks require large quantities of et al., 2018; Devlin et al., 2018) has led to im- training data, but in many domain-specific appli- pressive performance gains across many natural cations the construction of a large training set is language processing tasks such as question an- very costly and requires the use of experts to la- swering, natural language inference, sequence la- bel data. The use of pretrained embeddings allows belling and text classification. Models with access models to obtain linguistic knowledge from very to pre-trained language knowledge currently pro- large auxiliary corpora, often reduce the amount of vide state-of-the-art results on the GLUE bench- task-specific training data required for good per- mark1 tasks and also outperform human base- formance. lines in some cases. The GLUE benchmark con- Recent approaches to natural language process- sists of nine natural language understanding tasks ing have revolved around neural methods for in- (e.g., natural language inference, sentence simi- ferring probability distributions over sequences larity, etc.). Each comes with its own unique set of words, referred to as language modelling of examples and labels, ranging in size from 635 (LM), using deep learning architectures. Recur- training examples (WNLI) to 393k (MNLI) (Wang rent Neural Network (RNN) based language mod- et al., 2018). els, owing largely to their capacity for learn- However, legal text (whether it contains techni- ing sequential context, have been extensively re- cal language or simple correspondence) tends to searched (Mikolov et al., 2019; Chelba et al., differ from the text corpora on which these state- 2013; Zaremba et al., 2014; Wang and Cho, 2015; of-the-art language models are trained, such as Jozefowicz et al., 2016) despite various challenges Wikipedia and BookCorpus. In this paper, towards (Merity et al., 2017; Yang et al., 2017). The se- the goal of developing large multitask models for quential nature of RNN-based models precludes different legal applications, we first demonstrate parallelization within training examples which the successful use of pre-trained language models makes scaling to long sequence lengths and large transferred to a legal domain task. corpora challenging. The Transformer architec- We focus on the task of litigation code classifi- ture, relying on stacked self-attention and point- cation, illustrated in Table 1, which is an important wise, fully-connected layers, allows for signifi- sub-task in legal time-recording and for preparing cantly more parallelization (Vaswani et al., 2017). bills of costs for assessment by the courts. We One approach to developing deep architectures base our approach on fine-tuning BERT (Bidirec- for specific language tasks has been to exploit fea- tional Encoder Representations from Transform- ture representations learned from large datasets of ers), a transformer-based language representation general purpose data such as Wikipedia. These model, (Devlin et al., 2018) and our evaluation pre-trained approaches are now key components shows that a single pre-trained model achieves sig- in many natural language applications (Mikolov nificant performance gains over the next best clas- et al., 2013). These concepts have also been ex- sifier on the test set. tended to the legal domain, including the creation of the Law2Vec legal word embeddings, which 2 Related Work is likely to accelerate the progress in this research Text classification is a category of Natural Lan- area (Chalkidis and Kampas, 2019). guage Processing (NLP) tasks with real-world ap- There are generally two strategies for applying plications such as spam detection, fraud identi- pre-trained language models to downstream tasks: fication (Ngai et al., 2011), and legal discovery feature-based and fine-tuning. The feature-based (Roitblat et al., 2010). Formally, it is about as- approach, as was used in ELMo (Peters et al., signing a Boolean value to each pair of hdj , ci i ∈ 2018), learns a fixed representation, or feature D × C (Sebastiani, 2002), where D in our ex- space, on a large text corpus. More specifically, amplen is a domain o of narrative documents and ELMo develops a coupled forward LM and back- C = c1 , ..., c|C| a set of J-Codes such that we ward LM approach as well as a linear combination obtain a decision value for each narrative docu- of the hidden representations stacked above each ment dj being classed as ci . input word for each end task, and markedly im- proves performance over just using the top LSTM 1 https://gluebenchmark.com/leaderboard layer representation. Pre-trained Contextual Embeddings for Litigation Code Classification LegalAIIA Workshop, ICAIL '19, June 17, 2019, Montreal, Quebec, Canada The fine-tuning approach, as demonstrated in ple, the Issue / Statements of Case phase (JE00) ULMFiT (Howard and Ruder, 2018) and GPT includes the lower tier tasks of Review of Other (Radford et al., 2018), introduce minimal task- Party/Opponents’ Statement of Case (JE20) and specific parameters and are adapted for down- Amendment of Statement of Case (JE40). An ex- stream tasks simply by re-learning the weights in ample of the distribution of J-codes used in the one or more layers of the deep architecture. evaluation can be seen in Figure 1. The lowest In this paper, we build upon the recent release of tier is Action, but we do not use this granular- BERT (Devlin et al., 2018), which makes use of a ity in this study. Actions specify how the work masked language model for its pre-training objec- is done, Tasks inform of what is being done and tive to learn a deep bidirectional language model. are further grouped by Phases. The detailed ex- We develop our approach by fine-tuning the pre- planation of the J-codes structure can be found in trained parameters for the downstream legal time- (Nelson and Jackson, 2014). recording classification task. Text classification in the legal space has in- 3.2 Motivation cluded research in court ruling predictions (Sulea This classification task is important in the context et al., 2017) and legal deontic modality classifi- of legal digital workflows because it allows law cation (Neill et al., 2017), but the incorporation firms to extract value from billing data. Organiz- of pre-trained contextual embeddings remains rel- ing work by Phase and Task facilitates more ef- atively unexplored. fective budgeting, particularly as alternative fee ar- 3 Litigation Code Classification rangements become more prevalent, and increases transparency across different clients and matters. 3.1 Overview Automating Phase-Task code classification The task is a 40-class classification problem where also reduces administrative burden upon lawyers, the labels are litigation J-Codes. The J-codes set who may each record thousands of time entries in- are one set of the Uniform Task Based Manage- volving these codes annually. Furthermore, the ment System (UTBMS) codes used to classify le- adoption of UTBMS codes can be inconsistent gal services performed by a legal vendor in an within industries or even a given firm, with some electronic invoice submission2 . lawyers delegating their task-based coding or as- The background of the J-code-set originates signing blocks of time entries to the same code. from the Review of Civil Litigation Costs in Eng- In these cases, automation is likely to improve land and Wales (Nelson and Jackson, 2014). A the quality of data collected and allow for inter- key recommendation of the review was that a new department comparative analyses. format for bills of costs be standardized to in- Moreover, it is possible for time entries to be crease both the transparency of costs assessed by entered just once into a solicitor’s system (includ- the courts, and the consistency in the way costs are ing Task and Activity codes) and then used in presented to judges. a variety of different reporting applications, from The new format, designed to be produced and the client, to the court to the normal administrative analyzed in digital workflows, resulted in a set of functions of finance and tax. discrete J-Codes that are used to categorize work Lastly, the nature of billing data in an indus- undertaken. There are three hierarchical levels try characterized by time-based charging, means of granularity. The highest level is the Phase. it is likely to be a key source of data in any multi- Examples include Pre-Action work and Disclo- modal multitask system supporting task automa- sure corresponding to J-code JC00 and JF00 re- tion in the digital workplace. spectively. The intermediate level of generality All of the above emphasize the importance of is the Task. Each Phase has a finite and lim- accuracy, when assigning the codes. There are ited number of Tasks assigned to it. For exam- also financial incentives, as any incorrect entries 2 A similar set of codes have previously been developed in may be impossible to recover from the other side the United States. Here the codes have been developed to or not approved by the court. Additionally, the provide a common language for e-billing, under which both time fee earners spend amending and checking the the law firm and the client have systems using a common code set for respectively the delivery and analysis of bills - codes has to be written off and does not provide commonly referred to as L-codes. any benefit to the law firm. Thus, automated code LegalAIIA Workshop, ICAIL '19, June 17, 2019, Montreal, Quebec, Canada Bartolo, Tylinski and Moore J-Code distribution 10000 8000 6000 4000 2000 0 JC10 JC20 JC30 JE10 JD20 JJ70 JJ20 JI10 JF20 JK20 JD10 JF10 JG10 JL30 JJ60 JB10 JH10 JJ10 JJ50 JJ30 JE20 JA10 JF40 JE40 JE30 JM10 JK10 JH30 JF30 JG20 JI30 JH20 JB20 JM20 JI20 JB30 JJ40 JM40 JL20 JM30 Figure 1: Histogram to show the distribution of J-codes. The long tail demonstrates the class imbalance in this dataset. This is to be expected as time entries, aggregated by type of work performed, mean that multiple time entries could result from the services performed in a single day on a single matter. assignment can lead to significant improvements 3.4 Evaluation Metrics in productivity, even if the output requires to be This is a multi-class classification problem, with reviewed by the legal professional. significant class imbalance so we evaluate on both micro-averaged accuracy and macro-averaged ac- 3.3 Data curacy in a one-vs-all setting. The data is a collection of narratives from a le- The micro-averaged accuracy is computed by gal firm’s proprietary set spanning more than 1500 aggregating to contributions of all the classes to matters and 300 timekeepers. Due to its sensi- compute the average by taking the number of cor- tive nature, the data has been anonymized using a rect predictions divided by the total number of ex- Named Entity Recognition (NER) algorithm that amples. identifies and redacts the names of people, organi- The macro-averaged accuracy considers the zations, and locations, among other entity types in computation of the accuracy for each individual the form of a word mask. This algorithm combines class independently (class average), followed by machine learning based on linguistic features with taking the average across classes (hence treating stricter pattern-based exclusions. Another effect all classes equally). This is useful for understand- of preprocessing data with the NER algorithm is ing how the system performs across each class de- to ensure a higher degree of model generalisabil- spite the limited data points for particular classes. ity, since it is not trained based on specific proper nouns which may be present in the vocabulary at 4 Models training time but not at test time. This can be seen To demonstrate any improved performance from in Figure 2 where we can see high mask counts for the use of pre-trained contextual embeddings on MASK_PERSON and MASK_ORG. this domain specific task we benchmark perfor- The data has been cleaned by a heuristic mance against a variety of different baseline mod- whereby blocks of time entries from the same els. timekeeper assigned almost exclusively to the same phase-task code combination were excluded. 4.1 Random Baseline Despite this process, classes in the data set remain The random baseline simply predicts a random relatively imbalanced, with about one third of en- class for any given data point. As such, we ex- tries assigned to the most common phase code and pect the micro-averaged accuracy to be roughly one fifth of entries assigned to the most common 1 num_classes . task code. The data set consists of 51, 948 examples split 4.2 Majority Baseline into training, development, and testing sets using We present a majority baseline which predicts the 80%/10%/10% split ratios respectively. most common class (JC10) for any given data Pre-trained Contextual Embeddings for Litigation Code Classification LegalAIIA Workshop, ICAIL '19, June 17, 2019, Montreal, Quebec, Canada Figure 2: Histogram to show the distribution of vocabulary, including word masks. We can see that the person (MASK_PERSON) and organization (MASK_ORG) masks are more frequent. point. various fine-tuning experiments. BERT is de- signed to learn deep bidirectional representations 4.3 Surface Logistic Regression by jointly conditioning on both left and right con- We featurise the narratives to the surface mod- text in all layers through a masked language model els by normalising the input narratives and con- objective. Pre-trained BERT representations are verting to a Bag-of-Words (BOW) sparse repre- publicly available for download and can be fine- sentation. In addition, we also experiment with tuned with just one task-specific output layer to character and word tokenisation, removal of stop- create state-of-the-art models for a wide range words and TF-IDF feature reweighting but ob- of tasks (Devlin et al., 2018). We experiment serve best performance on bigram-enhanced BOW with the uncased and cased versions of pre-trained features tokenised at word level while retaining BERTBASE which is a 12-layer transformer ar- stopwords. A logistic regression model is ap- chitecture with a hidden size of 768 and 12 self- plied to the featurised input in a one-versus-rest attention heads adding up to 110 million param- multi-class scheme and an L2 weight regularisa- eters, and the uncased version of BERTLARGE tion penalty. which is a 24-layer transformer architecture with a hidden size of 1024 and 16 self-attention heads 4.4 XGBoost Baseline adding up to 340 million parameters, both trained As a final baseline, we use the scalable gradient- on a combined BookCorpus and Wikipedia corpus boosting implementation XGBoost (Chen and of 3.3 billion words on 4 × 4 and 8 × 8 TPU slices Guestrin, 2016), which has been used on vari- respectively for 4 days. ous text classification tasks with strong perfor- We fine-tune the models on an AWS mance results based on additive tree-based opti- p2.xlarge instance running a single NVIDIA misation. As with the logistic regression baseline, K80 GPU. We adapt the BERT fine-tuning mech- we performed pre-processing based on stopword- anism for single sentence classification tasks to removal, TF-IDF weighting, and n-gram selec- the matter classification task. tion. We also experimented with lemmatisation and case standardisation to achieve highest model 4.6 Chronology-enhanced models performance. In principle, any production system for time- recording can take account of additional informa- 4.5 BERT Models tion to support the classification task. The J-codes We work with the HuggingFace3 PyTorch imple- set has ordinal structure resulting from the pro- mentation of BERT (Bidirectional Encoder Rep- gression of Phases and Tasks during the case, resentations from Transformers) model and run and any specific time-entries also have temporal 3 https://github.com/huggingface/ structure that can be exploited. pytorch-pretrained-BERT As a result of this, we can significantly im- LegalAIIA Workshop, ICAIL '19, June 17, 2019, Montreal, Quebec, Canada Bartolo, Tylinski and Moore Figure 3: Confusion matrices. a) BERTBASE (Uncased) b) BERTLARGE (Uncased) c) XGBoost. We can see that BERTLARGE is better at classifying class JC10, particularly against JM 30. prove model performance by incorporating fea- mistake for others. We find that both the XGBoost tures based on the set of codes typically associ- text-based model and the BERTBASE model ated with a user or matter. Therefore, we include a commonly predict the most common class JC10 chronology-enhanced XGBoost model in our anal- (Factual Investigation: Work required to under- ysis to set any performance improvements in con- stand the facts of the case including instructions text. from the client and the identification of potential Care is taken to verify that the model behavior witnesses) when the ground truth is JM30 (Hear- is not to simply repeat the last code on a given ings: Includes preparation for and attendance at matter by setting chronology-based features to hearings for directions and interim certificate ap- zero, obtaining predictions from the chronology- plications as well as the detailed assessment it- enhanced model, and confirming that the differ- self ). We also observe that all text-based mod- ence in micro-accuracy is not greater than five per- els have difficulty distinguishing between JG10 cent relative to the purely text-based model. (Taking, preparing and finalising witness state- ment(s)) and JG20 (Reviewing Other Party(s)’ 5 Results and Discussion witness statement(s)). It is likely that this can be Results for the different models are presented in explained to some extent by the text anonymisa- Table 2. We observe substantial performance tion. improvements of BERT models over the text- We can also see that there are different error based baselines as well as the XGBoost text- patterns between the BERT and XGBoost mod- based model, particularly with regards to macro- els and therefore we are likely to be able to im- accuracy. prove performance in a production system using The best performing BERT single model an ensemble approach. Furthermore, in addition achieves performance gains of 2.23 micro- to the Task level results above, results on the averaged accuracy points and 9.39 macro- Phase level are encouraging for use in produc- averaged accuracy points over the XGBoost text- tion, with a micro-accuracy rate of 90.40 percent only classifier on the test set. This is likely to have for the chronology-enhanced XGBoost model. In a strong effect on user experience of a production some cases, such data is already sufficiently granu- system as it indicates substantially better perfor- lar to derive actionable firm budgeting insights and mance on less common classes. It also demon- an improvement over existing manual methods. strates the effectiveness of pre-trained methods to 6 Conclusion and Future Work incorporate prior knowledge and learn on low- resource data, despite the linguistic differences be- Recent empirical improvements due to transfer tween the pre-trained and legal domains. learning with language models have demonstrated We also perform an in-depth error analysis, in- that rich, unsupervised pre-training is an inte- cluding visual inspection of different model pre- gral part of many language understanding systems. dictions and confusion matrices (see Figure 3) to Here we present experiments and analysis of state- understand which classes the models commonly of-the-art models based on deep pre-trained con- Pre-trained Contextual Embeddings for Litigation Code Classification LegalAIIA Workshop, ICAIL '19, June 17, 2019, Montreal, Quebec, Canada Model Micro Acc. (%) Macro Acc. (%) Random Baseline 2.02 2.26 Majority Baseline 19.96 2.50 Surface Random Forest 42.66 28.49 Surface Logistic Regression 45.87 32.30 Surface Logistic Regression (enhanced with bigram features) 51.78 39.30 XGBoost 53.15 36.65 BERT Base (Uncased) 55.17 44.28 BERT Base (Cased) 55.38 46.04 BERT Large (Uncased) 54.17 45.25 XGBoost (Chronological features) 77.11 61.51 Table 2: Results of the models on the test set. We can see increased performance over baseline models. textual embeddings applied to the task of litiga- tion may also achieve scale conducive to learning tion code classification. We show that BERT fine- contextualised legal-corpora representations men- tuned to the 40-class matter classification task pro- tioned above. vides substantial performance gains over our best- performing baseline. Acknowledgements One area to explore further is to incorporate We thank Edwin Zhang and Brandon Hill at Ping these chronology-based features into a BERT- Inc. for their assistance in the data preparation, centric approach. For example, one approach baseline modeling, and chronology enhancements. could be to learn contextual embeddings for text over temporal set of J-codes. Another could be to ensemble the predictions of purely chronology- References based model with the BERT output. Ilias Chalkidis and Dimitrios Kampas. 2019. Deep We achieve our primary goal of demonstrating learning in law: early adaptation and legal word em- that there is the capability to transfer pre-trained beddings trained on large corpora. Artificial Intelli- gence and Law, 27(2):171–198. language knowledge from a general corpus to the legal domain task, with improved performance. Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Notwithstanding this fine-tuning result, in fu- Thorsten Brants, and Phillipp Koehn. 2013. One Billion Word Benchmark for Measuring Progress ture work we intend to extend this by learning con- in Statistical Language Modeling. arXiv preprint textualised representations from legal corpora, a arXiv:1312.3005. direction that has achieved some success in other Tianqi Chen and Carlos Guestrin. 2016. XG- domains (Lee et al., 2019) and which could be ap- Boost: A Scalable Tree Boosting System. plied across a wide variety of tasks in the legal do- https://arxiv.org/abs/1603.02754. main. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Moreover, although we have explored use of Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. multi-task learning framework, we have only 2019. Transformer-XL: Attentive Language Mod- demonstrated performance on a single legal task. els Beyond a Fixed-Length Context. arXiv preprint Future work will likely include extending this arXiv:1901.02860. analysis to a set of legal benchmark tasks that in- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and clude natural language inference tasks (similar to Kristina Toutanova. 2018. Bert: Pre-training of deep GLUE) on publicly available legal datasets. bidirectional transformers for language understand- ing. arXiv preprint arXiv:1810.04805. Given the relatively high degree of class imbal- ance present in Phase and Task codes, as well Jeremy Howard and Sebastian Ruder. 2018. Universal as the level of legal expertise involved in distin- language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146. guishing closely related or rarer options, this clas- sification problem lends itself well to human-in- Armand Joulin, Edouard Grave, Piotr Bojanowski, and the-loop machine learning. Such an active learn- Tomas Mikolov. 2017. Bag of Tricks for Efficient Text Classification. Proceedings of the 15th Confer- ing platform would involve feeding timekeeper- ence of the European Chapter of the Association for validated data back into the model for near-real- Computational Linguistics: Volume 2, Short Papers, time retraining. This method of data collec- pages 427–431. LegalAIIA Workshop, ICAIL '19, June 17, 2019, Montreal, Quebec, Canada Bartolo, Tylinski and Moore Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noa- Herbert Roitblat, Anne Kershaw, and Patrick Oot. mand Shazeer, and Yonghui Wu. 2016. Exploring 2010. Document categorization in legal electronic the limits of language modeling. arXiv preprint discovery: computer classification vs. manual re- arXiv:1602.02410. view. Journal of the Association for Information Science and Technology, 61(1):70–80. Jinhyuk Lee, Wonjin Yoon1, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, Fabrizio Sebastiani. 2002. Machine Learning in Au- and Jaewoo Kang. 2019. BioBERT: a pre-trained tomated Text Categorization. ACM Computing Sur- biomedical language representation model for veys, 34(1):1–47. biomedical text mining. Bioinformatics, 1. Octavia-Maria Sulea, Marcos Zampieri, Shervin Mal- masi, Mihaela Vela, Liviu P. Dinu, and Josef van Stephen Merity, Nitish Shirish Keskar, and Richard Genabith. 2017. Exploring the use of text classifica- Socher. 2017. Regularizing and optimizing LSTM tion in the legal domain. Proceedings of 2nd Work- language models. arXiv preprint arXiv:1708.02182. shop on Automated Semantic Analysis of Informa- tion in Legal Texts. Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur. 2019. Recur- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob rent neural network based language model. INTER- Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz SPEECH, 2:1045–1048. Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. 31st Conference on Neural Information Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Processing Systems (NIPS 2017). rado, and Jeff Dean. 2013. Distributed representa- tions of words and phrases and their compositional- Alex Wang, Amanpreet Singh, Julian Michael, Felix ity. NIPS. Hill, Omer Levy, and Samuel R. Bowman. 2018. GLUE: A multi-task benchmark and analysis plat- James O’ Neill, Paul Buitelaar, Cecile Robin, and form for natural language understanding. arXiv Leona O’ Brien. 2017. Classifying Sentential preprint arXiv:1804.07461. Modality in Legal Language: A Use Case in Finan- cial Regulations, Acts and Directives. Proceedings Tian Wang and Kyunghyun Cho. 2015. Larger- of the 16th Edition of the International Conference context language modelling. arXiv preprint on Articial Intelligence and Law (ICAIL ’17). arXiv:1511.03729. Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and David Nelson and Jackson. 2014. EW-UTBMS Civil William W. Cohen. 2017. Breaking the softmax bot- Litigation J-Code Set Overview and Guidelines. tleneck: A high-rank RNN language model. arXiv preprint arXiv:1711.03953. EWT Ngai, Yong Hu, YH Wong, Yijun Chen, and Xin Sun. 2011. The application of data mining tech- Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. niques in financial fraud detection: A classifica- 2014. Recurrent Neural Network Regularization. tion framework and an academic review of literature. arXiv preprint arXiv:1409.2329. Decision Support Systems, 50(3):559–569. Xiang Zhang, Junbo Zhao, and Yann Lecun. 2015. Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Character-level Convolutional Networks for Text Gardner, Christopher Clark, Kenton Lee, and Luke Classification. Advances in Neural Information Pro- Zettlemoyer. 2018. Deep contextualized word rep- cessing Systems 28, pages 649–657. resentations. arXiv preprint arXiv:1802.05365. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language under- standing by generative pre-training. URL https://s3- us-west-2. amazonaws.com/openai-assets/research- covers/languageunsupervised/language under- standing paper.pdf. Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher RÃl’. 2017. Snorkel: Rapid training data creation with weak su- pervision. Proceedings of the VLDB Endowment, 11(3):269–282. Alexander Ratner, Braden Hancock, Jared Dunnmon, Frederic Sala, Shreyash Pandey, and Christopher Re. 2019. Training Complex Models with Multi-Task Weak Supervision. AAAI.