Pre-trained Contextual Embeddings for Litigation Code Classification
Max Bartolo Kamil Tylinski Alastair Moore
UCL Mishcon de Reya LLP Mishcon de Reya LLP
m.bartolo@cs.ucl.ac.uk kamil.tylinski@mishcon.com alastair.moore@mishcon.com
Narrative text J-code
Abstract Working on response from and statutory review. JE20
Preparing documents for meeting with JH30
Models for a variety of natural language pro-
cessing tasks, such as question answering or Attendance on client, email exchange JJ70
text classification, are potentially important
components for a wide range of legal machine Table 1: Example narrative text for the classification
learning systems. These tasks may include ex- task. Given a sentence of text describing the actions
amining whole legal corpora, but may also in- completed by the lawyer, assign a label based on a
clude a broad range of tasks that can support discrete J-codes label set. J-codes are time-recording
automation in the digital workplace. Impor- codes introduced to comply with requirements under
tantly, recent advances in pre-trained contex- the UK Civil Procedure Rules. The process of redac-
tual embeddings have substantially improved tion, highlighted, is discussed in Section 3.3.
the performance of text classification across
a wide range of tasks. In this paper, we in-
vestigate the application of these recent ap- Developing systems that can support the au-
proaches on a legal time-recording task. We
tomation of a variety of tasks across the digital
demonstrate improved performance on a 40-
class J-code classification task over a variety
workplace involves working with heterogeneous
of baseline techniques. The best performing data, with different quantities of labelled data (for
single model achieves performance gains of the purposes of supervised learning) of variable
2.23 micro-averaged accuracy points and 9.39 quality. For this reason, practitioners are increas-
macro-averaged accuracy points over the next ingly turning to more indirect ways of injecting
best classifier on the test set. This result sug- weak supervision signals into their models (Ratner
gests these techniques will find broad utility in et al., 2017). Recent work on multitask learning
the development of legal language models for
(Ratner et al., 2019) has developed an approach
a range of automation tasks.
to deep learning architectures that learn massive
1 Introduction multitask models with different heads adapted for
different tasks.
Legal data comes in a variety of different forms, A traditional approach to text classification
from contracts and legal documents containing tasks is to create a linear classifier (Logistic re-
technical language, to the variety of correspon- gression or Support Vector Machine) on sentences
dence between client and solicitor (from email to presented as bag of words. The main disadvan-
transcripts), to billing and enterprise performance tage of this method is its inability to share pa-
management (EPM) systems used to support the rameters among classes and features (Joulin et al.,
business of law. 2017). Alternatively, the problem be approached
In: Proceedings of the First International Workshop on AI by means of neural networks (Zhang et al., 2015),
and Intelligent Assistance for Legal Professionals in the where transformer architectures has proven to be
Digital Workplace (LegalAIIA 2019), held in conjunction
with ICAIL 2019. June 17, 2019. Montréal, QC, Canada. more appropriate for a wide variety of tasks, not
only text classification (Vaswani et al., 2017; Dai
Copyright c 2019 for this paper by its authors. Use et al., 2019).
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0). Published at http://ceur-ws.org. Importantly, incorporating pre-trained contex-
LegalAIIA Workshop, ICAIL '19, June 17, 2019, Montreal, Quebec, Canada Bartolo, Tylinski and Moore
tual embeddings (Peters et al., 2018; Radford Classification tasks require large quantities of
et al., 2018; Devlin et al., 2018) has led to im- training data, but in many domain-specific appli-
pressive performance gains across many natural cations the construction of a large training set is
language processing tasks such as question an- very costly and requires the use of experts to la-
swering, natural language inference, sequence la- bel data. The use of pretrained embeddings allows
belling and text classification. Models with access models to obtain linguistic knowledge from very
to pre-trained language knowledge currently pro- large auxiliary corpora, often reduce the amount of
vide state-of-the-art results on the GLUE bench- task-specific training data required for good per-
mark1 tasks and also outperform human base- formance.
lines in some cases. The GLUE benchmark con- Recent approaches to natural language process-
sists of nine natural language understanding tasks ing have revolved around neural methods for in-
(e.g., natural language inference, sentence simi- ferring probability distributions over sequences
larity, etc.). Each comes with its own unique set of words, referred to as language modelling
of examples and labels, ranging in size from 635 (LM), using deep learning architectures. Recur-
training examples (WNLI) to 393k (MNLI) (Wang rent Neural Network (RNN) based language mod-
et al., 2018). els, owing largely to their capacity for learn-
However, legal text (whether it contains techni- ing sequential context, have been extensively re-
cal language or simple correspondence) tends to searched (Mikolov et al., 2019; Chelba et al.,
differ from the text corpora on which these state- 2013; Zaremba et al., 2014; Wang and Cho, 2015;
of-the-art language models are trained, such as Jozefowicz et al., 2016) despite various challenges
Wikipedia and BookCorpus. In this paper, towards (Merity et al., 2017; Yang et al., 2017). The se-
the goal of developing large multitask models for quential nature of RNN-based models precludes
different legal applications, we first demonstrate parallelization within training examples which
the successful use of pre-trained language models makes scaling to long sequence lengths and large
transferred to a legal domain task. corpora challenging. The Transformer architec-
We focus on the task of litigation code classifi- ture, relying on stacked self-attention and point-
cation, illustrated in Table 1, which is an important wise, fully-connected layers, allows for signifi-
sub-task in legal time-recording and for preparing cantly more parallelization (Vaswani et al., 2017).
bills of costs for assessment by the courts. We
One approach to developing deep architectures
base our approach on fine-tuning BERT (Bidirec-
for specific language tasks has been to exploit fea-
tional Encoder Representations from Transform-
ture representations learned from large datasets of
ers), a transformer-based language representation
general purpose data such as Wikipedia. These
model, (Devlin et al., 2018) and our evaluation
pre-trained approaches are now key components
shows that a single pre-trained model achieves sig-
in many natural language applications (Mikolov
nificant performance gains over the next best clas-
et al., 2013). These concepts have also been ex-
sifier on the test set.
tended to the legal domain, including the creation
of the Law2Vec legal word embeddings, which
2 Related Work is likely to accelerate the progress in this research
Text classification is a category of Natural Lan- area (Chalkidis and Kampas, 2019).
guage Processing (NLP) tasks with real-world ap- There are generally two strategies for applying
plications such as spam detection, fraud identi- pre-trained language models to downstream tasks:
fication (Ngai et al., 2011), and legal discovery feature-based and fine-tuning. The feature-based
(Roitblat et al., 2010). Formally, it is about as- approach, as was used in ELMo (Peters et al.,
signing a Boolean value to each pair of hdj , ci i ∈ 2018), learns a fixed representation, or feature
D × C (Sebastiani, 2002), where D in our ex- space, on a large text corpus. More specifically,
amplen is a domain o of narrative documents and ELMo develops a coupled forward LM and back-
C = c1 , ..., c|C| a set of J-Codes such that we ward LM approach as well as a linear combination
obtain a decision value for each narrative docu- of the hidden representations stacked above each
ment dj being classed as ci . input word for each end task, and markedly im-
proves performance over just using the top LSTM
1
https://gluebenchmark.com/leaderboard layer representation.
Pre-trained Contextual Embeddings for Litigation Code Classification LegalAIIA Workshop, ICAIL '19, June 17, 2019, Montreal, Quebec, Canada
The fine-tuning approach, as demonstrated in ple, the Issue / Statements of Case phase (JE00)
ULMFiT (Howard and Ruder, 2018) and GPT includes the lower tier tasks of Review of Other
(Radford et al., 2018), introduce minimal task- Party/Opponents’ Statement of Case (JE20) and
specific parameters and are adapted for down- Amendment of Statement of Case (JE40). An ex-
stream tasks simply by re-learning the weights in ample of the distribution of J-codes used in the
one or more layers of the deep architecture. evaluation can be seen in Figure 1. The lowest
In this paper, we build upon the recent release of tier is Action, but we do not use this granular-
BERT (Devlin et al., 2018), which makes use of a ity in this study. Actions specify how the work
masked language model for its pre-training objec- is done, Tasks inform of what is being done and
tive to learn a deep bidirectional language model. are further grouped by Phases. The detailed ex-
We develop our approach by fine-tuning the pre- planation of the J-codes structure can be found in
trained parameters for the downstream legal time- (Nelson and Jackson, 2014).
recording classification task.
Text classification in the legal space has in- 3.2 Motivation
cluded research in court ruling predictions (Sulea
This classification task is important in the context
et al., 2017) and legal deontic modality classifi-
of legal digital workflows because it allows law
cation (Neill et al., 2017), but the incorporation
firms to extract value from billing data. Organiz-
of pre-trained contextual embeddings remains rel-
ing work by Phase and Task facilitates more ef-
atively unexplored.
fective budgeting, particularly as alternative fee ar-
3 Litigation Code Classification rangements become more prevalent, and increases
transparency across different clients and matters.
3.1 Overview Automating Phase-Task code classification
The task is a 40-class classification problem where also reduces administrative burden upon lawyers,
the labels are litigation J-Codes. The J-codes set who may each record thousands of time entries in-
are one set of the Uniform Task Based Manage- volving these codes annually. Furthermore, the
ment System (UTBMS) codes used to classify le- adoption of UTBMS codes can be inconsistent
gal services performed by a legal vendor in an within industries or even a given firm, with some
electronic invoice submission2 . lawyers delegating their task-based coding or as-
The background of the J-code-set originates signing blocks of time entries to the same code.
from the Review of Civil Litigation Costs in Eng- In these cases, automation is likely to improve
land and Wales (Nelson and Jackson, 2014). A the quality of data collected and allow for inter-
key recommendation of the review was that a new department comparative analyses.
format for bills of costs be standardized to in- Moreover, it is possible for time entries to be
crease both the transparency of costs assessed by entered just once into a solicitor’s system (includ-
the courts, and the consistency in the way costs are ing Task and Activity codes) and then used in
presented to judges. a variety of different reporting applications, from
The new format, designed to be produced and the client, to the court to the normal administrative
analyzed in digital workflows, resulted in a set of functions of finance and tax.
discrete J-Codes that are used to categorize work Lastly, the nature of billing data in an indus-
undertaken. There are three hierarchical levels try characterized by time-based charging, means
of granularity. The highest level is the Phase. it is likely to be a key source of data in any multi-
Examples include Pre-Action work and Disclo- modal multitask system supporting task automa-
sure corresponding to J-code JC00 and JF00 re- tion in the digital workplace.
spectively. The intermediate level of generality All of the above emphasize the importance of
is the Task. Each Phase has a finite and lim- accuracy, when assigning the codes. There are
ited number of Tasks assigned to it. For exam- also financial incentives, as any incorrect entries
2
A similar set of codes have previously been developed in may be impossible to recover from the other side
the United States. Here the codes have been developed to or not approved by the court. Additionally, the
provide a common language for e-billing, under which both time fee earners spend amending and checking the
the law firm and the client have systems using a common
code set for respectively the delivery and analysis of bills - codes has to be written off and does not provide
commonly referred to as L-codes. any benefit to the law firm. Thus, automated code
LegalAIIA Workshop, ICAIL '19, June 17, 2019, Montreal, Quebec, Canada Bartolo, Tylinski and Moore
J-Code distribution
10000
8000
6000
4000
2000
0
JC10
JC20
JC30
JE10
JD20
JJ70
JJ20
JI10
JF20
JK20
JD10
JF10
JG10
JL30
JJ60
JB10
JH10
JJ10
JJ50
JJ30
JE20
JA10
JF40
JE40
JE30
JM10
JK10
JH30
JF30
JG20
JI30
JH20
JB20
JM20
JI20
JB30
JJ40
JM40
JL20
JM30
Figure 1: Histogram to show the distribution of J-codes. The long tail demonstrates the class imbalance in this
dataset. This is to be expected as time entries, aggregated by type of work performed, mean that multiple time
entries could result from the services performed in a single day on a single matter.
assignment can lead to significant improvements 3.4 Evaluation Metrics
in productivity, even if the output requires to be This is a multi-class classification problem, with
reviewed by the legal professional. significant class imbalance so we evaluate on both
micro-averaged accuracy and macro-averaged ac-
3.3 Data
curacy in a one-vs-all setting.
The data is a collection of narratives from a le- The micro-averaged accuracy is computed by
gal firm’s proprietary set spanning more than 1500 aggregating to contributions of all the classes to
matters and 300 timekeepers. Due to its sensi- compute the average by taking the number of cor-
tive nature, the data has been anonymized using a rect predictions divided by the total number of ex-
Named Entity Recognition (NER) algorithm that amples.
identifies and redacts the names of people, organi- The macro-averaged accuracy considers the
zations, and locations, among other entity types in computation of the accuracy for each individual
the form of a word mask. This algorithm combines class independently (class average), followed by
machine learning based on linguistic features with taking the average across classes (hence treating
stricter pattern-based exclusions. Another effect all classes equally). This is useful for understand-
of preprocessing data with the NER algorithm is ing how the system performs across each class de-
to ensure a higher degree of model generalisabil- spite the limited data points for particular classes.
ity, since it is not trained based on specific proper
nouns which may be present in the vocabulary at 4 Models
training time but not at test time. This can be seen To demonstrate any improved performance from
in Figure 2 where we can see high mask counts for the use of pre-trained contextual embeddings on
MASK_PERSON and MASK_ORG. this domain specific task we benchmark perfor-
The data has been cleaned by a heuristic mance against a variety of different baseline mod-
whereby blocks of time entries from the same els.
timekeeper assigned almost exclusively to the
same phase-task code combination were excluded. 4.1 Random Baseline
Despite this process, classes in the data set remain The random baseline simply predicts a random
relatively imbalanced, with about one third of en- class for any given data point. As such, we ex-
tries assigned to the most common phase code and pect the micro-averaged accuracy to be roughly
one fifth of entries assigned to the most common 1
num_classes .
task code.
The data set consists of 51, 948 examples split 4.2 Majority Baseline
into training, development, and testing sets using We present a majority baseline which predicts the
80%/10%/10% split ratios respectively. most common class (JC10) for any given data
Pre-trained Contextual Embeddings for Litigation Code Classification LegalAIIA Workshop, ICAIL '19, June 17, 2019, Montreal, Quebec, Canada
Figure 2: Histogram to show the distribution of vocabulary, including word masks. We can see that the person
(MASK_PERSON) and organization (MASK_ORG) masks are more frequent.
point. various fine-tuning experiments. BERT is de-
signed to learn deep bidirectional representations
4.3 Surface Logistic Regression by jointly conditioning on both left and right con-
We featurise the narratives to the surface mod- text in all layers through a masked language model
els by normalising the input narratives and con- objective. Pre-trained BERT representations are
verting to a Bag-of-Words (BOW) sparse repre- publicly available for download and can be fine-
sentation. In addition, we also experiment with tuned with just one task-specific output layer to
character and word tokenisation, removal of stop- create state-of-the-art models for a wide range
words and TF-IDF feature reweighting but ob- of tasks (Devlin et al., 2018). We experiment
serve best performance on bigram-enhanced BOW with the uncased and cased versions of pre-trained
features tokenised at word level while retaining BERTBASE which is a 12-layer transformer ar-
stopwords. A logistic regression model is ap- chitecture with a hidden size of 768 and 12 self-
plied to the featurised input in a one-versus-rest attention heads adding up to 110 million param-
multi-class scheme and an L2 weight regularisa- eters, and the uncased version of BERTLARGE
tion penalty. which is a 24-layer transformer architecture with
a hidden size of 1024 and 16 self-attention heads
4.4 XGBoost Baseline adding up to 340 million parameters, both trained
As a final baseline, we use the scalable gradient- on a combined BookCorpus and Wikipedia corpus
boosting implementation XGBoost (Chen and of 3.3 billion words on 4 × 4 and 8 × 8 TPU slices
Guestrin, 2016), which has been used on vari- respectively for 4 days.
ous text classification tasks with strong perfor- We fine-tune the models on an AWS
mance results based on additive tree-based opti- p2.xlarge instance running a single NVIDIA
misation. As with the logistic regression baseline, K80 GPU. We adapt the BERT fine-tuning mech-
we performed pre-processing based on stopword- anism for single sentence classification tasks to
removal, TF-IDF weighting, and n-gram selec- the matter classification task.
tion. We also experimented with lemmatisation
and case standardisation to achieve highest model 4.6 Chronology-enhanced models
performance. In principle, any production system for time-
recording can take account of additional informa-
4.5 BERT Models
tion to support the classification task. The J-codes
We work with the HuggingFace3 PyTorch imple- set has ordinal structure resulting from the pro-
mentation of BERT (Bidirectional Encoder Rep- gression of Phases and Tasks during the case,
resentations from Transformers) model and run and any specific time-entries also have temporal
3
https://github.com/huggingface/ structure that can be exploited.
pytorch-pretrained-BERT As a result of this, we can significantly im-
LegalAIIA Workshop, ICAIL '19, June 17, 2019, Montreal, Quebec, Canada Bartolo, Tylinski and Moore
Figure 3: Confusion matrices. a) BERTBASE (Uncased) b) BERTLARGE (Uncased) c) XGBoost. We can see
that BERTLARGE is better at classifying class JC10, particularly against JM 30.
prove model performance by incorporating fea- mistake for others. We find that both the XGBoost
tures based on the set of codes typically associ- text-based model and the BERTBASE model
ated with a user or matter. Therefore, we include a commonly predict the most common class JC10
chronology-enhanced XGBoost model in our anal- (Factual Investigation: Work required to under-
ysis to set any performance improvements in con- stand the facts of the case including instructions
text. from the client and the identification of potential
Care is taken to verify that the model behavior witnesses) when the ground truth is JM30 (Hear-
is not to simply repeat the last code on a given ings: Includes preparation for and attendance at
matter by setting chronology-based features to hearings for directions and interim certificate ap-
zero, obtaining predictions from the chronology- plications as well as the detailed assessment it-
enhanced model, and confirming that the differ- self ). We also observe that all text-based mod-
ence in micro-accuracy is not greater than five per- els have difficulty distinguishing between JG10
cent relative to the purely text-based model. (Taking, preparing and finalising witness state-
ment(s)) and JG20 (Reviewing Other Party(s)’
5 Results and Discussion witness statement(s)). It is likely that this can be
Results for the different models are presented in explained to some extent by the text anonymisa-
Table 2. We observe substantial performance tion.
improvements of BERT models over the text- We can also see that there are different error
based baselines as well as the XGBoost text- patterns between the BERT and XGBoost mod-
based model, particularly with regards to macro- els and therefore we are likely to be able to im-
accuracy. prove performance in a production system using
The best performing BERT single model an ensemble approach. Furthermore, in addition
achieves performance gains of 2.23 micro- to the Task level results above, results on the
averaged accuracy points and 9.39 macro- Phase level are encouraging for use in produc-
averaged accuracy points over the XGBoost text- tion, with a micro-accuracy rate of 90.40 percent
only classifier on the test set. This is likely to have for the chronology-enhanced XGBoost model. In
a strong effect on user experience of a production some cases, such data is already sufficiently granu-
system as it indicates substantially better perfor- lar to derive actionable firm budgeting insights and
mance on less common classes. It also demon- an improvement over existing manual methods.
strates the effectiveness of pre-trained methods to
6 Conclusion and Future Work
incorporate prior knowledge and learn on low-
resource data, despite the linguistic differences be- Recent empirical improvements due to transfer
tween the pre-trained and legal domains. learning with language models have demonstrated
We also perform an in-depth error analysis, in- that rich, unsupervised pre-training is an inte-
cluding visual inspection of different model pre- gral part of many language understanding systems.
dictions and confusion matrices (see Figure 3) to Here we present experiments and analysis of state-
understand which classes the models commonly of-the-art models based on deep pre-trained con-
Pre-trained Contextual Embeddings for Litigation Code Classification LegalAIIA Workshop, ICAIL '19, June 17, 2019, Montreal, Quebec, Canada
Model Micro Acc. (%) Macro Acc. (%)
Random Baseline 2.02 2.26
Majority Baseline 19.96 2.50
Surface Random Forest 42.66 28.49
Surface Logistic Regression 45.87 32.30
Surface Logistic Regression (enhanced with bigram features) 51.78 39.30
XGBoost 53.15 36.65
BERT Base (Uncased) 55.17 44.28
BERT Base (Cased) 55.38 46.04
BERT Large (Uncased) 54.17 45.25
XGBoost (Chronological features) 77.11 61.51
Table 2: Results of the models on the test set. We can see increased performance over baseline models.
textual embeddings applied to the task of litiga- tion may also achieve scale conducive to learning
tion code classification. We show that BERT fine- contextualised legal-corpora representations men-
tuned to the 40-class matter classification task pro- tioned above.
vides substantial performance gains over our best-
performing baseline. Acknowledgements
One area to explore further is to incorporate We thank Edwin Zhang and Brandon Hill at Ping
these chronology-based features into a BERT- Inc. for their assistance in the data preparation,
centric approach. For example, one approach baseline modeling, and chronology enhancements.
could be to learn contextual embeddings for text
over temporal set of J-codes. Another could be
to ensemble the predictions of purely chronology- References
based model with the BERT output. Ilias Chalkidis and Dimitrios Kampas. 2019. Deep
We achieve our primary goal of demonstrating learning in law: early adaptation and legal word em-
that there is the capability to transfer pre-trained beddings trained on large corpora. Artificial Intelli-
gence and Law, 27(2):171–198.
language knowledge from a general corpus to the
legal domain task, with improved performance. Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,
Notwithstanding this fine-tuning result, in fu- Thorsten Brants, and Phillipp Koehn. 2013. One
Billion Word Benchmark for Measuring Progress
ture work we intend to extend this by learning con- in Statistical Language Modeling. arXiv preprint
textualised representations from legal corpora, a arXiv:1312.3005.
direction that has achieved some success in other
Tianqi Chen and Carlos Guestrin. 2016. XG-
domains (Lee et al., 2019) and which could be ap- Boost: A Scalable Tree Boosting System.
plied across a wide variety of tasks in the legal do- https://arxiv.org/abs/1603.02754.
main.
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G.
Moreover, although we have explored use of Carbonell, Quoc V. Le, and Ruslan Salakhutdinov.
multi-task learning framework, we have only 2019. Transformer-XL: Attentive Language Mod-
demonstrated performance on a single legal task. els Beyond a Fixed-Length Context. arXiv preprint
Future work will likely include extending this arXiv:1901.02860.
analysis to a set of legal benchmark tasks that in- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
clude natural language inference tasks (similar to Kristina Toutanova. 2018. Bert: Pre-training of deep
GLUE) on publicly available legal datasets. bidirectional transformers for language understand-
ing. arXiv preprint arXiv:1810.04805.
Given the relatively high degree of class imbal-
ance present in Phase and Task codes, as well Jeremy Howard and Sebastian Ruder. 2018. Universal
as the level of legal expertise involved in distin- language model fine-tuning for text classification.
arXiv preprint arXiv:1801.06146.
guishing closely related or rarer options, this clas-
sification problem lends itself well to human-in- Armand Joulin, Edouard Grave, Piotr Bojanowski, and
the-loop machine learning. Such an active learn- Tomas Mikolov. 2017. Bag of Tricks for Efficient
Text Classification. Proceedings of the 15th Confer-
ing platform would involve feeding timekeeper- ence of the European Chapter of the Association for
validated data back into the model for near-real- Computational Linguistics: Volume 2, Short Papers,
time retraining. This method of data collec- pages 427–431.
LegalAIIA Workshop, ICAIL '19, June 17, 2019, Montreal, Quebec, Canada Bartolo, Tylinski and Moore
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noa- Herbert Roitblat, Anne Kershaw, and Patrick Oot.
mand Shazeer, and Yonghui Wu. 2016. Exploring 2010. Document categorization in legal electronic
the limits of language modeling. arXiv preprint discovery: computer classification vs. manual re-
arXiv:1602.02410. view. Journal of the Association for Information
Science and Technology, 61(1):70–80.
Jinhyuk Lee, Wonjin Yoon1, Sungdong Kim,
Donghyeon Kim, Sunkyu Kim, Chan Ho So, Fabrizio Sebastiani. 2002. Machine Learning in Au-
and Jaewoo Kang. 2019. BioBERT: a pre-trained tomated Text Categorization. ACM Computing Sur-
biomedical language representation model for veys, 34(1):1–47.
biomedical text mining. Bioinformatics, 1.
Octavia-Maria Sulea, Marcos Zampieri, Shervin Mal-
masi, Mihaela Vela, Liviu P. Dinu, and Josef van
Stephen Merity, Nitish Shirish Keskar, and Richard
Genabith. 2017. Exploring the use of text classifica-
Socher. 2017. Regularizing and optimizing LSTM
tion in the legal domain. Proceedings of 2nd Work-
language models. arXiv preprint arXiv:1708.02182.
shop on Automated Semantic Analysis of Informa-
tion in Legal Texts.
Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan
Cernocky, and Sanjeev Khudanpur. 2019. Recur- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
rent neural network based language model. INTER- Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
SPEECH, 2:1045–1048. Kaiser, and Illia Polosukhin. 2017. Attention Is All
You Need. 31st Conference on Neural Information
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Processing Systems (NIPS 2017).
rado, and Jeff Dean. 2013. Distributed representa-
tions of words and phrases and their compositional- Alex Wang, Amanpreet Singh, Julian Michael, Felix
ity. NIPS. Hill, Omer Levy, and Samuel R. Bowman. 2018.
GLUE: A multi-task benchmark and analysis plat-
James O’ Neill, Paul Buitelaar, Cecile Robin, and form for natural language understanding. arXiv
Leona O’ Brien. 2017. Classifying Sentential preprint arXiv:1804.07461.
Modality in Legal Language: A Use Case in Finan-
cial Regulations, Acts and Directives. Proceedings Tian Wang and Kyunghyun Cho. 2015. Larger-
of the 16th Edition of the International Conference context language modelling. arXiv preprint
on Articial Intelligence and Law (ICAIL ’17). arXiv:1511.03729.
Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and
David Nelson and Jackson. 2014. EW-UTBMS Civil William W. Cohen. 2017. Breaking the softmax bot-
Litigation J-Code Set Overview and Guidelines. tleneck: A high-rank RNN language model. arXiv
preprint arXiv:1711.03953.
EWT Ngai, Yong Hu, YH Wong, Yijun Chen, and Xin
Sun. 2011. The application of data mining tech- Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals.
niques in financial fraud detection: A classifica- 2014. Recurrent Neural Network Regularization.
tion framework and an academic review of literature. arXiv preprint arXiv:1409.2329.
Decision Support Systems, 50(3):559–569.
Xiang Zhang, Junbo Zhao, and Yann Lecun. 2015.
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Character-level Convolutional Networks for Text
Gardner, Christopher Clark, Kenton Lee, and Luke Classification. Advances in Neural Information Pro-
Zettlemoyer. 2018. Deep contextualized word rep- cessing Systems 28, pages 649–657.
resentations. arXiv preprint arXiv:1802.05365.
Alec Radford, Karthik Narasimhan, Tim Salimans, and
Ilya Sutskever. 2018. Improving language under-
standing by generative pre-training. URL https://s3-
us-west-2. amazonaws.com/openai-assets/research-
covers/languageunsupervised/language under-
standing paper.pdf.
Alexander Ratner, Stephen H. Bach, Henry Ehrenberg,
Jason Fries, Sen Wu, and Christopher RÃl’. 2017.
Snorkel: Rapid training data creation with weak su-
pervision. Proceedings of the VLDB Endowment,
11(3):269–282.
Alexander Ratner, Braden Hancock, Jared Dunnmon,
Frederic Sala, Shreyash Pandey, and Christopher Re.
2019. Training Complex Models with Multi-Task
Weak Supervision. AAAI.