=Paper= {{Paper |id=Vol-2484/paper6 |storemode=property |title=Pre-trained Contextual Embeddings for Litigation Code Classification |pdfUrl=https://ceur-ws.org/Vol-2484/paper6.pdf |volume=Vol-2484 |authors=Max Bartolo,Kamil Tylinski,Alastair Moore |dblpUrl=https://dblp.org/rec/conf/icail/BartoloTM19 }} ==Pre-trained Contextual Embeddings for Litigation Code Classification== https://ceur-ws.org/Vol-2484/paper6.pdf
    Pre-trained Contextual Embeddings for Litigation Code Classification


            Max Bartolo                         Kamil Tylinski                               Alastair Moore
              UCL                             Mishcon de Reya LLP                          Mishcon de Reya LLP
     m.bartolo@cs.ucl.ac.uk               kamil.tylinski@mishcon.com                 alastair.moore@mishcon.com




                                                               Narrative text                                               J-code

                       Abstract                                Working on response from  and statutory review.   JE20

                                                               Preparing documents for meeting with        JH30
    Models for a variety of natural language pro-
    cessing tasks, such as question answering or               Attendance on client, email exchange                         JJ70

    text classification, are potentially important
    components for a wide range of legal machine              Table 1: Example narrative text for the classification
    learning systems. These tasks may include ex-             task. Given a sentence of text describing the actions
    amining whole legal corpora, but may also in-             completed by the lawyer, assign a label based on a
    clude a broad range of tasks that can support             discrete J-codes label set. J-codes are time-recording
    automation in the digital workplace. Impor-               codes introduced to comply with requirements under
    tantly, recent advances in pre-trained contex-            the UK Civil Procedure Rules. The process of redac-
    tual embeddings have substantially improved               tion, highlighted, is discussed in Section 3.3.
    the performance of text classification across
    a wide range of tasks. In this paper, we in-
    vestigate the application of these recent ap-                Developing systems that can support the au-
    proaches on a legal time-recording task. We
                                                              tomation of a variety of tasks across the digital
    demonstrate improved performance on a 40-
    class J-code classification task over a variety
                                                              workplace involves working with heterogeneous
    of baseline techniques. The best performing               data, with different quantities of labelled data (for
    single model achieves performance gains of                the purposes of supervised learning) of variable
    2.23 micro-averaged accuracy points and 9.39              quality. For this reason, practitioners are increas-
    macro-averaged accuracy points over the next              ingly turning to more indirect ways of injecting
    best classifier on the test set. This result sug-         weak supervision signals into their models (Ratner
    gests these techniques will find broad utility in         et al., 2017). Recent work on multitask learning
    the development of legal language models for
                                                              (Ratner et al., 2019) has developed an approach
    a range of automation tasks.
                                                              to deep learning architectures that learn massive
1   Introduction                                              multitask models with different heads adapted for
                                                              different tasks.
Legal data comes in a variety of different forms,                A traditional approach to text classification
from contracts and legal documents containing                 tasks is to create a linear classifier (Logistic re-
technical language, to the variety of correspon-              gression or Support Vector Machine) on sentences
dence between client and solicitor (from email to             presented as bag of words. The main disadvan-
transcripts), to billing and enterprise performance           tage of this method is its inability to share pa-
management (EPM) systems used to support the                  rameters among classes and features (Joulin et al.,
business of law.                                              2017). Alternatively, the problem be approached
In: Proceedings of the First International Workshop on AI     by means of neural networks (Zhang et al., 2015),
and Intelligent Assistance for Legal Professionals in the     where transformer architectures has proven to be
Digital Workplace (LegalAIIA 2019), held in conjunction
with ICAIL 2019. June 17, 2019. Montréal, QC, Canada.         more appropriate for a wide variety of tasks, not
                                                              only text classification (Vaswani et al., 2017; Dai
Copyright c 2019 for this paper by its authors. Use           et al., 2019).
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0). Published at http://ceur-ws.org.     Importantly, incorporating pre-trained contex-
LegalAIIA Workshop, ICAIL '19, June 17, 2019, Montreal, Quebec, Canada                                    Bartolo, Tylinski and Moore



tual embeddings (Peters et al., 2018; Radford                               Classification tasks require large quantities of
et al., 2018; Devlin et al., 2018) has led to im-                        training data, but in many domain-specific appli-
pressive performance gains across many natural                           cations the construction of a large training set is
language processing tasks such as question an-                           very costly and requires the use of experts to la-
swering, natural language inference, sequence la-                        bel data. The use of pretrained embeddings allows
belling and text classification. Models with access                      models to obtain linguistic knowledge from very
to pre-trained language knowledge currently pro-                         large auxiliary corpora, often reduce the amount of
vide state-of-the-art results on the GLUE bench-                         task-specific training data required for good per-
mark1 tasks and also outperform human base-                              formance.
lines in some cases. The GLUE benchmark con-                                Recent approaches to natural language process-
sists of nine natural language understanding tasks                       ing have revolved around neural methods for in-
(e.g., natural language inference, sentence simi-                        ferring probability distributions over sequences
larity, etc.). Each comes with its own unique set                        of words, referred to as language modelling
of examples and labels, ranging in size from 635                         (LM), using deep learning architectures. Recur-
training examples (WNLI) to 393k (MNLI) (Wang                            rent Neural Network (RNN) based language mod-
et al., 2018).                                                           els, owing largely to their capacity for learn-
   However, legal text (whether it contains techni-                      ing sequential context, have been extensively re-
cal language or simple correspondence) tends to                          searched (Mikolov et al., 2019; Chelba et al.,
differ from the text corpora on which these state-                       2013; Zaremba et al., 2014; Wang and Cho, 2015;
of-the-art language models are trained, such as                          Jozefowicz et al., 2016) despite various challenges
Wikipedia and BookCorpus. In this paper, towards                         (Merity et al., 2017; Yang et al., 2017). The se-
the goal of developing large multitask models for                        quential nature of RNN-based models precludes
different legal applications, we first demonstrate                       parallelization within training examples which
the successful use of pre-trained language models                        makes scaling to long sequence lengths and large
transferred to a legal domain task.                                      corpora challenging. The Transformer architec-
   We focus on the task of litigation code classifi-                     ture, relying on stacked self-attention and point-
cation, illustrated in Table 1, which is an important                    wise, fully-connected layers, allows for signifi-
sub-task in legal time-recording and for preparing                       cantly more parallelization (Vaswani et al., 2017).
bills of costs for assessment by the courts. We
                                                                            One approach to developing deep architectures
base our approach on fine-tuning BERT (Bidirec-
                                                                         for specific language tasks has been to exploit fea-
tional Encoder Representations from Transform-
                                                                         ture representations learned from large datasets of
ers), a transformer-based language representation
                                                                         general purpose data such as Wikipedia. These
model, (Devlin et al., 2018) and our evaluation
                                                                         pre-trained approaches are now key components
shows that a single pre-trained model achieves sig-
                                                                         in many natural language applications (Mikolov
nificant performance gains over the next best clas-
                                                                         et al., 2013). These concepts have also been ex-
sifier on the test set.
                                                                         tended to the legal domain, including the creation
                                                                         of the Law2Vec legal word embeddings, which
2     Related Work                                                       is likely to accelerate the progress in this research
Text classification is a category of Natural Lan-                        area (Chalkidis and Kampas, 2019).
guage Processing (NLP) tasks with real-world ap-                            There are generally two strategies for applying
plications such as spam detection, fraud identi-                         pre-trained language models to downstream tasks:
fication (Ngai et al., 2011), and legal discovery                        feature-based and fine-tuning. The feature-based
(Roitblat et al., 2010). Formally, it is about as-                       approach, as was used in ELMo (Peters et al.,
signing a Boolean value to each pair of hdj , ci i ∈                     2018), learns a fixed representation, or feature
D × C (Sebastiani, 2002), where D in our ex-                             space, on a large text corpus. More specifically,
amplen is a domain o of narrative documents and                          ELMo develops a coupled forward LM and back-
C = c1 , ..., c|C| a set of J-Codes such that we                         ward LM approach as well as a linear combination
obtain a decision value for each narrative docu-                         of the hidden representations stacked above each
ment dj being classed as ci .                                            input word for each end task, and markedly im-
                                                                         proves performance over just using the top LSTM
1
    https://gluebenchmark.com/leaderboard                                layer representation.
Pre-trained Contextual Embeddings for Litigation Code Classification   LegalAIIA Workshop, ICAIL '19, June 17, 2019, Montreal, Quebec, Canada



   The fine-tuning approach, as demonstrated in                        ple, the Issue / Statements of Case phase (JE00)
ULMFiT (Howard and Ruder, 2018) and GPT                                includes the lower tier tasks of Review of Other
(Radford et al., 2018), introduce minimal task-                        Party/Opponents’ Statement of Case (JE20) and
specific parameters and are adapted for down-                          Amendment of Statement of Case (JE40). An ex-
stream tasks simply by re-learning the weights in                      ample of the distribution of J-codes used in the
one or more layers of the deep architecture.                           evaluation can be seen in Figure 1. The lowest
   In this paper, we build upon the recent release of                  tier is Action, but we do not use this granular-
BERT (Devlin et al., 2018), which makes use of a                       ity in this study. Actions specify how the work
masked language model for its pre-training objec-                      is done, Tasks inform of what is being done and
tive to learn a deep bidirectional language model.                     are further grouped by Phases. The detailed ex-
We develop our approach by fine-tuning the pre-                        planation of the J-codes structure can be found in
trained parameters for the downstream legal time-                      (Nelson and Jackson, 2014).
recording classification task.
   Text classification in the legal space has in-                      3.2     Motivation
cluded research in court ruling predictions (Sulea
                                                                       This classification task is important in the context
et al., 2017) and legal deontic modality classifi-
                                                                       of legal digital workflows because it allows law
cation (Neill et al., 2017), but the incorporation
                                                                       firms to extract value from billing data. Organiz-
of pre-trained contextual embeddings remains rel-
                                                                       ing work by Phase and Task facilitates more ef-
atively unexplored.
                                                                       fective budgeting, particularly as alternative fee ar-
3       Litigation Code Classification                                 rangements become more prevalent, and increases
                                                                       transparency across different clients and matters.
3.1       Overview                                                         Automating Phase-Task code classification
The task is a 40-class classification problem where                    also reduces administrative burden upon lawyers,
the labels are litigation J-Codes. The J-codes set                     who may each record thousands of time entries in-
are one set of the Uniform Task Based Manage-                          volving these codes annually. Furthermore, the
ment System (UTBMS) codes used to classify le-                         adoption of UTBMS codes can be inconsistent
gal services performed by a legal vendor in an                         within industries or even a given firm, with some
electronic invoice submission2 .                                       lawyers delegating their task-based coding or as-
   The background of the J-code-set originates                         signing blocks of time entries to the same code.
from the Review of Civil Litigation Costs in Eng-                      In these cases, automation is likely to improve
land and Wales (Nelson and Jackson, 2014). A                           the quality of data collected and allow for inter-
key recommendation of the review was that a new                        department comparative analyses.
format for bills of costs be standardized to in-                           Moreover, it is possible for time entries to be
crease both the transparency of costs assessed by                      entered just once into a solicitor’s system (includ-
the courts, and the consistency in the way costs are                   ing Task and Activity codes) and then used in
presented to judges.                                                   a variety of different reporting applications, from
   The new format, designed to be produced and                         the client, to the court to the normal administrative
analyzed in digital workflows, resulted in a set of                    functions of finance and tax.
discrete J-Codes that are used to categorize work                          Lastly, the nature of billing data in an indus-
undertaken. There are three hierarchical levels                        try characterized by time-based charging, means
of granularity. The highest level is the Phase.                        it is likely to be a key source of data in any multi-
Examples include Pre-Action work and Disclo-                           modal multitask system supporting task automa-
sure corresponding to J-code JC00 and JF00 re-                         tion in the digital workplace.
spectively. The intermediate level of generality                           All of the above emphasize the importance of
is the Task. Each Phase has a finite and lim-                          accuracy, when assigning the codes. There are
ited number of Tasks assigned to it. For exam-                         also financial incentives, as any incorrect entries
2
    A similar set of codes have previously been developed in           may be impossible to recover from the other side
    the United States. Here the codes have been developed to           or not approved by the court. Additionally, the
    provide a common language for e-billing, under which both          time fee earners spend amending and checking the
    the law firm and the client have systems using a common
    code set for respectively the delivery and analysis of bills -     codes has to be written off and does not provide
    commonly referred to as L-codes.                                   any benefit to the law firm. Thus, automated code
LegalAIIA Workshop, ICAIL '19, June 17, 2019, Montreal, Quebec, Canada                                        Bartolo, Tylinski and Moore




                                                                   J-Code distribution
             10000


              8000


              6000


              4000


              2000


                0
                      JC10
                      JC20
                      JC30
                      JE10
                     JD20
                       JJ70
                       JJ20
                       JI10
                      JF20
                      JK20
                     JD10
                      JF10
                     JG10
                      JL30
                       JJ60
                      JB10
                     JH10
                       JJ10
                       JJ50
                       JJ30
                      JE20
                      JA10
                      JF40
                      JE40
                      JE30
                     JM10
                      JK10
                     JH30
                      JF30
                     JG20
                       JI30
                     JH20
                      JB20
                     JM20
                       JI20
                      JB30
                       JJ40
                     JM40
                      JL20
                     JM30
Figure 1: Histogram to show the distribution of J-codes. The long tail demonstrates the class imbalance in this
dataset. This is to be expected as time entries, aggregated by type of work performed, mean that multiple time
entries could result from the services performed in a single day on a single matter.


assignment can lead to significant improvements                              3.4         Evaluation Metrics
in productivity, even if the output requires to be                           This is a multi-class classification problem, with
reviewed by the legal professional.                                          significant class imbalance so we evaluate on both
                                                                             micro-averaged accuracy and macro-averaged ac-
3.3    Data
                                                                             curacy in a one-vs-all setting.
The data is a collection of narratives from a le-                               The micro-averaged accuracy is computed by
gal firm’s proprietary set spanning more than 1500                           aggregating to contributions of all the classes to
matters and 300 timekeepers. Due to its sensi-                               compute the average by taking the number of cor-
tive nature, the data has been anonymized using a                            rect predictions divided by the total number of ex-
Named Entity Recognition (NER) algorithm that                                amples.
identifies and redacts the names of people, organi-                             The macro-averaged accuracy considers the
zations, and locations, among other entity types in                          computation of the accuracy for each individual
the form of a word mask. This algorithm combines                             class independently (class average), followed by
machine learning based on linguistic features with                           taking the average across classes (hence treating
stricter pattern-based exclusions. Another effect                            all classes equally). This is useful for understand-
of preprocessing data with the NER algorithm is                              ing how the system performs across each class de-
to ensure a higher degree of model generalisabil-                            spite the limited data points for particular classes.
ity, since it is not trained based on specific proper
nouns which may be present in the vocabulary at                              4       Models
training time but not at test time. This can be seen                         To demonstrate any improved performance from
in Figure 2 where we can see high mask counts for                            the use of pre-trained contextual embeddings on
MASK_PERSON and MASK_ORG.                                                    this domain specific task we benchmark perfor-
   The data has been cleaned by a heuristic                                  mance against a variety of different baseline mod-
whereby blocks of time entries from the same                                 els.
timekeeper assigned almost exclusively to the
same phase-task code combination were excluded.                              4.1         Random Baseline
Despite this process, classes in the data set remain                         The random baseline simply predicts a random
relatively imbalanced, with about one third of en-                           class for any given data point. As such, we ex-
tries assigned to the most common phase code and                             pect the micro-averaged accuracy to be roughly
one fifth of entries assigned to the most common                                   1
                                                                             num_classes .
task code.
   The data set consists of 51, 948 examples split                           4.2         Majority Baseline
into training, development, and testing sets using                           We present a majority baseline which predicts the
80%/10%/10% split ratios respectively.                                       most common class (JC10) for any given data
Pre-trained Contextual Embeddings for Litigation Code Classification   LegalAIIA Workshop, ICAIL '19, June 17, 2019, Montreal, Quebec, Canada




Figure 2: Histogram to show the distribution of vocabulary, including word masks. We can see that the person
(MASK_PERSON) and organization (MASK_ORG) masks are more frequent.


point.                                                                 various fine-tuning experiments. BERT is de-
                                                                       signed to learn deep bidirectional representations
4.3     Surface Logistic Regression                                    by jointly conditioning on both left and right con-
We featurise the narratives to the surface mod-                        text in all layers through a masked language model
els by normalising the input narratives and con-                       objective. Pre-trained BERT representations are
verting to a Bag-of-Words (BOW) sparse repre-                          publicly available for download and can be fine-
sentation. In addition, we also experiment with                        tuned with just one task-specific output layer to
character and word tokenisation, removal of stop-                      create state-of-the-art models for a wide range
words and TF-IDF feature reweighting but ob-                           of tasks (Devlin et al., 2018). We experiment
serve best performance on bigram-enhanced BOW                          with the uncased and cased versions of pre-trained
features tokenised at word level while retaining                       BERTBASE which is a 12-layer transformer ar-
stopwords. A logistic regression model is ap-                          chitecture with a hidden size of 768 and 12 self-
plied to the featurised input in a one-versus-rest                     attention heads adding up to 110 million param-
multi-class scheme and an L2 weight regularisa-                        eters, and the uncased version of BERTLARGE
tion penalty.                                                          which is a 24-layer transformer architecture with
                                                                       a hidden size of 1024 and 16 self-attention heads
4.4     XGBoost Baseline                                               adding up to 340 million parameters, both trained
As a final baseline, we use the scalable gradient-                     on a combined BookCorpus and Wikipedia corpus
boosting implementation XGBoost (Chen and                              of 3.3 billion words on 4 × 4 and 8 × 8 TPU slices
Guestrin, 2016), which has been used on vari-                          respectively for 4 days.
ous text classification tasks with strong perfor-                         We fine-tune the models on an AWS
mance results based on additive tree-based opti-                       p2.xlarge instance running a single NVIDIA
misation. As with the logistic regression baseline,                    K80 GPU. We adapt the BERT fine-tuning mech-
we performed pre-processing based on stopword-                         anism for single sentence classification tasks to
removal, TF-IDF weighting, and n-gram selec-                           the matter classification task.
tion. We also experimented with lemmatisation
and case standardisation to achieve highest model                      4.6     Chronology-enhanced models
performance.                                                           In principle, any production system for time-
                                                                       recording can take account of additional informa-
4.5     BERT Models
                                                                       tion to support the classification task. The J-codes
We work with the HuggingFace3 PyTorch imple-                           set has ordinal structure resulting from the pro-
mentation of BERT (Bidirectional Encoder Rep-                          gression of Phases and Tasks during the case,
resentations from Transformers) model and run                          and any specific time-entries also have temporal
3
    https://github.com/huggingface/                                    structure that can be exploited.
    pytorch-pretrained-BERT                                               As a result of this, we can significantly im-
LegalAIIA Workshop, ICAIL '19, June 17, 2019, Montreal, Quebec, Canada                                   Bartolo, Tylinski and Moore




Figure 3: Confusion matrices. a) BERTBASE (Uncased) b) BERTLARGE (Uncased) c) XGBoost. We can see
that BERTLARGE is better at classifying class JC10, particularly against JM 30.


prove model performance by incorporating fea-                            mistake for others. We find that both the XGBoost
tures based on the set of codes typically associ-                        text-based model and the BERTBASE model
ated with a user or matter. Therefore, we include a                      commonly predict the most common class JC10
chronology-enhanced XGBoost model in our anal-                           (Factual Investigation: Work required to under-
ysis to set any performance improvements in con-                         stand the facts of the case including instructions
text.                                                                    from the client and the identification of potential
   Care is taken to verify that the model behavior                       witnesses) when the ground truth is JM30 (Hear-
is not to simply repeat the last code on a given                         ings: Includes preparation for and attendance at
matter by setting chronology-based features to                           hearings for directions and interim certificate ap-
zero, obtaining predictions from the chronology-                         plications as well as the detailed assessment it-
enhanced model, and confirming that the differ-                          self ). We also observe that all text-based mod-
ence in micro-accuracy is not greater than five per-                     els have difficulty distinguishing between JG10
cent relative to the purely text-based model.                            (Taking, preparing and finalising witness state-
                                                                         ment(s)) and JG20 (Reviewing Other Party(s)’
5    Results and Discussion                                              witness statement(s)). It is likely that this can be
Results for the different models are presented in                        explained to some extent by the text anonymisa-
Table 2. We observe substantial performance                              tion.
improvements of BERT models over the text-                                  We can also see that there are different error
based baselines as well as the XGBoost text-                             patterns between the BERT and XGBoost mod-
based model, particularly with regards to macro-                         els and therefore we are likely to be able to im-
accuracy.                                                                prove performance in a production system using
   The best performing BERT single model                                 an ensemble approach. Furthermore, in addition
achieves performance gains of 2.23 micro-                                to the Task level results above, results on the
averaged accuracy points and 9.39 macro-                                 Phase level are encouraging for use in produc-
averaged accuracy points over the XGBoost text-                          tion, with a micro-accuracy rate of 90.40 percent
only classifier on the test set. This is likely to have                  for the chronology-enhanced XGBoost model. In
a strong effect on user experience of a production                       some cases, such data is already sufficiently granu-
system as it indicates substantially better perfor-                      lar to derive actionable firm budgeting insights and
mance on less common classes. It also demon-                             an improvement over existing manual methods.
strates the effectiveness of pre-trained methods to
                                                                         6   Conclusion and Future Work
incorporate prior knowledge and learn on low-
resource data, despite the linguistic differences be-                    Recent empirical improvements due to transfer
tween the pre-trained and legal domains.                                 learning with language models have demonstrated
   We also perform an in-depth error analysis, in-                       that rich, unsupervised pre-training is an inte-
cluding visual inspection of different model pre-                        gral part of many language understanding systems.
dictions and confusion matrices (see Figure 3) to                        Here we present experiments and analysis of state-
understand which classes the models commonly                             of-the-art models based on deep pre-trained con-
Pre-trained Contextual Embeddings for Litigation Code Classification   LegalAIIA Workshop, ICAIL '19, June 17, 2019, Montreal, Quebec, Canada



                Model                                                                Micro Acc. (%)        Macro Acc. (%)
                Random Baseline                                                             2.02                  2.26
                Majority Baseline                                                          19.96                  2.50
                Surface Random Forest                                                      42.66                 28.49
                Surface Logistic Regression                                                45.87                 32.30
                Surface Logistic Regression (enhanced with bigram features)                51.78                 39.30
                XGBoost                                                                    53.15                 36.65
                BERT Base (Uncased)                                                       55.17                  44.28
                BERT Base (Cased)                                                         55.38                  46.04
                BERT Large (Uncased)                                                      54.17                  45.25
                XGBoost (Chronological features)                                           77.11                 61.51

       Table 2: Results of the models on the test set. We can see increased performance over baseline models.


textual embeddings applied to the task of litiga-                      tion may also achieve scale conducive to learning
tion code classification. We show that BERT fine-                      contextualised legal-corpora representations men-
tuned to the 40-class matter classification task pro-                  tioned above.
vides substantial performance gains over our best-
performing baseline.                                                   Acknowledgements
   One area to explore further is to incorporate                       We thank Edwin Zhang and Brandon Hill at Ping
these chronology-based features into a BERT-                           Inc. for their assistance in the data preparation,
centric approach. For example, one approach                            baseline modeling, and chronology enhancements.
could be to learn contextual embeddings for text
over temporal set of J-codes. Another could be
to ensemble the predictions of purely chronology-                      References
based model with the BERT output.                                      Ilias Chalkidis and Dimitrios Kampas. 2019. Deep
   We achieve our primary goal of demonstrating                           learning in law: early adaptation and legal word em-
that there is the capability to transfer pre-trained                      beddings trained on large corpora. Artificial Intelli-
                                                                          gence and Law, 27(2):171–198.
language knowledge from a general corpus to the
legal domain task, with improved performance.                          Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,
   Notwithstanding this fine-tuning result, in fu-                       Thorsten Brants, and Phillipp Koehn. 2013. One
                                                                         Billion Word Benchmark for Measuring Progress
ture work we intend to extend this by learning con-                      in Statistical Language Modeling. arXiv preprint
textualised representations from legal corpora, a                        arXiv:1312.3005.
direction that has achieved some success in other
                                                                       Tianqi Chen and Carlos Guestrin. 2016.    XG-
domains (Lee et al., 2019) and which could be ap-                         Boost:     A Scalable Tree Boosting System.
plied across a wide variety of tasks in the legal do-                     https://arxiv.org/abs/1603.02754.
main.
                                                                       Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G.
   Moreover, although we have explored use of                            Carbonell, Quoc V. Le, and Ruslan Salakhutdinov.
multi-task learning framework, we have only                              2019. Transformer-XL: Attentive Language Mod-
demonstrated performance on a single legal task.                         els Beyond a Fixed-Length Context. arXiv preprint
Future work will likely include extending this                           arXiv:1901.02860.
analysis to a set of legal benchmark tasks that in-                    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
clude natural language inference tasks (similar to                        Kristina Toutanova. 2018. Bert: Pre-training of deep
GLUE) on publicly available legal datasets.                               bidirectional transformers for language understand-
                                                                          ing. arXiv preprint arXiv:1810.04805.
   Given the relatively high degree of class imbal-
ance present in Phase and Task codes, as well                          Jeremy Howard and Sebastian Ruder. 2018. Universal
as the level of legal expertise involved in distin-                       language model fine-tuning for text classification.
                                                                          arXiv preprint arXiv:1801.06146.
guishing closely related or rarer options, this clas-
sification problem lends itself well to human-in-                      Armand Joulin, Edouard Grave, Piotr Bojanowski, and
the-loop machine learning. Such an active learn-                         Tomas Mikolov. 2017. Bag of Tricks for Efficient
                                                                         Text Classification. Proceedings of the 15th Confer-
ing platform would involve feeding timekeeper-                           ence of the European Chapter of the Association for
validated data back into the model for near-real-                        Computational Linguistics: Volume 2, Short Papers,
time retraining. This method of data collec-                             pages 427–431.
LegalAIIA Workshop, ICAIL '19, June 17, 2019, Montreal, Quebec, Canada                                     Bartolo, Tylinski and Moore



Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noa-                     Herbert Roitblat, Anne Kershaw, and Patrick Oot.
  mand Shazeer, and Yonghui Wu. 2016. Exploring                            2010. Document categorization in legal electronic
  the limits of language modeling. arXiv preprint                          discovery: computer classification vs. manual re-
  arXiv:1602.02410.                                                        view. Journal of the Association for Information
                                                                           Science and Technology, 61(1):70–80.
Jinhyuk Lee, Wonjin Yoon1, Sungdong Kim,
   Donghyeon Kim, Sunkyu Kim, Chan Ho So,                                Fabrizio Sebastiani. 2002. Machine Learning in Au-
   and Jaewoo Kang. 2019. BioBERT: a pre-trained                           tomated Text Categorization. ACM Computing Sur-
   biomedical language representation model for                            veys, 34(1):1–47.
   biomedical text mining. Bioinformatics, 1.
                                                                         Octavia-Maria Sulea, Marcos Zampieri, Shervin Mal-
                                                                           masi, Mihaela Vela, Liviu P. Dinu, and Josef van
Stephen Merity, Nitish Shirish Keskar, and Richard
                                                                           Genabith. 2017. Exploring the use of text classifica-
   Socher. 2017. Regularizing and optimizing LSTM
                                                                           tion in the legal domain. Proceedings of 2nd Work-
   language models. arXiv preprint arXiv:1708.02182.
                                                                           shop on Automated Semantic Analysis of Informa-
                                                                           tion in Legal Texts.
Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan
  Cernocky, and Sanjeev Khudanpur. 2019. Recur-                          Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  rent neural network based language model. INTER-                         Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
  SPEECH, 2:1045–1048.                                                     Kaiser, and Illia Polosukhin. 2017. Attention Is All
                                                                           You Need. 31st Conference on Neural Information
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-                       Processing Systems (NIPS 2017).
  rado, and Jeff Dean. 2013. Distributed representa-
  tions of words and phrases and their compositional-                    Alex Wang, Amanpreet Singh, Julian Michael, Felix
  ity. NIPS.                                                               Hill, Omer Levy, and Samuel R. Bowman. 2018.
                                                                           GLUE: A multi-task benchmark and analysis plat-
James O’ Neill, Paul Buitelaar, Cecile Robin, and                          form for natural language understanding. arXiv
  Leona O’ Brien. 2017.         Classifying Sentential                     preprint arXiv:1804.07461.
  Modality in Legal Language: A Use Case in Finan-
  cial Regulations, Acts and Directives. Proceedings                     Tian Wang and Kyunghyun Cho. 2015.    Larger-
  of the 16th Edition of the International Conference                       context language modelling. arXiv preprint
  on Articial Intelligence and Law (ICAIL ’17).                             arXiv:1511.03729.
                                                                         Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and
David Nelson and Jackson. 2014. EW-UTBMS Civil                             William W. Cohen. 2017. Breaking the softmax bot-
  Litigation J-Code Set Overview and Guidelines.                           tleneck: A high-rank RNN language model. arXiv
                                                                           preprint arXiv:1711.03953.
EWT Ngai, Yong Hu, YH Wong, Yijun Chen, and Xin
  Sun. 2011. The application of data mining tech-                        Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals.
  niques in financial fraud detection: A classifica-                      2014. Recurrent Neural Network Regularization.
  tion framework and an academic review of literature.                    arXiv preprint arXiv:1409.2329.
  Decision Support Systems, 50(3):559–569.
                                                                         Xiang Zhang, Junbo Zhao, and Yann Lecun. 2015.
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt                          Character-level Convolutional Networks for Text
 Gardner, Christopher Clark, Kenton Lee, and Luke                          Classification. Advances in Neural Information Pro-
 Zettlemoyer. 2018. Deep contextualized word rep-                          cessing Systems 28, pages 649–657.
 resentations. arXiv preprint arXiv:1802.05365.

Alec Radford, Karthik Narasimhan, Tim Salimans, and
  Ilya Sutskever. 2018. Improving language under-
  standing by generative pre-training. URL https://s3-
  us-west-2. amazonaws.com/openai-assets/research-
  covers/languageunsupervised/language         under-
  standing paper.pdf.

Alexander Ratner, Stephen H. Bach, Henry Ehrenberg,
  Jason Fries, Sen Wu, and Christopher RÃl’. 2017.
  Snorkel: Rapid training data creation with weak su-
  pervision. Proceedings of the VLDB Endowment,
  11(3):269–282.

Alexander Ratner, Braden Hancock, Jared Dunnmon,
  Frederic Sala, Shreyash Pandey, and Christopher Re.
  2019. Training Complex Models with Multi-Task
  Weak Supervision. AAAI.