=Paper=
{{Paper
|id=Vol-2764/paper5
|storemode=property
|title=Cross-Domain Generalization and Knowledge Transfer in Transformers Trained on Legal Data
|pdfUrl=https://ceur-ws.org/Vol-2764/paper5.pdf
|volume=Vol-2764
|authors=Jaromir Savelka,Hannes Westermann,Karim Benyekhlef
|dblpUrl=https://dblp.org/rec/conf/jurix/SavelkaWB20
}}
==Cross-Domain Generalization and Knowledge Transfer in Transformers Trained on Legal Data==
<pdf width="1500px">https://ceur-ws.org/Vol-2764/paper5.pdf</pdf>
<pre>
Šavelka et al. Proceedings of ASAIL 2020                                                           1


       Cross-Domain Generalization and
      Knowledge Transfer in Transformers
            Trained on Legal Data
    Jaromı́r ŠAVELKA a,1 , Hannes WESTERMANN b , and Karim BENYEKHLEF b
                a School of Computer Science, Carnegie Mellon University
           b Cyberjustice Laboratory, Faculté de droit, Université de Montréal


             Abstract. In this paper we examine the ability of pre-trained language models to
             generalize beyond the legal domain and dataset they were trained on. We transform
             three separate datasets of annotated legal cases with different type systems to use
             a single type system (namely, determining whether a sentence describes facts or
             not). Then, we train SVM and RoBERTa models on the different source datasets,
             and measure their performance on the remaining datasets. We also create models
             trained on combined datasets, and measure their performance. Overall, we find that
             the performance of the models is surprisingly high, despite being trained and eval-
             uated between datasets from different domains and jurisdictions. The RoBERTa
             model seems to be able to benefit from its pre-trained language representation to
             better capture the abstract idea behind the new type system. Further, the models
             benefit from training on multiple datasets. This potentially opens the door for mod-
             els trained on separate datasets that could perform better in previously unseen do-
             mains, and to overcome cold-start problems when using active learning methods.
             Keywords. Transformers, cross-domain generalization, case-law, rhetorical roles,
             sentence classification


1. Introduction

In this paper we examine the ability of pre-trained language models to generalize beyond
the domain and dataset they were trained on in the legal domain. We explore if and
how different datasets and type systems can be re-cast in such a way that a language
model trained on one dataset could be successfully utilized on other datasets. As high-
quality legal data sets are scarce, the efficient use of those that are available is of utmost
importance in AI & Law research.
     Written legal cases typically follow a certain pattern in their reasoning. For example,
legal cases often contain the following sections:
     1. Claims of the plaintiff and responses/counter-claims of the defendant

  1 E-mail: jsavelka@andrew.cmu.edu


 Proceedings of the 2020 Workshop on Automated Semantic Analysis of Information in Legal Text (ASAIL)
                             Copyright © 2020 for this paper by its authors.
        Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
2                                                        Šavelka et al. Proceedings of ASAIL 2020


    2. A description of factual circumstances of the case as seen by the parties and as
       determined by the court
    3. Legal rules applicable to the factual circumstances
    4. Application of the rules to the factual circumstances
    5. Court’s conclusions and outcomes of the case
The sections play different roles in a decision, and could carry different meaning depen-
dent on the context in which they are read. A judge, a law student, a party to a dispute, or
a legal professional could all read the different sections with different focus depending
on what they need to learn. It is an indispensable skill for any professional lawyer to
quickly identify and understand the different sections of a case.
     Due to its prominence the task has attracted abundant research in AI & Law (see
e.g. [14,17,1]). Researchers have worked on automating the identification of the different
sections, using sophisticated ML models, on a sentence, paragraph or other level. The
automatic segmentation of cases could serve as a reading aid for actors in the legal world,
as a way to improve legal search results, and as an important prerequisite for future
research in AI & Law focused on specific sections of the cases.
     Typically, research in prediction of rhetorical roles includes dataset creation and
training of ML models on that dataset. The dataset usually focuses on one or several
domains. The model is then tested against a test set, typically sampled randomly from
the dataset. This shows the predictive capability of the model on test data from the same
distribution as the training data.
     However, it can be difficult to evaluate how well a model would perform if applied to
other, previously unseen datasets that might have different distributions, e.g. whether the
model is able to generalize beyond the annotated dataset. For example, it is possible that
a model has only learned a specific characteristic of a single domain or jurisdiction to
describe factual circumstances, rather than the abstract idea of what a sentence describing
factual circumstances looks like. If the goal is to apply the model in a general fashion,
on cases from different jurisdictions, domains and countries, it is therefore important to
evaluate the performance of the system on datasets it has not seen at all during training.
     Evaluating the models on multiple datasets, however, is not trivial. Firstly, publicly
available datasets are scarce in AI & Law. Further, the datasets that are available typically
use custom type systems, making cross-dataset evaluation challenging. In this paper, we
identify an example meta type system, that generalizes across three datasets. We use the
meta type system to train models on data from different domains and jurisdictions and
assess performance of the models across the domains.
     We compare the ability of different types of ML models to generalize from one
dataset to another. First, as a baseline we use a Support Vector Machines classifier which
learns the correlation between word and/or n-gram occurrences and a certain label. Then,
we evaluate the more recently developed BERT model (bidirectional encoder representa-
tion from transformers). This model has been pre-trained on massive corpora of text, to
learn a language model. It can then be further fine-tuned on a down-stream task, such as
text classification in a specific domain, achieving strong results with little training data
by leveraging the previously learned language representation.
Šavelka et al. Proceedings of ASAIL 2020                                                 3


2. Hypotheses

By conducting the experiments described in Section 4, we investigate the following hy-
potheses:
     • H1 - Machine learning models are able to generalize the knowledge learned from
       one dataset and apply it successfully on another dataset.
     • H2 - The BERT model is able to leverage its understanding of language to abstract
       beyond the specific domain vocabulary, thereby showing better performance in
       generalizing across the domains than the SVM model.
     • H3 - Training the models on pooled data from multiple datasets improves perfor-
       mance and results in more robust models.
Showing that the models generalize well among different domains could have impor-
tant implications for the real-world applicability of the methods. Further, they could be
employed in active or interactive learning settings, where annotators are presented with
predictions by the model and confirm/correct these to create accurate models. This could
drastically lower the number of expensive annotations that need to be performed to start
the annotation of novel datasets.


3. Related Work

The identification of rhetorical rules of sentences in legal cases has been investigated by
several researchers. The authors of [14] used Conditional Random Field models with cus-
tom features to predict the rhetorical role of sentences in three distinct domains. Walker
et. al [17] trained a number of machine learning models to predict the rhetorical role of
sentences in decision from the U.S. Board of Veteran Appeals and compared them to rule-
based approaches. In [22] the researchers created an interface to easily create boolean
search rules for sentence classification. Bhattacharya et al. [1] created a dataset of deci-
sions from the Indian Supreme Court from 5 domains, and trained Hierarchical BiLSTM
CRF models, achieving good results across domains. These papers trained and evaluated
the models on the same datasets. The contribution in this paper is that we investigate the
capability of models trained on one dataset to generalize to other datasets. Therefore,
we re-cast three datasets (including the publicly available datasets from [17] and [1])
into a meta type system. Then we investigate the ability and respective performance of
SVM and RoBERTa models trained on a single or multiple datasets to generalize to the
other datasets. To our knowledge, training models for the classification of legal texts on
one dataset and evaluating the performance on other datasets has not been extensively
investigated previously, and is thus a novel contribution.
     We further investigate whether BERT-based models are able to exploit their pre-
trained language model to achieve higher performance in generalizing across datasets.
The use of BERT on legal texts has numerous examples. In [3] BERT is evaluated on
classification of claim acceptance given judges’ arguments. A task of retrieving related
case-law similar to a case decision a user provides is tackled in [11]. In [2] BERT is
evaluated as one of the approaches to predict court decision outcome given the facts of a
4                                                          Šavelka et al. Proceedings of ASAIL 2020


case. BERT has been successfully used for classification of legal areas of Supreme Court
judgments. [8] The authors of [10] combine BERT with a simple similarity measure to
tackle the challenging task of case law entailment. BERT was also used in learning-to-
rank settings for retrieval of legal news [12] and case-law sentences interpreting statutory
concepts [13]. In [20], the researchers investigate the additional fine-tuning of language
models on related tasks to improve performance in the analysis of legal entailment.
     We show that models pre-trained on one dataset can to some extent perform predic-
tions on other datasets. This could be used to bootstrap new datasets in other domains,
decreasing the effort of annotation. This follows a steady line of work in AI & Law
on making annotation more effective. Westermann et al. [22] proposed and assessed a
method for building strong, explainable classifiers in the form of Boolean search rules.
Employing an intuitive interface, the user develops Boolean rules for matching instead
of annotating the individual sentences. In [21] a method for using pre-trained language
models to identify semantically similar sentences suitable for annotation is proposed.
Šavelka and Ashley [16] evaluated the effectiveness of an approach where a user labels
the documents by confirming (or correcting) the prediction of a ML algorithm (interac-
tive approach). The application of active learning has further been explored in the context
of classification of statutory provisions [19] and eDiscovery [4,5,7].


4. Experimental Design

In order to evaluate the ability of models to generalize beyond a single domain, we em-
ploy an experimental design consisting of several steps. First, we identify three datasets
containing a categorization of sentences by rhetorical role (Section 4.1). Then, we iden-
tify a meta type system that we can transform all the type systems into to create a task
shared among the datasets (Section 4.2). We then fine-tune and evaluate a pre-trained
language model and a Support Vector Machine model (Section 4.4) on different combi-
nations of these datasets (Sections 4.4 and 4.5).

4.1. Data

In this work we utilize three datasets. The first one comes from [17]. The authors ana-
lyzed 50 fact-finding decisions issued by the U.S. Board of Veterans’ Appeals (“BVA”)
from 2013 through 2017, all arbitrarily selected cases dealing with claims by veterans
for service-related post-traumatic stress disorder (PTSD). For each of the 50 BVA deci-
sions in the PTSD dataset, the researchers extracted all sentences addressing the factual
issues related to the claim for PTSD, or for a closely-related psychiatric disorder. These
were tagged with the rhetorical roles [18] the sentences play in the decision. These were
Finding, Reasoning, Evidence, Legal Rule, and Citation. 2
     The second dataset also focuses on the rhetorical roles of sentences. Bhattacharya
et al. [1] analyzed 50 opinions of the Supreme Court of India (“ISC”). The cases were
sampled from five different domains in proportion to their frequencies (criminal, land and
    2 Dataset available at https://github.com/LLTLab/VetClaims-JSON
Šavelka et al. Proceedings of ASAIL 2020                                                           5


                       Table 1. Label distributions of the datasets used in this work.
                                             BVA      CB         ISC     Total
                             Facts          2,420   4,182      2,219     8,821
                             Non-Facts      3,733   5,783      9,380    18,896
                             Total          6,153   9,965     11,599    27,717

property, constitutional, labor and industrial, and intellectual property). The decisions
were split into 9,380 sentences and manually classified into one of the seven categories
according to the rhetorical roles they play in a decision. These were Facts, Ruling (lower
court), Argument, Ratio, Statute, Precedent, Ruling (present court).3
     We also created a brand-new data set by scraping the case briefs from a publicly
available Case Brief Summary database (“CB”).4 The case briefs were categorized in
terms of the areas of regulation, such as administrative law, business law, or criminal
law (11 categories in total). In total, we were able to obtain 715 unique case briefs.
The case briefs are structured into a number of sections with headings. We extract the
sections based on an extensive battery of regular expressions to segment the retrieved
briefs into individual sections. While there were over 100 unique section heading names
we were able to identify six main types to which we could map many of the different
variations (e.g., all of Legal Issue, Issues, and Issue map to a single category). We applied
a specialized legal case sentence boundary detection system [15] to segment the sections
into sentences. This results in the dataset comprising 9,965 sentences, each with one of
the six labels corresponding to the section of the brief where the sentence occurred. The
possible sections are Facts, Issue, Conclusion, Procedural History, Reasoning, and Rule.

4.2. Task

The datasets described in 4.1 use different annotation type systems and stem from differ-
ent domains and decision makers. However, parts of the type systems overlap, making
it possible to map certain types to a new meta type system that allows the training of
models and evaluation of the trained models between datasets.
     In order to map the source type systems into a single type system, we identify the
label of the source datasets that most closely corresponds to a description of factual
circumstances in the cases. These sentences are tagged as “Facts”. All other sentences
are tagged as “Non-facts”. The labels that we chose to represent facts in the different
datasets are as follows:
     1. BVA - “Evidence” - defined as a sentence that primarily states the content of the
        testimony of a witness, states the content of documents introduced into evidence,
        or describes other evidence.
     2. CB - “Facts” - Sentences stemming from a section with the title “Facts”, typically
        describing the factual background to a case.
     3. ISC - “Facts” - refers to the chronology of events that led to filing the case, and
        how the case evolved over time in the legal system (e.g., First Information Report
        at a police station, filing an appeal to the Magistrate).
  3 Dataset available at github.com/Law-AI/semantic-segmentation
  4 http://www.casebriefsummary.com/; Note that as of 2020-11-29 the site is no longer available.
6                                                        Šavelka et al. Proceedings of ASAIL 2020


After transforming the datasets into this schema, the distributions of Facts vs Non-Facts
are as described in Table 1 for each dataset. While the source type systems are different,
and the definitions of the corresponding types are not the same, for the purposes of the
training and evaluation we assume that the new type system describes the same task when
applied across the three datasets (i.e., we assume that we have transformed three related
tasks into a unified one).

4.3. Dataset splits

We randomly split all the data sets into training, validation, and test sets using ratios of
50-25-25. The splitting for all four datasets is performed at the level of documents, i.e.,
all the sentences from a single document are in the same fold. We did not take the size
of a document into account, meaning that the number of sentences could vary slightly
between datasets. The training split of the datasets is used for training the models, the
validation split is used for model selection and hyperparameter optimization, while the
test set is used for final evaluation.

4.4. Models

In this work, we use the RoBERTa (a robustly optimized BERT pretraining approach)
described in [9] as the starting point for our experiments.5 Out of the available models
we chose to work with the smaller roberta.base model that has 125 million parameters, in
order to be able to iterate the experiments faster than the larger models. RoBERTa is using
the same architecture as BERT [6]. However, the authors of [9] conducted a replication
study of BERT pre-training and found that BERT was significantly undertrained. They
used the insights thus gained to propose a better pre-training procedure.
     As baseline, we use a Support Vector Machine (SVM) classifier. SVM constructs a
hyper-plane in a high dimensional space, which is used to separate the classes from each
other. As an implementation of SVM we use the scikit-learn’s Support Vector Classifica-
tion module.6 As features we use the bag of words of (1-3)-grams weighted by TF-IDF.

4.5. Experiments

We train the models on all the different possible pools of training data. The possible pools
are BVA, CB, ISC, BVA+CB, BVA+ISC, CB+ISC, and BVA+CB+ISC. Both RoBERTa
and SVM are trained on each of these pools separately, and the results are presented for
prediction on the test split of each dataset (Section 5).
     In all the experiments, we fine-tune the base RoBERTa model, with a linear clas-
sification layer on top of the pooled output, for 10 epochs on the training splits of the
selected datasets. We use the batch size of 8 which is the maximum allowed by our hard-
ware setup (1080Ti with 11GB) given we set the length of a sequence to 512 (maximum).
As optimizer we use Adam with initial learning rate set to 4e−5 . We store a model’s
checkpoint after the end of each training epoch. The checkpoints are evaluated on the
    5 github.com/pytorch/fairseq/tree/master/examples/roberta
    6 scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
Šavelka et al. Proceedings of ASAIL 2020                                               7


corresponding validation set. The model with the highest F1 on the validation set is then
selected as the one to make predictions on the test sets. The performance on the test sets
is what we report in the results section.
     The SVM is optimized via gridsearch over the few most important hyperparameters
(C, class weight, number of iterations). The various hyperparameter combinations are
evaluated against the validation dataset. The model with the highest F1 on the validation
set is then selected as the one to make predictions on the test sets. The performance on
the test sets is what we report in the results section.
     For evaluation we report the F1-measure (F1 ), i.e., the traditional information re-
trieval measures, to evaluate performance of the trained models. Since the task is binary,
the application of the measure is straightforward.

                           |S|                |S|
                                 TP                 TP               2PR
                    P=∑                     R=∑               F1 =
                          i=1 T P + FP       i=1 T P + FN            P+R

T P stands for true positives, FP for false positives, and FN for false negatives.


5. Results

5.1. H1 - Machine learning models are able to generalize the knowledge learned from
     one dataset and apply it successfully on another dataset.

In order to evaluate this hypothesis, we investigate the scores in Table 2. Of course, the
models with the highest scores are the models where the training data stems from the
same datasets as the testing data (cells colored in gray). Further, the RoBERTa model
performs better than the SVM model across the board, which is expected as RoBERTa is a
much more complex model (compared to SVM) pre-trained for language understanding.
     To investigate H1 we assess the scores for models trained on one dataset when ap-
plied an another dataset. For the RoBERTa model, the maximum drop in performance is
0.27 in F1-score, when trained on BVA and evaluated against ISC. In general, the drop is
smaller, with strong performances even when applied on different datasets. The fact that
the RoBERTa model trained on the CB dataset achieves and F1-score of .84 on the BVA
dataset, despite the different characteristics of the datasets, is quite impressive.
     The SVM model seems to have problems learning the distributions of the ISC
dataset, resulting in low performance when trained on this dataset. This might be due
to the transformed ISC dataset being more imbalanced than the other two (see Table 1.
Overall, the RoBERTA model trained on the CB dataset performs the strongest on aver-
age. This might be due to the large distribution of domains contained in this dataset.

5.2. H2 - The BERT model is able to leverage its understanding of language to abstract
     beyond the specific domain vocabulary, thereby generalizing better across the
     domains than the SVM model.

The RoBERTa model appears to transfer the knowledge learned on any one dataset to
prediction on the other two much better than the baseline SVM model (Table 2). For
8                                                                             Šavelka et al. Proceedings of ASAIL 2020

Table 2. F1 scores for models trained on specific dataset training pools (rows) predicting on testing datasets
(columns), compared between SVM and RoBERTa. Grey cells indicate that the training data includes the target
dataset. values in parentheses indicate difference to the same model type trained on the training data matching
the test data.
                       BVA (test)                   CB (test)                   ISC (test)                Avg (test)
                      SVM RoBERTa                 SVM RoBERTa                  SVM RoBERTa              SVM RoBERTa
    BVA (train)          .92          .94    .67 (-.11)    .71 (-.12)   .44 (+.03)        .44 (-.15)     .68           .70
    CB (train)    .60 (-.22)   .84 (-.10)           .78           .83   .50 (+.09)      .57 (-0.02)      .63           .75
    ISC (train)   .11 (-.81)   .67 (-.27)    .19 (-.59)    .65 (-.18)          .41               .59     .24           .64


Table 3. F1-scores for models trained on specific dataset training pools (rows) predicting on testing datasets
(columns), compared between SVM and RoBERTa. First row (“Matching train”) displays performance of mod-
els when trained and evaluated on same dataset. Parentheses show difference in F1-score to model trained and
evaluated on the same dataset.
                                 BVA (test)                       CB (test)                       ISC (test)
                               SVM          RoBERTa            SVM      RoBERTa                SVM       RoBERTa
        Matching train           .92              .94            .78              .83             .41           .59
        BVA+CB             .91 (-.01)       .93 (-.01)    .78 (+.00)    .84 (+.01)        .41 (+.00)      .55 (-.04)
        BVA+ISC            .91 (-.01)       .93 (-.01)     .46 (-.32)    .75 (-.09)       .46 (+.05)     .59 (+.00)
        CB+ISC             .55 (-.37)       .85 (-.09)     .71 (-.07)    .82 (-.01)       .53 (+.12)     .60 (+.01)
        BVA+CB+ISC         .90 (-.02)       .93 (-.01)    .68 (-.10)     .82 (-.01)       .50 (+.09)     .59 (+.00)


example, RoBERTa trained on CB performs at F1 = 0.84 on BVA. This is a sizeable
performance hit when compared to RoBERTa trained on BVA (F1 = 0.94). However,
the SVM model goes from F1 = 0.92 to F1 = 0.60. While it is customary for BERT
models to perform better than traditional ML models, the sizeable difference of .24 in
this instance might indicate that there is a qualitatively different understanding going on,
as RoBERTa can understand the semantic meaning rather than the specific vocabulary
used. The difference does not hold up for all datasets, however.

5.3. H3 - Training the models on pooled data from multiple datasets improves
     performance and results in more robust models.

Table 3 shows the results for models trained on the matching training data (e.g. the model
trained on BVA evaluated on BVA) in the top row, and models trained on multiple pooled
datasets and evaluated on a single test dataset below. Overall, there does not seem to be
a strong improvement when adding another dataset to the training data.
     Concerning robustness, the results suggest that RoBERTa is much better at handling
data coming from different datasets, achieving impressive performances on both datasets.
This is not the case for the SVM baseline. The RoBERTa model trained on all three
datasets is especially impressive as it achieves within 0.01 of all the individual datasets
simultaneously. Such a model would likely generalize better to other domains (not con-
sidered here) as compared to the model trained exclusively on one of the three datasets.
     This robustness can further be illustrated by the fact that combining data from the
two non-target datasets seems to yield higher performance on the target datasets than
training just on either of the non-target datasets. For example, training the RoBERTa
Šavelka et al. Proceedings of ASAIL 2020                                                  9


model on both BVA and ISC yields higher performance when predicting CB (.75) than
using either BVA (.71) or ISC (.65) for training on their own.


6. Discussion

The results reveal several interesting properties of the evaluated models. First of all, they
show that it is indeed possible to train a model on one dataset, and use it for prediction
on another dataset with decent performance. This holds especially true for the powerful
RoBERTa model. It should be noted that the models in section 2 have never seen any
sample from the target dataset, except when the source dataset and the target dataset
are the same, and still achieve strong results. This shows a clear path towards using a
model trained on one dataset to bootstrap a new dataset for active or interactive learning
purposes. It also shows that the tasks defined in terms of their respective type systems
are related, despite being created in different contexts and jurisdictions. The relatedness
enables ML models to be successfully applied across domains.
     Further, several interesting observations can be made when comparing the perfor-
mance of the SVM and RoBERTa model. In several instances, the drop between the
model trained and evaluated on the same dataset and a model trained on another dataset
is larger for the SVM model. This could indicate that RoBERTa leverages its language
understanding to abstract beyond the specific vocabulary used.
     Finally, the RoBERTa models trained on multiple datasets seem to retain high perfor-
mance on the specific datasets they are trained on, while also improving performance on
datasets they were not trained on. This is another promising result showing that the mod-
els can learn patterns even from datasets created in different contexts and jurisdictions,
and use the additional data to create robust and strong classifiers.


7. Conclusion and Future Work

In this paper, we have provided an example of how datasets created in different contexts
can be re-cast to support the same task, and utilized towards a common goal. The pre-
trained language models seem to be able to understand the underlying idea behind a
task to a larger extent than SVM models, resulting in higher performance when applied
to different datasets than they were trained on. Further, training the models on several
domains improved robustness and to some extent performance. Overall, these results
suggest that pre-trained language models have several desirable properties when training
on multiple datasets, making them an ideal tool for efficient utilization of the existing AI
& Law datasets to support future research in novel tasks.
     We observed that it is possible to train a model on one dataset, and use it for predic-
tion on another one with decent performance. We plan to investigate whether this could be
used to overcome the cold start problem to improve efficacy in an active learning context.
Another angle of investigation is to experiment with the transformation of the datasets,
and identifying other pertinent documents, potentially even across languages. Finally, us-
ing larger, state-of-the art models is warranted to investigate additional improvements in
predictive performance.
10                                                                   Šavelka et al. Proceedings of ASAIL 2020


References

 [1]   Bhattacharya, P., S. Paul, K. Ghosh, S. Ghosh, and A. Wyner. “Identification of Rhetorical Roles of
       Sentences in Indian Legal Judgments.” JURIX 2019, IOS Press.
 [2]   Chalkidis, I., I. Androutsopoulos, and N. Aletras. “Neural legal judgment prediction in english.” Pro-
       ceedings of the 57th Annual Meeting of the ACL. 2019.
 [3]   Condevaux, C., S. Harispe, S. Mussard, and G. Zambrano. “Weakly Supervised One-Shot Classification
       Using Recurrent Neural Networks with Attention: Application to Claim Acceptance Detection.” JURIX.
       2019.
 [4]   Cormack, G., and M. Grossman. “Scalability of continuous active learning for reliable high-recall text
       classification.” In Proc. 25th ACM Int’l Conf. on Info. & Knowledge Management. 2016.
 [5]   Cormack, G., and M. Grossman. “Autonomy and reliability of continuous active learning for technology-
       assisted review.” arXiv preprint arXiv:1504.06868 (2015).
 [6]   Devlin, J., M. Chang, K. Lee, and K. Toutanova. “Bert: Pre-training of deep bidirectional transformers
       for language understanding.” NAACL, 2019.
 [7]   Hogan, C., R. Bauer, and D. Brassil. “Human-aided computer cognition for e-discovery.” In ICAIL. 2009.
 [8]   Soh, J., H. K. Lim, and I. E. Chai. “Legal Area Classification: A Comparative Study of Text Classifiers
       on Singapore Supreme Court Judgments.” NLLP Workshop, 2019.
 [9]   Liu, Y., M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov.
       “Roberta: A robustly optimized bert pretraining approach.” arXiv preprint arXiv:1907.11692 (2019).
[10]   Rabelo, J., M. Kim, and R. Goebel. “Combining similarity and transformer methods for case law entail-
       ment.” ICAIL. 2019.
[11]   Rossi, J., and E. Kanoulas. “Legal Search in Case Law and Statute Law.” JURIX. 2019.
[12]   Sanchez, L., J. He, J. Manotumruksa, D. Albakour, M. Martinez, and A. Lipani. “Easing Legal News
       Monitoring with Learning to Rank and BERT.” European Conference on Information Retrieval. Springer,
       Cham, 2020.
[13]   Savelka, J. Discovering sentences for argumentation about the meaning of statutory terms. Diss. Univer-
       sity of Pittsburgh, 2020.
[14]   Savelka, J., and K. D. Ashley. “Using CRF to detect different functional types of content in decisions of
       united states courts with example application to sentence boundary detection.” ASAIL 2017.
[15]   Savelka, J., Walker, V. R., Grabmair, M., & Ashley, K. D. (2017). Sentence boundary detection in adju-
       dicatory decisions in the united states. Traitement automatique des langues, 58, 21.
[16]   Savelka, J., G. Trivedi, and K. Ashley. “Applying an interactive machine learning approach to statutory
       analysis.” JURIX. 2015.
[17]   Walker, V. R., K. Pillaipakkamnatt, A. M. Davidson, M. Linares, and D. J. Pesce. “Automatic Classifica-
       tion of Rhetorical Roles for Sentences: Comparing Rule-Based Scripts with Machine Learning.” ASAIL
       2019.
[18]   Walker, V. R., J. H. Han, X. Ni, and K. Yoseda. “Semantic types for computational legal reasoning:
       propositional connectives and sentence roles in the veterans’ claims dataset.” ICAIL. 2017.
[19]   Waltl, B., J. Muhr, I. Glaser, G. Bonczek, E. Scepankova, and F. Matthes. “Classifying Legal Norms with
       Active Machine Learning.” JURIX. 2017.
[20]   Westermann, H., J. Savelka, and K. Benyekhlef. “Paragraph Similarity Scoring and Fine-Tuned BERT
       for Legal Information Retrieval and Entailment.” COLIEE. 2020.
[21]   Westermann, H., J. Savelka, V. Walker, K. Ashley, and K. Benyekhlef. “Sentence Embeddings and High-
       speed Similarity Search for Fast Computer Assisted Annotation of Legal Documents.” JURIX. 2020.
[22]   Westermann, H., J. Savelka, V. Walker, K. Ashley, and K. Benyekhlef. “Computer-Assisted Creation of
       Boolean Search Rules for Text Classification in the Legal Domain.” JURIX. 2019.

</pre>