=Paper=
{{Paper
|id=Vol-2484/paper4
|storemode=property
|title= Evaluation of Seed Set Selection Approaches and Active Learning Strategies in Predictive Coding
|pdfUrl=https://ceur-ws.org/Vol-2484/paper4.pdf
|volume=Vol-2484
|authors=Christian J. Mahoney,Nathaniel Huber-Fliflet,Haozhen Zhao,Jianping Zhang,Peter Gronvall,Shi Ye
|dblpUrl=https://dblp.org/rec/conf/icail/MahoneyHZZGY19
}}
== Evaluation of Seed Set Selection Approaches and Active Learning Strategies in Predictive Coding==
Evaluation of Seed Set Selection Approaches and Active Learning
Strategies in Predictive Coding
Christian J. Mahoney Nathaniel Huber-Fliflet Haozhen Zhao
e-Discovery Data & Technology Data & Technology
Cleary Gottlieb Steen & Hamilton LLP. Ankura Consulting Group, LLC Ankura Consulting Group, LLC
Washington, D.C. USA Washington, D.C. USA Washington, D.C. USA
cmahoney@cgsh.com nathaniel.huber-fliflet@ankura.com haozhen.zhao@ankura.com
Jianping Zhang Peter Gronvall Shi Ye
Data & Technology Data & Technology Data & Technology
Ankura Consulting Group, LLC Ankura Consulting Group, LLC Ankura Consulting Group, LLC
Washington, D.C. USA Washington, D.C. USA Washington, D.C. USA
jianping.zhang@ankura.com peter.gronvall@ankura.com shi.ye@ankura.com
ABSTRACT threshold in the legal domain when using classifiers to designate
documents for production. In most cases we find that seed set
Active learning is a popular methodology in text classification – selection methods have a minor impact, though they do show
known in the legal domain as ‘predictive coding’ or ‘Technology significant impact in lower richness data sets or when choosing a
Assisted Review’ or ‘TAR’ – due to its potential to minimize the top-ranked active learning selection strategy. Our results also show
required review effort to build effective classifiers. It is generally that active learning selection strategies implementing uncertainty,
assumed that when building a classifier of data for legal purposes random, or 75% recall selection strategies has the potential to reach
(such as production to an opposing party or identification of the optimum active learning round much earlier than the popular
attorney-client privileged data), the seed set matters less as continuous active learning approach (top-ranked selection). The
additional learning rounds are performed, thus in most existing results of our research shed light on the impact of active learning
relevant seed set studies the seed set is either built from a random seed set selection strategies and also the effectiveness of the
document set or from synthetic documents. However, our recent selection strategies for the following learning rounds. Legal
empirical evaluation on a range of seed set selection strategies practitioners can use the results of this study to enhance the
demonstrates that the seed set selection strategy can significantly efficiency, precision, and simplicity of their predictive coding
impact predictive coding performance. It is unclear whether that process.
conclusion applies to active learning for predictive coding. In this
study, we try to answer that question by using extensive
experimentation which examines the impact of popular seed set KEYWORDS
selection strategies in active learning, within a predictive coding
text classification, predictive coding, technology assisted review,
exercise. Additionally, significant research has been devoted to
TAR, electronic discovery, eDiscovery, e-discovery, Continuous
achieving high levels of recall efficiently through continuous active
Active Learning, CAL, SAL, Machine Learning, seed set
learning strategies when there is an assumption that human review
will continue until a certain recall is achieved. However, for reasons
such as monetary costs, sensitivity of data (or lack thereof), or time 1 Introduction
to classify a population, this heavy human lift is often less than ideal
The exponential growth of electronically stored information
for lawyers that are classifying a population for production to an
(ESI) falling within the scope of today’s large legal cases creates
opposing party or classifying a population for attorney-client
unique challenges for all parties involved, including clients,
privilege. Often the strategy is to, instead, minimize the human
lawyers, and courts/tribunals/enforcement agencies. Given the
review effort and to classify a population efficiently with minimal
volumes and complexities of ESI, litigators struggle to identify
human intervention. In these instances, the selection strategy may
documents relevant to a case (with data populations doubling about
be different than what prior research suggests. In this study, we
every two years) [10], while maintaining the quality and
evaluate different active learning strategies against well-researched
affordability of legal document review. Companies regularly spend
continuous active learning strategies for the purpose of determining
millions of dollars producing responsive ESI for matters in
efficient training methods for classifying large populations quickly
litigation, and research shows that often the majority of the costs
and precisely. We study how random sampling, keyword models
are incurred by the review process [12]. The traditional manual
and clustering based seed set selection strategies combined together
review approach is often neither economically feasible nor timely
with top-ranked, uncertain, random, recall inspired, and hybrid
enough to meet courts’ or regulators’ requirements. To confront
active learning document selection strategies affect the
these challenges, predictive coding is increasingly embraced by
performance of active learning for predictive coding. For the
legal practitioners to cull through massive volumes of data for
purpose of this study, we use the percentage of documents requiring
relevant information. Predictive coding, or text classification as it
review to reach 75% recall as the ‘benchmark’ metric to evaluate
is referred to in the machine learning domain, uses a machine
and compare our approaches. 75% is a commonly used recall
learning algorithm to train a model from a sample set, then uses the
In: Proceedings of the First International Workshop on AI and Intelligent Assistance
for Legal Professionals in the Digital Workplace (LegalAIIA 2019), held in
conjunction with ICAIL 2019. June 17, 2019. Montreal, QC, Canada.
Copyright © 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). Published
at http://ceur-ws.org.
model to identify documents that are potentially relevant, which For each of these data sets, we utilize their keyword search terms
can then be isolated for legal document production or prioritized for testing seeding and iterative training strategies. We conducted
for review. roughly 115,000 rounds of predictive coding experiments to study
how different seed set selection and active learning document
A common protocol in applying predictive coding in legal selection strategies affect the performance of predictive coding.
document review is to, instead of relying on a single model trained Our paper is organized as follows. (i) We first review existing
from a single seed set, train predictive coding models using an research related to seed set selection and active learning document
iterative approach. Following the coding of a first round of training, selection strategies. (ii) We then lay out our methodology,
commonly referred to as a seed set, an initial predictive model is including the seed set selection and active learning document
created – this model is used to score all the unlabeled documents. selection strategies, as well as our research questions. (iii) Next, we
Then, a training document selection strategy is used to choose new introduce the data sets used in our experiments, our experimental
training documents from the scored population. These documents procedure, and our evaluation metrics. (iv) Finally, we discuss our
will be reviewed, and then added to the training set to train a new experimental results and conclude the paper with key insights from
version of the model. This process is repeated until the goal of our study and describe future work.
manually finding enough relevant documents during an active
learning review is met (a strategy called Continuous Active
Learning, or “CAL”) or until the performance of the latest model 2 Related Work
meets an acceptable recall threshold with an acceptable amount of The seed set, as the initial training set for predictive coding, has
precision. Once this level is met the document-level scoring from created significant debates in the legal domain. One of these
the classifier is used to make a relevance determination on the debates centers around how the seed set, or initial training set of
remaining unreviewed documents in the population (a strategy documents, should be generated. In our research, we focus on the
called Simple Active Learning, or “SAL”). Existing studies show best strategies to generate seed sets.
that active learning approaches provide an advantage by finding as
many relevant documents as possible while spending minimal There is no established consensus on seed set selection sampling
review efforts [3]. However, these studies assume the human methods. Two major seed set selection methods are: random
review of all documents identified as relevant by the predictive sampling and judgmental sampling. Schieneman et. al. argue that
model and focus on how best to expedite this process through the seed set should be “representative of the collection” thus based
continuous prioritization of relevant documents until target recall on random sampling such that the predictive coding process would
thresholds are achieved [4]. This group of collaborators finds that result in adequate recall and that judgmental sampling could
though there are real world legal matters where such human review potentially “bias” results [14]. In contrast, Cormack et al. [4]
is excessively costly or time consuming, there are a lack of studies propose the use of a synthetic seed document, e.g. constructed from
that focus on SAL and how to most efficiently train an active topic descriptions, in their AutoTAR protocol. Pickens et al. [13]
learning model that efficiently achieves a high level of recall with studied manual seeding in the TREC Total Recall Track and found
minimal human review of training documents. that initial seeding conditions had impact on task outcomes. In our
previous work [11], we studied the effect of different seed set
In certain situations, particularly where minimizing either the selection strategies in predictive coding, and empirically
time or cost to classify a data set is paramount, this can be a more demonstrated that complex seed set selection techniques with the
desirable approach than a Continuous Active Learning protocol that purpose of ensuring the diversity of the seed set or increasing its
reprioritizes documents round after round until a desired recall is richness only provides modest improvement when compared to the
achieved through human review. There are two critical aspects of random sampling.
this kind of protocol. One aspect concerns the initial seed set used
to train the first-round model – whether seed sets selected using In active learning protocols, a key component is the method of
different approaches ultimately have a significant impact on an selecting additional training documents after each round. The
active learning model. The other aspect concerns the impact of how seminal work by Lewis et al. [8, 9] showed that choosing additional
additional training documents are chosen and added to improve training documents closest to a score of .5 (on a scale of 0 to 1),
model performance. A more thorough understanding of these two that are the area that Lewis describes as most uncertain to the
aspects will provide guidance to legal practitioners to make classifier, produces an effective classifier quicker than other
decisions in managing the predictive coding process that will help selection strategies. In their original paper on the Continuous
to minimize the amount of time and cost to develop a highly Active Learning protocol, Cormack et al. [3] compared three active
effective model. learning document selection strategies: (i) select top scored
documents (most commonly associated with CAL); (ii) select
In this paper, we report our empirical studies on the impact of documents of which the learning algorithm is most uncertain in
seed set selection and active learning document selection strategies making a relevance call (most commonly associated with SAL);
on predictive coding for legal document review. We use four fully and (iii) select documents randomly (most commonly associated
coded or labeled data sets prepared in response to production with Simple Passive Learning, or SPL). Their paper demonstrated
requests in actual legal matters spanning across different industries. that the CAL training selection strategy consistently outperformed
other approaches in finding the most relevant documents with Six active learning selection strategies were studied in this
minimal review efforts. Chhatwal et al. [2] also studied the same research.
three active learning document selection strategies, Top-Ranked, • Top-Ranked (TOP): select documents with the highest
Uncertain, and Random applied to real legal matter data sets. This scores assigned by the model.
study revealed that always selecting the highest-scoring documents • Uncertain (MID-50): select documents nearest to the
as additional training documents may not be the most efficient score of 0.5 (in either direction from .5), which is the
approach because round by round the model’s performance may score indicating highest uncertainty prescribed by our
not improve. Both conclusions are understandable if we appreciate model.
the dual purpose inherent in active learning: (i) quickly find as • MID at 75% recall (MID_75RC): select documents
many relevant documents as possible; (ii) train an effective final nearest the cut-off score (in either direction from the cut
model using as few rounds as possible. The conflicting conclusions off score) resulting in a recall of 75% of all responsive
of the two studies are due to evaluating the selection strategies documents.
differently. In Cormack’s work, the performance was evaluated • Random (RAND): select documents randomly from all
using only the training set, namely the documents that were the documents scored by the model.
selected. In our work, the performance was evaluated on both the • 80% Top scored + 20% random (80TOP20RD): select
documents selected and the documents classified by the model. 80% of the documents with the highest scores assigned
Recently, there are new efforts in experimenting with retraining by the model and 20% of the documents randomly from
strategies in CAL. Ghelani et al. [5] compared retraining with the rest.
exponentially increased or static top-scored documents, as well as • 20% Top scored + 80% random (20TOP80RD): select
partial retraining, precision-based, and recency weighted retraining 20% of the documents with the highest scores assigned
strategies, and show that CAL can achieve higher recall when by the model and 80% of the documents randomly from
retraining more frequently. the rest.
It should be noted the MID_75RC strategy is a novel strategy
3 Training Document Selection
that we have not seen in any literature. The reason we used 75%
In this section, we introduce both seed set document selection recall is that in real-world legal document reviews, a recall of 75%
and active learning document selection strategies. is a commonly used minimum performance metric. In practice, this
strategy can be implemented as selecting documents with scores
nearest to the cut-off score for 75% recall derived from a
3.1 Seed Set Selection Methods statistically representative sample set – essentially implementing an
In our previous paper [11], we studied the predictive coding initial validation set (or control set) is required to implement this
performance of the following seed set selection strategies. strategy in a real-world scenario. As an example, a control set of
• Random Sampling (random): generate a random sample 2,000 documents is isolated and coded by human reviewers and has
of documents from the corpus of all documents. a richness of 20%, resulting in 400 relevant documents within the
• Stratified Keyword Sampling (keyword_method1): select random sample. The classifier would achieve an estimated 75%
an equal number of documents from the document hits of recall by identifying the cutoff score at which 300 of the 400
each keyword developed by counsel for the purpose of relevant documents are identified by the classifier. For purposes of
identifying responsive information. this study, we have used fully coded document populations in order
• Weighted Stratified Keyword Sampling to eliminate the uncertainty involved with this type of recall
(keyword_method2): select a number of documents from estimate.
document hits of each keyword proportional to the hits
size. Our research empirically compared different seed set selection
• Clustering Sampling (cluster_method1): select an equal strategies combined with different active learning document
number of documents from each cluster. We use a variant selection strategies. Specifically, we address the following
of the K-Means clustering algorithm to create a cluster questions:
set of three branches to a depth of five layers for each 1. What effect do different seed set selection strategies have
data set. on the active learning process?
• Weighted Clustering Sampling (cluster_method2): select 2. What effect do different active learning document
a number of documents from each cluster proportional to selection strategies have on the predictive coding
the cluster size. process?
3. How do seed set selection strategies impact the
More detailed description of these seed set selection methods effectiveness of active learning selection strategies?
can be found in [11]. 4. Are there combinations of seed set and active learning
strategies that consistently outperform other strategies
when an emphasis is placed on objectives most
3.2 Active Learning Selection Strategies commonly associated with a SAL approach (namely
minimizing the amount of human review, time, and costs
in isolating a precise population that achieves a certain Table 1C: Privilege Keyword Statistics
Documents Hit by
Documents Hit by
recall threshold)?
Total Documents
Keyword Hit
Keywords
Data Sets
Percentage
Privileged
Keywords
Keywords
4 Experiments
In this section, we first introduce the data sets we used in the
empirical study, and then we discuss the experimental procedure
and evaluation metrics. We report the experimental results in the Project A 308,621 808 193,017 43,847 62.54%
next section. Project B 393,745 4,211 368,506 13,571 93.59%
Project C 277,745 509 159,900 36,234 57.57%
4.1 Data Sets
Table 1D: Responsive Keyword Statistics
We conducted experiments on four data sets from confidential,
Total Documents
Documents Hit
Documents Hit
non-public, real legal matters across various industries such as
by Keywords
by Keywords
Keyword Hit
Keywords
Responsive
Data Sets
Percentage
social media, communications, construction, and security. We
chose matters with data sets that ranged from around 300,000 to
500,000 documents in order to execute our experiments within a
reasonable time period. The richness, or positive class rate, of the
four data sets ranged from approximately 4% to 39%. Attorneys Project D 412,880 23 81,362 53,611 19.71%
reviewed all documents in the four data sets over the course of the
legal matter and their coding (labels) provided the ability to fully
evaluate the performance of the models. Tables 1A, 1B, 1C, and 1D 4.2 Experiment Procedure
provide the details for the four data sets, respectively. The details We conducted an empirical study on the effect that seed set and
include descriptions, sizes, attorney coding statistics, and statistics active learning document selection strategies have on the
about keyword terms on the data sets. The predictive coding performance of a predictive coding process.
objective for Data Sets A, B, and C was to identify privileged
communications between attorneys and clients. The objective for The same set of experiments were performed on each of the four
Data Set D was to identify documents responsive to production data sets. For each data set, all of the five seed set selection
requests from the opposing party in the matter. strategies and all of the six active learning document selection
strategies were tested. In total there were 30 combinations of seed
The recall of the keyword hits is around 93% for the privileged set selection and active learning document selection strategies for
data sets and 34% for the responsive data set. As comparing each data set. In all experiments, the seed set included 500 training
keywords-based and predictive coding approaches for legal documents, and an additional 250 training documents are selected
document review is beyond the scope of this paper, readers that are in each round of active learning. Table II shows the richness of the
interested in this subject can read our previous related papers [6, 7]. seed sets for the four data sets. From the table, we can see that the
Random seed set selection method generally has similar richness as
Table 1A: Privilege Data Set Statistics that of the overall data set, while seed sets derived from keyword
Documents
Documents
Documents
Data Sets
search have higher richness than the overall data set.
Richness
Privileged
Privileged
Total
Not
Table 2: Richness of seed sets (%)
Keyword Method 1
Keyword Method 2
Cluster Method 1
Cluster Method 1
Data Sets
Project A 308,621 46,730 261,891 15.14%
Random
Project B 393,745 14,307 379,438 3.63%
Project C 277,412 38,834 238,578 14.00%
Table 1B: Responsive Data Set Statistics Project A 14.8 40.2 43.4 15.0 15.2
Documents
Responsive
Responsive
Documents
Documents
Data Sets
Richness
Project B 3.6 6.6 6.8 3.8 3.2
Total
Not
Project C 11.8 36.2 34.4 14.2 12.8
Project D 40.4 70.2 73.8 40.2 38.6
Project D 412,880 159,304 253,576 38.58%
Our experimental procedure was:
1. First, use the selected seed set sampling method to a seed set with 74 positive documents and 426 negative documents.
determine an initial training set of 500 documents. Now suppose we use the TOP active learning document selection
2. Train a model with the selected seed set using the same strategy, which selects 250 documents with the highest scores to
underlying machine learning algorithm (logistic add to the training set. Lastly, assume that after ten rounds the
regression) and text processing parameters. training set contains 2,491 positive documents and 509 negative
3. Then, score the entire data set, excluding any document documents, in total 3,000 documents. Examining the document
used in training. scores after ten rounds in this example, we find that if we choose
4. Next, select an additional 250 documents using one of the 24.7 as the cut-off score, there are 32,558 positive documents above
active learning document selection strategies. this cut-off score. 32,558 + 2,491 = to 35,049, represents 75% of
5. Finally, add these new training documents, train a new all the responsive documents (46,730) in this data set. The total
model, and repeat steps 3, 4, and 5 until there are no more population of documents requiring review is then established by
documents left to be scored or the minimum performance adding all the documents with scores above 24.7 (82,206) to all the
of the model is achieved. training documents (3,000) divided by the total population size
(308,621). This would equal: 27.6%.
We used Logistic Regression as the machine learning algorithm
due to its consistent high performance across different settings over
various data sets demonstrated in previous studies [1, 2]. Other text 5 Results and Discussion
processing parameters we used for modeling were, bag of words The total number of our experimental parameter combinations
with 1-gram, normalized frequency, and 20,000 tokens were used was 120. These parameters include: data set, seed set selection
as features. method, and the active learning document selection strategy. On
average there were roughly 1,000 rounds of experiments generated
In each round of our experiments, the entire data set was used for each combination. To save space in this paper, we only present
either in training or scoring, which means on average more than the most interesting results.
300,000 documents were used in training or scoring. The total
number of models trained in our experiments was: 114,933. We
leveraged the Apache Lucene search engine library to build full text 5.1 The Impact of Seed Set Selection Approaches
indices of the data sets to speed up the training and scoring Figures 1 displays the percentage of documents requiring
processes. review to achieve 75% recall for different seed set selection
strategies. Active learning strategies were fixed to TOP and
MID_75RC on Projects C and D and the first 100 rounds of
4.3 Evaluation Metrics experiments are shown. Figure 2 details the percentage of
Our performance metric measured the percentage of documents documents requiring review to achieve 75% recall for different
requiring review to achieve the targeted recall level. In the common seed set selection strategies. The RAND active learning document
passive learning scenario, this metric can be calculated on a selection strategy was fixed on Project B and C and the first 100
validation set and does not consider documents that are reviewed rounds of experiments are shown. In general, these figures show
for training because these documents typically have a negligible that seed set selection strategies have very modest impact on the
impact when attempting to achieve the desired recall performance. performance of the active learning strategies, especially after many
In an active learning scenario, as rounds increase, the number of rounds of active learning. These results were expected when using
documents reviewed for training and used to develop the model seed sets with a small number of documents (e.g., 500) because the
could constitute a considerable portion of the population requiring initial impact of the seed set selection strategy likely degrades over
review. Therefore, performance metrics in our experiments were training rounds. Therefore, it may be worthwhile to experiment
computed after each active learning round using two sets of with seed sets of larger document sizes in the future. However, from
documents. The (i) first set contained the documents that were these results, we do find two salient aspects about the impact of the
selected and reviewed during training. The (ii) second set contained seed set selection approach. First, among the different active
the documents categorized as Responsive or Privileged by the learning strategies, the TOP strategy is the most sensitive to the
predictive model after each round, namely the documents with seed selection strategy; we can see more apparent performance
probability scores greater than or equal to the predictive model’s difference across the different seed set selection strategies (Figure
cut-off score. The documents with scores greater than or equal to 1). This implies that in the very popular Continuous Active
the cut-off score are the documents that attorneys would consider Learning protocol, the seed set selection strategy has an impactful
producing to an opposing party, for assertions of privilege, or in role and should be considered carefully. Second, curves in Project
some instances for review because they are likely responsive, or in B – a matter with 3.6% richness – show that the seed set selection
the case of privilege, may contain content that would allow for the strategy had a greater impact on a low richness population and that
assertion of claims of privilege. judgmental seed set selection strategies using keywords or
clustering outperform randomly selected seed set documents in the
We can use an example to illustrate the calculation of these early rounds (Figure 2).
measures. Project A has 308,621 documents, of which 46,730 are
positive. Using the random seed selection strategy, we would select
Figure 2: Required Review at 75% Recall for the five Seed Set
Methods with RAND Active Learning Strategy on Project B
and C (First 100 Rounds)
5.2 The Impact of Active Learning Strategies
Figure 3 shows the performance differences among TOP, MID-
50, MID_75RC, and RAND with the seed set selection method
fixed to random, over learning rounds of experiments until the
optimum round is reached. These results confirm the findings in
our previous research [1], i.e. active learning selection strategies
such as uncertain sampling (MID-50) and random selection
(RAND) can generate an effective model within fewer rounds than
the popular TOP strategy. Moreover, we find that the MID_75RC
strategy, a novel active learning strategy proposed for the first time
in this paper, performs the best in almost all the scenarios. This
indicates that selecting documents nearest to the cut-off score for
75 percent recall would be the most effective active learning
strategy, when attempting to achieve 75 percent recall.
Figure 1: Required Review at 75% Recall for the five Seed Set
Methods with TOP, MID_75RC Active Learning Strategies on
Project C and D (First 100 Rounds)
Table 3: Required Review at 75% Recall for TOP and
MID_75RC Active Learning Strategies (First 50 Rounds Every 10
Rounds)
MID_75RC
Difference
Data Set
Round
TOP
10 28% 18% 9%
Project A
20 25% 17% 8%
30 26% 17% 9%
40 24% 17% 7%
50 23% 17% 6%
10 47% 35% 12%
Project B
20 44% 32% 11%
30 43% 31% 12%
40 40% 29% 11%
50 39% 28% 11%
10 24% 14% 10%
Project C 20 26% 14% 13%
30 32% 14% 18%
40 33% 14% 19%
50 29% 14% 15%
10 33% 31% 2%
Project D
20 33% 31% 3%
30 34% 31% 3%
40 34% 30% 3%
50 34% 30% 4%
5.3 Optimum Performance Round Analysis
Figure 3: Required Review at 75% Recall TOP, MID-50,
MID_75RC and RAND Active Learning Strategies with We define the optimum performance round as the round in
random Seed Set Selection Method which the amount of review required to reach 75 percent recall is
the earliest. After some analysis, we found the dominant factor in
The performance difference between the TOP strategy and the reaching the optimum performance round is the active learning
MID_75RC strategy is even more clear when we look closely into strategy and not the seed set selection strategy. In Table 4A through
the plots of the first 100 rounds. Table 3 shows in the first 50 rounds 4D, we compiled the optimum performance round of each active
and the MID_75RC strategy consistently requires less review than learning strategy for the four data sets. We can see that strategies
the TOP strategy across all projects. The maximum saving would such as RAND, MID-50 or MID_75RC consistently take fewer
be close to 20 percent in Project C. In practice, this has a significant rounds to reach the optimum performance round. Moreover, if a
impact on the predictive coding process and should be considered satisficing goal is set to a review percentage within 5%, 10% or
by legal teams to help reduce review costs. 15% of the optimum performance, we can see that those strategies
require fewer rounds to reach the goal.
Table 4A: Project A Optimum Performance Rounds Table 4D: Project D Optimum Performance Rounds
1st Round within 10%
1st Round within 15%
1st Round within 5%
Review Percentage
1st Round within 5% of
1st Round within 10%
1st Round within 15%
Optimum Round
Active Learning
Review Percentage
Optimum Round
of Op. Perf.
of Op. Perf.
of Op. Perf.
Active Learning
Strategy
of Op. Perf.
of Op. Perf.
Op. Perf.
Strategy
TOP 30.47 494 350 0 0
TOP 15.90 192 167 145 113 MID-50 31.02 13 0 0 0
MID-50 15.71 60 33 21 13 MID_75RC 30.36 332 1 0 0
MID_75RC 16.24 74 19 12 8 RAND 31.65 8 0 0 0
RAND 18.77 21 4 2 2 80TOP20RD 31.90 5 0 0 0
80TOP20RD 18.36 213 80 27 10 20TOP80RD 31.58 6 0 0 0
* Round 0 means the initial round.
20TOP80RD 19.09 50 7 3 2
Table 4B: Project B Optimum Performance Rounds
6 Conclusions
1st Round within 5%
1st Round within
1st Round within
10% of Op. Perf.
15% of Op. Perf.
Optimum Round
Active Learning
Our experiment results show that seed set selection strategies
of Op. Perf.
Percentage
have little impact on the active learning process. However, for low
Strategy
Review
richness projects, keyword-based seed set selection strategies have
more apparent effect. Also, the popular TOP active learning
strategy is the most sensitive strategy to different seed selection
methodologies.
TOP 18.58 279 263 235 206
MID-50 18.59 290 258 236 218 Our results also show that choosing documents nearest to the
cut-off score determined by reaching a 75 percent document recall
MID_75RC 20.88 263 168 117 92
potentially result in a high performing model quickly. When
RAND 27.79 123 85 41 22
excluding data sets with extremely low richness (such as Project
80TOP20RD 21.20 321 272 225 194 B), this training methodology results in significantly higher
20TOP80RD 27.50 181 106 70 53 performing models in early training rounds, such as round 10 or
round 20, rounds that are often associated with stopping points for
Table 4C: Project C Optimum Performance Rounds Simple Active Learning models. In fact, in all three of our data sets
1st Round within 10%
1st Round within 15%
that had richness above 10 percent, using the MID_75RC active
1st Round within 5%
Review Percentage
Optimum Round
Active Learning
learning strategy resulted in achieving performance within roughly
of Op. Perf.
of Op. Perf.
of Op. Perf.
10 percent of the optimum model performance within 10 rounds of
Strategy
active learning. In theory, focusing training around the dynamic
cut-off score from round to round makes sense. Documents just
above the cut-off score should be the documents included as
positives by the model with the least amount of certainty, so there
should be the most opportunity to improve precision by improving
TOP 13.56 148 130 118 107
performance by classifying the features within these documents.
MID-50 13.22 25 11 7 5 Documents just below the cut-off score should be the documents
MID_75RC 13.33 37 12 6 4 excluded as negatives by the model that have the highest amount of
RAND 15.73 18 6 4 2 richness in the excluded population, so there should be the most
80TOP20RD 15.30 156 108 50 21 opportunity to improve recall by classifying the features within
20TOP80RD 16.05 29 6 3 2 these documents. It will be interesting to continue to test these
assumptions and study this strategy both in data sets with low
richness and in utilizing other cut-off scores to meet different recall
objectives or thresholds, such as those prescribing 50 percent or 90
percent recall. It should be noted that in our current study we fixed
the seed set size at 500 and the additional number of training
documents in each round at 250. In future studies, we intend to
examine seed sets of larger sizes or various sizes of additional
active learning training documents.
The results provide practical techniques that legal practitioners 2018 IEEE International Conference on Big Data (Big Data), 2018, pp. 3282–
3291.
can use to enhance their active learning predictive coding [7] R. Keeling, N. Huber-Fliflet, J. Zhang, and R. P. Chhatwal, “Separating the
processes, as well as influencing their training document selection Privileged Wheat from the Chaff – Using Text Analytics and Machine Learning
to Protect Attorney-Client Privilege,” Richmond Journal of Law and Technology,
strategies for passive learning approaches. 2019.
[8] D. D. Lewis, “A Sequential Algorithm for Training Text Classifiers: Corrigendum
and Additional Data,” SIGIR Forum, vol. 29, no. 2, pp. 13–19, Sep. 1995.
[9] D. D. Lewis and W. A. Gale, “A Sequential Algorithm for Training Text
REFERENCES Classifiers,” in Proceedings of the 17th Annual International ACM SIGIR
[1] R. Chhatwal, N. Huber-Fliflet, R. Keeling, J. Zhang, and H. Zhao, “Empirical Conference on Research and Development in Information Retrieval, New York,
evaluations of active learning strategies in legal document review,” in 2017 IEEE NY, USA, 1994, pp. 3–12.
International Conference on Big Data (Big Data), 2017, pp. 1428–1437. [10] S. Lohr, “The Age of Big Data,” New York Times, vol. 11, 2012.
[2] R. Chhatwal, N. Huber-Fliflet, R. Keeling, J. Zhang, and H. Zhao, “Empirical [11] C. J. Mahoney, N. Huber-Fliflet, K. Jensen, H. Zhao, R. Neary, and S. Ye,
evaluations of preprocessing parameters’ impact on predictive coding’s “Empirical Evaluations of Seed Set Selection Strategies for Predictive Coding,”
effectiveness,” in Big Data (Big Data), 2016 IEEE International Conference on, in 2018 IEEE International Conference on Big Data (Big Data), 2018, pp. 3292–
2016, pp. 1394–1401. 3301.
[3] G. V. Cormack and M. R. Grossman, “Evaluation of Machine-learning Protocols [12] N. M. Pace and L. Zakaras, Where the money goes: Understanding litigant
for Technology-assisted Review in Electronic Discovery,” in Proceedings of the expenditures for producing electronic discovery. RAND Corporation, 2012.
37th International ACM SIGIR Conference on Research & Development in [13] J. Pickens, T. Gricks, B. Hardi, M. Noel, and J. Tredennick, “An Exploration of
Information Retrieval, New York, NY, USA, 2014, pp. 153–162. Total Recall with Multiple Manual Seedings,” in Proceedings of TREC 2016,
[4] G. V. Cormack and M. R. Grossman, “Autonomy and Reliability of Continuous 2016.
Active Learning for Technology-Assisted Review,” arXiv preprint [14] K. Schieneman and T. C. Gricks III, “Implications of Rule 26 (g) on the Use of
arXiv:1504.06868, 2015. Technology-Assisted Review,” Fed. Cts. L. Rev., vol. 7, p. 247, 2014.
[5] N. Ghelani, G. V. Cormack, and M. D. Smucker, “Refresh Strategies in Continuous
Active Learning,” in ProfS2018: First International Workshop on Professional
Search, 2018.
[6] P. Gronvall, N. Huber-Fliflet, J. Zhang, R. Keeling, R. Neary, and H. Zhao, “An
Empirical Study of the Application of Machine Learning and Keyword Terms
Methodologies to Privilege-Document Review Projects in Legal Matters,” in