=Paper= {{Paper |id=Vol-2484/paper4 |storemode=property |title= Evaluation of Seed Set Selection Approaches and Active Learning Strategies in Predictive Coding |pdfUrl=https://ceur-ws.org/Vol-2484/paper4.pdf |volume=Vol-2484 |authors=Christian J. Mahoney,Nathaniel Huber-Fliflet,Haozhen Zhao,Jianping Zhang,Peter Gronvall,Shi Ye |dblpUrl=https://dblp.org/rec/conf/icail/MahoneyHZZGY19 }} == Evaluation of Seed Set Selection Approaches and Active Learning Strategies in Predictive Coding== https://ceur-ws.org/Vol-2484/paper4.pdf
  Evaluation of Seed Set Selection Approaches and Active Learning
                   Strategies in Predictive Coding
          Christian J. Mahoney                                           Nathaniel Huber-Fliflet                                Haozhen Zhao
                e-Discovery                                                 Data & Technology                                 Data & Technology
  Cleary Gottlieb Steen & Hamilton LLP.                               Ankura Consulting Group, LLC                       Ankura Consulting Group, LLC
         Washington, D.C. USA                                             Washington, D.C. USA                              Washington, D.C. USA
          cmahoney@cgsh.com                                         nathaniel.huber-fliflet@ankura.com                    haozhen.zhao@ankura.com

                Jianping Zhang                                                Peter Gronvall                                       Shi Ye
              Data & Technology                                            Data & Technology                                 Data & Technology
        Ankura Consulting Group, LLC                                  Ankura Consulting Group, LLC                      Ankura Consulting Group, LLC
            Washington, D.C. USA                                         Washington, D.C. USA                              Washington, D.C. USA
         jianping.zhang@ankura.com                                     peter.gronvall@ankura.com                             shi.ye@ankura.com

ABSTRACT                                                                                threshold in the legal domain when using classifiers to designate
                                                                                        documents for production. In most cases we find that seed set
Active learning is a popular methodology in text classification –                       selection methods have a minor impact, though they do show
known in the legal domain as ‘predictive coding’ or ‘Technology                         significant impact in lower richness data sets or when choosing a
Assisted Review’ or ‘TAR’ – due to its potential to minimize the                        top-ranked active learning selection strategy. Our results also show
required review effort to build effective classifiers. It is generally                  that active learning selection strategies implementing uncertainty,
assumed that when building a classifier of data for legal purposes                      random, or 75% recall selection strategies has the potential to reach
(such as production to an opposing party or identification of                           the optimum active learning round much earlier than the popular
attorney-client privileged data), the seed set matters less as                          continuous active learning approach (top-ranked selection). The
additional learning rounds are performed, thus in most existing                         results of our research shed light on the impact of active learning
relevant seed set studies the seed set is either built from a random                    seed set selection strategies and also the effectiveness of the
document set or from synthetic documents. However, our recent                           selection strategies for the following learning rounds. Legal
empirical evaluation on a range of seed set selection strategies                        practitioners can use the results of this study to enhance the
demonstrates that the seed set selection strategy can significantly                     efficiency, precision, and simplicity of their predictive coding
impact predictive coding performance. It is unclear whether that                        process.
conclusion applies to active learning for predictive coding. In this
study, we try to answer that question by using extensive
experimentation which examines the impact of popular seed set                           KEYWORDS
selection strategies in active learning, within a predictive coding
                                                                                        text classification, predictive coding, technology assisted review,
exercise. Additionally, significant research has been devoted to
                                                                                        TAR, electronic discovery, eDiscovery, e-discovery, Continuous
achieving high levels of recall efficiently through continuous active
                                                                                        Active Learning, CAL, SAL, Machine Learning, seed set
learning strategies when there is an assumption that human review
will continue until a certain recall is achieved. However, for reasons
such as monetary costs, sensitivity of data (or lack thereof), or time                  1    Introduction
to classify a population, this heavy human lift is often less than ideal
                                                                                            The exponential growth of electronically stored information
for lawyers that are classifying a population for production to an
                                                                                        (ESI) falling within the scope of today’s large legal cases creates
opposing party or classifying a population for attorney-client
                                                                                        unique challenges for all parties involved, including clients,
privilege. Often the strategy is to, instead, minimize the human
                                                                                        lawyers, and courts/tribunals/enforcement agencies. Given the
review effort and to classify a population efficiently with minimal
                                                                                        volumes and complexities of ESI, litigators struggle to identify
human intervention. In these instances, the selection strategy may
                                                                                        documents relevant to a case (with data populations doubling about
be different than what prior research suggests. In this study, we
                                                                                        every two years) [10], while maintaining the quality and
evaluate different active learning strategies against well-researched
                                                                                        affordability of legal document review. Companies regularly spend
continuous active learning strategies for the purpose of determining
                                                                                        millions of dollars producing responsive ESI for matters in
efficient training methods for classifying large populations quickly
                                                                                        litigation, and research shows that often the majority of the costs
and precisely. We study how random sampling, keyword models
                                                                                        are incurred by the review process [12]. The traditional manual
and clustering based seed set selection strategies combined together
                                                                                        review approach is often neither economically feasible nor timely
with top-ranked, uncertain, random, recall inspired, and hybrid
                                                                                        enough to meet courts’ or regulators’ requirements. To confront
active learning document selection strategies affect the
                                                                                        these challenges, predictive coding is increasingly embraced by
performance of active learning for predictive coding. For the
                                                                                        legal practitioners to cull through massive volumes of data for
purpose of this study, we use the percentage of documents requiring
                                                                                        relevant information. Predictive coding, or text classification as it
review to reach 75% recall as the ‘benchmark’ metric to evaluate
                                                                                        is referred to in the machine learning domain, uses a machine
and compare our approaches. 75% is a commonly used recall
                                                                                        learning algorithm to train a model from a sample set, then uses the
In: Proceedings of the First International Workshop on AI and Intelligent Assistance
for Legal Professionals in the Digital Workplace (LegalAIIA 2019), held in
conjunction with ICAIL 2019. June 17, 2019. Montreal, QC, Canada.
Copyright © 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). Published
at http://ceur-ws.org.
model to identify documents that are potentially relevant, which         For each of these data sets, we utilize their keyword search terms
can then be isolated for legal document production or prioritized        for testing seeding and iterative training strategies. We conducted
for review.                                                              roughly 115,000 rounds of predictive coding experiments to study
                                                                         how different seed set selection and active learning document
    A common protocol in applying predictive coding in legal             selection strategies affect the performance of predictive coding.
document review is to, instead of relying on a single model trained      Our paper is organized as follows. (i) We first review existing
from a single seed set, train predictive coding models using an          research related to seed set selection and active learning document
iterative approach. Following the coding of a first round of training,   selection strategies. (ii) We then lay out our methodology,
commonly referred to as a seed set, an initial predictive model is       including the seed set selection and active learning document
created – this model is used to score all the unlabeled documents.       selection strategies, as well as our research questions. (iii) Next, we
Then, a training document selection strategy is used to choose new       introduce the data sets used in our experiments, our experimental
training documents from the scored population. These documents           procedure, and our evaluation metrics. (iv) Finally, we discuss our
will be reviewed, and then added to the training set to train a new      experimental results and conclude the paper with key insights from
version of the model. This process is repeated until the goal of         our study and describe future work.
manually finding enough relevant documents during an active
learning review is met (a strategy called Continuous Active
Learning, or “CAL”) or until the performance of the latest model         2    Related Work
meets an acceptable recall threshold with an acceptable amount of           The seed set, as the initial training set for predictive coding, has
precision. Once this level is met the document-level scoring from        created significant debates in the legal domain. One of these
the classifier is used to make a relevance determination on the          debates centers around how the seed set, or initial training set of
remaining unreviewed documents in the population (a strategy             documents, should be generated. In our research, we focus on the
called Simple Active Learning, or “SAL”). Existing studies show          best strategies to generate seed sets.
that active learning approaches provide an advantage by finding as
many relevant documents as possible while spending minimal                   There is no established consensus on seed set selection sampling
review efforts [3]. However, these studies assume the human              methods. Two major seed set selection methods are: random
review of all documents identified as relevant by the predictive         sampling and judgmental sampling. Schieneman et. al. argue that
model and focus on how best to expedite this process through             the seed set should be “representative of the collection” thus based
continuous prioritization of relevant documents until target recall      on random sampling such that the predictive coding process would
thresholds are achieved [4]. This group of collaborators finds that      result in adequate recall and that judgmental sampling could
though there are real world legal matters where such human review        potentially “bias” results [14]. In contrast, Cormack et al. [4]
is excessively costly or time consuming, there are a lack of studies     propose the use of a synthetic seed document, e.g. constructed from
that focus on SAL and how to most efficiently train an active            topic descriptions, in their AutoTAR protocol. Pickens et al. [13]
learning model that efficiently achieves a high level of recall with     studied manual seeding in the TREC Total Recall Track and found
minimal human review of training documents.                              that initial seeding conditions had impact on task outcomes. In our
                                                                         previous work [11], we studied the effect of different seed set
    In certain situations, particularly where minimizing either the      selection strategies in predictive coding, and empirically
time or cost to classify a data set is paramount, this can be a more     demonstrated that complex seed set selection techniques with the
desirable approach than a Continuous Active Learning protocol that       purpose of ensuring the diversity of the seed set or increasing its
reprioritizes documents round after round until a desired recall is      richness only provides modest improvement when compared to the
achieved through human review. There are two critical aspects of         random sampling.
this kind of protocol. One aspect concerns the initial seed set used
to train the first-round model – whether seed sets selected using            In active learning protocols, a key component is the method of
different approaches ultimately have a significant impact on an          selecting additional training documents after each round. The
active learning model. The other aspect concerns the impact of how       seminal work by Lewis et al. [8, 9] showed that choosing additional
additional training documents are chosen and added to improve            training documents closest to a score of .5 (on a scale of 0 to 1),
model performance. A more thorough understanding of these two            that are the area that Lewis describes as most uncertain to the
aspects will provide guidance to legal practitioners to make             classifier, produces an effective classifier quicker than other
decisions in managing the predictive coding process that will help       selection strategies. In their original paper on the Continuous
to minimize the amount of time and cost to develop a highly              Active Learning protocol, Cormack et al. [3] compared three active
effective model.                                                         learning document selection strategies: (i) select top scored
                                                                         documents (most commonly associated with CAL); (ii) select
   In this paper, we report our empirical studies on the impact of       documents of which the learning algorithm is most uncertain in
seed set selection and active learning document selection strategies     making a relevance call (most commonly associated with SAL);
on predictive coding for legal document review. We use four fully        and (iii) select documents randomly (most commonly associated
coded or labeled data sets prepared in response to production            with Simple Passive Learning, or SPL). Their paper demonstrated
requests in actual legal matters spanning across different industries.   that the CAL training selection strategy consistently outperformed
other approaches in finding the most relevant documents with              Six active learning selection strategies were studied in this
minimal review efforts. Chhatwal et al. [2] also studied the same      research.
three active learning document selection strategies, Top-Ranked,            •    Top-Ranked (TOP): select documents with the highest
Uncertain, and Random applied to real legal matter data sets. This               scores assigned by the model.
study revealed that always selecting the highest-scoring documents          •    Uncertain (MID-50): select documents nearest to the
as additional training documents may not be the most efficient                   score of 0.5 (in either direction from .5), which is the
approach because round by round the model’s performance may                      score indicating highest uncertainty prescribed by our
not improve. Both conclusions are understandable if we appreciate                model.
the dual purpose inherent in active learning: (i) quickly find as           •    MID at 75% recall (MID_75RC): select documents
many relevant documents as possible; (ii) train an effective final               nearest the cut-off score (in either direction from the cut
model using as few rounds as possible. The conflicting conclusions               off score) resulting in a recall of 75% of all responsive
of the two studies are due to evaluating the selection strategies                documents.
differently. In Cormack’s work, the performance was evaluated               •    Random (RAND): select documents randomly from all
using only the training set, namely the documents that were                      the documents scored by the model.
selected. In our work, the performance was evaluated on both the            •    80% Top scored + 20% random (80TOP20RD): select
documents selected and the documents classified by the model.                    80% of the documents with the highest scores assigned
Recently, there are new efforts in experimenting with retraining                 by the model and 20% of the documents randomly from
strategies in CAL. Ghelani et al. [5] compared retraining with                   the rest.
exponentially increased or static top-scored documents, as well as          •    20% Top scored + 80% random (20TOP80RD): select
partial retraining, precision-based, and recency weighted retraining             20% of the documents with the highest scores assigned
strategies, and show that CAL can achieve higher recall when                     by the model and 80% of the documents randomly from
retraining more frequently.                                                      the rest.

                                                                           It should be noted the MID_75RC strategy is a novel strategy
3     Training Document Selection
                                                                       that we have not seen in any literature. The reason we used 75%
   In this section, we introduce both seed set document selection      recall is that in real-world legal document reviews, a recall of 75%
and active learning document selection strategies.                     is a commonly used minimum performance metric. In practice, this
                                                                       strategy can be implemented as selecting documents with scores
                                                                       nearest to the cut-off score for 75% recall derived from a
3.1    Seed Set Selection Methods                                      statistically representative sample set – essentially implementing an
   In our previous paper [11], we studied the predictive coding        initial validation set (or control set) is required to implement this
performance of the following seed set selection strategies.            strategy in a real-world scenario. As an example, a control set of
     •   Random Sampling (random): generate a random sample            2,000 documents is isolated and coded by human reviewers and has
         of documents from the corpus of all documents.                a richness of 20%, resulting in 400 relevant documents within the
     •   Stratified Keyword Sampling (keyword_method1): select         random sample. The classifier would achieve an estimated 75%
         an equal number of documents from the document hits of        recall by identifying the cutoff score at which 300 of the 400
         each keyword developed by counsel for the purpose of          relevant documents are identified by the classifier. For purposes of
         identifying responsive information.                           this study, we have used fully coded document populations in order
     •   Weighted          Stratified    Keyword          Sampling     to eliminate the uncertainty involved with this type of recall
         (keyword_method2): select a number of documents from          estimate.
         document hits of each keyword proportional to the hits
         size.                                                             Our research empirically compared different seed set selection
     •   Clustering Sampling (cluster_method1): select an equal        strategies combined with different active learning document
         number of documents from each cluster. We use a variant       selection strategies. Specifically, we address the following
         of the K-Means clustering algorithm to create a cluster       questions:
         set of three branches to a depth of five layers for each            1. What effect do different seed set selection strategies have
         data set.                                                               on the active learning process?
     •   Weighted Clustering Sampling (cluster_method2): select              2. What effect do different active learning document
         a number of documents from each cluster proportional to                 selection strategies have on the predictive coding
         the cluster size.                                                       process?
                                                                             3. How do seed set selection strategies impact the
   More detailed description of these seed set selection methods                 effectiveness of active learning selection strategies?
can be found in [11].                                                        4. Are there combinations of seed set and active learning
                                                                                 strategies that consistently outperform other strategies
                                                                                 when an emphasis is placed on objectives most
3.2    Active Learning Selection Strategies                                      commonly associated with a SAL approach (namely
                                                                                 minimizing the amount of human review, time, and costs
                     in isolating a precise population that achieves a certain                  Table 1C: Privilege Keyword Statistics




                                                                                                                                                                             Documents Hit by




                                                                                                                                                                                                     Documents Hit by
                     recall threshold)?




                                                                                                                      Total Documents




                                                                                                                                                                                                                                              Keyword Hit
                                                                                                                                                          Keywords
                                                                                           Data Sets




                                                                                                                                                                                                                                               Percentage
                                                                                                                                                                                                        Privileged
                                                                                                                                                                                Keywords




                                                                                                                                                                                                        Keywords
 4               Experiments
    In this section, we first introduce the data sets we used in the
 empirical study, and then we discuss the experimental procedure
 and evaluation metrics. We report the experimental results in the                    Project A                    308,621                       808                      193,017                   43,847                     62.54%
 next section.                                                                        Project B                    393,745                       4,211                    368,506                   13,571                     93.59%
                                                                                      Project C                    277,745                       509                      159,900                   36,234                     57.57%
 4.1                Data Sets
                                                                                                Table 1D: Responsive Keyword Statistics
    We conducted experiments on four data sets from confidential,




                                                                                                                     Total Documents




                                                                                                                                                                           Documents Hit




                                                                                                                                                                                                   Documents Hit
 non-public, real legal matters across various industries such as




                                                                                                                                                                           by Keywords




                                                                                                                                                                                                   by Keywords


                                                                                                                                                                                                                                              Keyword Hit
                                                                                                                                                         Keywords




                                                                                                                                                                                                    Responsive
                                                                                          Data Sets




                                                                                                                                                                                                                                               Percentage
 social media, communications, construction, and security. We
 chose matters with data sets that ranged from around 300,000 to
 500,000 documents in order to execute our experiments within a
 reasonable time period. The richness, or positive class rate, of the
 four data sets ranged from approximately 4% to 39%. Attorneys                        Project D                    412,880                       23                       81,362                   53,611                  19.71%
 reviewed all documents in the four data sets over the course of the
 legal matter and their coding (labels) provided the ability to fully
 evaluate the performance of the models. Tables 1A, 1B, 1C, and 1D                      4.2               Experiment Procedure
 provide the details for the four data sets, respectively. The details                      We conducted an empirical study on the effect that seed set and
 include descriptions, sizes, attorney coding statistics, and statistics                active learning document selection strategies have on the
 about keyword terms on the data sets. The predictive coding                            performance of a predictive coding process.
 objective for Data Sets A, B, and C was to identify privileged
 communications between attorneys and clients. The objective for                            The same set of experiments were performed on each of the four
 Data Set D was to identify documents responsive to production                          data sets. For each data set, all of the five seed set selection
 requests from the opposing party in the matter.                                        strategies and all of the six active learning document selection
                                                                                        strategies were tested. In total there were 30 combinations of seed
     The recall of the keyword hits is around 93% for the privileged                    set selection and active learning document selection strategies for
 data sets and 34% for the responsive data set. As comparing                            each data set. In all experiments, the seed set included 500 training
 keywords-based and predictive coding approaches for legal                              documents, and an additional 250 training documents are selected
 document review is beyond the scope of this paper, readers that are                    in each round of active learning. Table II shows the richness of the
 interested in this subject can read our previous related papers [6, 7].                seed sets for the four data sets. From the table, we can see that the
                                                                                        Random seed set selection method generally has similar richness as
        Table 1A: Privilege Data Set Statistics                                         that of the overall data set, while seed sets derived from keyword
                           Documents




                                          Documents




                                                        Documents
        Data Sets




                                                                                        search have higher richness than the overall data set.
                                                                         Richness
                                          Privileged




                                                        Privileged
                             Total




                                                           Not




                                                                                                Table 2: Richness of seed sets (%)
                                                                                                                                                       Keyword Method 1



                                                                                                                                                                                Keyword Method 2




                                                                                                                                                                                                                           Cluster Method 1
                                                                                                                                                                                                      Cluster Method 1
                                                                                                       Data Sets




Project A              308,621          46,730         261,891       15.14%
                                                                                                                                        Random




Project B              393,745          14,307         379,438       3.63%
Project C              277,412          38,834         238,578       14.00%

        Table 1B: Responsive Data Set Statistics                                               Project A                     14.8                 40.2                    43.4                     15.0                  15.2
                                                         Documents
                                         Responsive




                                                        Responsive
                         Documents




                                         Documents
     Data Sets




                                                                           Richness




                                                                                               Project B                     3.6                  6.6                     6.8                      3.8                   3.2
                           Total




                                                           Not




                                                                                               Project C                     11.8                 36.2                    34.4                     14.2                  12.8

                                                                                               Project D                     40.4                 70.2                    73.8                     40.2                  38.6
Project D             412,880          159,304         253,576       38.58%

                                                                                                Our experimental procedure was:
      1.    First, use the selected seed set sampling method to           a seed set with 74 positive documents and 426 negative documents.
            determine an initial training set of 500 documents.           Now suppose we use the TOP active learning document selection
      2.    Train a model with the selected seed set using the same       strategy, which selects 250 documents with the highest scores to
            underlying machine learning algorithm (logistic               add to the training set. Lastly, assume that after ten rounds the
            regression) and text processing parameters.                   training set contains 2,491 positive documents and 509 negative
      3.    Then, score the entire data set, excluding any document       documents, in total 3,000 documents. Examining the document
            used in training.                                             scores after ten rounds in this example, we find that if we choose
      4.    Next, select an additional 250 documents using one of the     24.7 as the cut-off score, there are 32,558 positive documents above
            active learning document selection strategies.                this cut-off score. 32,558 + 2,491 = to 35,049, represents 75% of
      5.    Finally, add these new training documents, train a new        all the responsive documents (46,730) in this data set. The total
            model, and repeat steps 3, 4, and 5 until there are no more   population of documents requiring review is then established by
            documents left to be scored or the minimum performance        adding all the documents with scores above 24.7 (82,206) to all the
            of the model is achieved.                                     training documents (3,000) divided by the total population size
                                                                          (308,621). This would equal: 27.6%.
    We used Logistic Regression as the machine learning algorithm
due to its consistent high performance across different settings over
various data sets demonstrated in previous studies [1, 2]. Other text     5     Results and Discussion
processing parameters we used for modeling were, bag of words                 The total number of our experimental parameter combinations
with 1-gram, normalized frequency, and 20,000 tokens were used            was 120. These parameters include: data set, seed set selection
as features.                                                              method, and the active learning document selection strategy. On
                                                                          average there were roughly 1,000 rounds of experiments generated
    In each round of our experiments, the entire data set was used        for each combination. To save space in this paper, we only present
either in training or scoring, which means on average more than           the most interesting results.
300,000 documents were used in training or scoring. The total
number of models trained in our experiments was: 114,933. We
leveraged the Apache Lucene search engine library to build full text      5.1    The Impact of Seed Set Selection Approaches
indices of the data sets to speed up the training and scoring                 Figures 1 displays the percentage of documents requiring
processes.                                                                review to achieve 75% recall for different seed set selection
                                                                          strategies. Active learning strategies were fixed to TOP and
                                                                          MID_75RC on Projects C and D and the first 100 rounds of
4.3        Evaluation Metrics                                             experiments are shown. Figure 2 details the percentage of
   Our performance metric measured the percentage of documents            documents requiring review to achieve 75% recall for different
requiring review to achieve the targeted recall level. In the common      seed set selection strategies. The RAND active learning document
passive learning scenario, this metric can be calculated on a             selection strategy was fixed on Project B and C and the first 100
validation set and does not consider documents that are reviewed          rounds of experiments are shown. In general, these figures show
for training because these documents typically have a negligible          that seed set selection strategies have very modest impact on the
impact when attempting to achieve the desired recall performance.         performance of the active learning strategies, especially after many
In an active learning scenario, as rounds increase, the number of         rounds of active learning. These results were expected when using
documents reviewed for training and used to develop the model             seed sets with a small number of documents (e.g., 500) because the
could constitute a considerable portion of the population requiring       initial impact of the seed set selection strategy likely degrades over
review. Therefore, performance metrics in our experiments were            training rounds. Therefore, it may be worthwhile to experiment
computed after each active learning round using two sets of               with seed sets of larger document sizes in the future. However, from
documents. The (i) first set contained the documents that were            these results, we do find two salient aspects about the impact of the
selected and reviewed during training. The (ii) second set contained      seed set selection approach. First, among the different active
the documents categorized as Responsive or Privileged by the              learning strategies, the TOP strategy is the most sensitive to the
predictive model after each round, namely the documents with              seed selection strategy; we can see more apparent performance
probability scores greater than or equal to the predictive model’s        difference across the different seed set selection strategies (Figure
cut-off score. The documents with scores greater than or equal to         1). This implies that in the very popular Continuous Active
the cut-off score are the documents that attorneys would consider         Learning protocol, the seed set selection strategy has an impactful
producing to an opposing party, for assertions of privilege, or in        role and should be considered carefully. Second, curves in Project
some instances for review because they are likely responsive, or in       B – a matter with 3.6% richness – show that the seed set selection
the case of privilege, may contain content that would allow for the       strategy had a greater impact on a low richness population and that
assertion of claims of privilege.                                         judgmental seed set selection strategies using keywords or
                                                                          clustering outperform randomly selected seed set documents in the
   We can use an example to illustrate the calculation of these           early rounds (Figure 2).
measures. Project A has 308,621 documents, of which 46,730 are
positive. Using the random seed selection strategy, we would select
                                                                Figure 2: Required Review at 75% Recall for the five Seed Set
                                                                Methods with RAND Active Learning Strategy on Project B
                                                                and C (First 100 Rounds)


                                                                5.2    The Impact of Active Learning Strategies
                                                                    Figure 3 shows the performance differences among TOP, MID-
                                                                50, MID_75RC, and RAND with the seed set selection method
                                                                fixed to random, over learning rounds of experiments until the
                                                                optimum round is reached. These results confirm the findings in
                                                                our previous research [1], i.e. active learning selection strategies
                                                                such as uncertain sampling (MID-50) and random selection
                                                                (RAND) can generate an effective model within fewer rounds than
                                                                the popular TOP strategy. Moreover, we find that the MID_75RC
                                                                strategy, a novel active learning strategy proposed for the first time
                                                                in this paper, performs the best in almost all the scenarios. This
                                                                indicates that selecting documents nearest to the cut-off score for
                                                                75 percent recall would be the most effective active learning
                                                                strategy, when attempting to achieve 75 percent recall.
Figure 1: Required Review at 75% Recall for the five Seed Set
Methods with TOP, MID_75RC Active Learning Strategies on
Project C and D (First 100 Rounds)
                                                                             Table 3: Required Review at 75% Recall for TOP and
                                                                           MID_75RC Active Learning Strategies (First 50 Rounds Every 10
                                                                           Rounds)




                                                                                                                        MID_75RC



                                                                                                                                    Difference
                                                                                    Data Set



                                                                                               Round



                                                                                                           TOP
                                                                                                10        28%           18%          9%




                                                                                   Project A
                                                                                                20        25%           17%          8%
                                                                                                30        26%           17%          9%
                                                                                                40        24%           17%          7%
                                                                                                50        23%           17%          6%
                                                                                                10        47%           35%          12%




                                                                                   Project B
                                                                                                20        44%           32%          11%
                                                                                                30        43%           31%          12%
                                                                                                40        40%           29%          11%
                                                                                                50        39%           28%          11%
                                                                                                10        24%           14%          10%
                                                                                   Project C    20        26%           14%          13%
                                                                                                30        32%           14%          18%
                                                                                                40        33%           14%          19%
                                                                                                50        29%           14%          15%
                                                                                                10        33%           31%          2%
                                                                                   Project D




                                                                                                20        33%           31%          3%
                                                                                                30        34%           31%          3%
                                                                                                40        34%           30%          3%
                                                                                                50        34%           30%          4%


                                                                           5.3    Optimum Performance Round Analysis
Figure 3: Required Review at 75% Recall TOP, MID-50,
MID_75RC and RAND Active Learning Strategies with                              We define the optimum performance round as the round in
random Seed Set Selection Method                                           which the amount of review required to reach 75 percent recall is
                                                                           the earliest. After some analysis, we found the dominant factor in
    The performance difference between the TOP strategy and the            reaching the optimum performance round is the active learning
MID_75RC strategy is even more clear when we look closely into             strategy and not the seed set selection strategy. In Table 4A through
the plots of the first 100 rounds. Table 3 shows in the first 50 rounds    4D, we compiled the optimum performance round of each active
and the MID_75RC strategy consistently requires less review than           learning strategy for the four data sets. We can see that strategies
the TOP strategy across all projects. The maximum saving would             such as RAND, MID-50 or MID_75RC consistently take fewer
be close to 20 percent in Project C. In practice, this has a significant   rounds to reach the optimum performance round. Moreover, if a
impact on the predictive coding process and should be considered           satisficing goal is set to a review percentage within 5%, 10% or
by legal teams to help reduce review costs.                                15% of the optimum performance, we can see that those strategies
                                                                           require fewer rounds to reach the goal.
Table 4A: Project A Optimum Performance Rounds                                                                                                   Table 4D: Project D Optimum Performance Rounds




                                                                                                                                                                                                                             1st Round within 10%



                                                                                                                                                                                                                                                        1st Round within 15%
                                                                                                                                                                                                       1st Round within 5%
                                                                                                                                                                   Review Percentage
                                                               1st Round within 5% of



                                                                                        1st Round within 10%



                                                                                                                  1st Round within 15%




                                                                                                                                                                                       Optimum Round
                                                                                                                                                 Active Learning
                  Review Percentage


                                           Optimum Round




                                                                                                                                                                                                            of Op. Perf.



                                                                                                                                                                                                                                  of Op. Perf.



                                                                                                                                                                                                                                                             of Op. Perf.
Active Learning




                                                                                                                                                     Strategy
                                                                                             of Op. Perf.



                                                                                                                       of Op. Perf.
                                                                      Op. Perf.
    Strategy




                                                                                                                                                 TOP             30.47      494                                    350                              0                          0
TOP                     15.90                  192                           167                     145                       113               MID-50          31.02        13                                     0                              0                          0
MID-50                  15.71                   60                            33                      21                        13               MID_75RC        30.36      332                                      1                              0                          0
MID_75RC                16.24                   74                            19                      12                         8               RAND            31.65         8                                     0                              0                          0
RAND                    18.77                   21                             4                       2                         2               80TOP20RD       31.90         5                                     0                              0                          0
80TOP20RD               18.36                  213                            80                      27                        10               20TOP80RD       31.58         6                                     0                              0                          0
                                                                                                                                                 * Round 0 means the initial round.
20TOP80RD               19.09                   50                             7                       3                         2

Table 4B: Project B Optimum Performance Rounds
                                                                                                                                             6        Conclusions
                                                                1st Round within 5%



                                                                                        1st Round within



                                                                                                                  1st Round within
                                                                                        10% of Op. Perf.



                                                                                                                  15% of Op. Perf.
                                             Optimum Round
Active Learning




                                                                                                                                                 Our experiment results show that seed set selection strategies
                                                                     of Op. Perf.
                     Percentage




                                                                                                                                             have little impact on the active learning process. However, for low
    Strategy




                      Review




                                                                                                                                             richness projects, keyword-based seed set selection strategies have
                                                                                                                                             more apparent effect. Also, the popular TOP active learning
                                                                                                                                             strategy is the most sensitive strategy to different seed selection
                                                                                                                                             methodologies.
TOP                       18.58                 279                           263                    235                      206
MID-50                    18.59                 290                           258                    236                      218                Our results also show that choosing documents nearest to the
                                                                                                                                             cut-off score determined by reaching a 75 percent document recall
MID_75RC                  20.88                 263                           168                    117                       92
                                                                                                                                             potentially result in a high performing model quickly. When
RAND                      27.79                 123                            85                     41                       22
                                                                                                                                             excluding data sets with extremely low richness (such as Project
80TOP20RD                 21.20                 321                           272                    225                      194            B), this training methodology results in significantly higher
20TOP80RD                 27.50                 181                           106                     70                       53            performing models in early training rounds, such as round 10 or
                                                                                                                                             round 20, rounds that are often associated with stopping points for
Table 4C: Project C Optimum Performance Rounds                                                                                               Simple Active Learning models. In fact, in all three of our data sets
                                                                                           1st Round within 10%



                                                                                                                      1st Round within 15%




                                                                                                                                             that had richness above 10 percent, using the MID_75RC active
                                                                  1st Round within 5%
                       Review Percentage


                                               Optimum Round
Active Learning




                                                                                                                                             learning strategy resulted in achieving performance within roughly
                                                                       of Op. Perf.



                                                                                                of Op. Perf.



                                                                                                                           of Op. Perf.




                                                                                                                                             10 percent of the optimum model performance within 10 rounds of
    Strategy




                                                                                                                                             active learning. In theory, focusing training around the dynamic
                                                                                                                                             cut-off score from round to round makes sense. Documents just
                                                                                                                                             above the cut-off score should be the documents included as
                                                                                                                                             positives by the model with the least amount of certainty, so there
                                                                                                                                             should be the most opportunity to improve precision by improving
TOP                          13.56                 148                           130                     118                       107
                                                                                                                                             performance by classifying the features within these documents.
MID-50                       13.22                  25                            11                       7                         5       Documents just below the cut-off score should be the documents
MID_75RC                     13.33                  37                            12                       6                         4       excluded as negatives by the model that have the highest amount of
RAND                         15.73                  18                             6                       4                         2       richness in the excluded population, so there should be the most
80TOP20RD                    15.30                 156                           108                      50                        21       opportunity to improve recall by classifying the features within
20TOP80RD                    16.05                  29                             6                       3                         2       these documents. It will be interesting to continue to test these
                                                                                                                                             assumptions and study this strategy both in data sets with low
                                                                                                                                             richness and in utilizing other cut-off scores to meet different recall
                                                                                                                                             objectives or thresholds, such as those prescribing 50 percent or 90
                                                                                                                                             percent recall. It should be noted that in our current study we fixed
                                                                                                                                             the seed set size at 500 and the additional number of training
                                                                                                                                             documents in each round at 250. In future studies, we intend to
                                                                                                                                             examine seed sets of larger sizes or various sizes of additional
                                                                                                                                             active learning training documents.
    The results provide practical techniques that legal practitioners                         2018 IEEE International Conference on Big Data (Big Data), 2018, pp. 3282–
                                                                                              3291.
can use to enhance their active learning predictive coding                               [7] R. Keeling, N. Huber-Fliflet, J. Zhang, and R. P. Chhatwal, “Separating the
processes, as well as influencing their training document selection                           Privileged Wheat from the Chaff – Using Text Analytics and Machine Learning
                                                                                              to Protect Attorney-Client Privilege,” Richmond Journal of Law and Technology,
strategies for passive learning approaches.                                                   2019.
                                                                                         [8] D. D. Lewis, “A Sequential Algorithm for Training Text Classifiers: Corrigendum
                                                                                              and Additional Data,” SIGIR Forum, vol. 29, no. 2, pp. 13–19, Sep. 1995.
                                                                                         [9] D. D. Lewis and W. A. Gale, “A Sequential Algorithm for Training Text
REFERENCES                                                                                    Classifiers,” in Proceedings of the 17th Annual International ACM SIGIR
[1] R. Chhatwal, N. Huber-Fliflet, R. Keeling, J. Zhang, and H. Zhao, “Empirical              Conference on Research and Development in Information Retrieval, New York,
     evaluations of active learning strategies in legal document review,” in 2017 IEEE        NY, USA, 1994, pp. 3–12.
     International Conference on Big Data (Big Data), 2017, pp. 1428–1437.               [10] S. Lohr, “The Age of Big Data,” New York Times, vol. 11, 2012.
[2] R. Chhatwal, N. Huber-Fliflet, R. Keeling, J. Zhang, and H. Zhao, “Empirical         [11] C. J. Mahoney, N. Huber-Fliflet, K. Jensen, H. Zhao, R. Neary, and S. Ye,
     evaluations of preprocessing parameters’ impact on predictive coding’s                   “Empirical Evaluations of Seed Set Selection Strategies for Predictive Coding,”
     effectiveness,” in Big Data (Big Data), 2016 IEEE International Conference on,           in 2018 IEEE International Conference on Big Data (Big Data), 2018, pp. 3292–
     2016, pp. 1394–1401.                                                                     3301.
[3] G. V. Cormack and M. R. Grossman, “Evaluation of Machine-learning Protocols          [12] N. M. Pace and L. Zakaras, Where the money goes: Understanding litigant
     for Technology-assisted Review in Electronic Discovery,” in Proceedings of the           expenditures for producing electronic discovery. RAND Corporation, 2012.
     37th International ACM SIGIR Conference on Research & Development in                [13] J. Pickens, T. Gricks, B. Hardi, M. Noel, and J. Tredennick, “An Exploration of
     Information Retrieval, New York, NY, USA, 2014, pp. 153–162.                             Total Recall with Multiple Manual Seedings,” in Proceedings of TREC 2016,
[4] G. V. Cormack and M. R. Grossman, “Autonomy and Reliability of Continuous                 2016.
     Active Learning for Technology-Assisted Review,” arXiv preprint                     [14] K. Schieneman and T. C. Gricks III, “Implications of Rule 26 (g) on the Use of
     arXiv:1504.06868, 2015.                                                                  Technology-Assisted Review,” Fed. Cts. L. Rev., vol. 7, p. 247, 2014.
[5] N. Ghelani, G. V. Cormack, and M. D. Smucker, “Refresh Strategies in Continuous
     Active Learning,” in ProfS2018: First International Workshop on Professional
     Search, 2018.
[6] P. Gronvall, N. Huber-Fliflet, J. Zhang, R. Keeling, R. Neary, and H. Zhao, “An
     Empirical Study of the Application of Machine Learning and Keyword Terms
     Methodologies to Privilege-Document Review Projects in Legal Matters,” in