Refresh Strategies in Continuous Active Learning

                      Nimesh Ghelani, Gordon V. Cormack, and Mark D. Smucker
                                   University of Waterloo, Canada
                           {nghelani,gvcormac,mark.smucker}@uwaterloo.ca


                                                        Abstract
                       High recall information retrieval is crucial to tasks such as electronic
                       discovery and systematic review. Continuous Active Learning (CAL)
                       is a technique where a human reviewer works in loop with a machine
                       learning model; the model presents a set of documents likely to be
                       relevant and the reviewer provides relevance feedback. Our focus in
                       this paper is on one particular aspect of CAL: refreshing, which is a
                       crucial and recurring event in the CAL process. During a refresh, the
                       machine learning model is trained with the relevance judgments and a
                       new list of likely-to-be-relevant documents is produced for the reviewer
                       to judge. It is also computationally the most expensive step in CAL.
                       In this paper, we investigate the effects of the default and alternative
                       refresh strategies on the effectiveness and efficiency of CAL. We find
                       that more frequent refreshes can significantly reduce the human effort
                       required to achieve certain recall. For moderately sized datasets, the
                       high computation cost of frequent refreshes can be reduced through
                       a careful implementation. For dealing with resource constraints and
                       large datasets, we propose alternative refresh strategies which provide
                       the benefits of frequent refreshes at a lower computation cost.

1    Introduction
High recall information retrieval is crucial to tasks such as electronic discovery and systematic review - where the
goal is to find all or nearly all relevant documents using minimal human effort. Technical Assisted Review (TAR)
methods outperform manual review in legal eDiscovery by reducing the cost spent on human reviewers [4, 7].
In a TAR process, a computer system uses judgments made by human reviewers to classify documents as
either relevant or non-relevant. Continuous active learning (CAL) [2] is a TAR protocol where a machine
learning algorithm suggests the most likely relevant documents for review and continuously incorporates relevance
feedback to improve its understanding of the search task. In a previous study, Cormack and Grossman [2] showed
that CAL outperforms other TAR protocols on review tasks from actual legal matters and the TREC 2009 Legal
Track. The Total Recall track in TREC 2015 and 2016 evaluated different systems under a simulated TAR
setting [5, 6]. Baseline Model Implementation (BMI) based on CAL was used as the baseline in these tracks.
BMI implements the AutoTAR algorithm. None of the participating systems were able to consistently outperform
the baseline. In this paper, we modify and extend the AutoTAR algorithm.
   Refreshing is an important step in the CAL process. During a refresh, the relevance judgments from the
reviewer are used to train a new classifier. This classifier generates an ordered list of documents most likely to
be relevant, which is later processed by the reviewer. This step has a high computation cost because it involves

Copyright c by the paper’s authors. Copying permitted for private and academic purposes.
In: Joint Proceedings of the First International Workshop on Professional Search (ProfS2018); the Second Workshop on Knowledge
Graphs and Semantics for Text Retrieval, Analysis, and Understanding (KG4IR); and the International Workshop on Data Search
(DATA:SEARCH18). Co-located with SIGIR 2018, Ann Arbor, Michigan, USA – 12 July 2018, published at http://ceur-ws.org


                                                        18
training a classifier and computing relevance likelihood scores for potentially all the documents in the corpus. A
refresh strategy determines when and how to perform the refresh. By studying various refresh strategies, we aim
to:

     • Improve effectiveness of CAL; specifically its capability to achieve higher recall with lesser effort

     • Improve computational efficiency of CAL; so that it is responsive and feasible in a production environment.

   In this paper, we propose different refresh strategies and compare their effect on the behaviour of CAL. By
default, refreshing in AutoTAR is done after a certain number of judgments is received. This number increases
exponentially over time. We find that we can make CAL more effective by refreshing after every judgment.
However, this comes with significant computation overhead, which with careful implementation and modern
hardware, can be minimized for reasonably sized datasets. We also discuss other refresh strategies which have
lower computation cost and achieve similar effectiveness. These strategies can be considered when dealing with
resource constraints or large datasets.

2      Continuous Active Learning
A general version of the AutoTAR CAL algorithm is described in Algorithm 1. The choice of refresh strategy
can control when to perform a refresh (step 10), as well as the behaviour of training (step 4) and scoring (step
5). In this paper, we simulate human assessors using a set of existing relevance judgments (Step 8). Unlabelled
documents are considered non-relevant during the simulation.

 Algorithm 1: AutoTAR CAL Algorithm (assuming an arbitrary refresh strategy). A refresh strategy
 can control behaviour of steps 4, 7 and 10
 1 Construct a seed document whose content is the topic description
 2 Label the seed document as relevant and add it to the training set
 3 Add 100 random documents from the collection, temporarily labeled as “not relevant”
 4 Train a Logistic Regression classifier using the training set
 5 Remove the random documents from the training set added in step 3
 6 Flush the review queue
 7 Using the classifier, order documents by their relevance scores and put them into a review queue
 8 Review a document from the review queue, coding it as “relevant” or “not relevant”
 9 Add the document to the training set
10 Repeat steps 8-9 until a refresh is needed (defined by the refresh strategy)
11 Repeat steps 3-10 until some stopping condition is met.


   Documents are represented as a vector of unigram tf-idf features which are used for training the classifier and
calculating relevance likelihood scores. BMI AutoTAR uses sofia-ml∗ [8] to train a logistic regression classifier.
It uses the logreg-pegasos learner with 100000 iterations of roc sampling. A training iteration involves randomly
sampling a relevant and a non-relevant document from the training set, computing the loss and adjusting the
classifier weights accordingly. The classifier weights are used to calculate relevance likelihood score for any
document.
   The BMI tool provided by the TREC Total Recall organizers is written in BASH. For efficiency and extensi-
bility, we implemented Algorithm 1 in C++† [1]. Our implementation organizes all the document feature vectors
in the memory to eliminate disk accesses and enable faster operations. New refresh strategies can be easily
implemented and tested using this tool. The tool provides a command line interface to run simulations and a
HTTP API for use in real world applications.

3      Experiment Setup
We used the Athome1 test collection from the TREC 2015 Total Recall track [6]. The collection has 290,099
documents and 10 topics. Each topic has an average of 4398 documents labelled as relevant.
     ∗ https://code.google.com/archive/p/sofia-ml/
     † https://nims11.github.io/tarcal.html


                                                       19
   To measure the effectiveness, we ran a CAL simulation for each topic and refresh strategy until Emax number
of judgments were made. Emax is also known as the max review effort. In our experiments, Emax was equal
to 2 × R where R is the total number of relevant documents for that topic. A gain curve for a topic is a plot
of recall (y-axis) against the normalized review effort (Enorm = ER ), where E is the number of judgments made
since the beginning of the simulation. We get the average gain curve by averaging the recall over the 10 topics.
For the sake of readability and space, we excluded those plots from this paper. Instead, we report certain points
of interest from the plots in a table. Specifically, we compare different refresh strategies based on their recall
values when Enorm ∈ {1, 1.5, 2}. We also report the effort required to reach 75% recall for each refresh strategy.
   To measure the computational efficiency, we fixed Emax = 10000 and ran the simulation across four 2.20GHz
CPUs (Step 7 of Algorithm 1 is the only parallelized step) for each topic and refresh strategy. Different refresh
strategies are compared based on the running time of the simulation averaged over the 10 topics.

4    Refresh Strategies
In this section, we describe the refresh strategies we investigated. We use the term full refresh to denote a
refresh in which all available judgments are used in training and relevance likelihood scores for all the documents
are calculated. A full refresh runs in O(n log n) ‡ time where n is the number of documents in the corpus.
Training takes O(1) time (number of iterations is fixed), computing the relevance likelihood scores take O(n)
time (assuming length of the documents to be constant) and sorting the scores take O(n log n) time. However,
the constant costs associated with training and scoring are significantly higher than sorting.

BMI
In BMI AutoTAR, a full refresh is performed after every k judgments. Initially k = 1, and after every refresh,
is updated using k ← k + b k+9
                             10 c. Therefore, k increases exponentially over time.
   This strategy scales well with the number of judgments (E) made during the CAL process since only O(log E)
refreshes are done. According to the authors of BMI, the motivation behind this strategy was to “reap the benefits
of early precision, while avoiding downside risk and excessive running time, by using exponentially increasing
batch size” [3].

Static Batch
In static batch refresh strategy, a full refresh is performed after every k judgments (k is fixed).
   This strategy incurs a high computation cost and introduces scalability issues since it requires O(E) number
of refreshes and each refresh takes Ω(n) time, where E is the number of documents judged during the CAL
process and n is the number of documents in the dataset. For very small values of k (such as 1) and large values
of n, pauses as small as half a second due to frequent refreshes can disrupt the user experience. One way to
work around this problem is to perform asynchronous refreshes and immediately show the users documents from
the old review queue [1]. This delays the effect of user feedback on the review queue by d ttur e documents, where
tr is the time it takes to complete a refresh, and tu is the time a user takes to review one document. While
this delay is usually tiny, additional experiments need to be performed to measure the impact of these delays on
effectiveness.

Partial Refresh
In this strategy, a full refresh is performed after every k judgments. At the end of each full refresh, s documents
with the highest relevance likelihood scores are stored in a partial refresh set. After every judgment, a partial
refresh is performed. In a partial refresh, all available judgments are used in training but relevance likelihood
scores are only calculated for the documents in the partial refresh set. A single partial refresh runs in O(s log s)
time.
   With some enhancements, this strategy can also help reduce the high memory costs when working with low
physical memory or very large datasets (such as ClueWeb). As mentioned in Section 3, the documents are loaded
in memory to enable faster operations and improve the responsiveness of the system. Partial refreshes are fast
and performed on a small set of data which can be stored in the memory. Full refresh can be performed in the
background, and can thus afford reads from the disk without sacrificing the user experience or effectiveness of
this strategy.
    ‡ In most cases, only top k documents (k << n) are needed, and the refresh can be performed in O(n log k) time


                                                          20
                        Table 1: Summary of the refresh strategies and their parameters.

                           Strategy                           Parameters
                          bmi refresh                              None
                          static batch        k = no. of judgments between full refreshes
                                              k = no. of judgments between full refreshes
                         partial refresh
                                            s = no. of documents in the partial refresh set
                                                     m = no. of recent judgments to
                                                          compute precision on
                       precision strategy
                                            p = full refresh is triggered if the precision of
                                                  last k documents fall below this value
                                                  w = factor by which the latest judged
                       recency weighting         document is more likely to be sampled
                                                     than the oldest judged document
                                                      it = no. of training iterations
                               *
                                               (global parameter, 100000 unless specified)

Precision Based Refreshing
In this strategy, a full refresh is performed when the “output quality” of the CAL system falls below some
threshold. We define “output quality” as the fraction of relevant judgments in the last m judgments made by
the reviewer. A full refresh is performed whenever this fraction falls below some threshold p.
   Unlike previous strategies, we don’t define our refresh criteria in terms of elapsed number of judgments. Our
aim is to find more meaningful factors which can help us better understand the effectiveness of various refresh
strategies, and as a result, help us design better refresh strategies.

Recency Weighting
This strategy modifies the training step (step 4 in Algorithm 1) in the CAL process by favoring documents which
were recently judged. As described in Section 2, training is done over several iterations. In each iteration of the
original training, a relevant and a non-relevant document is randomly sampled from the training set. The loss
computed using them is used to update the classifier weights. To incorporate recency weighting, we modified the
uniform random sampling such that the probability of selecting a document increases if it was judged recently.
   Given a list of documents [D1 , D2 , ..., Dn ] ordered by the time they were judged, our modified random sampling
will select a document Dx with probability P (Dx ) where

                                                           P (D1 )(x − 1)(w − 1)
                                     P (Dx ) = P (D1 ) +
                                                                   N −1
   Therefore, P (Dx ) is a linear function such that the latest judged document Dn is w times more likely to be
selected than the oldest judged document D1 . A full refresh (with modified training) is performed after every
judgment.

5   Results and Discussion
For sake of space and readability, we encode each strategy with their parameter settings as strat-
egy name(param1 = x, . . .). Table 1 lists all the strategies and their parameters. Table 2 lists the results
for different parameter settings of bmi refresh, static batch, partial refresh and precision strategy. We report
the recall achieved at different values of effort, effort required to achieve 75% recall, and the average running
time. Instead of absolute effort, we use normalized effort Enorm as defined in Section 3. For example, “Avg.
recall@(Enorm = 1.5)” denotes the average recall achieved across all the topics when 1.5 × R documents haven
been judged, where R is the total number of relevant documents for a topic.
   With static batch(k = 1), CAL achieves significantly higher recall of 75% at Enorm = 1 than bmi refresh which
achieves 71.5% recall. static batch(k = 100) performs worse than bmi refresh, managing to achieve 70.4% recall
at the same effort. These results establish that frequent refreshing helps to achieve higher recall. Although the
batch sizes in BMI increases exponentially with time, it still does frequent refreshes during the early stages of the
CAL process, thus performing better than static batch(k = 100). bmi refresh is also extremely cheap in terms


                                                    21
        Table 2: Summary of results for bmi refresh, static batch, partial refresh and precision strategy

                                      Avg. Recall    Avg. Recall    Avg. Recall     Enorm for    Running Time
             Strategy
                                      @(Enorm =1)   @(Enorm =1.5)   @(Enorm =2)    75% recall      (in min)
              bmi refresh                0.715          0.827          0.905          1.128           0.22
         static batch(k = 1)             0.750          0.865          0.926          1.021          49.29
        static batch(k = 100)            0.704          0.807          0.887          1.167           0.47
  partial refresh(k = 10,s = 1000)       0.753          0.862          0.926          1.008          40.92
  partial refresh(k = 100,s = 500)       0.753          0.855          0.923          1.022          38.28
  partial refresh(k = 100,s = 1000)      0.754          0.856          0.922          1.013          39.57
  partial refresh(k = 100,s = 5000)      0.756          0.855          0.921          1.016          40.70
  partial refresh(k = 500,s = 1000)      0.700          0.785          0.815          1.324          38.63
 precision strategy(m = 25,p = 0.4)      0.698          0.849          0.915          1.129          35.68
 precision strategy(m = 25,p = 0.6)      0.735          0.859          0.923          1.059          40.20
 precision strategy(m = 25,p = 0.8)      0.750          0.862          0.926          1.024          44.64
 precision strategy(m = 25,p = 1.0)      0.752          0.865          0.926          1.014          47.41

of computation cost since it only performs a logarithmic number of refreshes relative to static batch strategies.
bmi refresh simulation finished in less than a minute while static batch(k = 1) took 49 minutes.
   We evaluate rest of the refresh strategies by comparing them to static batch(k = 1).
   By fixing s = 1000 and varying k in partial refresh strategy, we observe that for k = 10 and k = 100, the
difference in recall remains insignificant throughout the CAL process. Their recall scores are also very similar to
static batch(k = 1). They achieve 86.2% and 85.5% recall at Enorm = 1.5, respectively. For k = 500, we observe
78.6% recall at the same effort, which is worse than bmi refresh (82.6%). This is in agreement with our previous
observation that more frequent full refreshes increases CAL’s effectiveness. static batch(k = 100) consistently
achieved lower recall when compared to static batch(k = 1) while partial refresh(k = 100, s = 1000) is as effective
as the latter. Based on this, it can be established that partial refreshing contributes significant improvements
to recall. On fixing k = 100 and varying s we observe no changes to the recall values. Partial refresh strategies
also improve the running time of simulations by 20% when compared to static batch(k = 1).
   In precision based refresh strategy, we fix m = 25 and vary p. For p = 0.8 and p = 1.0, precision strategy
achieves 75% recall at Enorm = 1 which is similar to static batch(k = 1). This similarity of recall is also
seen at Enorm = 1.5 and Enorm = 2. When p = 1, precision strategy refreshes whenever a non-relevant
judgment is made, thus behaving very similar to static batch(k = 1). For lower values of p, we observe lower
recall values during the initial stages (73.5% recall when Enorm = 1 for p = 0.6). However, they catch up
to static batch(k = 1) at higher Enorm , as relevant documents become rarer (85.9% recall when Enorm = 1.5
for p = 0.6). precision strategy improved the running time of simulations by 15% on average when compared
to static batch(k = 1). precision strategy triggers lower number of refreshes during the beginning of the CAL
process when relevant documents are easier to find. During the later stages when the relevant documents are
harder to find, precision strategy tends to keep refreshing after every judgment.
   During our initial experiments, recency weighting seemed to have no impact on the recall values. On further
experiments, we found that the default number of training iterations (it = 100000) was very high for recency
weighting to cause any difference. By reducing the number of training iterations to 1000, we introduced significant
degradation in the system’s effectiveness. Reducing the number of training iterations also reduced the running
time of simulation by 76% when compared to static batch(k = 1). We used recency weighting to see whether it
could recover the lost effectiveness. recency weighting(w = 1, it = 1000) is equivalent to static batch(k = 1, it =
1000) and it achieves 70.4% and 79.8% recall when Enorm is equal to 1 and 1.5 respectively. By increasing w,
we observe an increase in recall for Enorm ∈ {1.5, 2}. However, the recall is consistently and significantly lower
when compared to static batch(k = 1). For example, recency weighting(w = 10, it = 1000) is only able to achieve
82.4% recall at Enorm = 1.5.

6   Future Work
There are many large scale datasets (ClueWeb, Twitter, etc) which far exceed the scale of dataset used in this
paper. It is desirable to run CAL efficiently on these datasets. We described an enhancement to the partial
refresh strategy which could potentially achieve this. Additional strategies which deal with large amount of data
could also be designed. In addition to runtime efficiency, these strategies will also need to optimize for memory


                                                    22
efficiency.
    In this paper, we measure computational efficiency of strategies using the running time of the simulations. In
a more practical setting where an actual human is judging documents, responsiveness of the CAL system also
becomes important. Under static batch refresh strategy, we mentioned the idea of introducing a delay to improve
responsiveness of CAL systems. Additional experiments which simulate and measure the impact of those delays
could be designed.

7   Conclusion
We investigated a crucial part of the Continuous Active Learning (CAL) process called refreshing. We proposed
and compared various refresh strategies. Our results show that by refreshing more often, CAL can be used
to achieve higher recall. Refreshing after every judgment (static batch(k = 1)) results in consistently better
performance than the original BMI AutoTAR in terms of recall. However, frequent refreshing is computationally
more expensive than the BMI refresh strategy. But with an efficient implementation on modern computers,
frequent refreshing can be practically used in real world CAL systems. We also defined and analysed alternative
refresh strategies which are as effective as refreshing after every judgments. In our experiments, various settings
of partial refresh strategy and precision strategy achieved recall scores similar to static batch(k = 1) at a lower
computation cost. For situations where computational resources are limited or dataset is very large, partial
refresh strategy can be used. Precision based refreshing is efficient when the relevant documents are abundant
and easier to find.
   We also provide an efficient C++ implementation of CAL and all the refresh strategies mentioned in this paper
as an open source tool. The tool is designed to be used both as a research tool and in real world applications.

Acknowledgments
This work was supported in part by the Natural Sciences and Engineering Research Council of Canada (Grants
CRDPJ 468812-14 and RGPIN-2014-03642), in part by a Google Founders Award, and in part by the University
of Waterloo.

References
[1] Abualsaud, M., Ghelani, N., Zhang, H., Smucker, M.D., Cormack, G.V., Grossman, M.R.: A system for
    efficient high-recall retrieval. In: Proceedings of the 41st International ACM SIGIR Conference on Research
    and Development in Information Retrieval, ACM (2018)

[2] Cormack, G.V., Grossman, M.R.: Evaluation of machine-learning protocols for technology-assisted review
    in electronic discovery. In: Proceedings of the 37th international ACM SIGIR conference on Research &
    development in information retrieval, ACM (2014) 153–162
[3] Cormack, G.V., Grossman, M.R.: Autonomy and reliability of continuous active learning for technology-
    assisted review. arXiv preprint arXiv:1504.06868 (2015)

[4] Grossman, M.R., Cormack, G.V.: Technology-assisted review in e-discovery can be more effective and more
    efficient than exhaustive manual review. Rich. JL & Tech. 17 (2010) 1
[5] Grossman, M.R., Cormack, G.V., Roegiest, A.: Trec 2016 total recall track overview. In: TREC. (2016)
[6] Roegiest, A., Cormack, G.V., Grossman, M.R., Clarke, C.: Trec 2015 total recall track overview. Proc.
    TREC-2015 (2015)
[7] Roitblat, H.L., Kershaw, A., Oot, P.: Document categorization in legal electronic discovery: computer
    classification vs. manual review. Journal of the Association for Information Science and Technology 61(1)
    (2010) 70–80

[8] Sculley, D.: Combined regression and ranking. In: Proceedings of the 16th ACM SIGKDD international
    conference on Knowledge discovery and data mining, ACM (2010) 979–988


                                                   23