<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Refresh Strategies in Continuous Active Learning</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Nimesh Ghelani, Gordon V. Cormack, and Mark D. Smucker University of Waterloo</institution>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <fpage>18</fpage>
      <lpage>23</lpage>
      <abstract>
        <p>High recall information retrieval is crucial to tasks such as electronic discovery and systematic review. Continuous Active Learning (CAL) is a technique where a human reviewer works in loop with a machine learning model; the model presents a set of documents likely to be relevant and the reviewer provides relevance feedback. Our focus in this paper is on one particular aspect of CAL: refreshing, which is a crucial and recurring event in the CAL process. During a refresh, the machine learning model is trained with the relevance judgments and a new list of likely-to-be-relevant documents is produced for the reviewer to judge. It is also computationally the most expensive step in CAL. In this paper, we investigate the e ects of the default and alternative refresh strategies on the e ectiveness and e ciency of CAL. We nd that more frequent refreshes can signi cantly reduce the human e ort required to achieve certain recall. For moderately sized datasets, the high computation cost of frequent refreshes can be reduced through a careful implementation. For dealing with resource constraints and large datasets, we propose alternative refresh strategies which provide the bene ts of frequent refreshes at a lower computation cost.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Copyright c by the paper's authors. Copying permitted for private and academic purposes.
training a classi er and computing relevance likelihood scores for potentially all the documents in the corpus. A
refresh strategy determines when and how to perform the refresh. By studying various refresh strategies, we aim
to:</p>
      <p>Improve e ectiveness of CAL; speci cally its capability to achieve higher recall with lesser e ort
Improve computational e ciency of CAL; so that it is responsive and feasible in a production environment.</p>
      <p>In this paper, we propose di erent refresh strategies and compare their e ect on the behaviour of CAL. By
default, refreshing in AutoTAR is done after a certain number of judgments is received. This number increases
exponentially over time. We nd that we can make CAL more e ective by refreshing after every judgment.
However, this comes with signi cant computation overhead, which with careful implementation and modern
hardware, can be minimized for reasonably sized datasets. We also discuss other refresh strategies which have
lower computation cost and achieve similar e ectiveness. These strategies can be considered when dealing with
resource constraints or large datasets.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Continuous Active Learning</title>
      <p>A general version of the AutoTAR CAL algorithm is described in Algorithm 1. The choice of refresh strategy
can control when to perform a refresh (step 10), as well as the behaviour of training (step 4) and scoring (step
5). In this paper, we simulate human assessors using a set of existing relevance judgments (Step 8). Unlabelled
documents are considered non-relevant during the simulation.</p>
      <p>Algorithm 1: AutoTAR CAL Algorithm (assuming an arbitrary refresh strategy). A refresh strategy
can control behaviour of steps 4, 7 and 10
1 Construct a seed document whose content is the topic description
2 Label the seed document as relevant and add it to the training set
3 Add 100 random documents from the collection, temporarily labeled as \not relevant"
4 Train a Logistic Regression classi er using the training set
5 Remove the random documents from the training set added in step 3
6 Flush the review queue
7 Using the classi er, order documents by their relevance scores and put them into a review queue
8 Review a document from the review queue, coding it as \relevant" or \not relevant"
9 Add the document to the training set
10 Repeat steps 8-9 until a refresh is needed (de ned by the refresh strategy)
11 Repeat steps 3-10 until some stopping condition is met.</p>
      <p>
        Documents are represented as a vector of unigram tf-idf features which are used for training the classi er and
calculating relevance likelihood scores. BMI AutoTAR uses so a-ml [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to train a logistic regression classi er.
It uses the logreg-pegasos learner with 100000 iterations of roc sampling. A training iteration involves randomly
sampling a relevant and a non-relevant document from the training set, computing the loss and adjusting the
classi er weights accordingly. The classi er weights are used to calculate relevance likelihood score for any
document.
      </p>
      <p>
        The BMI tool provided by the TREC Total Recall organizers is written in BASH. For e ciency and
extensibility, we implemented Algorithm 1 in C++y [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Our implementation organizes all the document feature vectors
in the memory to eliminate disk accesses and enable faster operations. New refresh strategies can be easily
implemented and tested using this tool. The tool provides a command line interface to run simulations and a
HTTP API for use in real world applications.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Experiment Setup</title>
      <p>
        We used the Athome1 test collection from the TREC 2015 Total Recall track [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The collection has 290,099
documents and 10 topics. Each topic has an average of 4398 documents labelled as relevant.
https://code.google.com/archive/p/so a-ml/
yhttps://nims11.github.io/tarcal.html
      </p>
      <p>To measure the e ectiveness, we ran a CAL simulation for each topic and refresh strategy until Emax number
of judgments were made. Emax is also known as the max review e ort. In our experiments, Emax was equal
to 2 R where R is the total number of relevant documents for that topic. A gain curve for a topic is a plot
of recall (y-axis) against the normalized review e ort (Enorm = ER ), where E is the number of judgments made
since the beginning of the simulation. We get the average gain curve by averaging the recall over the 10 topics.
For the sake of readability and space, we excluded those plots from this paper. Instead, we report certain points
of interest from the plots in a table. Speci cally, we compare di erent refresh strategies based on their recall
values when Enorm 2 f1; 1:5; 2g. We also report the e ort required to reach 75% recall for each refresh strategy.</p>
      <p>To measure the computational e ciency, we xed Emax = 10000 and ran the simulation across four 2.20GHz
CPUs (Step 7 of Algorithm 1 is the only parallelized step) for each topic and refresh strategy. Di erent refresh
strategies are compared based on the running time of the simulation averaged over the 10 topics.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Refresh Strategies</title>
      <p>In this section, we describe the refresh strategies we investigated. We use the term full refresh to denote a
refresh in which all available judgments are used in training and relevance likelihood scores for all the documents
are calculated. A full refresh runs in O(n log n) z time where n is the number of documents in the corpus.
Training takes O(1) time (number of iterations is xed), computing the relevance likelihood scores take O(n)
time (assuming length of the documents to be constant) and sorting the scores take O(n log n) time. However,
the constant costs associated with training and scoring are signi cantly higher than sorting.
BMI</p>
      <sec id="sec-4-1">
        <title>Static Batch</title>
        <p>In BMI AutoTAR, a full refresh is performed after every k judgments. Initially k = 1, and after every refresh,
is updated using k k + b k1+09 c. Therefore, k increases exponentially over time.</p>
        <p>
          This strategy scales well with the number of judgments (E) made during the CAL process since only O(log E)
refreshes are done. According to the authors of BMI, the motivation behind this strategy was to \reap the bene ts
of early precision, while avoiding downside risk and excessive running time, by using exponentially increasing
batch size" [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>In static batch refresh strategy, a full refresh is performed after every k judgments (k is xed).</p>
        <p>
          This strategy incurs a high computation cost and introduces scalability issues since it requires O(E) number
of refreshes and each refresh takes (n) time, where E is the number of documents judged during the CAL
process and n is the number of documents in the dataset. For very small values of k (such as 1) and large values
of n, pauses as small as half a second due to frequent refreshes can disrupt the user experience. One way to
work around this problem is to perform asynchronous refreshes and immediately show the users documents from
tr
the old review queue [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. This delays the e ect of user feedback on the review queue by d tu e documents, where
tr is the time it takes to complete a refresh, and tu is the time a user takes to review one document. While
this delay is usually tiny, additional experiments need to be performed to measure the impact of these delays on
e ectiveness.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Partial Refresh</title>
        <p>In this strategy, a full refresh is performed after every k judgments. At the end of each full refresh, s documents
with the highest relevance likelihood scores are stored in a partial refresh set. After every judgment, a partial
refresh is performed. In a partial refresh, all available judgments are used in training but relevance likelihood
scores are only calculated for the documents in the partial refresh set. A single partial refresh runs in O(s log s)
time.</p>
        <p>With some enhancements, this strategy can also help reduce the high memory costs when working with low
physical memory or very large datasets (such as ClueWeb). As mentioned in Section 3, the documents are loaded
in memory to enable faster operations and improve the responsiveness of the system. Partial refreshes are fast
and performed on a small set of data which can be stored in the memory. Full refresh can be performed in the
background, and can thus a ord reads from the disk without sacri cing the user experience or e ectiveness of
this strategy.</p>
        <p>zIn most cases, only top k documents (k &lt;&lt; n) are needed, and the refresh can be performed in O(n log k) time
In this strategy, a full refresh is performed when the \output quality" of the CAL system falls below some
threshold. We de ne \output quality" as the fraction of relevant judgments in the last m judgments made by
the reviewer. A full refresh is performed whenever this fraction falls below some threshold p.</p>
        <p>Unlike previous strategies, we don't de ne our refresh criteria in terms of elapsed number of judgments. Our
aim is to nd more meaningful factors which can help us better understand the e ectiveness of various refresh
strategies, and as a result, help us design better refresh strategies.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Recency Weighting</title>
        <p>This strategy modi es the training step (step 4 in Algorithm 1) in the CAL process by favoring documents which
were recently judged. As described in Section 2, training is done over several iterations. In each iteration of the
original training, a relevant and a non-relevant document is randomly sampled from the training set. The loss
computed using them is used to update the classi er weights. To incorporate recency weighting, we modi ed the
uniform random sampling such that the probability of selecting a document increases if it was judged recently.</p>
        <p>Given a list of documents [D1; D2; :::; Dn] ordered by the time they were judged, our modi ed random sampling
will select a document Dx with probability P (Dx) where</p>
        <p>P (Dx) = P (D1) +</p>
        <p>P (D1)(x</p>
        <p>N
1)(w
1
1)</p>
        <p>Therefore, P (Dx) is a linear function such that the latest judged document Dn is w times more likely to be
selected than the oldest judged document D1. A full refresh (with modi ed training) is performed after every
judgment.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results and Discussion</title>
      <p>For sake of space and readability, we encode each strategy with their parameter settings as
strategy name(param1 = x; : : :). Table 1 lists all the strategies and their parameters. Table 2 lists the results
for di erent parameter settings of bmi refresh, static batch, partial refresh and precision strategy. We report
the recall achieved at di erent values of e ort, e ort required to achieve 75% recall, and the average running
time. Instead of absolute e ort, we use normalized e ort Enorm as de ned in Section 3. For example, \Avg.
recall@(Enorm = 1:5)" denotes the average recall achieved across all the topics when 1:5 R documents haven
been judged, where R is the total number of relevant documents for a topic.</p>
      <p>With static batch(k = 1), CAL achieves signi cantly higher recall of 75% at Enorm = 1 than bmi refresh which
achieves 71:5% recall. static batch(k = 100) performs worse than bmi refresh, managing to achieve 70:4% recall
at the same e ort. These results establish that frequent refreshing helps to achieve higher recall. Although the
batch sizes in BMI increases exponentially with time, it still does frequent refreshes during the early stages of the
CAL process, thus performing better than static batch(k = 100). bmi refresh is also extremely cheap in terms
of computation cost since it only performs a logarithmic number of refreshes relative to static batch strategies.
bmi refresh simulation nished in less than a minute while static batch(k = 1) took 49 minutes.</p>
      <p>We evaluate rest of the refresh strategies by comparing them to static batch(k = 1).</p>
      <p>By xing s = 1000 and varying k in partial refresh strategy, we observe that for k = 10 and k = 100, the
di erence in recall remains insigni cant throughout the CAL process. Their recall scores are also very similar to
static batch(k = 1). They achieve 86:2% and 85:5% recall at Enorm = 1:5, respectively. For k = 500, we observe
78:6% recall at the same e ort, which is worse than bmi refresh (82:6%). This is in agreement with our previous
observation that more frequent full refreshes increases CAL's e ectiveness. static batch(k = 100) consistently
achieved lower recall when compared to static batch(k = 1) while partial refresh(k = 100; s = 1000) is as e ective
as the latter. Based on this, it can be established that partial refreshing contributes signi cant improvements
to recall. On xing k = 100 and varying s we observe no changes to the recall values. Partial refresh strategies
also improve the running time of simulations by 20% when compared to static batch(k = 1).</p>
      <p>In precision based refresh strategy, we x m = 25 and vary p. For p = 0:8 and p = 1:0, precision strategy
achieves 75% recall at Enorm = 1 which is similar to static batch(k = 1). This similarity of recall is also
seen at Enorm = 1:5 and Enorm = 2. When p = 1, precision strategy refreshes whenever a non-relevant
judgment is made, thus behaving very similar to static batch(k = 1). For lower values of p, we observe lower
recall values during the initial stages (73:5% recall when Enorm = 1 for p = 0:6). However, they catch up
to static batch(k = 1) at higher Enorm, as relevant documents become rarer (85:9% recall when Enorm = 1:5
for p = 0:6). precision strategy improved the running time of simulations by 15% on average when compared
to static batch(k = 1). precision strategy triggers lower number of refreshes during the beginning of the CAL
process when relevant documents are easier to nd. During the later stages when the relevant documents are
harder to nd, precision strategy tends to keep refreshing after every judgment.</p>
      <p>During our initial experiments, recency weighting seemed to have no impact on the recall values. On further
experiments, we found that the default number of training iterations (it = 100000) was very high for recency
weighting to cause any di erence. By reducing the number of training iterations to 1000, we introduced signi cant
degradation in the system's e ectiveness. Reducing the number of training iterations also reduced the running
time of simulation by 76% when compared to static batch(k = 1). We used recency weighting to see whether it
could recover the lost e ectiveness. recency weighting (w = 1; it = 1000) is equivalent to static batch(k = 1; it =
1000) and it achieves 70:4% and 79:8% recall when Enorm is equal to 1 and 1:5 respectively. By increasing w,
we observe an increase in recall for Enorm 2 f1:5; 2g. However, the recall is consistently and signi cantly lower
when compared to static batch(k = 1). For example, recency weighting (w = 10; it = 1000) is only able to achieve
82:4% recall at Enorm = 1:5.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Future Work</title>
      <p>There are many large scale datasets (ClueWeb, Twitter, etc) which far exceed the scale of dataset used in this
paper. It is desirable to run CAL e ciently on these datasets. We described an enhancement to the partial
refresh strategy which could potentially achieve this. Additional strategies which deal with large amount of data
could also be designed. In addition to runtime e ciency, these strategies will also need to optimize for memory
e ciency.</p>
      <p>In this paper, we measure computational e ciency of strategies using the running time of the simulations. In
a more practical setting where an actual human is judging documents, responsiveness of the CAL system also
becomes important. Under static batch refresh strategy, we mentioned the idea of introducing a delay to improve
responsiveness of CAL systems. Additional experiments which simulate and measure the impact of those delays
could be designed.</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>We investigated a crucial part of the Continuous Active Learning (CAL) process called refreshing. We proposed
and compared various refresh strategies. Our results show that by refreshing more often, CAL can be used
to achieve higher recall. Refreshing after every judgment (static batch(k = 1)) results in consistently better
performance than the original BMI AutoTAR in terms of recall. However, frequent refreshing is computationally
more expensive than the BMI refresh strategy. But with an e cient implementation on modern computers,
frequent refreshing can be practically used in real world CAL systems. We also de ned and analysed alternative
refresh strategies which are as e ective as refreshing after every judgments. In our experiments, various settings
of partial refresh strategy and precision strategy achieved recall scores similar to static batch(k = 1) at a lower
computation cost. For situations where computational resources are limited or dataset is very large, partial
refresh strategy can be used. Precision based refreshing is e cient when the relevant documents are abundant
and easier to nd.</p>
      <p>We also provide an e cient C++ implementation of CAL and all the refresh strategies mentioned in this paper
as an open source tool. The tool is designed to be used both as a research tool and in real world applications.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work was supported in part by the Natural Sciences and Engineering Research Council of Canada (Grants
CRDPJ 468812-14 and RGPIN-2014-03642), in part by a Google Founders Award, and in part by the University
of Waterloo.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Abualsaud</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghelani</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , H.,
          <string-name>
            <surname>Smucker</surname>
            ,
            <given-names>M.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cormack</surname>
            ,
            <given-names>G.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grossman</surname>
            ,
            <given-names>M.R.:</given-names>
          </string-name>
          <article-title>A system for e cient high-recall retrieval</article-title>
          .
          <source>In: Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , ACM (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Cormack</surname>
            ,
            <given-names>G.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grossman</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          :
          <article-title>Evaluation of machine-learning protocols for technology-assisted review in electronic discovery</article-title>
          .
          <source>In: Proceedings of the 37th international ACM SIGIR conference on Research &amp; development in information retrieval</source>
          ,
          <source>ACM</source>
          (
          <year>2014</year>
          )
          <volume>153</volume>
          {
          <fpage>162</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Cormack</surname>
            ,
            <given-names>G.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grossman</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          :
          <article-title>Autonomy and reliability of continuous active learning for technologyassisted review</article-title>
          .
          <source>arXiv preprint arXiv:1504.06868</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Grossman</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cormack</surname>
            ,
            <given-names>G.V.</given-names>
          </string-name>
          :
          <article-title>Technology-assisted review in e-discovery can be more e ective and more e cient than exhaustive manual review</article-title>
          .
          <source>Rich. JL &amp; Tech. 17</source>
          (
          <year>2010</year>
          )
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Grossman</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cormack</surname>
            ,
            <given-names>G.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roegiest</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Trec 2016 total recall track overview</article-title>
          . In: TREC. (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Roegiest</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cormack</surname>
            ,
            <given-names>G.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grossman</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clarke</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Trec 2015 total recall track overview</article-title>
          .
          <source>Proc. TREC-2015</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Roitblat</surname>
            ,
            <given-names>H.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kershaw</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oot</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Document categorization in legal electronic discovery: computer classi cation vs. manual review</article-title>
          .
          <source>Journal of the Association for Information Science and Technology</source>
          <volume>61</volume>
          (
          <issue>1</issue>
          ) (
          <year>2010</year>
          )
          <volume>70</volume>
          {
          <fpage>80</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Sculley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Combined regression and ranking</article-title>
          .
          <source>In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          ,
          <source>ACM</source>
          (
          <year>2010</year>
          )
          <volume>979</volume>
          {
          <fpage>988</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>