Introduction

Refresh Strategies in Continuous Active Learning

0 Nimesh Ghelani, Gordon V. Cormack, and Mark D. Smucker University of Waterloo , Canada

18 23

High recall information retrieval is crucial to tasks such as electronic discovery and systematic review. Continuous Active Learning (CAL) is a technique where a human reviewer works in loop with a machine learning model; the model presents a set of documents likely to be relevant and the reviewer provides relevance feedback. Our focus in this paper is on one particular aspect of CAL: refreshing, which is a crucial and recurring event in the CAL process. During a refresh, the machine learning model is trained with the relevance judgments and a new list of likely-to-be-relevant documents is produced for the reviewer to judge. It is also computationally the most expensive step in CAL. In this paper, we investigate the e ects of the default and alternative refresh strategies on the e ectiveness and e ciency of CAL. We nd that more frequent refreshes can signi cantly reduce the human e ort required to achieve certain recall. For moderately sized datasets, the high computation cost of frequent refreshes can be reduced through a careful implementation. For dealing with resource constraints and large datasets, we propose alternative refresh strategies which provide the bene ts of frequent refreshes at a lower computation cost.

Introduction

Copyright c by the paper's authors. Copying permitted for private and academic purposes. training a classi er and computing relevance likelihood scores for potentially all the documents in the corpus. A refresh strategy determines when and how to perform the refresh. By studying various refresh strategies, we aim to:

Improve e ectiveness of CAL; speci cally its capability to achieve higher recall with lesser e ort Improve computational e ciency of CAL; so that it is responsive and feasible in a production environment.

In this paper, we propose di erent refresh strategies and compare their e ect on the behaviour of CAL. By default, refreshing in AutoTAR is done after a certain number of judgments is received. This number increases exponentially over time. We nd that we can make CAL more e ective by refreshing after every judgment. However, this comes with signi cant computation overhead, which with careful implementation and modern hardware, can be minimized for reasonably sized datasets. We also discuss other refresh strategies which have lower computation cost and achieve similar e ectiveness. These strategies can be considered when dealing with resource constraints or large datasets. 2

Continuous Active Learning

A general version of the AutoTAR CAL algorithm is described in Algorithm 1. The choice of refresh strategy can control when to perform a refresh (step 10), as well as the behaviour of training (step 4) and scoring (step 5). In this paper, we simulate human assessors using a set of existing relevance judgments (Step 8). Unlabelled documents are considered non-relevant during the simulation.

Algorithm 1: AutoTAR CAL Algorithm (assuming an arbitrary refresh strategy). A refresh strategy can control behaviour of steps 4, 7 and 10 1 Construct a seed document whose content is the topic description 2 Label the seed document as relevant and add it to the training set 3 Add 100 random documents from the collection, temporarily labeled as \not relevant" 4 Train a Logistic Regression classi er using the training set 5 Remove the random documents from the training set added in step 3 6 Flush the review queue 7 Using the classi er, order documents by their relevance scores and put them into a review queue 8 Review a document from the review queue, coding it as \relevant" or \not relevant" 9 Add the document to the training set 10 Repeat steps 8-9 until a refresh is needed (de ned by the refresh strategy) 11 Repeat steps 3-10 until some stopping condition is met.

Documents are represented as a vector of unigram tf-idf features which are used for training the classi er and calculating relevance likelihood scores. BMI AutoTAR uses so a-ml [ 8 ] to train a logistic regression classi er. It uses the logreg-pegasos learner with 100000 iterations of roc sampling. A training iteration involves randomly sampling a relevant and a non-relevant document from the training set, computing the loss and adjusting the classi er weights accordingly. The classi er weights are used to calculate relevance likelihood score for any document.

The BMI tool provided by the TREC Total Recall organizers is written in BASH. For e ciency and extensibility, we implemented Algorithm 1 in C++y [ 1 ]. Our implementation organizes all the document feature vectors in the memory to eliminate disk accesses and enable faster operations. New refresh strategies can be easily implemented and tested using this tool. The tool provides a command line interface to run simulations and a HTTP API for use in real world applications. 3

Experiment Setup

We used the Athome1 test collection from the TREC 2015 Total Recall track [ 6 ]. The collection has 290,099 documents and 10 topics. Each topic has an average of 4398 documents labelled as relevant. https://code.google.com/archive/p/so a-ml/ yhttps://nims11.github.io/tarcal.html

To measure the e ectiveness, we ran a CAL simulation for each topic and refresh strategy until Emax number of judgments were made. Emax is also known as the max review e ort. In our experiments, Emax was equal to 2 R where R is the total number of relevant documents for that topic. A gain curve for a topic is a plot of recall (y-axis) against the normalized review e ort (Enorm = ER ), where E is the number of judgments made since the beginning of the simulation. We get the average gain curve by averaging the recall over the 10 topics. For the sake of readability and space, we excluded those plots from this paper. Instead, we report certain points of interest from the plots in a table. Speci cally, we compare di erent refresh strategies based on their recall values when Enorm 2 f1; 1:5; 2g. We also report the e ort required to reach 75% recall for each refresh strategy.

To measure the computational e ciency, we xed Emax = 10000 and ran the simulation across four 2.20GHz CPUs (Step 7 of Algorithm 1 is the only parallelized step) for each topic and refresh strategy. Di erent refresh strategies are compared based on the running time of the simulation averaged over the 10 topics. 4

Refresh Strategies

In this section, we describe the refresh strategies we investigated. We use the term full refresh to denote a refresh in which all available judgments are used in training and relevance likelihood scores for all the documents are calculated. A full refresh runs in O(n log n) z time where n is the number of documents in the corpus. Training takes O(1) time (number of iterations is xed), computing the relevance likelihood scores take O(n) time (assuming length of the documents to be constant) and sorting the scores take O(n log n) time. However, the constant costs associated with training and scoring are signi cantly higher than sorting. BMI

Static Batch

In BMI AutoTAR, a full refresh is performed after every k judgments. Initially k = 1, and after every refresh, is updated using k k + b k1+09 c. Therefore, k increases exponentially over time.

This strategy scales well with the number of judgments (E) made during the CAL process since only O(log E) refreshes are done. According to the authors of BMI, the motivation behind this strategy was to \reap the bene ts of early precision, while avoiding downside risk and excessive running time, by using exponentially increasing batch size" [ 3 ].

In static batch refresh strategy, a full refresh is performed after every k judgments (k is xed).

This strategy incurs a high computation cost and introduces scalability issues since it requires O(E) number of refreshes and each refresh takes (n) time, where E is the number of documents judged during the CAL process and n is the number of documents in the dataset. For very small values of k (such as 1) and large values of n, pauses as small as half a second due to frequent refreshes can disrupt the user experience. One way to work around this problem is to perform asynchronous refreshes and immediately show the users documents from tr the old review queue [ 1 ]. This delays the e ect of user feedback on the review queue by d tu e documents, where tr is the time it takes to complete a refresh, and tu is the time a user takes to review one document. While this delay is usually tiny, additional experiments need to be performed to measure the impact of these delays on e ectiveness.

Partial Refresh

In this strategy, a full refresh is performed after every k judgments. At the end of each full refresh, s documents with the highest relevance likelihood scores are stored in a partial refresh set. After every judgment, a partial refresh is performed. In a partial refresh, all available judgments are used in training but relevance likelihood scores are only calculated for the documents in the partial refresh set. A single partial refresh runs in O(s log s) time.

With some enhancements, this strategy can also help reduce the high memory costs when working with low physical memory or very large datasets (such as ClueWeb). As mentioned in Section 3, the documents are loaded in memory to enable faster operations and improve the responsiveness of the system. Partial refreshes are fast and performed on a small set of data which can be stored in the memory. Full refresh can be performed in the background, and can thus a ord reads from the disk without sacri cing the user experience or e ectiveness of this strategy.

zIn most cases, only top k documents (k << n) are needed, and the refresh can be performed in O(n log k) time In this strategy, a full refresh is performed when the \output quality" of the CAL system falls below some threshold. We de ne \output quality" as the fraction of relevant judgments in the last m judgments made by the reviewer. A full refresh is performed whenever this fraction falls below some threshold p.

Unlike previous strategies, we don't de ne our refresh criteria in terms of elapsed number of judgments. Our aim is to nd more meaningful factors which can help us better understand the e ectiveness of various refresh strategies, and as a result, help us design better refresh strategies.

Recency Weighting

This strategy modi es the training step (step 4 in Algorithm 1) in the CAL process by favoring documents which were recently judged. As described in Section 2, training is done over several iterations. In each iteration of the original training, a relevant and a non-relevant document is randomly sampled from the training set. The loss computed using them is used to update the classi er weights. To incorporate recency weighting, we modi ed the uniform random sampling such that the probability of selecting a document increases if it was judged recently.

Given a list of documents [D1; D2; :::; Dn] ordered by the time they were judged, our modi ed random sampling will select a document Dx with probability P (Dx) where

P (Dx) = P (D1) +

P (D1)(x

N 1)(w 1 1)

Therefore, P (Dx) is a linear function such that the latest judged document Dn is w times more likely to be selected than the oldest judged document D1. A full refresh (with modi ed training) is performed after every judgment. 5

Results and Discussion

For sake of space and readability, we encode each strategy with their parameter settings as strategy name(param1 = x; : : :). Table 1 lists all the strategies and their parameters. Table 2 lists the results for di erent parameter settings of bmi refresh, static batch, partial refresh and precision strategy. We report the recall achieved at di erent values of e ort, e ort required to achieve 75% recall, and the average running time. Instead of absolute e ort, we use normalized e ort Enorm as de ned in Section 3. For example, \Avg. recall@(Enorm = 1:5)" denotes the average recall achieved across all the topics when 1:5 R documents haven been judged, where R is the total number of relevant documents for a topic.

With static batch(k = 1), CAL achieves signi cantly higher recall of 75% at Enorm = 1 than bmi refresh which achieves 71:5% recall. static batch(k = 100) performs worse than bmi refresh, managing to achieve 70:4% recall at the same e ort. These results establish that frequent refreshing helps to achieve higher recall. Although the batch sizes in BMI increases exponentially with time, it still does frequent refreshes during the early stages of the CAL process, thus performing better than static batch(k = 100). bmi refresh is also extremely cheap in terms of computation cost since it only performs a logarithmic number of refreshes relative to static batch strategies. bmi refresh simulation nished in less than a minute while static batch(k = 1) took 49 minutes.

We evaluate rest of the refresh strategies by comparing them to static batch(k = 1).

By xing s = 1000 and varying k in partial refresh strategy, we observe that for k = 10 and k = 100, the di erence in recall remains insigni cant throughout the CAL process. Their recall scores are also very similar to static batch(k = 1). They achieve 86:2% and 85:5% recall at Enorm = 1:5, respectively. For k = 500, we observe 78:6% recall at the same e ort, which is worse than bmi refresh (82:6%). This is in agreement with our previous observation that more frequent full refreshes increases CAL's e ectiveness. static batch(k = 100) consistently achieved lower recall when compared to static batch(k = 1) while partial refresh(k = 100; s = 1000) is as e ective as the latter. Based on this, it can be established that partial refreshing contributes signi cant improvements to recall. On xing k = 100 and varying s we observe no changes to the recall values. Partial refresh strategies also improve the running time of simulations by 20% when compared to static batch(k = 1).

In precision based refresh strategy, we x m = 25 and vary p. For p = 0:8 and p = 1:0, precision strategy achieves 75% recall at Enorm = 1 which is similar to static batch(k = 1). This similarity of recall is also seen at Enorm = 1:5 and Enorm = 2. When p = 1, precision strategy refreshes whenever a non-relevant judgment is made, thus behaving very similar to static batch(k = 1). For lower values of p, we observe lower recall values during the initial stages (73:5% recall when Enorm = 1 for p = 0:6). However, they catch up to static batch(k = 1) at higher Enorm, as relevant documents become rarer (85:9% recall when Enorm = 1:5 for p = 0:6). precision strategy improved the running time of simulations by 15% on average when compared to static batch(k = 1). precision strategy triggers lower number of refreshes during the beginning of the CAL process when relevant documents are easier to nd. During the later stages when the relevant documents are harder to nd, precision strategy tends to keep refreshing after every judgment.

During our initial experiments, recency weighting seemed to have no impact on the recall values. On further experiments, we found that the default number of training iterations (it = 100000) was very high for recency weighting to cause any di erence. By reducing the number of training iterations to 1000, we introduced signi cant degradation in the system's e ectiveness. Reducing the number of training iterations also reduced the running time of simulation by 76% when compared to static batch(k = 1). We used recency weighting to see whether it could recover the lost e ectiveness. recency weighting (w = 1; it = 1000) is equivalent to static batch(k = 1; it = 1000) and it achieves 70:4% and 79:8% recall when Enorm is equal to 1 and 1:5 respectively. By increasing w, we observe an increase in recall for Enorm 2 f1:5; 2g. However, the recall is consistently and signi cantly lower when compared to static batch(k = 1). For example, recency weighting (w = 10; it = 1000) is only able to achieve 82:4% recall at Enorm = 1:5. 6

Future Work

There are many large scale datasets (ClueWeb, Twitter, etc) which far exceed the scale of dataset used in this paper. It is desirable to run CAL e ciently on these datasets. We described an enhancement to the partial refresh strategy which could potentially achieve this. Additional strategies which deal with large amount of data could also be designed. In addition to runtime e ciency, these strategies will also need to optimize for memory e ciency.

In this paper, we measure computational e ciency of strategies using the running time of the simulations. In a more practical setting where an actual human is judging documents, responsiveness of the CAL system also becomes important. Under static batch refresh strategy, we mentioned the idea of introducing a delay to improve responsiveness of CAL systems. Additional experiments which simulate and measure the impact of those delays could be designed.

Conclusion

We investigated a crucial part of the Continuous Active Learning (CAL) process called refreshing. We proposed and compared various refresh strategies. Our results show that by refreshing more often, CAL can be used to achieve higher recall. Refreshing after every judgment (static batch(k = 1)) results in consistently better performance than the original BMI AutoTAR in terms of recall. However, frequent refreshing is computationally more expensive than the BMI refresh strategy. But with an e cient implementation on modern computers, frequent refreshing can be practically used in real world CAL systems. We also de ned and analysed alternative refresh strategies which are as e ective as refreshing after every judgments. In our experiments, various settings of partial refresh strategy and precision strategy achieved recall scores similar to static batch(k = 1) at a lower computation cost. For situations where computational resources are limited or dataset is very large, partial refresh strategy can be used. Precision based refreshing is e cient when the relevant documents are abundant and easier to nd.

We also provide an e cient C++ implementation of CAL and all the refresh strategies mentioned in this paper as an open source tool. The tool is designed to be used both as a research tool and in real world applications.

Acknowledgments

This work was supported in part by the Natural Sciences and Engineering Research Council of Canada (Grants CRDPJ 468812-14 and RGPIN-2014-03642), in part by a Google Founders Award, and in part by the University of Waterloo.

[1] Abualsaud , M. , Ghelani , N. , Zhang , H., Smucker , M.D. , Cormack , G.V. , Grossman , M.R.: A system for e cient high-recall retrieval . In: Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval , ACM ( 2018 )

[2] Cormack , G.V. , Grossman , M.R. : Evaluation of machine-learning protocols for technology-assisted review in electronic discovery . In: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval , ACM ( 2014 ) 153 { 162

[3] Cormack , G.V. , Grossman , M.R. : Autonomy and reliability of continuous active learning for technologyassisted review . arXiv preprint arXiv:1504.06868 ( 2015 )

[4] Grossman , M.R. , Cormack , G.V. : Technology-assisted review in e-discovery can be more e ective and more e cient than exhaustive manual review . Rich. JL & Tech. 17 ( 2010 ) 1

[5] Grossman , M.R. , Cormack , G.V. , Roegiest , A. : Trec 2016 total recall track overview . In: TREC. ( 2016 )

[6] Roegiest , A. , Cormack , G.V. , Grossman , M.R. , Clarke , C. : Trec 2015 total recall track overview . Proc. TREC-2015 ( 2015 )

[7] Roitblat , H.L. , Kershaw , A. , Oot , P. : Document categorization in legal electronic discovery: computer classi cation vs. manual review . Journal of the Association for Information Science and Technology 61 ( 1 ) ( 2010 ) 70 { 80

[8] Sculley , D. : Combined regression and ranking . In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining , ACM ( 2010 ) 979 { 988