Automatic Thresholding by Sampling Documents and Estimating Recall ILPS@UVA at TAR Task 2.2 Dan Li1 and Evangelos Kanoulas2 University of Amsterdam, 1098XH, Amsterdam, Netherlands1,2 {D.Li, E.Kanoulas}@uva.nl Abstract. In this paper, we describe the participation of the Informa- tion and Language Processing System (ILPS) group at CLEF eHealth 2019 Task 2.2: Technologically Assisted Reviews in Empirical Medicine. This task is targeted to produce an efficient ordering of the documents and to identify a subset of the documents which contains as many of the relevant abstracts for the least effort. Participants are provided with sys- tematic review topics with each including a review title, a boolean query constructed by Cochrane experts, and a set of PubMed Document Iden- tifiers (PID’s) returned by running the boolean query in MEDLINE. We handle the problem under the Continuous Active Learning framework by jointly training a ranking model to rank documents, and conducting a “greedy” sampling to estimate the real number of relevant documents in the collection. We finally submitted four runs. Keywords: Continuous active learning · Active sampling · R estima- tion. 1 Introduction Systematic reviews are a type of literature review that uses systematic methods to reliably bring together the findings from multiple studies that address a ques- tion and are often used to inform policy and practice, e.g. the development of medical guideline in evidence-based medicine [6]. In order to write a systematic review, researchers have to come up with a Boolean query and conduct a search that will retrieve all the documents that are relevant. This is a difficult task, known in the Information Retrieval (IR) domain as the total recall problem. The CLEF eHealth Task 2 “Technology Assisted Reviews in Empirical Medicine Introduction” aims to automate this process [3] [4]. It consists of two subtasks. Task 2.1 focuses on the construction of the Boolean query, and Task 2.2 focuses Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem- ber 2019, Lugano, Switzerland. on producing an efficient ordering of the documents, such that all of the rele- vant abstracts are retrieved as early as possible, and identifying a subset which contains all or as many of the relevant abstracts for the least effort. We participated Task 2.2 and submitted 4 runs: abs-th-ratio-ilps@uva, abs- hh-ratio-ilps@uva, doc-th-ratio-ilps@uva, doc-hh-ratio-ilps@uva. 2 Task Description Task 2.2 is Abstract and Title Screening. The participants are given the docu- ment collection extracted through the Boolean Search in Task 2.1, and are asked to produce an the efficient ordering of the documents, such that all of the rele- vant abstracts are retrieved as early as possible, and at the same time to identify a subset of A which contains all or as many of the relevant abstracts for the least effort (i.e. total number of abstracts to be assessed). 3 Method 3.1 The model In this paper, we propose a novel model for the TAR process inspired by [1] and [5]. The model mainly consists of a ranking module, a sampling module, an assessment module, an estimation module and a stopping module. Given a topic and a document collection C the reviewer is interested in, together with a target recall level rtarget that the reviewer wants to achieve, the model iteratively outputs a set of documents until the estimated recall exceeds the target recall. We elaborate the model in Algorithm 1. Let S denote the set of sampled documents and n denote the number of documents in S, Lt denote the labelled documents (the training set) at batch t and Ut denote the unlabelled documents at t. Initially, Lt is empty and we fill it with a pseudo document d0 which is made of the description of the topic pro- vided. In line 3, k documents are uniformly sampled from Ut , and temporarily labeled non-relevant and added to Lt . In line 4, a ranking model is trained on Lt . In line 5-7, a sampling distribution Pt is constructed based on the ranked list of documents produced by the ranking model and a fixed number of b doc- uments are independently and with replacement sampled from Pt . In line 8, reviewers assess the relevance of the sampled documents. Note that the sampled b documents may contain duplicates, therefore reviewers only need to assess the unique documents. In line 10, R ct and vd ar(Rct ) are calculated. In lines 11-15, the stopping module uses R ct and vd ar(R ct ) to decide whether to stop or not. In line 17, produce the ordering of documents by sampled relevant, sampled non-relevant, un-sampled, with the stopping threshold at the first un-sampled documents. Algorithm 1: Automatic thresholding algorithm Input: Target topic; document collection C, target recall rtarget . Output: A list of retrieved documents with stopping threshold. 1 Lt = {pseudo document d0 labelled relevant} 2 while not stop do // Sample 3 Temporarily augment Lt by uniformly sampling k documents from Ut , labeled non-relevant. 4 Train a ranking model on Lt . 5 Rank all the documents in C by the trained over Lt ranker. 6 Construct sampling distribution Pt over the ranked documents. 7 Sample b document from the distribution Pt . 8 Render relevance assessments for the sampled documents that belong to Ut . 9 Remove the k temporary documents from Lt . Place the b labeled documents in Lt , and remove them from Ut . // Estimate 10 Calculate R ct and vd ar(R ct ) via (4). // Stop condition 11 if R b and vdar(R ct ) satisfy stopping strategy then 12 stop = True 13 else 14 stop = False 15 end 16 end 17 Produce the ordering of documents by sampled relevant, sampled non-relevant, un-sampled, with the stopping threshold at the first un-sampled documents. 3.2 Ranking module We use the TF-IDF vector of a document as its features. Considering effective- ness and efficiency we employ Logistic Regression as the ranking model. We use its implementation in scikit-learn 1 . At each batch t, a new model is trained from scratch using the current training data Lt . 3.3 Sampling module Sampling distribution Note that n oin Algorithm 1 we need to sample docu- ments from a distribution Pt = pti (for notation simplicity we use P in this section). Ideally, the selection probability pi should be positively correlated with the relevance labels, which allows an estimator R b with low variance (see Section 3.4). However, the relevance labels are not known until documents are assessed by the reviewers. What we have instead is a ranking model that can predict the 1 https://scikit-learn.org/ probability of relevance and output a list of ranked documents, which we can use to construct P. We use Power Law distribution which assumes the selection probability of a document is a power function of its position in the ranked list, defined as 1 1 pi = ri ∈ [1, N ], α ∈ (0, +∞) (1) Z ri α PN 1 where N is the number of documents in C, ri is the rank position, Z = i=1 riα is the normalization factor. Inclusion probability We derive the first-order and second-order inclusion probabilities, which is indispensable to calculate R. b We adopt sampling with replacement as our sampling method. At each batch t and for each draw, a doc- ument is sampled independently from one of the aforementioned distributions. Let selection probability denote the probability that a document is sampled for a draw, and inclusion probability the probability that a document is included in the sample set considering all the draws. Under sampling-with-replacement design, the first-order inclusion probability πi is given by T Y  nt πi = 1 − 1 − pti (2) t=1 The second-order inclusion probability πi,j – the probability of any two different document di and dj being included – is given by h T Y  nt i πi,j = πi + πj − 1 − 1 − pti − ptj (3) t=1 3.4 Estimation module We employ Horvitz-Thompson estimator and Hansen-Hurwitz estimator to es- timate R and var(R). Both of them are designed for sampling with unequal probabilities, Hansen-Hurwitz estimator is only restricted for with-replacement sampling, while Horvitz-Thompson estimator is for any design. For more details of the derivation the reader can refer to Chapter 6 in [7]. Horvitz-Thompson estimator The Horvitz-Thompson estimator provides an unbiased P estimator of population total under a general sampling theory [2]. Let τ = i∈S yi denote the population total. With any sampling design, with or without replacement, P the unbiased Horvitz-Thompson estimator of the popula- tion total is τb = i∈S 0 πyii , where S 0 is the subset of S only containing unique documents, and πi is the inclusion probability for document i. In our case, the population total PNis the total number of relevant documents for a target topic, denoted as R = i=1 yi . The Horvitz-Thompson estimator of R is X yi btHT = R (4) 0 πi i∈S e t where Set = ∪tk=1 Sk denote the accumulated sample set till batch t, yit is rele- vance of document dti . We use 0 to denotes the operation to remove duplicated documents, andeto denote the operation to cumulate documents in all previous batches. Hansen-Hurwitz estimator Hansen-Hurwitz estimator provides an unbiased estimator of population total under sampling with replacement from the same distribution [7]. In our case, the sampling distribution changes at each batch and converges to the ultimate distribution produced by the ranking model trained on the whole documents. The Hansen-Hurwitz estimator of R on St is btHH = 1 yi X R (5) n et pki k∈{1,2,...,t},i∈Sk 3.5 Stopping module We propose a stopping strategy based on R. b With sampling continuing, the 0 strategy repeatedly examines whether ret > R·rtarget , and if so stop TAR process. b The intuition is straight forward, if we have collected more relevant than the target number we estimated, we should feel confident to stop. 4 Dataset The dataset consists of 72 topics for training and 31 topics for testing. For each topic, participants will be provided with 1. Topic-ID 2. The title of the review, written by Cochrane experts; 3. The Boolean query manually constructed by Cochrane experts; 4. The set of PubMed Document Identifiers (PID’s) returned by running the query in MEDLINE. 5 Runs The proposed method is topic-wise in the sense that it repeatedly trains a new ranker based on the current assessed documents. It doesn’t need extra training topics. Our runs are directly run on test data. We submitted four runs: abs-th-ratio-ilps@uva, abs-hh-ratio-ilps@uva, doc- th-ratio-ilps@uva, doc-hh-ratio-ilps@uva. abs and doc denote whether qrels at abstract level or at content level is used for the relevance feedback in assessment module. th and hh denote whether Horvitz-Thompson estimator or Hansen- Hurwitz estimator is used to estimate R. A description of each run is presented below. 1. abs-th-ratio-ilps@uva abs qrels and Horvitz-Thompson estimator 2. abs-hh-ratio-ilps@uva abs qrels and Hansen-Hurwitz estimator 3. doc-th-ratio-ilps@uva doc qrels and Horvitz-Thompson estimator 4. doc-hh-ratio-ilps@uva doc qrels and Hansen-Hurwitz estimator For all the four runs, we set α = 0.8, b = 100, k = 100, target recall rtarget = 0.8. 6 Results Our method targets on finding a stopping threshold given a target recall. We re-rank all the sampled relevant documents on the top, followed by all the non- relevant documents and all the un-sampled documents. The stopping threshold is set at the position of the first un-sampled documents. As all the sampled documents are before the stopping threshold, the stopping threshold also indi- cates the cost. As a consequence it is not valid to apply ranking metrics such as Average Precision, we report thresholding based metrics instead. Table 1 shows the thresholding result on the test set. First, on both abs and content level, the Horvitz-Thompson estimator has recall threshold closer to the target recall 0.8 than the Hansen-Hurwitz estimator, which indicates a more accurate estimation of R. Second, both estimators stop at early stage when sampled documents are less than 50% of the complete documents. Figure 1 and 2 shows the topic-wise recall threshold v.s. norm threshold. Horvitz-Thompson estimator stops at various recall for different topics, while Hansen-Hurwitz estimator stops between 0.8 - 1.0 for most topics. It indicates the estimation of R can help to stop viewing documents, but the variance of the estimated R is large for different topics. Table 1: Thresholding results on the test set RUN norm threshold recall threshold abs-th-ratio-ilps@uva 0.423 0.838 abs-hh-ratio-ilps@uva 0.47 0.89 doc-th-ratio-ilps@uva 0.392 0.894 doc-hh-ratio-ilps@uva 0.426 0.95 7 Conclusion In this paper, we presented the runs we submitted to the CLEF 2019 eHealth Task 2.2. We handle the problem under the Continuous Active Learning frame- work by jointly training a ranking model to rank documents, and conducting a “greedy” sampling to estimate the real number of relevant documents in the collection. We finally submitted four runs. The result indicates the method can retrieve most relevant documents (around 80% to 90%) with the cost viewing less than 50% of the complete documents. The estimation of R can help to stop viewing documents, but the variance of the estimated R is large for different topics. Further work needs to be done to reduce the variance of the estimated R. References 1. Cormack, G.V., Grossman, M.R.: Autonomy and reliability of continuous ac- tive learning for technology-assisted review. CoRR abs/1504.06868 (2015), http://arxiv.org/abs/1504.06868 2. Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association 47(260), 663– 685 (1952) 3. Kanoulas, E., Li, D., Azzopardi, L., Spijker, R.: Clef 2019 technologically assisted reviews in empirical medicine overview. In: CLEF 2019 Evaluation Labs and Work- shop: Online Working Notes, CEUR-WS (2019) 4. Kelly, L., Suominen, H., Goeuriot, L., Neves, M., Kanoulas, E., Li, D., Azzopardi, L., Spijker, R., Zuccon, G., Jimmy, Palotti, J.: Overview of the clef ehealth evaluation lab 2019. In: CLEF 2019 - 10th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS). Springer (2019) 5. Li, D., Kanoulas, E.: Active sampling for large-scale information retrieval evaluation. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. pp. 49–58. ACM (2017) 6. O’Mara-Eves, A., Thomas, J., McNaught, J., Miwa, M., Ananiadou, S.: Using text mining for study identification in systematic reviews: a systematic review of current approaches. Systematic reviews 4(1), 5 (2015) 7. Thompson, S.K.: Sampling. John Wiley & Sons, Inc., Hoboken, New Jersey, 3 edn. (2012) Topicwise recall_threshold v.s. norm_threshold (abs-th-ratio-ilps@uva). 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 (a) abs-th-ratio-ilps@uva Topicwise recall_threshold v.s. norm_threshold (abs-hh-ratio-ilps@uva). 1.0 0.9 0.8 0.7 0.6 0.2 0.4 0.6 0.8 (b) abs-hh-ratio-ilps@uva Fig. 1: Topicwise recall threshold v.s. norm threshold at abs level. Topicwise recall_threshold v.s. norm_threshold (doc-th-ratio-ilps@uva). 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 (a) doc-th-ratio-ilps@uva Topicwise recall_threshold v.s. norm_threshold (doc-hh-ratio-ilps@uva). 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 (b) doc-hh-ratio-ilps@uva Fig. 2: Topicwise recall threshold v.s. norm threshold at content level.