Data Balancing for Technologically Assisted Reviews: Undersampling or Reweighting

Data Balancing for Technologically Assisted Reviews: Undersampling or Reweighting ZheYu NC State University

27695 Raleigh NC USA

TimMenzies tim.menzies@gmail.com NC State University

27695 Raleigh NC USA

Data Balancing for Technologically Assisted Reviews: Undersampling or Reweighting 1C76A84E910E12962FFF8649039C8727 GROBID - A machine learning software for extracting information from scholarly documents technologically assisted reviews active learning data balancing

This paper provides approaches for automated support of citation screening in systematic reviews. Continuous active learning is chosen as our baseline approach, above which, two data balancing techniques are applied to handle the imbalance problem. These two techniques, aggressive undersampling and reweighting are tested and compared on 20 data sets for Diagnostic Test Accuracy (DTA) reviews. Results are evaluated by last rel and suggest that reweighting outperforms undersampling as it not only balances the training data, but also emphasizes the "content relevant" examples over "abstract relevant" ones and thus helps to retrieve "content relevant" papers earlier.

Introduction

This paper is a participant working note for the task of technologically assisted reviews in empirical medicine [7] in CLEF eHealth 2017 [6]. This task is about applying machine learning techniques to facilitate medical researchers conducting systematic reviews. More specifically, the task focuses on Diagnostic Test Accuracy (DTA) reviews since search in this area is generally considered the hardest, and a breakthrough in this field would likely be applicable to other areas as well [7]. Twenty DTA reviews data sets are provided for training and thirty for testing. The problem statement of this task is:

Given the results of a Boolean Search how to make Abstract and Title Screening more effective.

Here, in this paper, we further specify our problem to be:

Screen least amount of papers to retrieve most (or all) relevant ones.

This leads directly to the evaluation method-last rel [7], which measures the number of documents need to be screened before retrieving all relevant documents.

Previously, we analyzed the equivalent problem in software engineering (SE) and built a high performing method FASTREAD that combines a wide range of techniques taken from from electronic discovery and evidence based medicine [12]. Those results suggested that FASTREAD, which took aggressive undersampling from patient active learning [11,10] and the rest from continuous active learning [2][3][4], outperforms both of the original algorithms on SE reviews data [12]. It indicated that, at least on SE reviews data, continuous active learning is an efficient approach, and data balancing can further improve its performance.

While the above results are promising, we advise against applying the conclusions directly to the empirical medicine task since the target corpus are very different (one from SE reviews and one from DTA reviews). In addition, the DTA reviews data have two levels of query results, one from title and abstract screening, the other from document screening, while the SE reviews data [12] only have the query results from document screening. That is, we feel that when properly considered, reweighting can be another way to balance the training data while more weights are put on papers identified as "content relevant" over those identified as only "abstract relevant" or "not relevant". In this way, reweighting not only balances the two classes, but also favors "content relevant" examples when training the model.

Besides the two-level query results, DTA reviews data also offer a brief description of the topic being screened, which could be a great source for "Auto-Syn" described in [13] and [3]. Utilizing the description as an initial seed training example would provide better chance to retrieve "relevant" papers earlier and reduce variances in the experiments (comparing to a random start-up). Note that in order to train a classifier on just one "relevant" example (the description of the topic), Presumptive non-relevant examples are generated [3]. This technique randomly samples from the unlabeled examples and treats the sampled examples as "not relevant" in training. The low prevalence of "relevant" examples makes this technique reasonable.

The rest of the paper provides details about different approaches tested on training data and analyzes the results. Numerous engineering decisions have been made without fully tested due to limited time. Followed by conclusions and future works at last.

Method

In this section, we provide details on three approaches:

-CAL: a baseline approach from Cormack et al. [2][3][4].

-AU: add data balancing method called aggressive undersampling [11,10] to the baseline approach CAL. -RW: add reweighting method ("content relevant" papers weight more than other papers in training) to the baseline approach CAL.

Baseline: CAL

Besides the overall framework as continuous active learning [2][3][4], the baseline approach applies several predefined engineering decisions same as our previous work [12]. The entire work flow can be described as follows:

1. Corpus collection: collect titles and abstracts of papers in search results. 2. Auto-Syn: add the topic description into the corpus and label it as "abstract relevant". 3. Preprocessing: stemming, stop words removal, bag of words. 4. Featurization: term frequency, feature selection by tf-idf score (to 4000 terms), l2 normalization. 5. Training: train a binary classifier (linear SVM) on all the labeled papers, "content relevant" and "abstract relevant" papers are treated as one class-"relevant" while "not relevant" papers are the other class in the training.

Presumptive non-relevant examples are generated to enrich the "not relevant" class examples. 6. Certainty sampling: use the trained classifier to predict on the rest unlabeled papers. Sample N = 10 papers with highest probability to be "relevant" according to the classifier. 7. Review1 : ask reviewers to review the sampled papers by titles and abstracts, label each as "abstract relevant" or "not relevant". For those papers labeled as "abstract relevant", reviewers are asked to further review on content and decide whether to label each as "content relevant". Go back to 5 until stop rule is satisfied (every "content relevant" paper has been retrieved). The threshold of M = 30 is applied to avoid training an SVM model on too few papers [12].

Aggressive

Reweighting: RW

Reweighting (RW) is a new approach which takes advantage of the two-level labels offered by DTA reviews data. The difference between RW and the baseline approach CAL is:

Training: train a binary classifier (linear SVM) on all the labeled papers, "content relevant" and "abstract relevant" papers are treated as one class-"relevant" but "content relevant" papers have W = 10 times the weight of "abstract relevant" or "not relevant" ones. Presumptive non-relevant examples are generated to enrich the "not relevant" class examples.

The reweighting parameter of W = 10 is chosen quite arbitrarily without fully tested due to the limited time.

Experiment

Experiments are conducted in a "pseudo" way following the procedures in Section 2. When a paper is asked to be reviewed, its true label is queried without any real human review process. As a result, the experiments become repeatable and reproducible.

Data

Twenty data sets on DTA reviews are provided as training sets for the task of technologically assisted reviews in empirical medicine [7]. These data sets provide two-level query results, one for title and abstract screening and one for content screening. As a result, we label each paper in the data sets as one of the three classes:

-Not relevant: papers excluded by title and abstract screening. "Content" column displays the number of "content relevant" papers; "Abstract" column displays the number of "content relevant" papers plus the number of "abstract relevant" papers; "Total" column displays the total number of papers. Topic 1, 6, 19, 28, and 45 (colored in red ) are considered "not good" for last rel evaluation due to their lack of "content relevant" papers (fewer than 5).

-Abstract relevant: papers included by title and abstract screening but excluded by content screening. -Content relevant: papers included by title and abstract screening and content screening.

Statistics for the twenty data sets are presented in Table 1 where five sets are considered to be "not good" for last rel evaluations. The reason behind is that pure "luck" might affect the result when the target is to retrieve the only 1 (or 2, or 3) "content relevant" paper.

Performance Metrics

Since the objective is to screen least amount of papers to retrieve most (or all) relevant ones, we choose last rel for evaluation. More specifically, we use the number of papers screened when every "content relevant" one is retrieved as the performance score to take advantage of the two-level labels offered by DTA reviews data. This makes our last rel metrics different from that used in [7].

The lower the last rel score is, the fewer papers need to be manually screened, thus the better performance. To capture the possible variances, experiments of each method on every data set (topic) is repeated 10 times with different random seeds (which affect the presumptive non-relevant examples generated and thus introduce variances). The last rel score for each repeat is collected while medians and iqrs (75th-25th percentile) are calculated for comparison. Scott-Knott [9] analyses are applied on each topic to rank the performance of each treatment. Since the last rel scores are in asymmetric and non-normal distributions, Cliff's Delta [1] and bootstrapping [5] are applied for non-parametric hypothesis test; i.e. two treatments are ranked differently in Scott-Knott analysis if both bootstrapping and the effect size test agreed that the division is statistically significant (99% confidence) and not a small effect (Cliff's Delta ≥ 0.147).

Results

Table 2 shows the results on 20 topics from the training set. The first thing we notice is that there is no treatment ranks highest (colored in green ) across every topic. One treatment may outperform others in one topic but performs poorly in another topic. In addition, no domination can be found among the three treatments (we say treatment A dominates treatment B if A performs consistently better than B across all topics).

Therefore, when it comes to the question of which treatment is the best, it really depends on the data. However, we did summarize the results in Table 2 and count the number of "wins" and "losts" of each treatment. As shown in Table 3, statistically, reweighting (RW) wins more and loses less than any other treatment. As a result, among these three treatments, we recommend reweighting (RW), which over-weights the "content relevant" examples to balance training data as well as emphasize "content relevant" examples. Another gain from these experiments is that data balancing techniques do improve the performances. As indicated in Table 2, on 19 out of 20 (or 14 out of 15) topics, reweighting (RW) or aggressive undersampling (AU) ranks highest; on 13 out of 20 (or 9 out of 15) topics RW or AU ranks higher than continuous active learning (CAL) without data balancing. This also suggests that the ensemble of RW and AU to leverage the advantages from both data balancing techniques might offer even better results. We plan to explore this in our future works.

Variances are within an acceptably low range (except for some of the "not good" topics) thanks to "Auto-Syn" technique. Therefore the results are considered to be stable and repeatable. 1 and 2). One treatment is considered better than another if the number in "Top Rank" is larger while the number in "Lower Rank thanBaseline" is smaller.

Conclusion

How to retrieve most (or all) relevant documents by screening least amount of the candidate ones is a difficult problem which is also known in the Information Retrieval (IR) domain as the total recall problem. Proposed by Cormack et al. in 2014, continuous active learning has been an excellent algorithm to solve the problem [2][3][4]. It was also adopted as a baseline method in the total recall task of TREC 2015 [8]. This work extended continuous active learning method by testing two different data balancing techniques. Experimental results suggested that there were no single treatment that outperforms any other treatment across all topics. However, statistically, reweighting (RW) was considered to be most powerful for the total recall task. This treatment applied "Auto-Syn" with topic description as seed training data, generated "presumptive non-relevant examples" before training to enrich the "not relevant" class, over-weighted the "content relevant" examples for data balancing. With the reweighting treatment, training examples were balanced (thus the model will not over-fit on "not relevant" class), and the model was trained to "favor" the "content relevant" examples which had a positive effect on retrieving every "content relevant" paper earlier.

Due to the limited time, only one aspect (data balancing) has been explored in this study. This does not imply that other aspects of the total recall task are not worth exploring. The plans of future work include:

-Explore the ensemble of reweighting and aggressive undersampling and other possible data balancing techniques. -Many parameters in the tested treatments are chosen quite arbitrarily. Parameter tuning can be applied to see if these parameters affect the conclusion and whether we can find a better set of parameters.

-Different featurization techniques can be applied to extract "richer" features than bag-of-words or term frequencies; e.g. word vectors and citation link features might be useful for measurement of relevance. -Human errors can be injected to test how robust the active learning methods are and to what level of error rate can the system perform normally.

Table 1 .1Descriptive statistics for experimental data sets.Content Abstract TotalContent Abstract TotalTopic12303241 Topic359983857Topic4284428180 Topic37121541576Topic62615078 Topic38510912704Topic960981162 Topic43274843335Topic118591457 Topic44302063149Topic14206314907 Topic45142316Topic191112704 Topic50411437990Topic23482001938 Topic5319671310Topic28353964 Topic5414271499Topic33606048186 Topic5545922542

Table 2 .2Experimental Results.MEDIANIQRRWAUCALRWAUCALTopic1510890885205710Topic42604103857062122Topic654751227060553277440Topic9690750690000Topic1175908017107Topic14110115110102517Topic19832061608320007Topic2392084010400270Topic28171515251600172717Topic334360378049700620Topic352102604052710115Topic37310380475202735Topic384909609801544797Topic4318011402303721017Topic44670510945923750Topic452010101577Topic504254455356535105Topic533406202800600Topic545104404401000Topic557408506100170Results collected from 10 repeated runs on 20 topics. Both medians and iqrs are lowerthe better. For each topic, aggressive undersampling (AU) and reweighting (RW) arecompared along with the baseline method continuous active learning without databalancing (CAL). Scott-Knott analyses (with Cliff's Delta and bootstrapping for non-parametric hypothesis test) are applied to rank each treatment. The treatments withhighest rank are colored in green while the treatments with lower ranks than thebaseline (CAL) are colored in gray .

Table 3 .3Summary of the Experimental Results. Top Rank" column displays the number of times one treatment ranks highest while "Lower Rank thanBaseline" column displays the number of times one treatment ranks lower than baseline treatment (CAL). The first two columns count all 20 topics while the last two columns only count "good" topics (excluding topics colored in red in TableIn all 20 topicsIn 15 "good" topicsTop RankLower Rank than BaselineTop RankLower Rank than BaselineRW143112AU9665CAL7NA6NA"

The actual experiments are carried out without real human reviewers. When asked for labels, the true labels in the data sets are queried instead of a human reviewer.

Dominance statistics: Ordinal analyses to answer ordinal questions NCliff Psychological Bulletin 114 3 494 1993 Evaluation of machine-learning protocols for technology-assisted review in electronic discovery GVCormack MRGrossman Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval the 37th international ACM SIGIR conference on Research & development in information retrieval ACM 2014 Autonomy and reliability of continuous active learning for technology-assisted review GVCormack MRGrossman arXiv:1504.06868 2015 arXiv preprint Scalability of continuous active learning for reliable high-recall text classification GVCormack MRGrossman Proceedings of the 25th ACM International on Conference on Information and Knowledge Management the 25th ACM International on Conference on Information and Knowledge Management ACM 2016 An introduction to the bootstrap BEfron RJTibshirani 1994 CRC press Clef 2017 ehealth evaluation lab overview LGoeuriot LKelly HSuominen ANévéol ARobert EKanoulas RSpijker JR MPalotti GZuccon Experimental IR Meets Multilinguality, Multimodality, and Interaction -8th International Conference of the CLEF Association, CLEF 2017 Proceedings. Lecture Notes in Computer Science

Dublin, Ireland

Springer September 11-14, 2017. 2017 Overview of the CLEF technologically assisted reviews in empirical medicine EKanoulas DLi LAzzopardi RSpijker Working Notes of CLEF 2017 -Conference and Labs of the Evaluation forum CEUR Workshop Proceedings

Dublin, Ireland

September 11-14, 2017. 2017 Trec 2015 total recall track overview ARoegiest GVCormack MGrossman CClarke Proc. TREC-2015 TREC-2015 2015 A cluster analysis method for grouping means in the analysis of variance AJScott MKnott Biometrics 1974 Active learning for biomedical citation screening BCWallace KSmall CEBrodley TATrikalinos Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining the 16th ACM SIGKDD international conference on Knowledge discovery and data mining ACM 2010 Semiautomated screening of biomedical citations for systematic reviews BCWallace TATrikalinos JLau CBrodley CHSchmid BMC bioinformatics 11 1 1 2010 How to read less: Better machine assisted reading methods for systematic literature reviews ZYu NAKraft TMenzies CoRR abs/1612.03224 2016 Waterlooclarke: Trec 2015 total recall track HZhang WLin YWang CLClarke MDSmucker 2015 TREC