-

LIMSI@CLEF eHealth 2017 Task 2: Logistic Regression for Automatic Article Ranking

Christopher Norman

Mariska Lee ang

m.m.leeflang@uva.nl 0

Aurelie Neveol

1 0 Academic Medical Center, University of Amsterdam , Amsterdam , the Netherlands 1 LIMSI, CNRS, Universite Paris Saclay , F-91405 Orsay

This paper describes the participation of the LIMSI-MIROR team at CLEF eHealth 2017, task 2. The task addresses the automatic ranking of articles in order to assist with the screening process of Diagnostic Test Accuracy (DTA) Systematic Reviews. We used a logistic regression classi er and handled class imbalance using a combination of class reweighting and undersampling. We also experimented with two strategies for relevance feedback. Our best run obtained an overall Average Precision of 0.179 and Work Saved over Sampling @95% Recall of 0.650. This run uses stochastic gradient descent for training but no feature selection or relevance feedback. We observe high performance variation within the queries in the test set. Nonetheless, our results suggest that automatic assistance is promising for ranking the DTA literature as it could reduce the screening workload for review writer by 65% on average.

Evidence Based Medicine Information Storage and Retrieval Review Literature as Topic Supervised Machine Learning

Systematic reviews seek to gather all available published evidence for a given topic and provide an informed analysis of the results. This work constitutes some of the strongest forms of scienti c evidence. Systematic reviews are an integral part of evidence based medicine in particular, and serve a key role in informing and guiding public and institutional decision-making. Systematic reviews for Diagnostic Test Accuracy (DTA) studies have been shown particularly challenging compared to other types of reviews because of the di culty in de ning search strategies o ering adequate levels of sensitivity and speci city [ 8 ]. For this reason, there is a need to particularly investigate automation strategies to assist DTA systematic review writers in the time-consuming screening process.

Methods for automating the screening process in systematic reviews have been actively researched over the years [ 6 ], with promising results obtained using a range of machine learning methods. However, previous work has not addressed DTA studies.

This paper describes the work underlying our participation in the CLEF 2017 eHealth Task 2 [ 10, 4 ]. This work is part of an ongoing e ort on providing automatic assistance for the screening process in systematic reviews addressing a variety of topics, including DTA studies.

The remainder of this paper is organized as follows; Section 2 presents the datasets used for system development. Section 3 provides an overview of our system and describes each component. Finally, section 4 reports our results and section 5 provides an analysis of our methods and participation in the task. 2

Datasets

The task relied on a corpus comprising 50 DTA systematic review topics associated with the full list of articles retrieved by an expert query and assessed for inclusion based on title and abstract or full text. The corpus was split into a development dataset comprising 20 topics and a test set comprising the remaining 30 topics. Our classi er was trained on the development dataset and evaluated on the test dataset. We have also used a dataset of systematic reviews on drug class e cacy due to Cohen et al. [ 1 ] to develop the methods applied in this task. Several groups have been using this dataset in the past [ 1, 5 ], which gives us a way to compare our results with previous work, although we can of course only do by using the same evaluation metrics and training modes as previous work.

For both the CLEF and Cohen datasets we know the inclusion decisions based on the abstracts, as well as the inclusion decisions based on the full text. We thus have two de nitions of positive examples, depending on whether we use the abstract decisions or full text decisions as the gold standard.

We use a tripartite labeling to re ect this: { No (N) is the set of articles that were excluded based on the abstract { Maybe (M) is the set of articles that were preliminarily included based on the abstract, but later excluded based on the full text { Yes (Y) is the set of articles that were included based on both the abstract and the full text, and later used in the meta-analysis { Intertopic training uses articles from a di erent topic (systematic review) for training { Intratopic training uses articles from the current topic (systematic review) for training

Method

We rst give an overview of our system, which relies on logistic regression, in section 3.1. Further details about the system are given in sections 3.2{3.5, including features, strategies to handle class imbalance and implement relevance feedback. 3.1

Overview

We have tried the following two classi ers: { Classi er 1 uses logistic regression trained using stochastic gradient descent on all features { Classi er 2 uses standard logistic regression trained using standard methods on a subset of the features, and with additional preprocessing to improve the throughput

We have tried three approaches to relevance feedback: { no relevance feedback { abrupt uses intertopic ranking until a su cient number of relevant and nonrelevant articles have been identi ed, and then switches to using intratopic ranking based on the identi ed articles { gradual initially uses intertopic ranking, and gradually improves the model using both Y and M identi ed through relevance feedback

In total, we have submitted the following four runs to the CLEF evaluation: { no AF full uses classi er 1 with no relevance feedback { no AF uses classi er 2 with no relevance feedback { abrupt uses classi er 2 with abrupt relevance feedback { gradual uses classi er 2 with gradual relevance feedback 3.2

Classi cation approach

We are currently using two classi cation systems. Both use logistic regression but di er in how the model is optimized and the amounts and types of pre- and postprocessing that is performed. Both methods use implementations provided by sklearn [ 7 ].

Our rst method, which is used in no AF full tends to work well for intertopic classi cation on previous datasets (see table 3), presumably because it generalizes better. This system uses logistic regression trained using stochastic gradient descent. The only preprocessing done is the normalization of numerals.

Our second method, which is used in no AF, abrupt, and gradual uses standard methods for training (liblinear). This version tends to work well on intratopic classi cation on previous datasets (see table 3), but does not scale as well with data volume. We therefore need to do additional preprocessing to reduce the number of features and keep running times down. We thus remove features with variance less than a prede ned threshold, we only consider n-grams with high mutual information with the target class in the training set, we normalize numerals, and we extract the principal components from the resulting data.

Principal component analysis tends to reduce over tting in our experiments, and it also drastically reduces the time it takes to train and apply the classi er, which is mostly important when we use relevance feedback. 3.3

Features

For all classi ers we extract n-grams (n 5) from the titles and abstracts. We also extract publication type, journal names, author assigned keywords, MeSH terms, and backward references, where these are available. The backward references are only available for references pointing to articles available in Pubmed Central, and this feature set is therefore fairly sparse.

Not all feature sets are useful for identifying DTA studies, but the current model has been constructed such that irrelevant features should not adversely e ect the performance. All the feature sets have been shown to be useful on some domain. For instance MeSH terms might not be useful for DTA studies, but we have previously found them to be useful in identifying topics related to drug e cacy. 3.4

Class imbalance

Class imbalance can be handled using undersampling, or by class reweighting. We are currently using a combination of both these approaches. Class weights We set the weight for the positive class to 80 for the initial intertopic classi er. We have determined this to be a reasonable weight experimentally using the Cohen dataset.

For the gradual relevance feedback we also attached higher weights to the intratopic training examples identi ed through relevance feedback. Undersampling In order to reduce the e ects of the class imbalance we undersample the training set to include an equal number of Y, M, and N. However, by doing so we end up with only around 1500 training samples. PCA yields at most the same number of principal components as we have input samples, and 1500 is generally too few principal components to build an accurate classi er. For the second model we therefore perform undersampling in two steps; We rst select a maximum of 500 Y, 1000 M, and 1500 N that we feed into the feature extraction pipeline, which thus determines the number of features in our model. We then select a smaller undersample to use for training.

We take a new undersample in each iteration of relevance feedback. 3.5

Relevance Feedback

We use two schemes for relevance feedback. For both schemes we retrain the classi er each time we retrieve relevance feedback. abrupt trains an initial intertopic classi er on the training dataset and ranks the test dataset in descending order of con dence. The system then iteratively asks for feedback for the top ranked results. When enough positive and negative examples have been identi ed, the system switches to using a classi er trained on the examples identi ed from relevance feedback. Additional examples are added to the intratopic classi er as they are discovered.

The idea behind this system is that on some topics in Cohen we can train highly performing intratopic classi ers using very small amounts of data, and we have observed that even trained on small amounts of data these sometimes outperform intertopic classi ers by a large margin. In these cases it might make sense to switch to intratopic classi cation as soon as we can.

We set the minimum number of positive examples to 4, and the minimum number of negative examples to 10. gradual trains an initial intertopic classi er using the training set and ranks the test set in descending order of con dence. The system then iteratively asks for feedback for the top ranked result. Articles queried for relevance feedback are then added to the model as they are queried, but with higher weights than the intertopic examples. The model thus starts out as an intertopic classi er, but gradually turns into an intratopic classi er as more targeted data is added to the model. Since the intratopic examples identi ed through relevance feedback are given higher weights, these will eventually drown out the original classi er, provided enough examples exist to be discovered.

Besides using Y and N, we also use intratopic M as positive examples, with lower weights than intratopic Y, but higher than intertopic Y. The reasoning behind this is that we often encounter M earlier than Y, and in greater numbers, in particular on topics with very few Y. We have observed on other datasets that we can sometimes improve performance by using both Y and M as positive examples, when the number of Y is very low.

After the number of Y found is larger than 40, we stop using M as positive examples.

Reasonable parameter settings were identi ed experimentally on the Cohen dataset. 3.6

Use of the CLEF development dataset

We do not split the training data into separate training and validation splits, since we do not have the necessary number of Y to do this without hurting the performance of the classi er. We do however use a small set of samples that overlaps with the training set for validation. The performance we observe on this validation su ers from severe over tting, but we can observe when the model fails to build a classi er on the current undersample. In such cases we can observe an AUROC < 0:5 even on the training set. In these cases we simply discard the classi er and try again with a new undersample. We observe that this improves performance dramatically when we have a very small amount of training data (approximately four or less positive examples). 4

Results

We present a comparison with previous work on the Cohen dataset for WSS@95 in table 2 and for AUC in table 3. Results from previous literature are taken from Khabsa et al. [ 5 ], and Cohen et al. [ 2 ]. Exact intertopic AUC scores are not explicitly reported by Cohen et al. and have instead been extracted from Figure 1 in their paper The majority of these results, with the exception of one result by Cohenet al. [ 2 ] use intratopic classi cation.

We present our results on the CLEF dataset for average precision in table 4, normalized average precision in table 5, WSS@95 in table 6, and in aggregate in table 7. The results in these tables correspond to those submitted as o cial runs. For comparison, we also calculate a baseline by evaluating each metric on the data ordered randomly. This has been repeated 1000 times and we report the average and standard deviation.

We also report the mean, standard deviation, minimum and maximum WSS@95 and AUC over ten runs for a selection of topics in the CLEF dataset in table 8. 5 5.1

Discussion Datasets

One of the topics in the CLEF dataset, CD010653, has no Y. While we can still calculate performance scores relative to M, this topic might arguably have been omitted from the test data. One of the topics, CD008803, similarly has no M. This also happens to be the topic with the largest number of Y.

As a general tendency, we can observe that the relative number of Y / M / N in the CLEF dataset varies dramatically across topics. At the one end we have one topic consisting of 14.06% Y (CD008760), and one topic consisting of 15.79% Y (CD010705). At the other end we have three topics with a mere 0.01% Y (CD011548, CD011549, and CD012019). Most topics in the CLEF dataset have a very small number of Y compared to Cohen, both in terms of relative and absolute numbers. Several topics have a large number of M however (CD007427, CD008054, CD009020, CD009323, CD009591, 011134, CD011548, CD0011975, CD011984, CD009925, CD10339, CD011145). Curiously, more topics in the training set have a large number of M than in the test set, despite this comprising a smaller number of topics. 0.030 0.023 0.075 0.013 0.009 0.012 0.013 0.025 0.006 0.014 0.008 0.018 0.041 0.002 0.024 0.034 0.024 0.011 0.286 0.041 0.143 0.014 0.014 0.006 0.109 0.015 0.065 0.086 0.004 0.195 0.015 0.052 0.048 0.087 0.035 0.023 0.026 0.025 0.050 0.022 0.037 0.021 0.033 0.085 0.010 0.034 0.057 0.034 0.023 0.237 0.061 0.164 0.035 0.048 0.031 0.098 0.043 0.106 0.121 0.015 0.190 0.016 w/o RF

Topic no AF full no AF WSS@95 0.640 0.500 WSS@100 0.591 0.420

last rel 1678 2263 NCG@10 0.517 0.407 NCG@20 0.802 0.639 NCG@30 0.908 0.783 NCG@40 0.946 0.843 NCG@50 0.972 0.890 NCG@60 0.984 0.921 NCG@70 0.990 0.942 NCG@80 0.997 0.960 NCG@90 0.998 0.987 NCG@100 1.000 0.998 norm area 0.890 0.825 ap 0.133 0.100

The number of N also varies wildly, from 52 up to 43287. Compared to the Cohen dataset we also have a smaller minimum number of N, as well as much larger maximum number.

If we compare the training and test sets, the training set contains almost double the absolute number of M, many more N, but fewer Y. While relevance feedback sometimes gives an improvement in performance, relevance feedback often seems to only confuse the system (tables 4{7). This should be contrasted with our experiments on the Cohen dataset, where the same implementation reliably yields an improvement (table 3), and generally yields performance intermediate between intertopic and intratopic classi cation, as one would expect. There are perhaps better approaches to relevance feedback than ours, which can reliably improve upon the baseline, but it might also be that there is simply little to gain from relevance feedback on several of the topics. Of particular note, we should not expect any improvements by using RF on topics such as CD010386, CD010633, CD010860, CD010896, and CD012019, that have a low absolute number of Y and M. It is also worth pointing out that our abrupt scheme requires at least 4 Y before switching to the intratopic model, and any di erences between no AF and abrupt on these topics can thus only be due to chance.

We can see an improvement on the topic CD010705 when using relevance feedback (tables 4{7). This topics is also the topic with the highest percentage of Y at 15.79%. We do not see any improvement for CD008760, the other topic with a high percentage of Y (14.06%), but this may be due to the initial classi er having much higher performance.

We can observe that gradual outperforms abrupt on topic CD008760, despite this topic having only 3 M, which is probably too low a numbe for gradual to have an advantage. The simplest explanation for this is likely random chance.

It is however easy to see that relevance feedback does not appear to lead to an improvement for our system. For instance abrupt outperforms no AF 15 times out of 30, and gradual outperforms no AF only 10 times out of 30 (tables 4).

Of course, it seems unlikely for relevance feedback to be useful for those topics where the number of positives is extremely low, even in theory. In particular, if there is only one relevant article, as is the case for CD012019 and CD010386, then relevance feedback cannot really add any value to the classi cation. Any successful use of relevance feedback on such topics would necessarily have to use the negative examples.

We get better performance for no AF full than no AF. We have however generally observed that this di erence is generally reversed for intratopic classi cation, which is what we should end up with when we after relevance feedback, but it is possible that we would get better performance if we were to use no AF full as a base for our relevance feedback experiments, since we would start with a much better initial classi er.

Ordinarily, screeners would be free to choose the order in which they screen each article, and may proceed for instance in alphabetical or chronological order. For the purposes of our baseline, we assume that any such order ordinarily available to screeners would be indistinguishable from random order on average. 5.3

Metrics

Average Precision has been selected as the main metric for this task as it was previously found particularly adapted to evaluate retrieval performance for highly imbalanced datasets [ 9, 3 ]. However, these studies rely on common assumptions that we value high precision at the top of the ranking, whereas for systematic review screening we value recall almost exclusively. Of particular note, average precision heavily penalizes rankings where the top few results are non-relevant, even if the ranking manages to place all relevant articles in the upper percentiles of the ranking.

Furthermore, average precision is strongly correlated with the number of positives in the topic, with most of the cases where we achieve ap > 0:2 are for topics with high prevalence. While this is to be expected, it means that average precision makes it di cult to compare performance across topics, since we can see a strong correlation with the prevalence of relevant articles in the topic (tables 1, 4{7). Similarly, Mean Average Precision will likely be dominated by the results on the topics with many relevant articles and a small number of total candidates, i.e. arguably the topics which are the least representative systematic reviews of DTA studies, and where automated methods are likely the least useful. 5.4

Reliability of the Experiments

Our classi cation method is stochastic, and thus does not produce deterministic results that are always the same every time we run on the same input data. To gauge the reliability of the experiment we repeat it ten times for a subset of the topics and calculate the standard deviations, as well as examine the minimum and maximum values (table 8).

We can generally observe a fairly large variability for topics with a small total number of candidates, such as CD008760 and CD010705, and for topics with a comparably smaller proportion of Y, such as CD010339. When we consider topics with a large number of candidates we can observe a large variability for the CD012019, but small variability for CD010386. We might speculate that small topic size and a small relative number of Y is correlated with larger variability, but it is clear that the variability for some topics is quite large, regardless of the underlying causes and mechanisms. The standard deviation can be as large as .139, which is large enough that it casts doubts about the reliability of the results. Furthermore, the minimum and maximum values are much more skewed towards extreme values than we should expect from the standard deviations were the values normally distributed, suggesting that the distribution is heavy-tailed and skewed towards outliers.

Considering the above, we might suspect that the di erences in performance in tables 4{7 are not signi cant. For instance abrupt outperforms gradual 17 times out of 30, but we do not know whether this means that abrupt is a better method, or if this is simply due to random chance. We might speculate that our gradual implementation works better for the cases where we have a su cient number of M, but the experiment is ultimately too low-powered to draw conclusions. Future iterations of the campaign could consider whether performance should be computed as an average over multiple runs, in order to get more precise results for stochastic systems such as ours.

We can however see smaller variability in the mean performance across all topics, which might suggest that these are more reliable estimates. However, these give little indication as to how the performance depends on topic composition. 5.5

General Remarks on the Shared Task Model

The Shared Task Model is typically implemented in evaluation campaigns that seek to perform a community-wide technical evaluation of systems addressing a particular task. A Shared Task thus o ers an evaluation paradigm that includes: 1/a speci c de nition of the task and evaluation metrics 2/an implementation through the dissemination of datasets and evaluation tools and 3/the execution of the evaluation in a controlled setting where participants have access to data at the same time and are evaluated blindly by an independent third party. As outlined below, this year the TAR task was not conducted according to the Shared Task Model.

In this iteration of the evaluation campaign, the nal set of evaluation metrics was decided only shortly before participants were required to freeze their systems. One of the expected outcomes of evaluation campaigns such as this is indeed the discussion of the relative merits of the various metrics to be used. However, changing the target metric close to the submission deadline means that some participants may have optimized for di erent metrics than those ultimately used for evaluation.

The gold standard labeled test data was distributed directly to the participants at the begining of the test phase. This is explained by the lack of an assessor through which participants could receive relevance feedback as has been the case in e.g. TREC Total Recall. While common labeled test collections are routinely used for research, this procedure is unusual in a shared task setting where participants are typically asked to process a test dataset while being blind to the gold standard associated with the dataset. This could alternatively have been accomplished in part by requiring the submission of runs without relevance feedback before the distribution of the gold standard labels.

Another feature of the shared task model is the computation of performance metrics for all participants by a common, independent party which ensures that all participations are evaluated using the exact same conditions. This confers a stronger reliability in the comparability and reproducibility of results. At the time of writing, while a common evaluation tool has been released, the performance reported by participants has been self-computed without validation from the task organizers. In addition to result validation, it would also have been useful to receive an indication of the overall performance of the participants prior to the deadline for the submission of the working notes. This would have enabled a discussion about the relative performance of the system that is currently di cult to do without comparing with previous literature using external datasets. 6

Conclusions

Our best system is the one using logistic regression trained using stochastic gradient descent, using a minimum of preprocessing, and no relevance feedback. This system achieves a workload reduction of 64.0% on average, with a minimum workload reduction of 19.3%, and a maximum workload reduction of 92.0%. On average, we would have to screen 1678 articles per topic to retrieve all relevant articles. Overall there is a large variation in performance across topics however.

We do not generally see an improvement when using relevance feedback. For the topics where relevance feedback is hypothetically feasible we sometimes see an improvement, although the e ect does not appear very reliable, and the low power of the experiment means that the results are unlikely to be signi cant.

Acknowledgments

This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 676207.

[1] Cohen , A.M. , Hersh , W.R. , Peterson , K. , Yen , P. : Reducing Workload in Systematic Review Preparation Using Automated Citation Classi cation pp. 206 { 219 ( 2006 )

[2] Cohen , A.M.: Optimizing feature representation for automated systematic review work prioritization . AMIA Annual Symposium proceedings pp. 121 { 5 ( 2008 )

[3] Davis , J. , Goadrich , M.: The relationship between precision-recall and roc curves . In: Proceedings of the 23rd international conference on Machine learning . pp. 233 { 240 . ACM ( 2006 )

[4] Kanoulas , E. , Li , D. , Azzopardi , L. , Spijker , R.: Overview of the CLEF technologically assisted reviews in empirical medicine

[5] Khabsa , M. , Elmagarmid , A. , Ilyas , I. , Hammady , H. , Ouzzani , M. : Learning to identify relevant studies for systematic reviews using random forest and external information . Machine Learning 102 ( 3 ), 465 { 482 ( 2016 )

[6]

'Mara-Eves , A. , Thomas , J. , McNaught , J. , Miwa , M. , Ananiadou , S.: Using text mining for study identi cation in systematic reviews: a systematic review of current approaches . Systematic reviews 4(1) , 5 ( 2015 )

[7] Pedregosa , F. , Varoquaux , G. , Gramfort , A. , Michel , V. , Thirion , B. , Grisel , O. , et al.: Scikit-learn: Machine learning in python . Journal of Machine Learning Research 12 (Oct), 2825 { 2830 ( 2011 )

[8] Petersen , H. , Poon , J. , Poon , S.K. , Loy , C. : Increased workload for systematic review literature searches of diagnostic tests compared with treatments: Challenges and opportunities . JMIR medical informatics 2 ( 1 ), e11 ( 2014 )

[9] Saito , T. , Rehmsmeier , M.: The precision-recall plot is more informative than the roc plot when evaluating binary classi ers on imbalanced datasets . PloS one 10(3) , e0118432 ( 2015 )

[10] Suominen , H. , Kelly , L. , Goeuriot , L. , Kanoulas , E. , Spijker , R. , Neveol , A. , Zuccon , G. , Palotti , J.R.M.: Overview of the CLEF ehealth evaluation lab 2017 . In: Experimental IR Meets Multilinguality , Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017 , Dublin, Ireland, September 11-14 , 2017 , Proceedings. Lecture Notes in Computer Science , Springer ( 2017 )