Introduction

B. Grun and K. Hornik. `topicmodels: An R package for tting topic mod- els'. In: Journal of Statistical Software

1532-4435

Predicting Publication Inclusion for Diagnostic Accuracy Test Reviews Using Random Forests and Topic Modelling

A.J. van Altena

S.D. Olabarriaga

s.d.olabarriagag@amc.uva.nl 0 0 Department of Epidemiology, Biostatistics and Bioinformatics Academic Medical Center of the University of Amsterdam

2017

13 2011 11 14

Finding all relevant publications to perform a systematic review can be a time consuming task, especially in the eld of diagnostic test accuracy. Therefore, the CLEF eHealth lab `technologically assisted reviews in empirical medicine' was established to create a basis of comparison between various methods. In this paper we describe a method submitted to the lab. This method consists of a topic model used to extract features and a random forest to classify the relevant papers. Classi er performance shows and average decrease of 33.3% in workload (i.e., documents to read) when aiming for a 95% recall and 24.9% for 100% recall. However, there is a large variety in workload reduction (79.3% to 0.9%) between the diagnostic test accuracy reviews.

Introduction

Finding the right publications to include in a systematic review can be a time consuming task in the medical research eld, especially in Diagnostic Test Accuracy (DTA) reviews. This type of research aims to summarise all evidence on a speci c topic by analysing primary research, for example to study the accuracy of Lyme borreliosis tests [16]. Because systematic reviewers aim to retrieve all relevant publications, their search queries have to be very inclusive (i.e., broad). The number of results that these searches yield can range from a few hundreds to hundreds of thousands, while the searched publications (inclusions) account for only a very small part (often less than 1%). Sometimes the search strategy can be narrowed down by applying the lters that publication databases | such as PubMed, Scopus, or Ovid | provide. DTA di ers from other types of systematic reviews because search lters that can select the correct type of publications are not consistent enough to deliver trustworthy output.

Many methods have been proposed to lighten the burden on systematic reviewers. With the increased popularity of machine learning for text mining, applying such techniques seems a logical step. However, the task of identifying publications for inclusion is a di cult task because the available data is mostly unstructured text.

In 2015 a study identi ed 44 di erent text mining and machine learning methods [20]. However, there are at least two issues that can make a researcher that performs systematic reviews reluctant to apply these methods: (a) the comparison between the di erent methods is di cult because there is no de facto performance measure; and (b) even when the workload can be greatly reduced (up to 70%), there is no guarantee of a perfect recall of all relevant publications.

To work towards solving these issues the `technologically assisted reviews in empirical medicine' lab [15] was started as a subsidary of the CLEF eHealth labs [10]. In this lab a dataset of approximately 50 DTA studies with close to 270.000 publications was released. For 20 DTA studies the inclusion and exclusion labels were known to enable method development. To compete in the lab the labels of the other 30 studies had to be predicted.

In this paper we describe the method that we applied to this problem. To extract features from the publications the unsupervised text mining method `Topic Modelling' was used. The features were then fed into a `Random Forest' to classify the unknown publications. 2

Methods

In this section we describe feature extraction with topic modelling (TM), classication through Random Forests (RF), and how stability of results was assessed. More details about TM, our approach, and implementation can be obtained from our earlier work [ 3 ] and from the code [ 1, 2 ]. 2.1

Feature extraction

For extracting features from the corpus TM was applied [ 5, 6 ]. TM constructs topics (i.e., lists of ordered words) by considering each word in a document and estimates two latent variables, namely topic-to-document ( ) and word-to-topic ( ). When two words appear together in many documents, they have a higher chance of appearing in the same topic (through the word-to-topic relationship). Also, all documents with those words have a strong topic-to-document relation to that speci c topic. Note also that each document and word may have relationships with multiple topics, which is useful in the case of (bio)medical research where publications may contain many concepts (e.g., research eld, methods applied, etc.).

The pre-processing, TM tting, and post-processing steps are implemented in two packages, respectively using the PHP [ 1 ] and R [ 2, 21 ] languages.

Pre-processing consisted of preparing the documents for ingestion into the R environment and cleaning the text. Preparing for ingestion was performed using article miner [ 1 ]. This PHP package retrieved articles from PubMed through the public API using provided PubMed IDs. The titles and abstracts of all articles were parsed into a single CSV, the hyphens in hyphenated words were replaced by underscores to assist in further cleaning steps. Corpus cleaning was executed using the R tm package [ 9 ]. Processing consisted of removing punctuation, numbers, whitespace, and stop words taken from the SMART list [18, 22] (e.g., about, the, which)1.

Fitting was performed using the same approach as in our previous work [ 3 ]. Multiple topic models were tted with input parameters that were based on literature and previous experience. The number of topics (T ) has to be provided as an input to the method, so a range of T 2 f25; 50; 75g was chosen to generate three models. Furthermore, the inputs and (can be considered `smoothing' parameters for the and distributions, for more details see [23]) were set at = 50=T and = 0:01, and models were run for 500 iterations [ 23, 7 ]. TM results were post-processed to determine , which is not calculated directly by the applied TM implementation. Each of these steps was implemented in R using the tm and topicmodels packages [ 11, 9, 21 ]. 2.2

Classi er

To determine whether documents should be considered inclusion or exclusion the features extracted with TM (i.e., matrix) were used as input to a Random Forest (RF). The RF method was chosen because of its suitability for binary outcomes (i.e., inclusion or exclusion). Training and analysis of RF outcomes was implemented using the caret R package [14]. The number of trees was set at 800 and determined by examining the error by number of trees graph on larger test runs (i.e., 1500 trees). Choosing the optimal number of sampled parameters per tree was done by the caret package using the tuneGrid setting. The search grid was set in increments of 10 up to the size of the input TM (i.e., number of topics, T ) and included T when T mod 10 6= 0. For example, when T = 75 the grid was f10; 20; 30; 40; 50; 60; 70; 75g. Performance was assessed using ROC curve and F1-measure, where the latter expresses the average between recall and precision as follows:

F1 = 2 precision recall precision + recall (1) 2.3

Resources

All runs were performed on cloud servers with a varying number of cores and RAM. Test runs used a larger number of cores and RAM because one model had to be trained for each T (three in total). Our method bene ts from more cores as the applied packages allow parallelism, and as each TM can be trained individually. Furthermore, caret supports parallelising the cross-folds that are performed inside the train function using the registerDoMC function. Lastly, titles and abstracts of documents were retrieved from PubMed using the Entrez API. 1 The full list can be found at [17]

In this section we describe the results of the training runs and the test runs for the CLEF eHealth lab. The purpose of the training runs was to ne-tune our method, whereas the test run was submitted to compete in the lab. Not all documents could be retrieved through the Entrez API. In the training set 38 documents are missing, and abstracts were missing for 17 included documents. The test set had 7 documents missing, it is unknown how many abstracts were missing from included documents.

To achieve the optimal TM and RF settings various training runs were performed. Three di erent settings for T were tried to optimise the TM. For each TM a RF was trained and tested. The resulting F1-measures are shown in Figure 1. While the individual F1-measures are poor due to the class imbalance in the input data, little di erence is visible between the di erent values of T . Furthermore, ROC curves for each RF are shown in Figure 2.

Optimisation of the number of trees was done according to the reported error rate (data not shown). A steep drop in error is visible between 1 and 200 trees, and the error rate remains at a plateau after 200 until 1500 trees are reached. 3.3

Testing

Results of the test run are shown in Tables 1 and 2, and Figure 3, organized by workload reduction (i.e., Work Save over Sampling, WSS). Performance outcomes are split into two groups based on whether WSS at 95% recall (WSS 95) is greater than at 100% recall (WSS 100) or not { respectively Table 1 and Table 2. This split was done to better represent the results. The group where WSS 100 is greater then WSS 95 has a smaller number of relevant documents, therefore performance outcomes act more erratically (see Figure 3-left). Little variation was shown in RF performance in Figures 1 and 2. However, because tting large TMs (i.e., many documents and topics) consumes a high amount of RAM, our implementation was limited at approximately T = 75. Bigger TMs failed with out of memory errors on the largest servers available. Other implementations employ an online training method [13], which is implemented in [12] and circumvents the problem of out of memory errors by loading a subset of documents into memory. Therefore, while the performance of the RFs was stable, further ne-tuning of the TMs would be necessary to nd the optimal features for classifying.

The test run performance shows that a considerable workload reduction (WSS) can be achieved for both 100% and 95% recall of relevant documents. When considering the WSS at 100% recall our method has an acceptable performance (>10% decrease in workload) in 18 out of 30 reviews. At 95% recall this number increases to 22 out of 30 reviews. The classi er has a good performance (>50% decrease in workload) for respectively 6 and 8 reviews out of 30 (at 100% and 95% recall).

WSS varies wildly among the various DTA studies, as shown in Tables 1 and 2. There can be multiple reasons, one of which being the similarity of documents within a single DTA study. When topics of documents are relatively similar to each other, the classi er's score assigned to each document will be less distinctive. This may result in relevant documents being far apart in the ranking, thereby introducing more false positives. Another reason is that there could be a large di erence between the topics of the documents. When the topics in relevant documents from a certain DTA study do not line up with the topics found in the DTA studies used for training, the classi er cannot make the distinction between relevant and non-relevant documents.

TM was chosen in our method because it identi es topics that are shared between documents. Therefore, it can be employed to nd similarities between documents. However, this may also assist in building better search queries. For example, by nding the variable importance of the RF (using the varImp function of the caret package), the most important topics can be identi ed which distinguish between inclusion and exclusion in DTA reviews. Exploring and interpreting these topics could further specify the search query by suggesting search terms, either to include or exclude publications.

Finally, TM and RF can be employed in an unsupervised manner which relieves the reviewers from the task of providing training data to the method. The future of automation will likely rely on a compound method consisting of various classi cation techniques. We think the method proposed in this study contributes to systematic review automation by making an initial ordering of documents. While documents are being read and included or excluded an online method can further re ne the reading order of documents. 4.1

Related research

Both Bekhuis et al. and Mo et al. [ 4, 19 ] report on the use of TM as a feature in predicting systematic review inclusion. In both cases the systematic reviews are not speci cally DTA related.

Bekhuis et al. reports that classi cation performance outcomes for DTA reviews are better when compared to non-DTA reviews. This is likely due to the fact that DTA reviews focus on a very speci c topic which is easier to capture in features. From the results of Bekhuis et al. it is apparent that while recall is relatively high for classi ers based on TM features the precision is often lacking. This observation can also be seen in the F1-measure presented in this paper (see Figure 1). Therefore, nding a feature which increases the precision of the classi cation method will massively a ect performance measures such as F1 and will also drop workload (i.e., documents to read).

Mo et al. compares methods using either bag-of-words or TM features. They report that TM yields a better recall which is an highly important metric when considering systematic reviews where reviewers want to nd all relevant documents.

It is di cult to compare the employed methods directly because the experiment designs and reported performance measures vary. This is one of the di culties systematic reviewers encounter when they consider various classi cation systems, which is also reported in [20]. The performance measures reported in this paper are standardised according to the CLEF eHealth lab, which should contribute towards better understanding of classi cation methods.

Acknowledgements

This work was carried out on the High Performance Computing Cloud resources of the Dutch national e-infrastructure with the support of SURF Foundation. Furthermore, we would like to thank P.D. Moerland, A.H. Zwinderman, and M.M.G. Lee ang for their contributions and advice.

[1]

A. J. van Altena. Article

Miner . 2017 . url: https : / / github . com / AMCeScience/article-miner.

[2]

A. J. van Altena. R-

CLEF . 2017 . url: https://github.com/Flythe/RCLEF.

[3] A. J. van Altena ,

P. D.

Moerland ,

A. H.

Zwinderman and

S. D.

Olabarriaga . ` Understanding big data themes from scienti c biomedical literature through topic modeling' . In: Journal of Big Data 3.1 ( 2016 ), p. 23 .

[4]

Bekhuis ,

Tseytlin ,

K. J.

Mitchell and D. Demner-Fushman. `Feature engineering and a proposed decision-support system for systematic reviewers of medical evidence' . In: PloS one 9 .1 ( 2014 ), e86277 .

[5]

D. M.

Blei . ` Probabilistic topic models' . In: Communications of the ACM 55.4 ( 2012 ), pp. 77 { 84 .

[6]

D. M.

Blei ,

A. Y.

Ng and M. I. Jordan. ` Latent dirichlet allocation' . In: the Journal of Machine Learning Research 3 ( 2003 ), pp. 993 { 1022 .

[7]

Chuang ,

Gupta ,

Manning and

Heer . ` Topic model diagnostics: Assessing domain relevance via topical alignment' . In: Proceedings of the 30th International Conference on Machine Learning (ICML-13) . 2013 , pp. 612 { 620 .

[8]

G. V.

Cormack and

M. R.

Grossman . ` Engineering quality and reliability in technology-assisted review' . In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM . 2016 , pp. 75 { 84 .

[9]

Feinerer ,

Hornik and D. Meyer. ` Text mining infrastructure in R' . In: Journal of Statistical Software 25.5 ( 2008 ), pp. 1 { 54 . url: http://www. jstatsoft.org/v25/i05/.