<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>B. Grun and K. Hornik. `topicmodels: An R package for tting topic mod-
els'. In: Journal of Statistical Software</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1532-4435</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Predicting Publication Inclusion for Diagnostic Accuracy Test Reviews Using Random Forests and Topic Modelling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>A.J. van Altena</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>S.D. Olabarriaga</string-name>
          <email>s.d.olabarriagag@amc.uva.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Epidemiology, Biostatistics and Bioinformatics Academic Medical Center of the University of Amsterdam</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>13</volume>
      <issue>2011</issue>
      <fpage>11</fpage>
      <lpage>14</lpage>
      <abstract>
        <p>Finding all relevant publications to perform a systematic review can be a time consuming task, especially in the eld of diagnostic test accuracy. Therefore, the CLEF eHealth lab `technologically assisted reviews in empirical medicine' was established to create a basis of comparison between various methods. In this paper we describe a method submitted to the lab. This method consists of a topic model used to extract features and a random forest to classify the relevant papers. Classi er performance shows and average decrease of 33.3% in workload (i.e., documents to read) when aiming for a 95% recall and 24.9% for 100% recall. However, there is a large variety in workload reduction (79.3% to 0.9%) between the diagnostic test accuracy reviews.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Finding the right publications to include in a systematic review can be a time
consuming task in the medical research eld, especially in Diagnostic Test
Accuracy (DTA) reviews. This type of research aims to summarise all evidence on
a speci c topic by analysing primary research, for example to study the accuracy
of Lyme borreliosis tests [16]. Because systematic reviewers aim to retrieve all
relevant publications, their search queries have to be very inclusive (i.e., broad).
The number of results that these searches yield can range from a few hundreds to
hundreds of thousands, while the searched publications (inclusions) account for
only a very small part (often less than 1%). Sometimes the search strategy can
be narrowed down by applying the lters that publication databases | such as
PubMed, Scopus, or Ovid | provide. DTA di ers from other types of systematic
reviews because search lters that can select the correct type of publications are
not consistent enough to deliver trustworthy output.</p>
      <p>Many methods have been proposed to lighten the burden on systematic
reviewers. With the increased popularity of machine learning for text mining,
applying such techniques seems a logical step. However, the task of identifying
publications for inclusion is a di cult task because the available data is mostly
unstructured text.</p>
      <p>In 2015 a study identi ed 44 di erent text mining and machine learning
methods [20]. However, there are at least two issues that can make a researcher
that performs systematic reviews reluctant to apply these methods: (a) the
comparison between the di erent methods is di cult because there is no de facto
performance measure; and (b) even when the workload can be greatly reduced
(up to 70%), there is no guarantee of a perfect recall of all relevant publications.</p>
      <p>To work towards solving these issues the `technologically assisted reviews in
empirical medicine' lab [15] was started as a subsidary of the CLEF eHealth labs
[10]. In this lab a dataset of approximately 50 DTA studies with close to 270.000
publications was released. For 20 DTA studies the inclusion and exclusion labels
were known to enable method development. To compete in the lab the labels of
the other 30 studies had to be predicted.</p>
      <p>In this paper we describe the method that we applied to this problem. To
extract features from the publications the unsupervised text mining method
`Topic Modelling' was used. The features were then fed into a `Random Forest'
to classify the unknown publications.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>
        In this section we describe feature extraction with topic modelling (TM),
classication through Random Forests (RF), and how stability of results was assessed.
More details about TM, our approach, and implementation can be obtained from
our earlier work [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and from the code [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].
2.1
      </p>
      <sec id="sec-2-1">
        <title>Feature extraction</title>
        <p>
          For extracting features from the corpus TM was applied [
          <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
          ]. TM constructs
topics (i.e., lists of ordered words) by considering each word in a document and
estimates two latent variables, namely topic-to-document ( ) and word-to-topic
( ). When two words appear together in many documents, they have a higher
chance of appearing in the same topic (through the word-to-topic relationship).
Also, all documents with those words have a strong topic-to-document relation
to that speci c topic. Note also that each document and word may have
relationships with multiple topics, which is useful in the case of (bio)medical research
where publications may contain many concepts (e.g., research eld, methods
applied, etc.).
        </p>
        <p>
          The pre-processing, TM tting, and post-processing steps are implemented
in two packages, respectively using the PHP [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and R [
          <xref ref-type="bibr" rid="ref2">2, 21</xref>
          ] languages.
        </p>
        <p>
          Pre-processing consisted of preparing the documents for ingestion into the
R environment and cleaning the text. Preparing for ingestion was performed
using article miner [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. This PHP package retrieved articles from PubMed
through the public API using provided PubMed IDs. The titles and abstracts
of all articles were parsed into a single CSV, the hyphens in hyphenated words
were replaced by underscores to assist in further cleaning steps. Corpus
cleaning was executed using the R tm package [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Processing consisted of removing
punctuation, numbers, whitespace, and stop words taken from the SMART list
[18, 22] (e.g., about, the, which)1.
        </p>
        <p>
          Fitting was performed using the same approach as in our previous work [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
Multiple topic models were tted with input parameters that were based on
literature and previous experience. The number of topics (T ) has to be provided
as an input to the method, so a range of T 2 f25; 50; 75g was chosen to generate
three models. Furthermore, the inputs and (can be considered `smoothing'
parameters for the and distributions, for more details see [23]) were set at
= 50=T and = 0:01, and models were run for 500 iterations [
          <xref ref-type="bibr" rid="ref7">23, 7</xref>
          ]. TM
results were post-processed to determine , which is not calculated directly by
the applied TM implementation. Each of these steps was implemented in R using
the tm and topicmodels packages [
          <xref ref-type="bibr" rid="ref9">11, 9, 21</xref>
          ].
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Classi er</title>
        <p>To determine whether documents should be considered inclusion or exclusion
the features extracted with TM (i.e., matrix) were used as input to a Random
Forest (RF). The RF method was chosen because of its suitability for binary
outcomes (i.e., inclusion or exclusion). Training and analysis of RF outcomes
was implemented using the caret R package [14]. The number of trees was set
at 800 and determined by examining the error by number of trees graph on larger
test runs (i.e., 1500 trees). Choosing the optimal number of sampled parameters
per tree was done by the caret package using the tuneGrid setting. The search
grid was set in increments of 10 up to the size of the input TM (i.e., number
of topics, T ) and included T when T mod 10 6= 0. For example, when T = 75
the grid was f10; 20; 30; 40; 50; 60; 70; 75g. Performance was assessed using ROC
curve and F1-measure, where the latter expresses the average between recall and
precision as follows:</p>
        <p>F1 = 2
precision recall
precision + recall
(1)
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Resources</title>
        <p>All runs were performed on cloud servers with a varying number of cores and
RAM. Test runs used a larger number of cores and RAM because one model
had to be trained for each T (three in total). Our method bene ts from more
cores as the applied packages allow parallelism, and as each TM can be trained
individually. Furthermore, caret supports parallelising the cross-folds that are
performed inside the train function using the registerDoMC function. Lastly,
titles and abstracts of documents were retrieved from PubMed using the Entrez
API.
1 The full list can be found at [17]</p>
        <p>In this section we describe the results of the training runs and the test runs for
the CLEF eHealth lab. The purpose of the training runs was to ne-tune our
method, whereas the test run was submitted to compete in the lab.
Not all documents could be retrieved through the Entrez API. In the training set
38 documents are missing, and abstracts were missing for 17 included documents.
The test set had 7 documents missing, it is unknown how many abstracts were
missing from included documents.</p>
        <p>To achieve the optimal TM and RF settings various training runs were
performed. Three di erent settings for T were tried to optimise the TM. For each
TM a RF was trained and tested. The resulting F1-measures are shown in
Figure 1. While the individual F1-measures are poor due to the class imbalance
in the input data, little di erence is visible between the di erent values of T .
Furthermore, ROC curves for each RF are shown in Figure 2.</p>
        <p>Optimisation of the number of trees was done according to the reported error
rate (data not shown). A steep drop in error is visible between 1 and 200 trees,
and the error rate remains at a plateau after 200 until 1500 trees are reached.
3.3</p>
      </sec>
      <sec id="sec-2-4">
        <title>Testing</title>
        <p>Results of the test run are shown in Tables 1 and 2, and Figure 3, organized
by workload reduction (i.e., Work Save over Sampling, WSS). Performance
outcomes are split into two groups based on whether WSS at 95% recall (WSS 95) is
greater than at 100% recall (WSS 100) or not { respectively Table 1 and Table 2.
This split was done to better represent the results. The group where WSS 100
is greater then WSS 95 has a smaller number of relevant documents, therefore
performance outcomes act more erratically (see Figure 3-left).
Little variation was shown in RF performance in Figures 1 and 2. However,
because tting large TMs (i.e., many documents and topics) consumes a high
amount of RAM, our implementation was limited at approximately T = 75.
Bigger TMs failed with out of memory errors on the largest servers available. Other
implementations employ an online training method [13], which is implemented
in [12] and circumvents the problem of out of memory errors by loading a subset
of documents into memory. Therefore, while the performance of the RFs was
stable, further ne-tuning of the TMs would be necessary to nd the optimal
features for classifying.</p>
        <p>The test run performance shows that a considerable workload reduction
(WSS) can be achieved for both 100% and 95% recall of relevant documents.
When considering the WSS at 100% recall our method has an acceptable
performance (&gt;10% decrease in workload) in 18 out of 30 reviews. At 95% recall
this number increases to 22 out of 30 reviews. The classi er has a good
performance (&gt;50% decrease in workload) for respectively 6 and 8 reviews out of 30 (at
100% and 95% recall).</p>
        <p>WSS varies wildly among the various DTA studies, as shown in Tables 1
and 2. There can be multiple reasons, one of which being the similarity of
documents within a single DTA study. When topics of documents are relatively
similar to each other, the classi er's score assigned to each document will be
less distinctive. This may result in relevant documents being far apart in the
ranking, thereby introducing more false positives. Another reason is that there
could be a large di erence between the topics of the documents. When the
topics in relevant documents from a certain DTA study do not line up with the
topics found in the DTA studies used for training, the classi er cannot make the
distinction between relevant and non-relevant documents.</p>
        <p>TM was chosen in our method because it identi es topics that are shared
between documents. Therefore, it can be employed to nd similarities between
documents. However, this may also assist in building better search queries. For
example, by nding the variable importance of the RF (using the varImp
function of the caret package), the most important topics can be identi ed which
distinguish between inclusion and exclusion in DTA reviews. Exploring and
interpreting these topics could further specify the search query by suggesting search
terms, either to include or exclude publications.</p>
        <p>Finally, TM and RF can be employed in an unsupervised manner which
relieves the reviewers from the task of providing training data to the method.
The future of automation will likely rely on a compound method consisting of
various classi cation techniques. We think the method proposed in this study
contributes to systematic review automation by making an initial ordering of
documents. While documents are being read and included or excluded an online
method can further re ne the reading order of documents.
4.1</p>
      </sec>
      <sec id="sec-2-5">
        <title>Related research</title>
        <p>
          Both Bekhuis et al. and Mo et al. [
          <xref ref-type="bibr" rid="ref4">4, 19</xref>
          ] report on the use of TM as a feature
in predicting systematic review inclusion. In both cases the systematic reviews
are not speci cally DTA related.
        </p>
        <p>Bekhuis et al. reports that classi cation performance outcomes for DTA
reviews are better when compared to non-DTA reviews. This is likely due to the
fact that DTA reviews focus on a very speci c topic which is easier to capture
in features. From the results of Bekhuis et al. it is apparent that while recall is
relatively high for classi ers based on TM features the precision is often lacking.
This observation can also be seen in the F1-measure presented in this paper
(see Figure 1). Therefore, nding a feature which increases the precision of the
classi cation method will massively a ect performance measures such as F1 and
will also drop workload (i.e., documents to read).</p>
        <p>Mo et al. compares methods using either bag-of-words or TM features. They
report that TM yields a better recall which is an highly important metric when
considering systematic reviews where reviewers want to nd all relevant
documents.</p>
        <p>It is di cult to compare the employed methods directly because the
experiment designs and reported performance measures vary. This is one of the
di culties systematic reviewers encounter when they consider various classi
cation systems, which is also reported in [20]. The performance measures reported
in this paper are standardised according to the CLEF eHealth lab, which should
contribute towards better understanding of classi cation methods.</p>
      </sec>
      <sec id="sec-2-6">
        <title>Acknowledgements</title>
        <p>This work was carried out on the High Performance Computing Cloud resources
of the Dutch national e-infrastructure with the support of SURF Foundation.
Furthermore, we would like to thank P.D. Moerland, A.H. Zwinderman, and
M.M.G. Lee ang for their contributions and advice.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. J. van Altena. Article</given-names>
            <surname>Miner</surname>
          </string-name>
          .
          <year>2017</year>
          . url: https : / / github . com / AMCeScience/article-miner.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. J. van Altena. R-</given-names>
            <surname>CLEF</surname>
          </string-name>
          .
          <year>2017</year>
          . url: https://github.com/Flythe/RCLEF.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>A. J. van Altena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. D.</given-names>
            <surname>Moerland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Zwinderman</surname>
          </string-name>
          and
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Olabarriaga</surname>
          </string-name>
          . `
          <article-title>Understanding big data themes from scienti c biomedical literature through topic modeling'</article-title>
          .
          <source>In: Journal of Big Data 3.1</source>
          (
          <issue>2016</issue>
          ), p.
          <fpage>23</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bekhuis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tseytlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. J.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          and
          <string-name>
            <surname>D.</surname>
          </string-name>
          Demner-Fushman.
          <article-title>`Feature engineering and a proposed decision-support system for systematic reviewers of medical evidence'</article-title>
          .
          <source>In: PloS one 9</source>
          .1 (
          <issue>2014</issue>
          ),
          <year>e86277</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          . `
          <article-title>Probabilistic topic models'</article-title>
          .
          <source>In: Communications of the ACM 55.4</source>
          (
          <issue>2012</issue>
          ), pp.
          <volume>77</volume>
          {
          <fpage>84</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          and
          <string-name>
            <surname>M. I. Jordan.</surname>
          </string-name>
          `
          <article-title>Latent dirichlet allocation'</article-title>
          .
          <source>In: the Journal of Machine Learning Research</source>
          <volume>3</volume>
          (
          <year>2003</year>
          ), pp.
          <volume>993</volume>
          {
          <fpage>1022</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Manning</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Heer</surname>
          </string-name>
          . `
          <article-title>Topic model diagnostics: Assessing domain relevance via topical alignment'</article-title>
          .
          <source>In: Proceedings of the 30th International Conference on Machine Learning (ICML-13)</source>
          .
          <year>2013</year>
          , pp.
          <volume>612</volume>
          {
          <fpage>620</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Grossman</surname>
          </string-name>
          . `
          <article-title>Engineering quality and reliability in technology-assisted review'</article-title>
          .
          <source>In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM</source>
          .
          <year>2016</year>
          , pp.
          <volume>75</volume>
          {
          <fpage>84</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>I.</given-names>
            <surname>Feinerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hornik</surname>
          </string-name>
          and D. Meyer. `
          <article-title>Text mining infrastructure in R'</article-title>
          .
          <source>In: Journal of Statistical Software 25.5</source>
          (
          <issue>2008</issue>
          ), pp.
          <volume>1</volume>
          {
          <issue>54</issue>
          . url: http://www. jstatsoft.org/v25/i05/.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>