LIMSI@CLEF eHealth 2018 Task 2: Technology
Assisted Reviews by Stacking Active and Static
                  Learning

        Christopher Norman12 , Mariska Leeflang2 , and Aurélie Névéol1
              1
             LIMSI, CNRS, Université Paris Saclay, F-91405 Orsay
                        firstname.lastname@limsi.fr
2
  Academic Medical Center, University of Amsterdam, Amsterdam, the Netherlands
                            m.m.leeflang@uva.nl


       Abstract. This paper describes the participation of the LIMSI-MIROR
       team at CLEF eHealth 2018, task 2. The task addresses the automatic
       ranking of articles in order to assist with the screening process of Diag-
       nostic Test Accuracy (DTA) Systematic Reviews. We ranked articles by
       stacking two models, one linear regressor trained on untargeted train-
       ing data, and one model using active learning. The workload reduction
       to retrieve 95% of the relevant articles was estimated at 82.4%, and we
       observe a workload reduction less than 70% in only two topics. The re-
       sults suggest that automatic assistance is promising for ranking the DTA
       literature.

       Keywords: Evidence Based Medicine, Information Storage and Retrieval,
       Review Literature as Topic, Supervised Machine Learning


1    Introduction

Systematic reviews seek to gather all available published evidence for a given
topic and provide an informed analysis of the results. This work constitutes some
of the strongest forms of scientific evidence. Systematic reviews are an integral
part of evidence based medicine in particular, and serve a key role in informing
and guiding public and institutional decision-making. Systematic reviews for Di-
agnostic Test Accuracy (DTA) studies have been shown particularly challenging
compared to other types of reviews because of the difficulty in defining search
strategies offering acceptable recall [7]. For this reason, there is a need to investi-
gate automation strategies to assist DTA systematic review writers, particularly
in the time-consuming screening process.
    Methods for automating the screening process in systematic reviews have
been actively researched over the years [6], with promising results obtained using
a range of machine learning methods. However, previous work has not addressed
DTA studies.
    This paper describes the work underlying our participation in the CLEF 2018
eHealth Task 2 [4, 8]. This work is part of an ongoing effort to provide automated
         assistance in the screening process in systematic reviews addressing a variety of
         topics, including DTA studies.
             The remainder of this paper is organized as follows; Section 2 presents the
         dataset used for system development. Section 3 provides an overview of our
         system and describes each component. Finally, section 4 reports our results and
         section 5 provides an analysis of our methods and participation in the task.


         2     Material

         In this work we have used the Clef dataset [3] as the gold standard for eval-
         uation. The first iteration (2017) of the Clef dataset [3] comprised 50 DTA
         systematic review topics (20 for training, 30 for testing) associated with the full
         list of articles retrieved by an expert query and assessed for inclusion based on
         title and abstract or full text. The second iteration (2018) uses the previous 50
         topics for training, and supplies an additional 30 topics for testing.
             For each of the datasets we know the inclusion decisions based on the ab-
         stracts, as well as the inclusion decisions based on the full text. We thus have
         two definitions of positive examples, depending on whether we use the abstract
         decisions or full text decisions as the gold standard.
             We use a tripartite labeling to reflect this:

          – No (N) is the set of articles that were excluded based on the abstract
          – Maybe (M) is the set of articles that were preliminarily included based on
            the abstract, but later excluded based on the full text
          – Yes (Y) is the set of articles that were included based on both the abstract
            and the full text, and later used in the meta-analysis

            Table 1 shows a breakdown of the distribution of examples for each class in
         the CLEF dataset.


         3     Methods

         To rank candidate articles we construct three machine learning models:


         3.1   Overview

cnrs static Our static ranker uses logistic regression trained on a large number (>
            500,000) of features. This model is trained once on train split 1 (Table 1), and
            can then be used to rank candidate articles in any unseen DTA systematic
            review, without a provided search query or topic description. This model
            is intended to capture diagnostic test accuracy studies without considering
            whether the articles are topically relevant.
                                              Absolute number          Relative number
                           Split   Topic       Y    M      N         Y        M      N
train split 1 (2017 train split)   CD008643    4    7    15065     0.0%     0.0%   99.9%
                                   CD009593    24  54    14844     0.2%     0.4%   99.5%
                                   CD011549    1    1    12699     0.0%     0.0%   100.0%
                                   CD010771    1   47     274      0.3%    14.6%   85.1%
                                   CD010438    3   36    3211      0.1%     1.1%   98.8%
                                   CD007427    17  106   1398      1.1%     7.0%   91.9%
                                   CD008686    5    2    3946      0.1%     0.1%   99.8%
                                   CD011548    5   108 12591       0.0%     0.9%   99.1%
                                   CD007394    47  48    2450      1.8%     1.9%   96.3%
                                   CD009323    9   113   3757      0.2%     2.9%   96.9%
                                   CD010632    14  18    1472      0.9%     1.2%   97.9%
                                   CD011975    60  559   7582      0.7%     6.8%   92.5%
                                   CD009944    64  53    1064      5.4%     4.5%   90.1%
                                   CD009591    41  103   7847      0.5%     1.3%   98.2%
                                   CD011134    49  166   1738      2.5%     8.5%   89.0%
                                   CD009020    12  150   1422      0.8%     9.5%   89.8%
                                   CD010409    41  35    43287     0.1%     0.1%   99.8%
                                   CD008691    20  53    1243      1.5%     4.0%   94.5%
                                   CD011984    28  426   7738      0.3%     5.2%   94.5%
                                   CD008054    41  233   2940      1.3%     7.2%   91.5%
 train split 2 (2017 test split)   CD010783    11  19    10875     0.1%     0.2%   99.7%
                                   CD009135    19  58     714      2.4%     7.3%   90.3%
                                   CD009185    23  69    1523      1.4%     4.3%   94.3%
                                   CD010023    14  38     929      1.4%     3.9%   94.7%
                                   CD010653    0   45    7957      0.0%     0.6%   99.4%
                                   CD009647    17  39    2729      0.6%     1.4%   98.0%
                                   CD011145    48  154 10670       0.4%     1.4%   98.1%
                                   CD008760    9    3     52       14.1%    4.7%   81.2%
                                   CD010775    4    7     230      1.7%     2.9%   95.4%
                                   CD009925    55  405   6071      0.8%     6.2%   93.0%
                                   CD009372    10  15    2223      0.4%     0.7%   98.9%
                                   CD010896    3    3     163      1.8%     1.8%   96.4%
                                   CD010542    8   12     328      2.3%     3.4%   94.3%
                                   CD008803    99   0    5121      1.9%     0.0%   98.1%
                                   CD009519    46  58    5867      0.8%     1.0%   98.3%
                                   CD010386    1    1     623      0.2%     0.2%   99.7%
                                   CD008782    34  11    10462     0.3%     0.1%   99.6%
                                   CD009579    79  59    6317      1.2%     0.9%   97.9%
                                   CD010772    11  36     269      3.5%    11.4%   85.1%
                                   CD009551    16  30    1865      0.8%     1.6%   97.6%
                                   CD010173    10  13    5472      0.2%     0.2%   99.6%
                                   CD010339    9   105 12689       0.1%     0.8%   99.1%
                                   CD010633    3    1    1569      0.2%     0.1%   99.7%
                                   CD010705    18   5     91       15.8%    4.4%   79.8%
                                   CD012019    1    2    10314     0.0%     0.0%   100.0%
                                   CD007431    15   9    2050      0.7%     0.4%   98.8%
                                   CD010276    24  30    5441      0.4%     0.5%   99.0%
                                   CD009786    6    4    2055      0.3%     0.2%   99.5%
                                   CD008081    10  16     944      1.0%     1.6%   97.3%
                                   CD010860    4    3     87       4.3%     3.2%   92.6%
    test split (2018 test split)   CD011602    1    7    6149      0.0%     0.1%   99.9%
                                   CD011515    1   126   7117      0.0%     1.7%   98.2%
                                   CD010864    3   41    2461      0.1%     1.6%   98.2%
                                   CD012083    5    6     311      1.6%     1.9%   96.6%
                                   CD010680    0   26    8379      0.0%     0.3%   99.7%
                                   CD011431    26  271    885      2.2%    22.9%   74.9%
                                   CD012216    1   10     206      0.5%     4.6%   94.9%
                                   CD012281    9   14    9853      0.1%     0.1%   99.8%
                                   CD011686    2   53    9388      0.0%     0.6%   99.4%
                                   CD009175    7   58    5579      0.1%     1.0%   98.8%
                                   CD010213    33  566 14599       0.2%     3.7%   96.1%
                                   CD010657    35  104   1720      1.9%     5.6%   92.5%
                                   CD012599    19  556   7473      0.2%     6.9%   92.9%
                                   CD011420    5   37     209      2.0%    14.7%   83.3%
                                   CD012009    4   33     499      0.7%     6.2%   93.1%
                                   CD009263    10  114 78679       0.0%     0.1%   99.8%
                                   CD011926    29  11    4010      0.7%     0.3%   99.0%
                                   CD008122    57  215   1639      3.0%    11.3%   85.8%
                                   CD008587    35  44    9073      0.4%     0.5%   99.1%
                                   CD011912    18  18    1370      1.3%     1.3%   97.4%
                                   CD009694    9    7     145      5.6%     4.3%   90.1%
                                   CD010296    38  15    4549      0.8%     0.3%   98.8%
                                   CD012165    47  261   9914      0.5%     2.6%   97.0%
                                   CD008759    42  18     872      4.5%     1.9%   93.6%
                                   CD012179   117 187    9528      1.2%     1.9%   96.9%
                                   CD010502    71  158   2756      2.4%     5.3%   92.3%
                                   CD008892    30  39    1430      2.0%     2.6%   95.4%
                                   CD012010    8   282   6540      0.1%     4.1%   95.8%
                                   CD011053    7    5    2223      0.3%     0.2%   99.5%
                                   CD011126    9    4    5987      0.1%     0.1%   99.8%

                  Table 1: The distribution of class labels in the dataset.
cnrs RF (uni-/bigram) We construct two relevance feedback (active learning) models uses logis-
                      tic regression on a smaller number (≈ 2,000) of features. These models are
                      trained using relevance feedback on the target topic, starting with the topic
                      description as an artificial seed document. The unigram model is a reim-
                      plementation of the cal model by Cormack and Grossman [1, 2]. We also
                      experiment on a model which uses bigrams in addition to unigrams. These
                      models are intended to capture topicality, and to incrementally improve per-
                      formance through the screening process.
       cnrs combined Our stacked metaclassifier uses a three-layer feedforward dense neural
                      network to estimate the optimal ranking based on the output of the static
                      model and the RF bigram model.

                            We describe each system in detail in the remainder of this section.

                      3.2     Static Ranking Model
                      We here use a machine learning approach and train a classifier on the training
                      split, largely identical to the implementation of our static model submitted in
                      2017 [5]. The decision function of the classifier can then be used to calculate
                      probability scores for unseen candidate articles. This is a static model, intended
                      to capture diagnostic test accuracy studies without considering whether the ar-
                      ticles are topically relevant.
                          We use logistic regression trained using stochastic gradient descent (sklearn)
                      on a sparse feature matrix consisting of a large number (> 500,000) of fea-
                      tures. We have tried using other classifiers, including svms, random forests,
                      feed-forward neural networks, convolution networks and lstms, but logistic re-
                      gression yields consistently better performance in our experiments with a fraction
                      of the training time.
                          We handle class imbalance by class reweighting. We have implemented un-
                      dersampling mechanisms, but these tend to decrease performance. We set the
                      weight for the positive class to 80 for the initial intertopic classifier. We have de-
                      termined this to be a reasonable weight experimentally in previous experiments
                      on another dataset [5].
                          This model was trained on the 2017 training split.

                      3.3     Active Learning
                      We here use an active learning approach, where we at each timestep train a
                      classifier (ranker) on the relevant articles screened so far. We start the process
                      using the topic description as an artificial seed document. The model is intended
                      to capture topical relevance, and to use the data collected through the screening
                      process, which is generally more targeted than the data we have available in the
                      training split.
                          The model largely follows the continuous active learning approach of Cor-
                      mack and Grossman [1, 2], except for using bigrams in addition to unigrams. We
                      repeat the procedure for clarity.
    At each timestep we rank the candidate articles and show the top B articles
to the oracle, and the oracle labels these as Y, M, or N. The number of articles
B is initially set to 1 and is incremented by bBc at each timestep.
    We use the following process to construct positive training data:

 – if Y have been encountered:
   Then we use all encountered Y as positive training data. The synthetic seed
   document and any encountered M are discarded.
 – else if M have been encountered, but no Y:
   Then we use all encountered M as positive training data. The synthetic seed
   document is discarded.
 – else (no Y or M have been encountered):
   We use the synthetic seed document as positive training data.

    To construct negative training data we sample 100 articles (or as many as
remains) from the unseen candidates and temporarily label these N, irrespective
of their true labels. Any articles already shown to the oracle are not considered
for use as negative data.
    We train our model on using the above positive and negative data to re-rank
the candidate articles and repeat the process until all articles have been shown
to the oracle.
    This model only uses the candidate articles and the topic description as
training data, and thus do not depend on other training data, such as the topics
in the training split.


3.4   Stacked Model

We use a three-layer dense neural network as a function approximator to estimate
the joint score for a candidate document given the scores from our static and
active models. We use 16 nodes in each layer, apply 30% dropout after each
layer and use softmax activation on the final layer to simulate two-class logistic
regression.
    The model is trained by sampling training data uniformly from recorded
active learning output. We have tried using uncertainty sampling, but this has
yielded inferior results.
    As input to the model we use the score values we get from the static and
active learning models, along with meta-level features. The full set of features is
as follows:

 1. Static model document score (static)
 2. Active model document score (RF bigram)
 3. Number of Y found
 4. Amount of relevance feedback (absolute number)
 5. Amount of relevance feedback (percentage)
 6. Relevance feedback stage (whether using seed, M or Y as positive training
    data)
      Features 3 and 4 are normalized using the following log transform

                                       log2 (1 + |x|)
                              sgn(x) ×
                                              8
    to keep numbers in mainly in the range [0, 1]. We do not truncate large
numbers. Feature 6 take discrete values in {−1, 0, 1}
    However, we observe that features 5 and 6 decrease model performance and
we therefore excluded these in the model used in our officially submitted runs.
    This model is trained on data generated from training split 2 (Table 1) to
avoid overfitting. We generate the training data for the stacked model by letting
the active model run on the training data, and at each step in the process we
record the score generated by the active learning model, as well as the above
features. We do this 100 times for each topic. One data point thus consists of the
score from the static model (feature 1), and features 2–6 from this pre-generated
data.
    We train the stacked model on data sampled randomly from this pool of data
points, by sampling 50 runs in each iteration, and sampling an equal number of
positive and negative training examples from each run (with a minimum of 20
total). The model is trained on a batch of size 32. The training data is resampled
every training iteration.

4      Results
We present our results for average precision in table 2, WSS@95 in table 3,
WSS@100 in table 4, Last Rel in table 5, as well as the aggregate scores in table
6. For comparison, we also calculate a baseline by evaluating each metric on the
data ordered randomly. The baseline values are calculated using the average and
the standard deviation of 1000 repetitions.
    The RF unigram, and RF bigram, and the combined model were sub-
mitted as our official runs.
    The results omit one topic with no Y (CD010680).

5      Discussion
5.1     Datasets
One of the topics in the CLEF dataset, CD010653, has no Y. While we can still
calculate performance scores relative to M, this topic might arguably have been
omitted from the test data. One of the topics, CD008803, similarly has no M.
This also happens to be the topic with the second largest number of Y.
    As a general tendency, we can observe that the relative number of Y / M
/ N in the CLEF dataset varies dramatically across topics. At the one end we
have one topic consisting of 14.06% Y (CD008760), and one topic consisting of
15.79% Y (CD010705). At the other end we have five topics with less than 0.1%
Y (CD011548, CD011549, CD012019, CD011515, and CD009263). The number
of N also varies wildly, from 52 up to 78,679.
                                  Y||MN                                                 YM||N
                           RF                                                    RF
   Topic   static   unigram bigram combined         baseline     static   unigram bigram combined           baseline
    ALL    0.169     0.176    0.124    0.203       0.014 ± 0     0.313     0.314    0.218    0.337         0.053 ± 0
CD008122   0.331     0.274    0.327    0.344     0.042 ± 0.013   0.744     0.706    0.652    0.748       0.146 ± 0.001
CD008587   0.045     0.033    0.043    0.094       0.004 ± 0     0.076     0.063    0.062    0.109       0.009 ± 0.001
CD008759   0.477     0.543    0.283    0.549     0.047 ± 0.001   0.562     0.620    0.326    0.609       0.101 ± 0.010
CD008892   0.278     0.342    0.329    0.511     0.022 ± 0.001   0.323     0.376    0.361    0.462       0.043 ± 0.002
CD009175   0.085     0.095    0.003    0.059     0.002 ± 0.001   0.206     0.156    0.025    0.130       0.013 ± 0.002
CD009263   0.060     0.022    0.000    0.103     0.000 ± 0.000   0.116     0.104    0.003    0.038       0.002 ± 0.001
CD009694   0.435     0.447    0.494    0.843     0.084 ± 0.014   0.734     0.774    0.411    0.694       0.102 ± 0.018
CD010213   0.040     0.061    0.018    0.053     0.002 ± 0.000   0.260     0.250    0.195    0.226       0.042 ± 0.003
CD010296   0.450     0.535    0.074    0.541     0.011 ± 0.002   0.512     0.563    0.082    0.568       0.017 ± 0.005
CD010502   0.209     0.254    0.186    0.334     0.028 ± 0.003   0.339     0.409    0.323    0.467       0.080 ± 0.007
CD010657   0.176     0.206    0.070    0.196     0.028 ± 0.001   0.386     0.406    0.213    0.421       0.079 ± 0.003
CD010864   0.079     0.054    0.013    0.020     0.002 ± 0.001   0.084     0.082    0.113    0.133       0.023 ± 0.000
CD011053   0.065     0.063    0.019    0.048     0.007 ± 0.005   0.105     0.105    0.035    0.080       0.011 ± 0.005
CD011126   0.111     0.107    0.018    0.042     0.003 ± 0.001   0.145     0.141    0.027    0.070       0.003 ± 0.001
CD011420   0.062     0.056    0.263    0.215     0.021 ± 0.000   0.341     0.336    0.644    0.742       0.178 ± 0.000
CD011431   0.216     0.166    0.167    0.231     0.026 ± 0.004   0.649     0.626    0.662    0.669       0.262 ± 0.018
CD011515   0.050     0.028    0.071    0.042     0.001 ± 0.001   0.298     0.369    0.302    0.360       0.017 ± 0.001
CD011602   0.002     0.002    0.002    0.003     0.001 ± 0.000   0.018     0.014    0.021    0.037       0.004 ± 0.002
CD011686   0.015     0.012    0.005    0.047     0.002 ± 0.001   0.289     0.201    0.111    0.162       0.005 ± 0.001
CD011912   0.212     0.195    0.453    0.266     0.013 ± 0.001   0.374     0.365    0.447    0.481       0.031 ± 0.007
CD011926   0.428     0.540    0.028    0.129     0.008 ± 0.000   0.479     0.569    0.037    0.165       0.013 ± 0.002
CD012009   0.051     0.149    0.027    0.041     0.009 ± 0.002   0.387     0.317    0.192    0.455       0.085 ± 0.010
CD012010   0.090     0.125    0.102    0.106     0.002 ± 0.001   0.253     0.295    0.272    0.354       0.050 ± 0.001
CD012083   0.612     0.436    0.335    0.602     0.022 ± 0.003   0.373     0.313    0.243    0.378       0.040 ± 0.004
CD012165   0.072     0.075    0.013    0.073     0.005 ± 0.001   0.347     0.348    0.046    0.291       0.031 ± 0.002
CD012179   0.183     0.193    0.075    0.201     0.015 ± 0.002   0.374     0.343    0.123    0.356       0.033 ± 0.002
CD012216   0.016     0.016    0.013    0.012     0.014 ± 0.008   0.268     0.246    0.222    0.285       0.089 ± 0.023
CD012281   0.012     0.024    0.091    0.155     0.001 ± 0.000   0.026     0.027    0.080    0.210       0.003 ± 0.001
CD012599   0.054     0.059    0.080    0.042     0.002 ± 0.000   0.266     0.266    0.260    0.253       0.074 ± 0.004

               Table 2: Average precision score for each topic, evaluated using either inclusion deci-
               sions based on full text (Y||MN), or based on abstract and title (YM||N). The combined
               model uses the static and RF bigram as subcomponents.
                                  Y||MN                                                 YM||N
                           RF                                                    RF
   Topic   static   unigram bigram combined         baseline     static   unigram bigram combined           baseline
    ALL    0.741     0.815    0.668    0.824     0.104 ± 0.024   0.513     0.617    0.519    0.657       0.028 ± 0.009
CD008122   0.800     0.794    0.772    0.788     0.018 ± 0.033   0.403     0.455    0.415    0.453       0.005 ± 0.013
CD008587   0.839     0.838    0.836    0.896     0.034 ± 0.047   0.772     0.746    0.696    0.759       0.012 ± 0.026
CD008759   0.746     0.764    0.612    0.736     0.019 ± 0.037   0.685     0.703    0.612    0.668       0.015 ± 0.030
CD008892   0.891     0.884    0.788    0.883     0.048 ± 0.052   0.040     0.534    0.694    0.486       0.006 ± 0.027
CD009175   0.936     0.916    0.546    0.915     0.073 ± 0.111   0.027     0.532    0.285    0.532       0.011 ± 0.029
CD009263   0.465     0.920    0.117    0.861     0.041 ± 0.084   0.418     0.408    0.122    0.557       0.006 ± 0.020
CD009694   0.826     0.832    0.678    0.813     0.045 ± 0.091   0.521     0.795    0.320    0.683       0.061 ± 0.073
CD010213   0.278     0.834    0.647    0.825     0.038 ± 0.049   0.065     0.590    0.556    0.341       0.002 ± 0.009
CD010296   0.928     0.924    0.723    0.924     0.028 ± 0.042   0.906     0.909    0.588    0.918       0.022 ± 0.034
CD010502   0.346     0.617    0.757    0.646     0.019 ± 0.030   0.298     0.587    0.405    0.609       0.002 ± 0.014
CD010657   0.739     0.741    0.345    0.757     0.034 ± 0.044   0.473     0.453    0.404    0.503       0.006 ± 0.018
CD010864   0.914     0.885    0.837    0.854     0.197 ± 0.193   0.215     0.506    0.571    0.619       0.017 ± 0.036
CD011053   0.909     0.913    0.537    0.903     0.076 ± 0.112   0.913     0.913    0.766    0.906       0.105 ± 0.095
CD011126   0.921     0.929    0.819    0.910     0.048 ± 0.091   0.933     0.935    0.860    0.917       0.096 ± 0.094
CD011420   0.719     0.715    0.831    0.823     0.114 ± 0.144   0.572     0.575    0.585    0.627       0.015 ± 0.034
CD011431   0.763     0.733    0.696    0.703     0.025 ± 0.048   0.017     0.162    0.275    0.173       0.003 ± 0.011
CD011515   0.947     0.945    0.948    0.947     0.459 ± 0.290   0.398     0.178    0.679    0.721       0.005 ± 0.020
CD011602   0.879     0.864    0.877    0.890     0.448 ± 0.283   0.750     0.786    0.806    0.870       0.059 ± 0.098
CD011686   0.937     0.910    0.844    0.875     0.284 ± 0.231   0.584     0.285    0.457    0.811       0.022 ± 0.034
CD011912   0.871     0.874    0.854    0.883     0.053 ± 0.067   0.843     0.850    0.654    0.841       0.032 ± 0.044
CD011926   0.933     0.933    0.483    0.916     0.017 ± 0.047   0.928     0.926    0.483    0.909       0.024 ± 0.041
CD012009   0.713     0.734    0.362    0.476     0.150 ± 0.158   0.584     0.592    0.362    0.476       0.026 ± 0.042
CD012010   0.020     0.671    0.579    0.744     0.064 ± 0.105   0.004     0.534    0.261    0.581       0.001 ± 0.013
CD012083   0.925     0.900    0.835    0.897     0.122 ± 0.144   0.180     0.512    0.727    0.605       0.117 ± 0.102
CD012165   0.818     0.824    0.308    0.828     0.013 ± 0.035   0.779     0.769    0.234    0.774       0.002 ± 0.012
CD012179   0.804     0.790    0.403    0.819     0.010 ± 0.022   0.750     0.723    0.363    0.769       0.002 ± 0.012
CD012216   0.669     0.655    0.597    0.577     0.444 ± 0.289   0.669     0.655    0.583    0.581       0.112 ± 0.103
CD012281   0.880     0.886    0.931    0.923     0.054 ± 0.095   0.716     0.745    0.622    0.762       0.031 ± 0.054
CD012599   0.080     0.413    0.807    0.877     0.053 ± 0.067   0.154     0.384    0.422    0.476       0.001 ± 0.009

               Table 3: WSS@95 score for all topics in the CLEF dataset, evaluated using either
               inclusion decisions based on full text (Y||MN), or based on abstract and title (YM||N).
               The combined model uses the static and RF bigram as subcomponents.
                                  Y||MN                                                 YM||N
                           RF                                                    RF
   Topic   static   unigram bigram combined         baseline     static   unigram bigram combined           baseline
    ALL    0.640     0.762    0.633    0.779     0.130 ± 0.024   0.349     0.460    0.339    0.510       0.027 ± 0.007
CD008122   0.459     0.496    0.378    0.481     0.016 ± 0.015   0.289     0.320    0.040    0.332       0.003 ± 0.003
CD008587   0.782     0.848    0.769    0.845     0.029 ± 0.028   0.419     0.475    0.393    0.412       0.012 ± 0.012
CD008759   0.031     0.276    0.325    0.368     0.021 ± 0.020   0.031     0.276    0.325    0.368       0.016 ± 0.016
CD008892   0.828     0.887    0.576    0.875     0.031 ± 0.031   0.072     0.358    0.576    0.390       0.014 ± 0.014
CD009175   0.986     0.966    0.596    0.965     0.123 ± 0.111   0.010     0.381    0.264    0.269       0.015 ± 0.015
CD009263   0.515     0.970    0.167    0.911     0.091 ± 0.084   0.018     0.061    0.047    0.218       0.008 ± 0.008
CD009694   0.876     0.882    0.728    0.863     0.095 ± 0.091   0.565     0.720    0.228    0.708       0.051 ± 0.051
CD010213   0.019     0.520    0.582    0.727     0.029 ± 0.029   0.001     0.043    0.274    0.061       0.001 ± 0.002
CD010296   0.918     0.914    0.638    0.917     0.026 ± 0.026   0.918     0.914    0.418    0.917       0.019 ± 0.018
CD010502   0.335     0.629    0.626    0.684     0.014 ± 0.014   0.324     0.581    0.163    0.585       0.004 ± 0.004
CD010657   0.550     0.526    0.331    0.553     0.028 ± 0.028   0.057     0.058    0.103    0.047       0.007 ± 0.007
CD010864   0.964     0.935    0.887    0.904     0.247 ± 0.193   0.254     0.423    0.383    0.351       0.021 ± 0.022
CD011053   0.959     0.963    0.587    0.953     0.126 ± 0.112   0.959     0.957    0.587    0.953       0.078 ± 0.070
CD011126   0.971     0.979    0.869    0.960     0.098 ± 0.091   0.971     0.979    0.869    0.960       0.073 ± 0.070
CD011420   0.769     0.765    0.881    0.873     0.164 ± 0.144   0.343     0.530    0.575    0.534       0.020 ± 0.020
CD011431   0.707     0.665    0.724    0.695     0.036 ± 0.034   0.019     0.029    0.033    0.064       0.003 ± 0.003
CD011515   0.997     0.995    0.998    0.997     0.509 ± 0.290   0.171     0.012    0.386    0.575       0.007 ± 0.008
CD011602   0.929     0.914    0.927    0.940     0.498 ± 0.283   0.800     0.836    0.856    0.920       0.109 ± 0.098
CD011686   0.987     0.960    0.894    0.925     0.334 ± 0.231   0.069     0.051    0.198    0.798       0.018 ± 0.017
CD011912   0.886     0.902    0.704    0.897     0.051 ± 0.048   0.877     0.866    0.460    0.877       0.027 ± 0.026
CD011926   0.302     0.871    0.383    0.867     0.033 ± 0.034   0.302     0.871    0.383    0.867       0.025 ± 0.024
CD012009   0.763     0.784    0.412    0.526     0.200 ± 0.158   0.437     0.457    0.270    0.285       0.024 ± 0.024
CD012010   0.070     0.721    0.629    0.794     0.114 ± 0.105   0.027     0.180    0.067    0.226       0.003 ± 0.003
CD012083   0.975     0.950    0.885    0.947     0.172 ± 0.144   0.168     0.540    0.294    0.618       0.084 ± 0.075
CD012165   0.072     0.362    0.179    0.442     0.020 ± 0.019   0.039     0.347    0.087    0.367       0.003 ± 0.003
CD012179   0.141     0.482    0.205    0.464     0.008 ± 0.008   0.141     0.367    0.200    0.401       0.003 ± 0.003
CD012216   0.719     0.705    0.647    0.627     0.494 ± 0.289   0.576     0.599    0.303    0.627       0.078 ± 0.075
CD012281   0.930     0.936    0.981    0.973     0.104 ± 0.095   0.724     0.726    0.453    0.730       0.040 ± 0.038
CD012599   0.129     0.301    0.857    0.619     0.051 ± 0.048   0.092     0.089    0.109    0.067       0.001 ± 0.002

               Table 4: WSS@100 score for all topics in the CLEF dataset, evaluated using either
               inclusion decisions based on full text (Y||MN), or based on abstract and title (YM||N).
               The combined model uses the static and RF bigram as subcomponents.
                                  Y||MN                                                    YM||N
                         RF                                                       RF
   Topic static  unigram bigram combined           baseline       static  unigram bigram combined            baseline
    ALL 3349.448 1305.034 3798.000 1224.655   6405.696 ± 272.238 5708.400 5173.467 5500.600 4378.900   7131.769 ± 36.629
CD008122  1034      964      1189     991     1880.775 ± 29.665    1358     1300      1835      1276    1905.126 ± 6.638
CD008587  1998     1390      2113     1418    8890.107 ± 252.363   5317     4803      5559      5378  9042.512 ± 105.947
CD008759  903       675       630     589      912.361 ± 18.900     903      675       630       589    917.406 ± 15.133
CD008892  258       170       636     187     1452.336 ± 46.459    1391      962       636       914   1478.265 ± 21.061
CD009175   80       190      2282     195     4947.315 ± 626.643   5586     3492      4156      4125   5558.439 ± 85.764
CD009263 38214     2340     65642     6984  71659.995 ± 6650.289  77389    73961     75061     61604 78178.984 ± 632.362
CD009694   20       19         44      22      145.670 ± 14.589      70       45       125        47     152.735 ± 8.235
CD010213 14915     7297      6348     4144   14753.984 ± 445.766  15185    14543     11039     14269  15174.940 ± 22.935
CD010296  379       394      1665     382     4481.412 ± 121.595    379      394      2677       382   4516.729 ± 84.967
CD010502  1986     1108      1116     944     2942.072 ± 42.574    2018     1252      2500      1238   2973.515 ± 12.346
CD010657  836       882      1244     831     1806.271 ± 52.839    1753     1752      1668      1772   1846.911 ± 12.177
CD010864   90       164       283     240     1886.456 ± 482.878   1869     1445      1546      1625   2451.308 ± 54.825
CD011053   92       83        923     106     1952.562 ± 250.296     92       97       923       106  2060.096 ± 157.085
CD011126  174       128       784     238     5414.703 ± 543.169    174      128       784       238  5564.254 ± 418.618
CD011420   58       59         30      32      209.813 ± 36.150     165      118       107       117     246.108 ± 5.095
CD011431  346       396       326     361     1139.302 ± 40.689    1160     1148      1144      1106    1178.709 ± 3.773
CD011515   20       36         14      24    3553.976 ± 2097.615   6003     7160      4452      3079   7190.031 ± 54.788
CD011602  435       529       448     370    3088.216 ± 1740.224   1229     1011       886       495  5485.916 ± 605.398
CD011686  123       382       997     710    6291.365 ± 2182.568   8787     8965      7573      1903  9270.875 ± 161.630
CD011912  160       138       417     145     1334.026 ± 67.988     173      188       760       173   1368.069 ± 35.908
CD011926  2827      524      2501     537     3915.805 ± 135.943   2827      524      2501       537   3948.887 ± 97.154
CD012009  127       116       316     254      428.898 ± 84.684     302      291       392       383    523.100 ± 12.992
CD012010  6352     1907      2537     1405    6049.525 ± 719.498   6645     5601      6374      5284   6807.614 ± 22.027
CD012083    8       16         37      17      266.458 ± 46.415     268      148       228       123    294.965 ± 24.196
CD012165  9488     6521      8394     5706   10013.510 ± 193.570   9824     6673      9337      6468  10189.351 ± 31.388
CD012179  8446     5097      7813     5269    9750.778 ± 80.874    8446     6225      7863      5893   9800.761 ± 31.725
CD012216   61       64         77      81      109.840 ± 62.640      92       87       152        81    200.049 ± 16.240
CD012281  695       631       183     263     8851.610 ± 939.890   2728     2706      5400      2669  9479.651 ± 374.432
CD012599  7009     5626      1153     3070    7636.046 ± 387.458   7308     7328      7171      7512   8035.456 ± 13.165

                        Table 5: Last rel score for all topics in the CLEF dataset, evaluated using either inclu-
                        sion decisions based on full text (Y||MN), or based on abstract and title (YM||N). The
                        combined model uses the static and RF bigram as subcomponents.
                                     Y||MN                                                 YM||N
                             RF                                                     RF
  Metric     static  unigram bigram combined         baseline       static  unigram bigram combined         baseline
      AP      0.169    0.176    0.124     0.203   0.014 ± 0.000      0.313    0.314    0.218    0.337    0.053 ± 0.000
WSS@95        0.741    0.815    0.668     0.824   0.104 ± 0.024      0.513    0.617    0.519    0.657    0.028 ± 0.009
WSS@100       0.640    0.762    0.633     0.779   0.130 ± 0.024      0.349    0.460    0.339    0.510    0.027 ± 0.007
 Last Rel   3349.448 1305.034 3798.000 1224.655 6405.696 ± 272.238 5708.400 5173.467 5500.600 4378.900 7131.769 ± 36.629

                        Table 6: Aggregate scores, evaluated using either inclusion decisions based on full text
                        (Y||MN), or based on abstract and title (YM||N). The combined model uses the static
                        and RF bigram as subcomponents.


                        5.2   Performance

                        No single model performs best on all topics. Generally however, RF unigram
                        consistently outperforms the static model, and the combined model (static +
                        RF bigram) outperforms the other three models.
                            Surprisingly, the RF unigram model consistently outperforms the RF bi-
                        gram model, despite using a subset of the features of the RF bigram model.
                        For this reason it seems likely that a stacked model consisting of the static
                        model and the RF unigram model would have achieved better performance
                        than the stacked model submitted as our official run.
                            The RF unigram model is particularly adept at finding all relevant articles,
                        resulting in better last rel score than the static model for 19 topics out of 29,
                        and a better last rel score than the RF bigram model for 24 out of 29. This
                        also results in a WSS@100 score of 76.2% for the RF unigram, versus 64.0%
                        for the static model, and 63.3% for RF bigram.
                            Note however that last rel generates scores of wildly varying scale, and the
                        large last rel scores for static and RF bigram are therefore almost entirely due
                        to a few large outliers. In particular, 59% of the information contained in the
                        last rel score for RF bigram is due to a single topic with a large number of
                        candidate articles (CD009263). The metric may thus be useful when interpreted
                        on individual topics, but not when averaged. The WSS@100 metric, which is
                        equivalent to last rel on individual topics, produces scores on the same scale and
                        therefore makes sense also when averaged.


                        6     Conclusions

                        Our best system combines a static model and a relevance feedback model using
                        stacking. The workload reduction to retrieve 95% of relevant articles is estimated
                        at 82.4% on average, with a minimum workload reduction of 47.6%, and a max-
                        imum workload reduction of 94.7%. The workload reduction is consistent across
                        topics, and we note a workload reduction less than 70% in only two topics. Due to
                        the highly variable number of candidate articles in different topics, however, we
may still need to screen several thousands of articles to find all relevant articles
in any given systematic review.
    Our remarks on the implementation of the shared task model and task orga-
nization from last year [5] remain valid for this edition of the TAR task.


Acknowledgments

This project has received funding from the European Union’s Horizon 2020
research and innovation programme under the Marie Sklodowska-Curie grant
agreement No 676207.
                              Bibliography


[1] Cormack, G.V., Grossman, M.R.: Autonomy and reliability of continuous ac-
    tive learning for technology-assisted review. arXiv preprint arXiv:1504.06868
    (2015)
[2] Cormack, G.V., Grossman, M.R.: Technology-assisted review in empirical
    medicine: Waterloo participation in clef ehealth 2017. Working Notes of
    CLEF pp. 11–14 (2017)
[3] Kanoulas, E., Li, D., Azzopardi, L., Spijker, R.: Overview of the CLEF
    technologically assisted reviews in empirical medicine. In: Working Notes
    of CLEF 2017 - Conference and Labs of the Evaluation forum, Dublin, Ire-
    land, September 11-14, 2017. CEUR Workshop Proceedings, CEUR-WS.org
    (2017)
[4] Kanoulas, E., Li, D., Azzopardi, L., Spijker, R.: Overview of the CLEF tech-
    nologically assisted reviews in empirical medicine 2018. In: Working Notes
    of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop
    Proceedings (2018)
[5] Norman, C., Leeflang, M., Névéol, A.: Limsi@CLEF eHealth 2017 task 2:
    Logistic regression for automatic article ranking (2017)
[6] O’Mara-Eves, A., Thomas, J., McNaught, J., Miwa, M., Ananiadou, S.: Using
    text mining for study identification in systematic reviews: a systematic review
    of current approaches. Systematic reviews 4(1), 5 (2015)
[7] Petersen, H., Poon, J., Poon, S.K., Loy, C.: Increased workload for system-
    atic review literature searches of diagnostic tests compared with treatments:
    Challenges and opportunities. JMIR medical informatics 2(1), e11 (2014)
[8] Suominen, H., Kelly, L., Goeuriot, L., Kanoulas, E., Azzopardi, L., Spijker,
    R., Li, D., Névéol, A., Ramadier, L., Robert, A., Palotti, J., Jimmy, Zuc-
    con, G.: Overview of the clef ehealth evaluation lab 2018. clef 2018. In: 8th
    Conference and Labs of the Evaluation Forum. Lecture Notes in Computer
    Science, Springer (2018)