<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LIMSI@CLEF eHealth 2017 Task 2: Logistic Regression for Automatic Article Ranking</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christopher Norman</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mariska Lee ang</string-name>
          <email>m.m.leeflang@uva.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aurelie Neveol</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Academic Medical Center, University of Amsterdam</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LIMSI, CNRS, Universite Paris Saclay</institution>
          ,
          <addr-line>F-91405 Orsay</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the participation of the LIMSI-MIROR team at CLEF eHealth 2017, task 2. The task addresses the automatic ranking of articles in order to assist with the screening process of Diagnostic Test Accuracy (DTA) Systematic Reviews. We used a logistic regression classi er and handled class imbalance using a combination of class reweighting and undersampling. We also experimented with two strategies for relevance feedback. Our best run obtained an overall Average Precision of 0.179 and Work Saved over Sampling @95% Recall of 0.650. This run uses stochastic gradient descent for training but no feature selection or relevance feedback. We observe high performance variation within the queries in the test set. Nonetheless, our results suggest that automatic assistance is promising for ranking the DTA literature as it could reduce the screening workload for review writer by 65% on average.</p>
      </abstract>
      <kwd-group>
        <kwd>Evidence Based Medicine</kwd>
        <kwd>Information Storage and Retrieval</kwd>
        <kwd>Review Literature as Topic</kwd>
        <kwd>Supervised Machine Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Systematic reviews seek to gather all available published evidence for a given
topic and provide an informed analysis of the results. This work constitutes
some of the strongest forms of scienti c evidence. Systematic reviews are an
integral part of evidence based medicine in particular, and serve a key role in
informing and guiding public and institutional decision-making. Systematic
reviews for Diagnostic Test Accuracy (DTA) studies have been shown particularly
challenging compared to other types of reviews because of the di culty in de
ning search strategies o ering adequate levels of sensitivity and speci city [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. For
this reason, there is a need to particularly investigate automation strategies to
assist DTA systematic review writers in the time-consuming screening process.
      </p>
      <p>
        Methods for automating the screening process in systematic reviews have
been actively researched over the years [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], with promising results obtained using
a range of machine learning methods. However, previous work has not addressed
DTA studies.
      </p>
      <p>
        This paper describes the work underlying our participation in the CLEF
2017 eHealth Task 2 [
        <xref ref-type="bibr" rid="ref10 ref4">10, 4</xref>
        ]. This work is part of an ongoing e ort on providing
automatic assistance for the screening process in systematic reviews addressing
a variety of topics, including DTA studies.
      </p>
      <p>The remainder of this paper is organized as follows; Section 2 presents the
datasets used for system development. Section 3 provides an overview of our
system and describes each component. Finally, section 4 reports our results and
section 5 provides an analysis of our methods and participation in the task.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Datasets</title>
      <p>
        The task relied on a corpus comprising 50 DTA systematic review topics
associated with the full list of articles retrieved by an expert query and assessed for
inclusion based on title and abstract or full text. The corpus was split into a
development dataset comprising 20 topics and a test set comprising the remaining
30 topics. Our classi er was trained on the development dataset and evaluated
on the test dataset. We have also used a dataset of systematic reviews on drug
class e cacy due to Cohen et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to develop the methods applied in this task.
Several groups have been using this dataset in the past [
        <xref ref-type="bibr" rid="ref1 ref5">1, 5</xref>
        ], which gives us a
way to compare our results with previous work, although we can of course only
do by using the same evaluation metrics and training modes as previous work.
      </p>
      <p>For both the CLEF and Cohen datasets we know the inclusion decisions
based on the abstracts, as well as the inclusion decisions based on the full text.
We thus have two de nitions of positive examples, depending on whether we use
the abstract decisions or full text decisions as the gold standard.</p>
      <p>We use a tripartite labeling to re ect this:
{ No (N) is the set of articles that were excluded based on the abstract
{ Maybe (M) is the set of articles that were preliminarily included based on
the abstract, but later excluded based on the full text
{ Yes (Y) is the set of articles that were included based on both the abstract
and the full text, and later used in the meta-analysis
{ Intertopic training uses articles from a di erent topic (systematic review)
for training
{ Intratopic training uses articles from the current topic (systematic review)
for training</p>
    </sec>
    <sec id="sec-3">
      <title>Method</title>
      <p>We rst give an overview of our system, which relies on logistic regression, in
section 3.1. Further details about the system are given in sections 3.2{3.5,
including features, strategies to handle class imbalance and implement relevance
feedback.
3.1</p>
      <sec id="sec-3-1">
        <title>Overview</title>
        <p>We have tried the following two classi ers:
{ Classi er 1 uses logistic regression trained using stochastic gradient descent
on all features
{ Classi er 2 uses standard logistic regression trained using standard
methods on a subset of the features, and with additional preprocessing to improve
the throughput</p>
        <p>We have tried three approaches to relevance feedback:
{ no relevance feedback
{ abrupt uses intertopic ranking until a su cient number of relevant and
nonrelevant articles have been identi ed, and then switches to using intratopic
ranking based on the identi ed articles
{ gradual initially uses intertopic ranking, and gradually improves the model
using both Y and M identi ed through relevance feedback</p>
        <p>In total, we have submitted the following four runs to the CLEF evaluation:
{ no AF full uses classi er 1 with no relevance feedback
{ no AF uses classi er 2 with no relevance feedback
{ abrupt uses classi er 2 with abrupt relevance feedback
{ gradual uses classi er 2 with gradual relevance feedback
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Classi cation approach</title>
        <p>
          We are currently using two classi cation systems. Both use logistic regression
but di er in how the model is optimized and the amounts and types of pre- and
postprocessing that is performed. Both methods use implementations provided
by sklearn [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>Our rst method, which is used in no AF full tends to work well for
intertopic classi cation on previous datasets (see table 3), presumably because it
generalizes better. This system uses logistic regression trained using stochastic
gradient descent. The only preprocessing done is the normalization of numerals.</p>
        <p>Our second method, which is used in no AF, abrupt, and gradual uses
standard methods for training (liblinear). This version tends to work well on
intratopic classi cation on previous datasets (see table 3), but does not scale as
well with data volume. We therefore need to do additional preprocessing to
reduce the number of features and keep running times down. We thus remove
features with variance less than a prede ned threshold, we only consider n-grams
with high mutual information with the target class in the training set, we
normalize numerals, and we extract the principal components from the resulting
data.</p>
        <p>Principal component analysis tends to reduce over tting in our experiments,
and it also drastically reduces the time it takes to train and apply the classi er,
which is mostly important when we use relevance feedback.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Features</title>
        <p>For all classi ers we extract n-grams (n 5) from the titles and abstracts. We
also extract publication type, journal names, author assigned keywords, MeSH
terms, and backward references, where these are available. The backward
references are only available for references pointing to articles available in Pubmed
Central, and this feature set is therefore fairly sparse.</p>
        <p>Not all feature sets are useful for identifying DTA studies, but the current
model has been constructed such that irrelevant features should not adversely
e ect the performance. All the feature sets have been shown to be useful on
some domain. For instance MeSH terms might not be useful for DTA studies,
but we have previously found them to be useful in identifying topics related to
drug e cacy.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Class imbalance</title>
        <p>Class imbalance can be handled using undersampling, or by class reweighting.
We are currently using a combination of both these approaches.
Class weights We set the weight for the positive class to 80 for the initial
intertopic classi er. We have determined this to be a reasonable weight
experimentally using the Cohen dataset.</p>
        <p>For the gradual relevance feedback we also attached higher weights to the
intratopic training examples identi ed through relevance feedback.
Undersampling In order to reduce the e ects of the class imbalance we
undersample the training set to include an equal number of Y, M, and N. However,
by doing so we end up with only around 1500 training samples. PCA yields at
most the same number of principal components as we have input samples, and
1500 is generally too few principal components to build an accurate classi er.
For the second model we therefore perform undersampling in two steps; We rst
select a maximum of 500 Y, 1000 M, and 1500 N that we feed into the feature
extraction pipeline, which thus determines the number of features in our model.
We then select a smaller undersample to use for training.</p>
        <p>We take a new undersample in each iteration of relevance feedback.
3.5</p>
      </sec>
      <sec id="sec-3-5">
        <title>Relevance Feedback</title>
        <p>We use two schemes for relevance feedback. For both schemes we retrain the
classi er each time we retrieve relevance feedback.
abrupt trains an initial intertopic classi er on the training dataset and ranks
the test dataset in descending order of con dence. The system then iteratively
asks for feedback for the top ranked results. When enough positive and negative
examples have been identi ed, the system switches to using a classi er trained on
the examples identi ed from relevance feedback. Additional examples are added
to the intratopic classi er as they are discovered.</p>
        <p>The idea behind this system is that on some topics in Cohen we can train
highly performing intratopic classi ers using very small amounts of data, and
we have observed that even trained on small amounts of data these sometimes
outperform intertopic classi ers by a large margin. In these cases it might make
sense to switch to intratopic classi cation as soon as we can.</p>
        <p>We set the minimum number of positive examples to 4, and the minimum
number of negative examples to 10.
gradual trains an initial intertopic classi er using the training set and ranks
the test set in descending order of con dence. The system then iteratively asks
for feedback for the top ranked result. Articles queried for relevance feedback are
then added to the model as they are queried, but with higher weights than the
intertopic examples. The model thus starts out as an intertopic classi er, but
gradually turns into an intratopic classi er as more targeted data is added to
the model. Since the intratopic examples identi ed through relevance feedback
are given higher weights, these will eventually drown out the original classi er,
provided enough examples exist to be discovered.</p>
        <p>Besides using Y and N, we also use intratopic M as positive examples, with
lower weights than intratopic Y, but higher than intertopic Y. The reasoning
behind this is that we often encounter M earlier than Y, and in greater numbers,
in particular on topics with very few Y. We have observed on other datasets
that we can sometimes improve performance by using both Y and M as positive
examples, when the number of Y is very low.</p>
        <p>After the number of Y found is larger than 40, we stop using M as positive
examples.</p>
        <p>Reasonable parameter settings were identi ed experimentally on the Cohen
dataset.
3.6</p>
      </sec>
      <sec id="sec-3-6">
        <title>Use of the CLEF development dataset</title>
        <p>We do not split the training data into separate training and validation splits,
since we do not have the necessary number of Y to do this without hurting
the performance of the classi er. We do however use a small set of samples that
overlaps with the training set for validation. The performance we observe on this
validation su ers from severe over tting, but we can observe when the model
fails to build a classi er on the current undersample. In such cases we can observe
an AUROC &lt; 0:5 even on the training set. In these cases we simply discard the
classi er and try again with a new undersample. We observe that this improves
performance dramatically when we have a very small amount of training data
(approximately four or less positive examples).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>
        We present a comparison with previous work on the Cohen dataset for WSS@95
in table 2 and for AUC in table 3. Results from previous literature are taken
from Khabsa et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and Cohen et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Exact intertopic AUC scores are
not explicitly reported by Cohen et al. and have instead been extracted from
Figure 1 in their paper The majority of these results, with the exception of one
result by Cohenet al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] use intratopic classi cation.
      </p>
      <p>We present our results on the CLEF dataset for average precision in table 4,
normalized average precision in table 5, WSS@95 in table 6, and in aggregate
in table 7. The results in these tables correspond to those submitted as o cial
runs. For comparison, we also calculate a baseline by evaluating each metric on
the data ordered randomly. This has been repeated 1000 times and we report
the average and standard deviation.</p>
      <p>We also report the mean, standard deviation, minimum and maximum WSS@95
and AUC over ten runs for a selection of topics in the CLEF dataset in table 8.
5
5.1</p>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <sec id="sec-5-1">
        <title>Datasets</title>
        <p>One of the topics in the CLEF dataset, CD010653, has no Y. While we can still
calculate performance scores relative to M, this topic might arguably have been
omitted from the test data. One of the topics, CD008803, similarly has no M.
This also happens to be the topic with the largest number of Y.</p>
        <p>As a general tendency, we can observe that the relative number of Y / M
/ N in the CLEF dataset varies dramatically across topics. At the one end we
have one topic consisting of 14.06% Y (CD008760), and one topic consisting of
15.79% Y (CD010705). At the other end we have three topics with a mere 0.01%
Y (CD011548, CD011549, and CD012019). Most topics in the CLEF dataset
have a very small number of Y compared to Cohen, both in terms of
relative and absolute numbers. Several topics have a large number of M however
(CD007427, CD008054, CD009020, CD009323, CD009591, 011134, CD011548,
CD0011975, CD011984, CD009925, CD10339, CD011145). Curiously, more
topics in the training set have a large number of M than in the test set, despite this
comprising a smaller number of topics.
0.030
0.023
0.075
0.013
0.009
0.012
0.013
0.025
0.006
0.014
0.008
0.018
0.041
0.002
0.024
0.034
0.024
0.011
0.286
0.041
0.143
0.014
0.014
0.006
0.109
0.015
0.065
0.086
0.004
0.195
0.015
0.052
0.048
0.087
0.035
0.023
0.026
0.025
0.050
0.022
0.037
0.021
0.033
0.085
0.010
0.034
0.057
0.034
0.023
0.237
0.061
0.164
0.035
0.048
0.031
0.098
0.043
0.106
0.121
0.015
0.190
0.016
w/o RF</p>
        <p>Topic no AF full no AF
WSS@95 0.640 0.500
WSS@100 0.591 0.420</p>
        <p>last rel 1678 2263
NCG@10 0.517 0.407
NCG@20 0.802 0.639
NCG@30 0.908 0.783
NCG@40 0.946 0.843
NCG@50 0.972 0.890
NCG@60 0.984 0.921
NCG@70 0.990 0.942
NCG@80 0.997 0.960
NCG@90 0.998 0.987
NCG@100 1.000 0.998
norm area 0.890 0.825
ap 0.133 0.100</p>
        <p>The number of N also varies wildly, from 52 up to 43287. Compared to the
Cohen dataset we also have a smaller minimum number of N, as well as much
larger maximum number.</p>
        <p>If we compare the training and test sets, the training set contains almost
double the absolute number of M, many more N, but fewer Y.
While relevance feedback sometimes gives an improvement in performance,
relevance feedback often seems to only confuse the system (tables 4{7). This should
be contrasted with our experiments on the Cohen dataset, where the same
implementation reliably yields an improvement (table 3), and generally yields
performance intermediate between intertopic and intratopic classi cation, as one
would expect. There are perhaps better approaches to relevance feedback than
ours, which can reliably improve upon the baseline, but it might also be that
there is simply little to gain from relevance feedback on several of the topics. Of
particular note, we should not expect any improvements by using RF on topics
such as CD010386, CD010633, CD010860, CD010896, and CD012019, that have
a low absolute number of Y and M. It is also worth pointing out that our abrupt
scheme requires at least 4 Y before switching to the intratopic model, and any
di erences between no AF and abrupt on these topics can thus only be due to
chance.</p>
        <p>We can see an improvement on the topic CD010705 when using relevance
feedback (tables 4{7). This topics is also the topic with the highest percentage
of Y at 15.79%. We do not see any improvement for CD008760, the other topic
with a high percentage of Y (14.06%), but this may be due to the initial classi er
having much higher performance.</p>
        <p>We can observe that gradual outperforms abrupt on topic CD008760,
despite this topic having only 3 M, which is probably too low a numbe for gradual
to have an advantage. The simplest explanation for this is likely random chance.</p>
        <p>It is however easy to see that relevance feedback does not appear to lead
to an improvement for our system. For instance abrupt outperforms no AF 15
times out of 30, and gradual outperforms no AF only 10 times out of 30 (tables
4).</p>
        <p>Of course, it seems unlikely for relevance feedback to be useful for those topics
where the number of positives is extremely low, even in theory. In particular, if
there is only one relevant article, as is the case for CD012019 and CD010386,
then relevance feedback cannot really add any value to the classi cation. Any
successful use of relevance feedback on such topics would necessarily have to use
the negative examples.</p>
        <p>We get better performance for no AF full than no AF. We have however
generally observed that this di erence is generally reversed for intratopic classi
cation, which is what we should end up with when we after relevance feedback, but
it is possible that we would get better performance if we were to use no AF full
as a base for our relevance feedback experiments, since we would start with a
much better initial classi er.</p>
        <p>Ordinarily, screeners would be free to choose the order in which they screen
each article, and may proceed for instance in alphabetical or chronological order.
For the purposes of our baseline, we assume that any such order ordinarily
available to screeners would be indistinguishable from random order on average.
5.3</p>
      </sec>
      <sec id="sec-5-2">
        <title>Metrics</title>
        <p>
          Average Precision has been selected as the main metric for this task as it was
previously found particularly adapted to evaluate retrieval performance for highly
imbalanced datasets [
          <xref ref-type="bibr" rid="ref3 ref9">9, 3</xref>
          ]. However, these studies rely on common assumptions
that we value high precision at the top of the ranking, whereas for systematic
review screening we value recall almost exclusively. Of particular note, average
precision heavily penalizes rankings where the top few results are non-relevant,
even if the ranking manages to place all relevant articles in the upper percentiles
of the ranking.
        </p>
        <p>Furthermore, average precision is strongly correlated with the number of
positives in the topic, with most of the cases where we achieve ap &gt; 0:2 are
for topics with high prevalence. While this is to be expected, it means that
average precision makes it di cult to compare performance across topics, since
we can see a strong correlation with the prevalence of relevant articles in the
topic (tables 1, 4{7). Similarly, Mean Average Precision will likely be dominated
by the results on the topics with many relevant articles and a small number
of total candidates, i.e. arguably the topics which are the least representative
systematic reviews of DTA studies, and where automated methods are likely the
least useful.
5.4</p>
      </sec>
      <sec id="sec-5-3">
        <title>Reliability of the Experiments</title>
        <p>Our classi cation method is stochastic, and thus does not produce deterministic
results that are always the same every time we run on the same input data. To
gauge the reliability of the experiment we repeat it ten times for a subset of the
topics and calculate the standard deviations, as well as examine the minimum
and maximum values (table 8).</p>
        <p>We can generally observe a fairly large variability for topics with a small total
number of candidates, such as CD008760 and CD010705, and for topics with a
comparably smaller proportion of Y, such as CD010339. When we consider topics
with a large number of candidates we can observe a large variability for the
CD012019, but small variability for CD010386. We might speculate that small
topic size and a small relative number of Y is correlated with larger variability,
but it is clear that the variability for some topics is quite large, regardless of
the underlying causes and mechanisms. The standard deviation can be as large
as .139, which is large enough that it casts doubts about the reliability of the
results. Furthermore, the minimum and maximum values are much more skewed
towards extreme values than we should expect from the standard deviations were
the values normally distributed, suggesting that the distribution is heavy-tailed
and skewed towards outliers.</p>
        <p>Considering the above, we might suspect that the di erences in performance
in tables 4{7 are not signi cant. For instance abrupt outperforms gradual 17
times out of 30, but we do not know whether this means that abrupt is a better
method, or if this is simply due to random chance. We might speculate that
our gradual implementation works better for the cases where we have a su
cient number of M, but the experiment is ultimately too low-powered to draw
conclusions. Future iterations of the campaign could consider whether
performance should be computed as an average over multiple runs, in order to get
more precise results for stochastic systems such as ours.</p>
        <p>We can however see smaller variability in the mean performance across all
topics, which might suggest that these are more reliable estimates. However,
these give little indication as to how the performance depends on topic
composition.
5.5</p>
      </sec>
      <sec id="sec-5-4">
        <title>General Remarks on the Shared Task Model</title>
        <p>The Shared Task Model is typically implemented in evaluation campaigns that
seek to perform a community-wide technical evaluation of systems addressing a
particular task. A Shared Task thus o ers an evaluation paradigm that includes:
1/a speci c de nition of the task and evaluation metrics 2/an implementation
through the dissemination of datasets and evaluation tools and 3/the execution
of the evaluation in a controlled setting where participants have access to data
at the same time and are evaluated blindly by an independent third party. As
outlined below, this year the TAR task was not conducted according to the
Shared Task Model.</p>
        <p>In this iteration of the evaluation campaign, the nal set of evaluation
metrics was decided only shortly before participants were required to freeze their
systems. One of the expected outcomes of evaluation campaigns such as this is
indeed the discussion of the relative merits of the various metrics to be used.
However, changing the target metric close to the submission deadline means that
some participants may have optimized for di erent metrics than those ultimately
used for evaluation.</p>
        <p>The gold standard labeled test data was distributed directly to the
participants at the begining of the test phase. This is explained by the lack of an
assessor through which participants could receive relevance feedback as has been
the case in e.g. TREC Total Recall. While common labeled test collections are
routinely used for research, this procedure is unusual in a shared task setting
where participants are typically asked to process a test dataset while being blind
to the gold standard associated with the dataset. This could alternatively have
been accomplished in part by requiring the submission of runs without relevance
feedback before the distribution of the gold standard labels.</p>
        <p>Another feature of the shared task model is the computation of performance
metrics for all participants by a common, independent party which ensures that
all participations are evaluated using the exact same conditions. This confers a
stronger reliability in the comparability and reproducibility of results. At the
time of writing, while a common evaluation tool has been released, the
performance reported by participants has been self-computed without validation from
the task organizers. In addition to result validation, it would also have been
useful to receive an indication of the overall performance of the participants prior to
the deadline for the submission of the working notes. This would have enabled a
discussion about the relative performance of the system that is currently di cult
to do without comparing with previous literature using external datasets.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>Our best system is the one using logistic regression trained using stochastic
gradient descent, using a minimum of preprocessing, and no relevance feedback.
This system achieves a workload reduction of 64.0% on average, with a minimum
workload reduction of 19.3%, and a maximum workload reduction of 92.0%. On
average, we would have to screen 1678 articles per topic to retrieve all relevant
articles. Overall there is a large variation in performance across topics however.</p>
      <p>We do not generally see an improvement when using relevance feedback. For
the topics where relevance feedback is hypothetically feasible we sometimes see
an improvement, although the e ect does not appear very reliable, and the low
power of the experiment means that the results are unlikely to be signi cant.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This project has received funding from the European Union's Horizon 2020
research and innovation programme under the Marie Sklodowska-Curie grant
agreement No 676207.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hersh</surname>
            ,
            <given-names>W.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peterson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <source>Reducing Workload in Systematic Review Preparation Using Automated Citation Classi</source>
          cation pp.
          <volume>206</volume>
          {
          <issue>219</issue>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>A.M.:</given-names>
          </string-name>
          <article-title>Optimizing feature representation for automated systematic review work prioritization</article-title>
          .
          <source>AMIA Annual Symposium</source>
          proceedings pp.
          <volume>121</volume>
          {
          <issue>5</issue>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goadrich</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The relationship between precision-recall and roc curves</article-title>
          .
          <source>In: Proceedings of the 23rd international conference on Machine learning</source>
          . pp.
          <volume>233</volume>
          {
          <fpage>240</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Kanoulas</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Azzopardi</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spijker</surname>
          </string-name>
          , R.:
          <article-title>Overview of the CLEF technologically assisted reviews in empirical medicine</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Khabsa</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elmagarmid</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ilyas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hammady</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ouzzani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Learning to identify relevant studies for systematic reviews using random forest and external information</article-title>
          .
          <source>Machine Learning</source>
          <volume>102</volume>
          (
          <issue>3</issue>
          ),
          <volume>465</volume>
          {
          <fpage>482</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>O</given-names>
            <surname>'Mara-Eves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Thomas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>McNaught</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Miwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Ananiadou</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.:</surname>
          </string-name>
          <article-title>Using text mining for study identi cation in systematic reviews: a systematic review of current approaches</article-title>
          .
          <source>Systematic reviews 4(1)</source>
          ,
          <volume>5</volume>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          , et al.:
          <article-title>Scikit-learn: Machine learning in python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          (Oct),
          <volume>2825</volume>
          {
          <fpage>2830</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Petersen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poon</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poon</surname>
            ,
            <given-names>S.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Increased workload for systematic review literature searches of diagnostic tests compared with treatments: Challenges and opportunities</article-title>
          .
          <source>JMIR medical informatics 2</source>
          (
          <issue>1</issue>
          ),
          <year>e11</year>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Saito</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rehmsmeier</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The precision-recall plot is more informative than the roc plot when evaluating binary classi ers on imbalanced datasets</article-title>
          .
          <source>PloS one 10(3)</source>
          ,
          <year>e0118432</year>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Suominen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanoulas</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spijker</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neveol</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zuccon</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palotti</surname>
            ,
            <given-names>J.R.M.:</given-names>
          </string-name>
          <article-title>Overview of the CLEF ehealth evaluation lab 2017</article-title>
          .
          <article-title>In: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and Interaction - 8th
          <source>International Conference of the CLEF Association, CLEF</source>
          <year>2017</year>
          , Dublin, Ireland,
          <source>September 11-14</source>
          ,
          <year>2017</year>
          ,
          <source>Proceedings. Lecture Notes in Computer Science</source>
          , Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>