<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CLEF 2017 Technologically Assisted Reviews in Empirical Medicine Overview</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Evangelos Kanoulas</string-name>
          <email>E.Kanoulas@uva.nl</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dan Li</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leif Azzopardi</string-name>
          <email>leif.azzopardi@strath.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rene Spijker</string-name>
          <email>R.Spijker-2@umcutrecht.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cochrane Netherlands and UMC Utrecht, Julius Center for Health Sciences and Primary Care</institution>
          ,
          <country country="NL">Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Computer and Information Sciences, University of Strathclyde</institution>
          ,
          <addr-line>Glasgow</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Informatics Institute, University of Amsterdam</institution>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Systematic reviews are a widely used method to provide an overview over the current scientific consensus, by bringing together multiple studies in a reliable, transparent way. The large and growing number of published studies, and their increasing rate of publication, makes the task of identifying all relevant studies in an unbiased way both complex and time consuming to the extent that jeopardizes the validity of their findings and the ability to inform policy and practice in a timely manner. The CLEF 2017 e-Health Lab Task 2 focuses on the efficient and effective ranking of studies during the abstract and title screening phase of conducting Diagnostic Test Accuracy systematic reviews. We constructed a benchmark collection of fifty such reviews and the corresponding relevant and irrelevant articles found by the original Boolean query. Fourteen teams participated in the task, submitting 68 automatic and semi-automatic runs, using information retrieval and machine learning algorithms over a variety of text representations, in a batch and iterative manner. This paper reports both the methodology used to construct the benchmark collection, and the results of the evaluation.</p>
      </abstract>
      <kwd-group>
        <kwd>Evaluation</kwd>
        <kwd>Information Retrieval</kwd>
        <kwd>Systematic Reviews</kwd>
        <kwd>TAR</kwd>
        <kwd>Text Classification</kwd>
        <kwd>Active Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Evidence-based medicine has become an important pillar in health care and
policy making. In order to practice evidence-based medicine, it is important to
have a clear overview over the current scientific consensus. These overviews are
provided in systematic review articles, that summarize all available evidence
regarding a certain topic (e.g., a treatment or diagnostic test). In order to write
a systematic review, researchers have to conduct a search that will retrieve all
the studies that are relevant. The large and growing number of published studies,
and their increasing rate of publication, makes the task of identifying relevant
studies in an unbiased way both complex and time consuming to the extent
that jeopardizes the validity of their findings and the ability to inform policy
and practice in a timely manner. Hence, the need for automation in this process
becomes of utmost importance. Finding all relevant studies in a corpus is a
difficult task, known in the Information Retrieval (IR) domain as the total recall
problem.</p>
      <p>To this date, retrieval of evidence to inform systematic reviews is being
conducted in multiple stages:
1. Boolean Search: At the first stage information specialists build a broad
Boolean query expressing what constitutes relevant information. The query
is then submitted to a medical database containing titles, abstracts, and
indexing terms of a controlled vocabulary of medical studies. The result is a
set, A, of potentially interesting studies.
2. Title and Abstract Screening: At a second stage experts are screening the
titles and abstracts of the returned set and decide which one of those hold
potential value for their systematic review, a set D. If screening an abstract
has a cost Ca, screening all |A| abstracts has a cost of Ca jAj.
3. Study Screening: At a third stage experts are downloading the full text of
the potentially relevant abstracts, D, identified in the previous phase and
examine the content to decide whether indeed these studies are relevant or
not. Examining a document has typically a larger cost of Cd &gt; Ca. The
result of the second screening is a set of references to be included in the
systematic review.</p>
      <p>Unfortunately, the precision of the Boolean searches is typically low, hence
reviewers often need to look manually through many thousands of irrelevant titles
and abstracts in order to identify a small number of relevant ones. Furthermore,
the recall of the searches is often assumed to be 100%, which may not be the
case.</p>
      <p>
        To overcome some of the limitations of the Boolean search, researchers have
been testing the effectiveness of machine learning and information retrieval
methods. O’Mara-Eves et al.[
        <xref ref-type="bibr" rid="ref3">15</xref>
        ] provide a systematic review of the use of text mining
techniques for study identification in systematic reviews.
      </p>
      <p>The goal of this lab is to bring together academic, commercial, and
government researchers that will conduct experiments and share results on automatic
methods to retrieve relevant studies with high precision and high recall, and
release a reusable test collection that can be used as a reference for
comparing different retrieval and mining approaches in the field of medical systematic
reviews.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Benchmark Collection</title>
      <p>To construct the benchmark collection, the organizers of the task considered
58 systematic reviews on Diagnostic Test Accuracy conducted by the Cochrane
researchers. These reviews are publicly available through the Cochrane Library4
4 http://www.cochranelibrary.com/
and can be identified by setting the topic filter in the library to "Diagnostic" and
"Diagnostic Test Accuracy" and the stage fitler to "Review". At the date of the
publication of this article 79 such studies are available, however the last 22 were
performed after the organizers put the collection together. The 58 systematic
reviews considered can be found in the Appendix of this articles at Table 6.</p>
      <p>Participants were provided with two data sets: (a) a development set, and
(b) a test set. The development set consists of 20 topics for Diagnostic Test
Accuracy (DTA) systematic reviews, while the test set consists of 30 topics. For
both sets, one topic file and two files of relevance judgments at abstract and
document level respectively are constructed (qrel’s).</p>
      <p>The topic file is generated through the following procedure. For each
systematic review, we reviewed the search strategy from the corresponding study
in Cochrane Library. A search strategy, among others, consists of the exact
Boolean query developed and submitted to a medical database, at the time the
review was conducted, and typically can be found in the Appendix of the study.
Rene Spijker, a co-author of this work and a Cochrane information specialist
examined the grammatical correctness of the search query and specified the date
range which dictated the valid dates for the articles to be included in this
systematic review. The date range was necessary because a study published after
the systematic review should not be included even though it might be relevant,
since that would require manually examining its content to quantify its relevance.
Important note: A number of medical databases, and search interfaces to these
databases is available for search, and for each one information specialists
construct a different variation of their query that better fits the data and meta-data
of the database. For this task, we only considered the Boolean query constructed
for the MEDLINE database, using the Wolters Kluwer Ovid interface. Then we
submitted the constructed Boolean query to the OVID system5 and collected all
the returned PubMed document identification numbers (PMID’s) which satisfied
the date range constraint. This step was automated by a Python script we put
together and through an interface available to the University of Amsterdam6.
Out of the 58 reviews 8 were discarded since the provided Boolean query was
not in the right format, which made it difficult if not impossible to reconstruct
the set of PMID’s, hence the 50 topics in the development and test set.</p>
      <p>The topic file is in a text format and contains four sections, Topic, Title,
Query, and PMID’s, where Topic is the topic ID, a substring of DOI of the
document (e.g. CD010438 for 10.1002/14651858.CD010438.pub2), and PMID’s are
the document IDs returned by the Boolean query. The PIDs can be used to
access the corresponding document through the National Center for Biotechnology
Information (NCBI)7. An example of a topic file can be viewed below.
5 http://demo.ovid.com/demo/ovidsptools/launcher.htm
6 https://github.com/dli1/tar_data_collection
7 https://www.ncbi.nlm.nih.gov/books/NBK25497/
Topic: CD009551
Title: Polymerase chain reaction blood tests for the diagnosis of
invasive aspergillosis in immunocompromised people</p>
      <p>For the construction of the qrel files, we considered the reference section of the
50 systematic reviews. The references are split into three categories: Included,
Exclude, and Additional. Included are the studies that are relevant to the
systematic review. Excluded are the studies that in the abstract and title screening
stage were considered relevant, but at the article screening phase were considered
irrelevant to the study and hence excluded from it. Additional are additional
references that do not impact the outcome of the study, and hence irrelevant to it.
The included references were the relevant studies at the document-level qrels,
while both the included and excluded references were considered relevant at the
abstract-level qrels. The format of the qrels followed the standard TREC format:</p>
      <p>Topic Iteration Document Relevance
where Topic is the topic ID of the systematic review, Iteration in our case is a
dummy field always zero and not used, Document is the PMID, and Relevancy
is a binary code of 0 for not relevant and 1 for relevant studies. The order
of documents in the qrel files is not indicative of relevance. Studies that were
returned by the Boolean query but were not relevant based on the above process,
were considered irrelevant. Those are studies that were excluded at the abstract
and title screening phase. All other documents in MEDLINE were also assumed
to be irrelevant, given that they were not judged by the human assessor.</p>
      <p>Important Note: Note that, as mentioned earlier, the references of a
systematic review were produced after a number of Boolean queries were submitted
to a number of medical databases, and their titles and abstracts were screened.
The PMID’s provided however were only those that came out of the MEDLINE
query. Therefore, there was a number of abstract-level relevant studies (the gray
area in the Venn diagram below) that were not part of the result set of the
Boolean query provided to the participants. For the development set, the qrel
file contained those additional PMIDs, for those participants that would decide
to search the entire MEDLINE database, and not only consider the studies
provided to them in the Topic files. To the best of our knowledge, no one submitted
such a system, hence to avoid any bias we excluded those relevant studies from
the test set.</p>
      <sec id="sec-2-1">
        <title>MEDLINE Boolean Query</title>
      </sec>
      <sec id="sec-2-2">
        <title>Relevant Studies</title>
        <p>
          Table 1 shows the distribution of the relevant documents at abstract or
document level for all the topics in the development set and the test set. The total
number of unique PMID is 149,405 for the development set and 117,562 for
the test set. Their percentages of relevant documents at abstract level are quite
close, which is 1.88% for the development set and 1.58% for the test set. This
is not true at document level, however, where the relevant documents in the
test set is almost twice as large as in the development set, even though there
are 0.52% and 0.33% of relevant studies, respectively. In [
          <xref ref-type="bibr" rid="ref5">17</xref>
          ], a test collection
was developed based on a random selection of 93 Cochrane systematic reviews
(not just DTAs), and reported a slightly higher rate of relevance ( 111549 = 1:2%).
However, compared with the TREC campaign, the rate of relevant documents is
5.45%, 2.78% for the Adhoc track of TREC 8 and the Web track of TREC 2002.
Overall, the number of relevant documents is not very high in this lab, making
locating them quite a difficult task.
        </p>
        <p>Important Note: As one can observe in Table 1, there are topics for which
the output of the Boolean query is rather narrow, with as few as 64 studies to be
reviewed for topic CD008760. Cochrane is conducting systematic reviews on a
regular basis, in an attempt to update each review every two-three years. Some
of the reviews considered for the construction of the benchmark collection, such
file name</p>
        <p>Topic
1
11
14
19
23
28
33
35
37
38
4
43
44
45
50
53
54
55
6
9
total
as the CD008760 review, are updates to previous reviews. These updates, only
specify a query for a time range that starts after the last review on the topic
was conducted. Hence, the 64 studies, are the output of the Boolean query for
this short time range, hence its small number. If the Boolean query were to run
against the entire MEDLINE database, the number of studies would be in the
range of tens of thousands, as is the case for some other reviews considered, e.g.
CD008782.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Task Description</title>
      <p>
        The CLEF 2017 e-Health Lab [
        <xref ref-type="bibr" rid="ref9">8</xref>
        ], task 2, focused on retrieving studies for
conducting Diagnostic Test Accuracy (DTA) systematic reviews. Retrieval in this
area is generally considered very difficult, where sensitive searches result in large
quantities of references to be screened manually, and a breakthrough in this field
would likely be applicable to other areas as well. The task has a focus on the
second stage of the process, i.e. given the results of a Boolean search how to
make abstract and title screening more effective and efficient. Currently a
typical number needed to read (NNR), the number of studies to screen to identify
1 eligible study, for DTA systematic reviews is approximately 80 when applied
to potential abstracts that need further full text assessment. With an average
of 7000 results to be screened, which would take approximately 120 hours to
screen (1 minute per abstract [
        <xref ref-type="bibr" rid="ref6">18</xref>
        ]), a huge benefit can be made in reducing the
workload in this process.
      </p>
      <p>Given the results of the Boolean search from stage 1 as the starting point,
participants were asked to rank the set of the provided abstracts. The task
had two goals: (i) to produce an efficient ordering of the documents, such that
all the relevant abstracts are retrieved above the irrelevant ones, and (ii) to
identify the relevant subset of abstracts to be shown to a user, that is a stopping
point in the ranked list of abstract, where a researcher could confidently stop
screening abstracts and titles. Therefore, we solicited two types of submissions:
(i) ranking submission: automatic or manual methods that rank all abstracts,
with the goal of retrieving relevant abstracts as early in the ranking as possible,
and (ii) thresholding submission: thresholding can be performed in a batch, or
iterative manner as well.</p>
      <p>We also considered two evaluation frameworks, (a) a simple evaluation, and
(b) a cost-effective evaluation. The assumption behind the simple evaluation
framework is the following: The user of your system is the researcher that
performs the abstract and title screening of the retrieved articles. Every time an
abstract is returned (i.e. ranked) there is an incurred cost/effort of CA, while
the abstract is either irrelevant (in which case no further action will be taken)
or relevant (and hence passed to the next stage of document screening) to the
topic under review. The assumption behind the cost-effective evaluation is the
following: The user that performs the screening is not the end-user. The user
can interchangeably perform abstract and title screening, or document
screening, and decide what PMIDs to pass to the end-user. Every time an abstract
is returned the user can either (a) read the abstract (with an incurred cost of
CA) and decide whether to pass this PMID to the end-user, or (b) read the
full document (with an incurred cost of CA+CD) and decide whether to pass
this PMID to the end-user, or (c) directly pass the PMID to the end user (with
an incurred cost of 0), or (d) directly discard the PMID and not pass it to the
end user (with an incurred cost of 0). For every PMID passed to the end-user
there is a cost of attached to it: CA if the abstract passed on is not relevant,
and CA + CD if the abstract passed on is relevant (that is, we assume that
the end-user completes a two-round abstract and document screening, as usual,
but only for the PMIDs the algorithm+feedback user decided to be relevant).
Although a small number of teams participated in the cost-effective sub-task,
the lab focused on the simple evaluation sub-task, and this is what is described
in the remaining of this report.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>Evaluation within the context of using technology to assist in the reviewing
process is very much dependent on how the user(s) interact with the system - and
the goal of the technology assistance. For example, is the goal of the assistance
to automate the screening process - where the system assess all the abstracts and
returns a subset of the initial set to be screened by the end-user (i.e. screened in
batch mode). Or, it could be used to identify all the relevant documents as soon
as possible, in an iterative manner - where the system asks for feedback from the
end-user to help improve the ranking. Of course, then the an open problem is
decide when to stop requesting feedback, and when to stop assessing abstracts.
In which case a subset of abstracts is identified, which consist of abstracts have
been screened during the feedback cycles and the remainder that are screened
but are not used for feedback (i.e. in batch mode). There are, of course, many
other possible variations. For the purposes of this initial track/task, we consider
the problem as a ranking task - that is to rank the set of documents associated
with the topic in decreasing order of relevance. We consider a document relevant
if the abstract passed the abstract screening phase (regardless of whether it was
included or excluded from the study).</p>
      <p>For this task we employ a number of standard measures, typically used in
IR ranking evaluations, along with other measures from related tracks and some
new measures we have developed.</p>
      <p>– Standard Measures</p>
      <p>Average Precision (AP)
Normalized cumulative gain @ 0% to 100% of documents shown; for the
simple case that judgments are binary, normalized cumulative gain @ %
is simply Recall @ % of shown documents[10]
Number of Relevant Found (nr)
Recall r = nr=R, where R the total number of relevant documents
Number of documents returned/shown (n)
– Related Measures (from [6,5]</p>
      <p>LOSS-R lossr = (1 r)2
LOSS-E losse = (n=(R + 100) 100=N )2, where N is the size of the
collection
Reliability = lossr + losse [6]</p>
      <p>Work Saved over Sampling at r, W SS@Recall = (T N +F N )=N (1 r)[5]
– Proposed Measures</p>
      <p>Last Rel Found: Minimum number of documents returned to retrieve all
nr relevant documents
Total Cost (TC);
Total Cost with Uniform penalty (TCU)</p>
      <p>Total Cost with Weighted Penalty (TCW)</p>
      <p>To calculate the cost based measured, we considered three possible
interactions to support a range of different ways to screen the items and to utilize
feedback when ranking. We consider the follow possibilities:
1. suppose we have an ranking algorithm, which uses no feedback from the user,
simply ranks the list of abstracts. The list is then presented to the end-user,
who evaluates them in a batch. In this case, no feedback is requested, and
abstracted are marked, NF.
2. suppose we have a ranking algorithm which uses feedback (i.e. abstract(s)
are presented to the user, feedback on their relevance is obtained, which is
then used by the algorithm, thus simulating online feedback from the user).
In this case, for each document where feedback from the users is requested,
abstracts are marked AF, but if no feedback is requested it is marked NF.
Abstracts marked NF, are then presented to the end-user to evaluate in a
final batch.
3. for either above option, the algorithm may decided that an abstract is not
relevant, and thus it does not need to be shown to a user, and so are marked
NS.</p>
      <p>To calculate the total cost (TC), we calculated:</p>
      <p>T C = #N F:Ca + #AF:(Ca + Cf )
(1)
where Ca is the cost of assessing the abstract, Cf is the cost of asking for feedback
#N F is the number of NF items, #AF is the number of AF items.</p>
      <p>We also created two additional cost measures which included a penalty for
missing relevant abstracts (a) with a uniform penalty and (b) a weighted penalty.
The uniform penalty was calculated as follows:</p>
      <p>T CU = T C + (R
r=R) (N
n)</p>
      <p>Cp
(2)
where Cp is the cost of the penalty of missing a relevant abstract, N is the
total number of documents in the set for the topic. The assumption behind this
penalty is that the end-user would need to continue examining abstracts before
they would from the remaining (R r) relevant items, and encounters them
at a uniform rate in the remaining N n abstracts which were not shown. So
if half the relevant items were missing, then the penalty component would be
(N n)Cp=2. If no relevant items were missing the penalty component would
be zero.</p>
      <p>The weighted penalty was calculated as follows:</p>
      <p>(R r)
T CW = T C + X (1=2i)(N
n)</p>
      <p>CP
i=1
(3)
where the assumption is that the end user would been to examine half of the
remaining documents to find the next relevant abstract, per missing relevant
abstract. So if all relevant items were missing, then the summation would tend
to one, and the penalty component tends to (N n) Cp, while if only one
relevant item is missing then, the penalty component is (N n) Cp=2.</p>
      <p>To compute these measures we set Ca = 1,Cf = 2 and Cp = 2, to represent
the relative costs of the different actions. Note that these are not based on any
empirical data and used as a way to regulate penalize feedback and no shows.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Participants</title>
      <p>Fourteen groups from eleven countries submitted a total of 68 runs for this task:</p>
      <p>Table 2 categorizes the participating runs along five dimensions: (a)
automatic vs manual runs; (b) use of the development set; (c) use of supervised
and semi-supervised learning algorithms, (d) use of relevance feedback; and (e)
thresholding the ranked list of articles. The categorization has been performed
by the lab coordinators – not by the participants – based on the submitted
participants description of their algorithms. Hence, there is always a chance of
mis-classifying some run. Out of the 68 runs submitted, 52 focused on the simple
Team</p>
      <p>Run
evaluation framework, while 16 on the cost-effective one. Out of the 52
submitted runs for the simple sub-task, 35 ranked all the PMIDs that were returned by
the Boolean query, while 17 tested different stopping criteria over the ranking.
Participants employed both supervised and unsupervised methods, for ranking
articles. A large number of runs were trained over the provided development
set, and their generalization was tested against the test topics. 26 runs used the
development set in some fashion, while 26 made no explicit use of it; it may be
the case that participants tried different models and algorithms over the
development set, and selected to submit the best performing ones, hence there may
be a flavor of model selection, however we did not consider this as use of the
development set. Participants represented the textual data in a variety of ways,
including document-topic features, bag-of-words, topic model distributions,
embeddings, metadata. In the remainder of section, by article we mean the abstract
and the title of an article. We are not aware of any participant that worked on
the full text of these articles.</p>
      <p>
        In particular, AMC took a batch supervised approach, training a Random
Forest over a topic model representation of the articles. A 75-topic model was
fitted over all articles in the collection, and the Topic-to-Document matrix was
used to extract features [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        AUTH took a learning-to-rank approach, using both batch and active
learning. Their model, HybridRankSVM, consists of two parts: an inter-topic model
which utilizes XGBoost and is trained over the entire development corpus and
an intra-topic model, an iteratively-built SVM, trained over relevance feedback
provided partially in the test topics. For the inter-topic model a total of 24
topicdocument (or solely topic) features were computed over the title, abstract and
mesh terms of the articles and the query. For the intra-topic model a TF-IDF
vectorization of the articles was used [
        <xref ref-type="bibr" rid="ref10 ref13">3</xref>
        ].
      </p>
      <p>CNRS trained a logistic regression model on n-gram features from the titles
and abstracts and structured data from the Medline citations. One of their
models was trained using stochastic gradient descent on the majority of the features,
and one on the principal components of a subset of the features. Class
imbalance was handled by reweighting and undersampling, while two approaches for
relevance feedback were investigated [13].</p>
      <p>ECNU took a learning-to-rank approach, using BM25, PL2, and BB2 as
features. The trained model was also combined with a vector space model [4].</p>
      <p>ETH used a LAMBDA-Mart model trained on features, such as BM25, Fuzzy
search, Vector content representation, publishing data. This model was used to
experiment with different stopping criteria. One of the approaches taken was to
use minimal relevance feedback to estimate the distribution of positive samples
by score. This was done by sampling from the articles, preferring articles with
higher score. A Gaussian distribution was fitted on the positive samples and
the resulting biased distribution was corrected. The correction worked by first
adapting the mean and then iteratively finding the standard deviation matching
the sampled data the best. For more details the reader can refer to [9].</p>
      <p>
        NCSU adopted a continuous active learning framework for this task. An
SVM classifier was trained on the relevance feedback labels and undersampling
of the negatively labeled articles removing those furthest from the SVM
decision hyperplane was employed. Different runs made use of different weights on
the labels depending on whether the abstract or the full text was considered
relevant [
        <xref ref-type="bibr" rid="ref8">20</xref>
        ].
      </p>
      <p>NTU examined the role of convolutional neural networks for classifying
medical articles for systematic reviews [12].</p>
      <p>Padua used a two-dimensional probabilistic version of BM25 to rank articles.
The parameters were tuned using the development set. Further, the top abstract
returned by BM25 was provided to two non-experts who generated one
additional query each. The tree queries were then used to re-rank articles. Different
approaches for relevance feedback and thresholding were investigated [14].</p>
      <p>
        QUT trained a learning-to-rank model using domain specific features. As
domain specific features, PICO annotations (Population, Intervention, Control,
Outcome) were used; these were extracted automatically from articles and
manually from the Boolean queries [
        <xref ref-type="bibr" rid="ref4">16</xref>
        ].
      </p>
      <p>
        Sheffield automatically parsed the Boolean queries to extract both the terms
and MeSH heading,s and used TF-IDF cosine similarity to calculate the
similarity score between document title and abstracts [
        <xref ref-type="bibr" rid="ref1 ref12">1</xref>
        ].
      </p>
      <p>UOS explored two methods: (i) topic models, where they used Latent
Dirichlet Allocation to identify topics within the set of retrieved articles, and then
ranking articles by the topic most likely to be relevant to the query, and (ii) relevance
feedback, where they used Rocchio’s algorithm to update the query model for
subsequent rounds of interaction. A third approach combined the topic model
and relevance feedback approaches to quickly identify the relevant articles. For
the thresholding task, they applied a score threshold over BM25 [11].</p>
      <p>
        UCL took a supervised approach and trained a deep model architecture to
identify studies pertaining to a given review topic [
        <xref ref-type="bibr" rid="ref7">19</xref>
        ].
      </p>
      <p>Waterloo applied the Baseline Model Implementation (BMI) from the TREC
Total Recall Track (2015-2016). They further applied their "knee-method"
stopping criterion to BMI to determine how many abstracts should be examined for
each topic [7].
6</p>
    </sec>
    <sec id="sec-6">
      <title>Results</title>
      <p>Table 3 presents a number of evaluation measures for those runs that ranked the
entire set of articles provided by the original Boolean queries; no thresholding
has been applied. Some runs, as it may appear from Tables 7, 8, 9, 10, even
though they applied no stopping criterion, still missed a number of documents.
There may be multiple reasons for that, e.g. missing some topic, or not being
able to download the abstract text, since participants were provided by PIDs
only. The number of documents for which feedback was requested appears in the
second column of the table, while the remaining of the columns report different
measures of performance.</p>
      <p>Figure 2 shows the recall-effort curves for the participating runs, that is the
recall value at different percentage of documents shown to the user. The straight
pink line with the triangular markers on x=y is the results of the Boolean query
randomly shuffled, and it serves as a naive baseline, provided by the UOS team.
The brown curve with the triangular markers is the BM25 retrieval function,
also provided by the UOS team as a baseline; it ranks abstracts by BM25 over
the Boolean query terms, with the default BM25 parameters setting.</p>
      <p>Figure 3 presents the box-plots of Mean Average Precision values for runs
that do not make use of relevance feedback (left) and runs that make use of
relevance feedback (right) respectively. On average relevance feedback boosts
amc.run
auth.simple.run1
auth.simple.run2
auth.simple.run3
auth.simple.run4
BASELINE.BM25
BASELINE.pubmed.random
cnrs.abrupt.all
cnrs.gradual.all
cnrs.noaf.all
cnrs.noaffull.all
ecnu.run1
ntu.run1
ntu.run2
ntu.run3
padua.iafa_m10k150f0m10
padua.iafap_m10p2f0m10
padua.iafap_m10p5f0m10
padua.iafas_m10k50f0m10
qut.ca_bool_ltr
qut.ca_pico_ltr
qut.rf_bool_ltr
qut.rf_pico_ltr
sheffield.run1
sheffield.run2
sheffield.run3
sheffield.run4
ucl.run_abstract
ucl.run_fulltext
uos.sis.TMAL30Q_BM25
uos.sis.TMBEST_BM25
waterloo.A-rank-normal
waterloo.B-rank-normal
the effectiveness of the ranking algorithms, as expected, however it may come
with additional cost in terms of assessing the relevance of abstract (based on the
screening setup considered).
Table 4 presents a number of evaluation measures for those runs that applied
a threshold criterion. The total number of shown to the user abstracts can be
found in the second column of the table, the number of documents for which
feedback was requested in the third, while the remaining of the columns report
different measures of performance. The cost measures account both for the cost
of presenting a document to the user and for the additional cost of requesting
feedback for a document, while they also account for the cost one would need to
pay to reach 100% recall, under certain assumptions. Reliability considers the
cost of not finding all relevant documents but makes no discrimination between
the documents returned to the user and those for which feedback is requested.
Average precision is well defined under the stopping criterion but hard to be
used for comparing runs that use different thresholds. An easy to understand
measure is the achieved recall at the rank of the threshold.</p>
      <p>Figure 5 presents recall at the point of the threshold as a function of the
number of documents presented to the user; that is at different stopping criteria,
but also with different ranking and thresholding algorithms. As expected the
more documents presented to the user (the lower the threshold criterion) the
higher the achieved recall. Nevertheless, there are still algorithms that dominate
others. The figure present the Pareto frontier. Figure 5 presents recall at the
point of the threshold as a function of the feedback documents requested. As it
can be viewed, although feedback documents, are in principle helpful towards
achieving a high recall, there are algorithms that used no relevance feedback and
still achieved high recall at a threshold.
Table 5 provides statistics on the topics used in the test set, along with the
average Average Precision (AAP) for each topic, a measure that can be seen as
a proxy of the difficult of each topic. The Pearson correlation coefficient between
AAP and the percentage of relevant documents, the total number of documents,
and the total number of relevant documents is -0.4868 (p-value = 0.006), 0.1295
(p-value = 0.495), and 0.8994 (p-value = 0). Figures 6 and 7 visually demonstrate
this correlation.</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions</title>
      <p>The CLEF 2017 e-Health Lab Task 2 constructed a benchmark collection of 50
Diagnostic Test Accuracy systematic reviews to study the effectiveness and
efficiency of information retrieval and machine learning algorithms in prioritizing
the studies to be screened at the abstract and title screening stage, and
providing a stopping criterion over the ranked list. The results demonstrate that
automatic methods can be trusted for finding most, if not all, relevant studies
in a fraction of the time manual screening can do the same. Given that across
different runs many parameters change simultaneously it is not easy to come to
certain conclusions about the relative performance of automatic methods.</p>
      <p>Regarding the benchmark collection itself, there is a number of limitations to
be considered: (a) Pivoting on the results of the the OVID MEDLINE Boolean
query limits our ability to identify all relevant studies, i.e. relevant studies that
are outputted by Boolean queries over different databases, and relevant studies
that are actually not found by these Boolean queries. The former can be overcome
by considering all the different queries submitted; for the latter extra manual
judgments would be required. (b) Pivoting on abstract and title only we miss the
opportunity to study the effect of automatic methods when applied to the full
text of the studies, that would present an opportunity to completely overcome
the multi-stage process of systematic reviews. However, most of the full text
articles are protected under copyright laws that do not give all participants access
to those. (c) The evaluation setup of ranking does not allows us to consider the
cost of the process, since given a ranking a researcher would have to still go over
all studies ranked. A more realistic setup, e.g. a double-screening setup, could
be considered. (d) In the construction of relevant judgments we considered the
included and excluded references of the systematic reviews under study, which
prevented us to study the noise and disagreement between reviewers. (e) In our
effort to allow iterative algorithms, e.g. active learning algorithms, to be
submitted, we handed the test sets’ relevant judgments directly to the participants,
which is rather unusual for this type of evaluation exercises. An alternative would
be the setup used by the TREC Total Recall, where participants submitted their
running algorithms to the organizers. (f) When it comes to evaluation measures
there is a large variety of those, all of which take a different often useful view
point on the effectiveness of algorithm, but which makes it difficult to decide
upon a single golden measure to rank participants’ runs.
- Conference and Labs of the Evaluation forum, Dublin, Ireland, September 11-14,
2017. CEUR Workshop Proceedings, CEUR-WS.org (2017)
3. Anagnostou, A., Lagopoulos, A., Tsoumakas, G., Vlahavas, I.: Hybridranksvm:
A cost-effective hybrid ltr approach for document ranking. In: Working Notes
of CLEF 2017 - Conference and Labs of the Evaluation forum, Dublin, Ireland,
September 11-14, 2017. CEUR Workshop Proceedings, CEUR-WS.org (2017)
4. Chen, J., Chen, S., Song, Y., Liu, H., Wang, Y., Hu, Q., He, L.: Ecnu at 2017
ehealth task 2: Technologically assisted reviews in empirical medicine. In: Working
Notes of CLEF 2017 - Conference and Labs of the Evaluation forum, Dublin,
Ireland, September 11-14, 2017. CEUR Workshop Proceedings, CEUR-WS.org (2017)
5. Cohen, A.M., Hersh, W.R., Peterson, K., Yen, P.Y.: Reducing workload in
systematic review preparation using automated citation classification. Journal of the
American Medical Informatics Association 13(2), 206–219 (2006)
6. Cormack, G.V., Grossman, M.R.: Engineering quality and reliability in
technologyassisted review. In: Proceedings of the 39th International ACM SIGIR
Conference on Research and Development in Information Retrieval. pp. 75–84. SIGIR
’16, ACM, New York, NY, USA (2016), http://doi.acm.org/10.1145/2911451.
2911510
7. Cormack, G.V., Grossman, M.R.: Technology-assisted review in empirical
medicine: Waterloo participation in clef ehealth 2017. In: Working Notes of CLEF
2017 - Conference and Labs of the Evaluation forum, Dublin, Ireland, September
11-14, 2017. CEUR Workshop Proceedings, CEUR-WS.org (2017)
8. Goeuriot, L., Kelly, L., Suominen, H., Névéol, A., Robert, A., Kanoulas, E.,
Spijker, R., Palotti, J., Zuccon, G.: CLEF 2017 eHealth evaluation lab overview. In:
CLEF 2017 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes in
Computer Science (LNCS). Springer (September 2017)
9. Hollmann, N., Eickhoff, C.: Relevance-based stopping for recall-centric medical
document retrieval. In: Working Notes of CLEF 2017 - Conference and Labs of
the Evaluation forum, Dublin, Ireland, September 11-14, 2017. CEUR Workshop
Proceedings, CEUR-WS.org (2017)
10. Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of ir techniques.</p>
      <p>ACM Trans. Inf. Syst. 20(4), 422–446 (Oct 2002), http://doi.acm.org/10.1145/
582415.582418
11. Kalphov, V., Georgiadis, G., Azzopardi, L.: Sis at clef 2017 ehealth tar task. In:
Working Notes of CLEF 2017 - Conference and Labs of the Evaluation forum,
Dublin, Ireland, September 11-14, 2017. CEUR Workshop Proceedings,
CEURWS.org (2017)
12. Lee, G.E.: Medical document classification for systematic reviews using
convolutional neural networks: Sysreview at clef ehealth 2017. In: Working Notes of CLEF
2017 - Conference and Labs of the Evaluation forum, Dublin, Ireland, September
11-14, 2017. CEUR Workshop Proceedings, CEUR-WS.org (2017)
13. Norman, C., Leeflang, M., Neveol, A.: Limsi@clef ehealth 2017 task 2: Logistic
regression for automatic article ranking. In: Working Notes of CLEF 2017 -
Conference and Labs of the Evaluation forum, Dublin, Ireland, September 11-14, 2017.</p>
      <p>CEUR Workshop Proceedings, CEUR-WS.org (2017)
14. Nunzio, G.M.D., Beghini, F., Vezzani, F., Henrot, G.: An interactive
twodimensional approach to query aspects rewriting in systematic reviews. ims unipd
at clef ehealth task 2. In: Working Notes of CLEF 2017 - Conference and Labs of
the Evaluation forum, Dublin, Ireland, September 11-14, 2017. CEUR Workshop
Proceedings, CEUR-WS.org (2017)
10.1002/14651858.CD010438.pub2/full
10.1002/14651858.CD010775.pub2/full
10.1002/14651858.CD009175.pub2/full
10.1002/14651858.CD011984/full
10.1002/14651858.CD009786.pub2/full
10.1002/14651858.CD008643.pub2/full
10.1002/14651858.CD009579.pub2/full
10.1002/14651858.CD009925/full
10.1002/14651858.CD009944.pub2/full
10.1002/14651858.CD007431.pub2/full
10.1002/14651858.CD007427.pub2/full
10.1002/14651858.CD008803.pub2/full
10.1002/14651858.CD008122.pub2/full
10.1002/14651858.CD009593.pub3/full
10.1002/14651858.CD008782.pub4/full
10.1002/14651858.CD009647.pub2/full
10.1002/14651858.CD009135.pub2/full
10.1002/14651858.CD008760.pub2/full
10.1002/14651858.CD011549/full
10.1002/14651858.CD009263.pub2/full
10.1002/14651858.CD009519.pub2/full
10.1002/14651858.CD009372.pub2/full
10.1002/14651858.CD011134.pub2/full
10.1002/14651858.CD010079.pub2/full
10.1002/14651858.CD010276.pub2/full
10.1002/14651858.CD008081.pub3/full
10.1002/14651858.CD009185.pub2/full
10.1002/14651858.CD011975/full
10.1002/14651858.CD009323.pub2/full
0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 0 2 4 4 4 4 1 3 8 8 8 0 0 0 0
.0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .7 .7 .9 .9 .8 .7 .7 .7 .7 .7 .8 .9 .9 .9 .0 .0 .0 .0 m
1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 gd
u
j
3 0 9 8 9 7 4 4 5 4 8 9 6 7 2 1 1 6 6 6 6 1 1 8 6 5 82 .52 .72 .72 ce
.1 .3 .2 .2 .2 .1 .0 .1 .1 .1 .1 .0 .1 .1 .2 .2 .2 .1 .1 .1 .1 .1 .1 .0 .0 .0 .
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 n
a
v
re dn cea .6 3 2 2 2 1 8 3 1 8 4 3 4 5 1 0 4 8 8 8 8 1 8 1 9 4 0 8 9 9 e
7 .9 .9 .9 .9 .8 .4 .7 .7 .7 .8 .6 .6 .6 .8 .8 .7 .6 .6 .6 .6 .6 .6 .6 .5 .5 .9 .8 .8 .8 le
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 r
e
/w ted ta 81 47 77 47 78 81 81 05 79 81 81 81 14 71 04 02 09 58 58 58 58 59 46 03 03 03 57 67 11 60 l-t
v
y e
l
h
t g n 9 6 6 4 6 9 9 2 4 9 9 9 6 7 7 7 5 6 6 6 6 6 9 1 1 1 0 0 3 2 c
s i
o e e 3 6 6 5 6 3 3 5 5 3 3 3 6 6 4 4 5 6 6 6 6 7 6 4 4 4 4 4 4 4 a
r
.3 .6 .
.2 .5 .5 .5 .5 .2 .0 .1 .2 .2 .4 .1 .0 .0 .2 .1 .1 .1 .1 .1 .1 .1 .1 .0 .1 .0 .4 .3 .4 .4 n
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ru
k
leR scoD odnuF 7581 8571 8571 8571 8571 8571 8571 8571 8571 8571 8571 8571 1911 1971 6861 7021 4061 0061 0061 0061 0061 0731 6111 7951 7951 7951 8571 8571 8571 8571 frsou
s
t
l
0 3</p>
      <p>0 3
1 1 3 1
4 4 2 4</p>
      <p>3
8 8
3 3 5 0 0 9 6 0 0 0 0 0 0 06 0 3 3
9 3 5
1 2
4 4 4 4 2 0
5 5 5 5 4 5
3 3 9 9 0 0 0 530 367 893 320 rseu
2 2 5 4
n
o
i
48 61 61 61 61 50 62 57 57 57 57 61 00 00 40 04 46 54 54 54 54 42 50 70 70 96 57 57 57 57 ta
5 5 5 5 5 5 5 5 5 5 5 5 0 0 6 6 0 3 3 3 3 9 9 1 1 1 5 5
7 7 7 7 7 7 7 7 7 7 7 7 0 0 1 1 7 5 5 5 5 2 7 1 1 1
11 11 11 11 11 11 11 11 11 11 11 11 3 3 5 5 2 1 1 1 1 1 2 11 11 11 11 11 11 11 v
E
7 7 75 75 lau
cea reh
R T
P
A
A U R
C W</p>
      <p>P
5
9
0
0
1
c
a
b
d
e
e
F
4 4 4 4 4 4 4 4 4 4 4 4 4 4 0 4 4 5 5 5 5 3 8 5 5 5 4 4 4 4
un cm tuh tu u u A A rn rsn rsn rsn cnu cnu cn th th .t i.t ii.tru ii.t scu sc t tu tu adu adu a a
R a a a a a B B c c c c e e e e e teh iii ii i i n n n n n p p p p
r .</p>
      <p>s
t
n
e
l
t
m
o
c
d
:
1 10 10 10 IT
0
m
0
f
0m f0m f0m R</p>
      <p>A
0 f
5 2 5 0 P
1 p p 5k .
k 0 0 0 7
0 1 1 1 e
1 m m m l
m _ _ b
_ p p s_ a</p>
      <p>a a a T
a f
f
a a
i
.
a
f
a
i
.
a
u u
d d
i
f
a
.
a
N N u</p>
      <p>I I r
m L L b r
l
l
a
.
l l
l
u
a l</p>
      <p>a
u .
d f
a
4 4 4 4 6 4 4 4 4 4 4 4 5 7 3 8 4 4 4 0 4 3
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 .
s
t
n
1 1 1 1 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 u
j
e
b
d
e
d
e
t
e
v
/w ted ta 81 81 81 81 67 86 81 81 81 81 81 81 68 20 84 56 08 81 55 43 55 50 tc
e
l
y
l
h
to eg en 39 39 39 39 49 51 39 39 39 39 39 39 37 39 37 38 62 39 117 95 117 71 tra
s i
s</p>
      <p>P
C W
a
/w rm tayn 819 819 819 819 084 725 819 819 819 819 819 819 454 509 711 082 082 819 557 098 557 704 isgn</p>
      <p>l
o
ts if
o n e 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 6 3 11 8 11 6 u
C U P
t
9 9 7 9 1 3 2 9 7 9 3 3 0 5 7 9 3 2 0 0 0 6 u
.2 .2 .
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 m
o
c
s
0 1 9 3 0 1 1 9 8 0 4 4 8 2 3 7 0 7 0 6 1 4 n
.2 .2 .1 .2 .1 .
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
b
a
l
e
R
cea reh
R T
P
A
0
0
1
e
e
F
e
1 5 1 2 3 2 7 2 0 2 4 4 7 7 7 7 6 2 8 8 2 1 c
.1 .1 .1 .1 .1 .1 .1 .2 .2 .2 .0 .0 .1 .1 .1 .1 .1 .1 .2 .2 .3 .3 n
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 av
e
a
57117 57117 57117 57117 5169 1863 56211 56211 62511 62511 62511 62511 51010 4076 0484 7694 15117 57117 58117 7877 58117 69360 lavuE
5 5 5 5 9 0 7 7 7 7 7 7 3 1 7 9 5 5 5 6 5
:
2</p>
      <p>u
58 67 58 369 rse
5
.
l1d l2d l2d 52 BM M m on r</p>
      <p>o o M _ B ro
o
h h B Q _ -n
m o R
o -n A
h P
s .
k e 8
n r
5 l l I
2 5 l a l a I
2 a m a
r</p>
      <p>rm T
_ _ 2 2 2 .
h h h Q L E ra t
t5 t5 t5_ 30L A B -
_ _ M M .A .</p>
      <p>T o</p>
      <p>-</p>
      <p>b</p>
      <p>B .B a
ts ts
tr tr tr tr te te
l l l l
l_ _ _
oo ico l
b p
_ _ _
a a f
. .
t t t</p>
      <p>r .
l
m m
b .</p>
      <p>b .</p>
      <p>m is
o
o l
o o T</p>
      <p>o
o l
cea reh
R T
P
A
C W
5
9
0
0
1
s
s
k
c
a
b
d
e
e
F
.0 .0 .
1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 d
u
j
A U R
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a
v
e
v
/w ted t 7 0 3 8 4 7 7 5 2 7 7 7 6 6 0 9 9 4 4 4 4 8 1 7 7 8 5 9 6 5
e
y l
l
h a 7 9 9 1 0 7 7 5 4 7 7 7 8 8 7 8 3 9 9 9 9 7 9 5 5 5 3 2 5 6 tn
to eg en 37 64 64 53 65 37 37 51 53 37 37 37 52 52 21 24 37 50 50 50 50 59 28 35 35 35 39 39 41 40 e
s i
m
P
c
/w rm ty 7 0 3 8 4 7 7 5 2 7 7 7 9 9 6 6 0 3 3 3 3 4 1 6 6 7 5 9 6 5 d
o
l
o a 7 9 9 1 0 7 7 5 4 7 7 7 6 6 7 9 5 8 8 8 8 5 9 0 0 0 3 2 5 6
C U P
4 4 4 4 4 4 4 4 4 4 4 4 9 9 4 8 5 8 8 8 8 1 3 2 2 2 4 4 4 4
.7 .7 .
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
d
1 2 1 2 1 7 7 9 6 0 4 7 0 1 8 5 6 9 9 9 9 8 6 2 0 4 1 9 0 2 e
.5 .8 .8 .8 .8 .5 .0 .3 .4 .5 .6 .2 .3 .3 .5 .5 .3 .1 .1 .1 .1 .3 .6 .2 .2 .1 .7 .6 .7 .7 t
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 up
d
ts kn leR 7142 853 857 839 858 1664 3316 2619 2384 2263 1678 2905 515 486 1000 1050 596 501 501 501 501 356 960 2954 2779 3050 1055 1007 838 990 itt
e
a a
L R m
b
le sco dnu 07 07 07 07 07 07 07 07 07 07 07 07 26 26 61 53 76 06 06 06 06 80 07 06 06 06 07 07 07 07 su
R D oF 6 6 6 6 6 6 6 6 6 6 6 6 4 4 5 5 4 4 4 4 4 4 6 6 6 6 6 6 6 6 r
o
f
.
s
t
n
e
u
m
o
c
r
s
t
l</p>
      <p>0m f0m f0m TR
0
f
0 f
5 2 5 50 PA
1 p p k .
k 0 0 0 9
0 1 1 1
1 m m m le
m _ _ _ b
_ p p s a
a
f
a
i
.
a
u u
i
fa T
a
.
a
0 4
8 8
9 2
1 2
0 3
9 9 2 9
3 3 2 3</p>
      <p>2
3 3 5 0 0 9 6 0 0 0 0 0 0 66 0 2 2
4</p>
      <p>6 6 2 0 u
2 2 6 9 0 0 0 82 20 49 17 se</p>
      <p>2 2 5 4 r
4 4 4 4 2 0
3 3 3 3 8 5
n
o
47 59 59 59 59 49 60 55 55 55 55 59 00 00 00 38 81 34 34 34 34 82 50 70 70 94 55 55 55 55 it
a
5 5 5 5 5 5 5 5 5 5 3 2 2 2 2 6 9 1 1 1 5 5 u
9 9</p>
      <p>9 9 9 9 95 95 95 95 9 9 90 90 7 6 5 5 5 5 5 2 7 3 3 3 95 95 9 9 l
10 10 10 10 10 10 10 10 10 10 10 10 2 2 4 4 2 1 1 1 1 1 2 01 10 10 10 10 10 10 av</p>
      <p>5
1 2 3 4 2 m
n n n n b
r .B .p</p>
      <p>.</p>
      <p>E E t</p>
      <p>p
N N u
a
l
a l</p>
      <p>a
u .
d f
l
l
a
.
l l</p>
      <p>l</p>
      <p>I I r a
m L L b r
s
u
ff 1 2 3
a a n n n
g .
g
0 0 0 0 1 8 0 0 0 0 0 0 9 8 9 0 0 0 0 0 0 8 d
1 1 1 1 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 e
c
n
8 1 6 9 0 9 2 8 5 8 4 4 4 3 4 6 1 9 9 9 3 3 a
.0 .1 .0 .0 .1 .0 .1 .1 .1 .1 .0 .0 .1 .1 .1 .1 .1 .0 .1 .1 .2 .2 v
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 le
e
r
a 5 8 2 6 8 7 4 7 7 7 1 1 58 .58 .58 .09 .78 .67 .59 .59 .59 .49 l
7 .7 .7 .7 .6 .6 .8 .8 .8 .8 .5 .5 . e
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 v
e
t
6 0 5 8 5 3 4 8 9 9 2 2 7 6 7 9 7 9 1 1 2 7 u
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 m
o
to/w teegd tean 3777 3777 3777 3777 4014 4591 3777 3777 3777 3777 3777 3777 3460 2962 3073 6047 6035 3777 11333 8251 11333 6559 eum
t
y n
l
h
s i</p>
      <p>P
/w rm tay 77 77 77 77 71 15 77 77 77 77 77 77 11 08 02 47 35 77 33 51 33 65 in
g
l
s
n
3 6 3 5 2 1 2 3 2 4 3 4 3 2 3 8 5 0 2 2 3 8 u
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d
e
t
i
ts kn leR 8252 3826 7724 1026 9515 2414 0118 2819 0219 4618 7234 0535 4815 8314 0615 690 8212 1980 461 461 469 375 bm
a a
L R u
s
le sco dnu 07 07 07 07 65 51 07 07 07 07 07 07 01 91 97 07 07 07 07 07 07 75 fo
r
R</p>
      <p>D oF 6 6 6 6 4 4 6 6 6 6 6 6 6 5 5 6 6 6 6 6 6 5 s
t
l
a
55 55 55 55 89 56 60 60 60 60 605 605 12 84 67 495 505 555 565 56 565 43 la
u
59 59 59 59 35 48 59 59 59 59 9 9 7 5 9
10 10 10 10 6 5 01 10 10 10 10 10 95 73 80 109 109 109 109 797 109 529 :vE
l
_ _ 2 2 2 .</p>
      <p>m m</p>
      <p>A .</p>
      <p>s
o l
o l
5
2 5 l a l a</p>
      <p>2 a m a
5
. r
l1d l2d l2d 52 BM M m on r</p>
      <p>o o M _ B ro
o
h h B Q _ -n
s 0</p>
      <p>P
sh .</p>
      <p>n r
h a h
k e 0</p>
      <p>1</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alharbi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stevenson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Ranking abstracts to identify relevant evidence for systematic reviews: The university of sheffield's approach to clef ehealth 2017 task 2</article-title>
          . In: Working Notes of CLEF 2017 -
          <article-title>Conference and Labs of the Evaluation forum</article-title>
          , Dublin, Ireland,
          <source>September 11-14</source>
          ,
          <year>2017</year>
          . CEUR Workshop Proceedings, CEUR-WS.org (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. van Altena,
          <string-name>
            <surname>A.J.:</surname>
          </string-name>
          <article-title>Predicting publication inclusion for diagnostic accuracy test reviews using random forests and topic modelling</article-title>
          .
          <source>In: Working Notes of CLEF 2017</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          15.
          <string-name>
            <given-names>O</given-names>
            <surname>'Mara-Eves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Thomas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>McNaught</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Miwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Ananiadou</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.:</surname>
          </string-name>
          <article-title>Using text mining for study identification in systematic reviews: a systematic review of current approaches</article-title>
          .
          <source>Systematic reviews 4(1)</source>
          ,
          <volume>5</volume>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          16.
          <string-name>
            <surname>Scells</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zuccon</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deacon</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koopman</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Qut ielab at clef 2017 technology assisted reviews track: Initial experiments with learning to rank</article-title>
          . In: Working Notes of CLEF 2017 -
          <article-title>Conference and Labs of the Evaluation forum</article-title>
          , Dublin, Ireland,
          <source>September 11-14</source>
          ,
          <year>2017</year>
          . CEUR Workshop Proceedings, CEUR-WS.org (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          17.
          <string-name>
            <surname>Scells</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zuccon</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koopman</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deacon</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Geva</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Azzopardi</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>A test collection for evaluating retrieval of studies for inclusion in systematic reviews</article-title>
          . In: To appear
          <source>in Proceedings of the 40th international ACM SIGIR conference on Research and development in Information Retrieval. ACM</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          18.
          <string-name>
            <surname>Shemilt</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khan</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Use of cost-effectiveness analysis to compare the efficiency of study identification methods in systematic reviews</article-title>
          .
          <source>Systematic Reviews</source>
          <volume>5</volume>
          (
          <issue>1</issue>
          ),
          <volume>140</volume>
          (Aug
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          19.
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marshall</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          , Thomas,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Wallace</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          :
          <article-title>Identifying diagnostic test accuracy publications using a deep model</article-title>
          . In: Working Notes of CLEF 2017 -
          <article-title>Conference and Labs of the Evaluation forum</article-title>
          , Dublin, Ireland,
          <source>September 11-14</source>
          ,
          <year>2017</year>
          . CEUR Workshop Proceedings, CEUR-WS.org (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          20.
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Menzies</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Technologically assisted reviews in empirical medicine: Data balancing or reweighting</article-title>
          . In: Working Notes of CLEF 2017 -
          <article-title>Conference and Labs of the Evaluation forum</article-title>
          , Dublin, Ireland,
          <source>September 11-14</source>
          ,
          <year>2017</year>
          . CEUR Workshop Proceedings, CEUR-WS.
          <article-title>org (2017) s b</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>/w rm ty 8 4</source>
          <volume>7 4 7 8 8 0 7 8 8 8 3 6 6 6 7 0 0 0 0 9 5 6 6 7 5 6 1 6 a d</volume>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <volume>3 9 0 8 9 0 3 4 9 6 0 2 6 7 8 4 6 4 4 4 4 6 7 1 3 7 1 8 0 2 te t</volume>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>ts kn leR 1342 3344</source>
          <volume>3099 3155 1972 1873 2678 2441 2404 2382 3727 3727 2503 1877 2068 2333 2305 3124 1464 1161 1469 914 bm i</volume>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>D oF 1 1</source>
          <volume>1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 s</volume>
          <source>t l n 7 4</source>
          <volume>4 3 5 7 7 1 3 7 7 7 0 0 8 1 8 5 5 5 5 3 8 6 6 6 9 9 1 0 g</volume>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>o n e 3 6</source>
          <volume>6 5 6 3 3 5 5 3 3 3 3 3 1 2 1 3 3 3 3 3 2 3 3 3 3 3 4</volume>
          4 in s u
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>