=Paper=
{{Paper
|id=Vol-1866/invited_paper_12
|storemode=property
|title=CLEF 2017 Technologically Assisted Reviews in Empirical Medicine Overview
|pdfUrl=https://ceur-ws.org/Vol-1866/invited_paper_12.pdf
|volume=Vol-1866
|authors=Evangelos Kanoulas,Dan Li,Leif Azzopardi,Rene Spijker
|dblpUrl=https://dblp.org/rec/conf/clef/KanoulasLAS17
}}
==CLEF 2017 Technologically Assisted Reviews in Empirical Medicine Overview==
CLEF 2017 Technologically Assisted Reviews in
Empirical Medicine Overview
Evangelos Kanoulas1 , Dan Li1 , Leif Azzopardi2 , and Rene Spijker3
1
Informatics Institute, University of Amsterdam, Netherlands,
E.Kanoulas@uva.nl, D.Li@uva.nl
2
Computer and Information Sciences, University of Strathclyde, Glasgow, UK,
leif.azzopardi@strath.ac.uk
3
Cochrane Netherlands and UMC Utrecht, Julius Center for Health Sciences and
Primary Care, Netherlands, R.Spijker-2@umcutrecht.nl
Abstract. Systematic reviews are a widely used method to provide an
overview over the current scientific consensus, by bringing together mul-
tiple studies in a reliable, transparent way. The large and growing number
of published studies, and their increasing rate of publication, makes the
task of identifying all relevant studies in an unbiased way both com-
plex and time consuming to the extent that jeopardizes the validity of
their findings and the ability to inform policy and practice in a timely
manner. The CLEF 2017 e-Health Lab Task 2 focuses on the efficient
and effective ranking of studies during the abstract and title screening
phase of conducting Diagnostic Test Accuracy systematic reviews. We
constructed a benchmark collection of fifty such reviews and the corre-
sponding relevant and irrelevant articles found by the original Boolean
query. Fourteen teams participated in the task, submitting 68 automatic
and semi-automatic runs, using information retrieval and machine learn-
ing algorithms over a variety of text representations, in a batch and
iterative manner. This paper reports both the methodology used to con-
struct the benchmark collection, and the results of the evaluation.
Keywords: Evaluation, Information Retrieval, Systematic Reviews, TAR, Text
Classification, Active Learning
1 Introduction
Evidence-based medicine has become an important pillar in health care and
policy making. In order to practice evidence-based medicine, it is important to
have a clear overview over the current scientific consensus. These overviews are
provided in systematic review articles, that summarize all available evidence
regarding a certain topic (e.g., a treatment or diagnostic test). In order to write
a systematic review, researchers have to conduct a search that will retrieve all
the studies that are relevant. The large and growing number of published studies,
and their increasing rate of publication, makes the task of identifying relevant
studies in an unbiased way both complex and time consuming to the extent
that jeopardizes the validity of their findings and the ability to inform policy
and practice in a timely manner. Hence, the need for automation in this process
becomes of utmost importance. Finding all relevant studies in a corpus is a
difficult task, known in the Information Retrieval (IR) domain as the total recall
problem.
To this date, retrieval of evidence to inform systematic reviews is being con-
ducted in multiple stages:
1. Boolean Search: At the first stage information specialists build a broad
Boolean query expressing what constitutes relevant information. The query
is then submitted to a medical database containing titles, abstracts, and in-
dexing terms of a controlled vocabulary of medical studies. The result is a
set, A, of potentially interesting studies.
2. Title and Abstract Screening: At a second stage experts are screening the
titles and abstracts of the returned set and decide which one of those hold
potential value for their systematic review, a set D. If screening an abstract
has a cost Ca , screening all |A| abstracts has a cost of Ca ∗ |A|.
3. Study Screening: At a third stage experts are downloading the full text of
the potentially relevant abstracts, D, identified in the previous phase and
examine the content to decide whether indeed these studies are relevant or
not. Examining a document has typically a larger cost of Cd > Ca . The
result of the second screening is a set of references to be included in the
systematic review.
Unfortunately, the precision of the Boolean searches is typically low, hence re-
viewers often need to look manually through many thousands of irrelevant titles
and abstracts in order to identify a small number of relevant ones. Furthermore,
the recall of the searches is often assumed to be 100%, which may not be the
case.
To overcome some of the limitations of the Boolean search, researchers have
been testing the effectiveness of machine learning and information retrieval meth-
ods. O’Mara-Eves et al.[15] provide a systematic review of the use of text mining
techniques for study identification in systematic reviews.
The goal of this lab is to bring together academic, commercial, and govern-
ment researchers that will conduct experiments and share results on automatic
methods to retrieve relevant studies with high precision and high recall, and
release a reusable test collection that can be used as a reference for compar-
ing different retrieval and mining approaches in the field of medical systematic
reviews.
2 Benchmark Collection
To construct the benchmark collection, the organizers of the task considered
58 systematic reviews on Diagnostic Test Accuracy conducted by the Cochrane
researchers. These reviews are publicly available through the Cochrane Library4
4
http://www.cochranelibrary.com/
and can be identified by setting the topic filter in the library to "Diagnostic" and
"Diagnostic Test Accuracy" and the stage fitler to "Review". At the date of the
publication of this article 79 such studies are available, however the last 22 were
performed after the organizers put the collection together. The 58 systematic
reviews considered can be found in the Appendix of this articles at Table 6.
Participants were provided with two data sets: (a) a development set, and
(b) a test set. The development set consists of 20 topics for Diagnostic Test
Accuracy (DTA) systematic reviews, while the test set consists of 30 topics. For
both sets, one topic file and two files of relevance judgments at abstract and
document level respectively are constructed (qrel’s).
The topic file is generated through the following procedure. For each sys-
tematic review, we reviewed the search strategy from the corresponding study
in Cochrane Library. A search strategy, among others, consists of the exact
Boolean query developed and submitted to a medical database, at the time the
review was conducted, and typically can be found in the Appendix of the study.
Rene Spijker, a co-author of this work and a Cochrane information specialist ex-
amined the grammatical correctness of the search query and specified the date
range which dictated the valid dates for the articles to be included in this sys-
tematic review. The date range was necessary because a study published after
the systematic review should not be included even though it might be relevant,
since that would require manually examining its content to quantify its relevance.
Important note: A number of medical databases, and search interfaces to these
databases is available for search, and for each one information specialists con-
struct a different variation of their query that better fits the data and meta-data
of the database. For this task, we only considered the Boolean query constructed
for the MEDLINE database, using the Wolters Kluwer Ovid interface. Then we
submitted the constructed Boolean query to the OVID system5 and collected all
the returned PubMed document identification numbers (PMID’s) which satisfied
the date range constraint. This step was automated by a Python script we put
together and through an interface available to the University of Amsterdam6 .
Out of the 58 reviews 8 were discarded since the provided Boolean query was
not in the right format, which made it difficult if not impossible to reconstruct
the set of PMID’s, hence the 50 topics in the development and test set.
The topic file is in a text format and contains four sections, Topic, Title,
Query, and PMID’s, where Topic is the topic ID, a substring of DOI of the doc-
ument (e.g. CD010438 for 10.1002/14651858.CD010438.pub2), and PMID’s are
the document IDs returned by the Boolean query. The PIDs can be used to ac-
cess the corresponding document through the National Center for Biotechnology
Information (NCBI)7 . An example of a topic file can be viewed below.
5
http://demo.ovid.com/demo/ovidsptools/launcher.htm
6
https://github.com/dli1/tar_data_collection
7
https://www.ncbi.nlm.nih.gov/books/NBK25497/
Topic: CD009551
Title: Polymerase chain reaction blood tests for the diagnosis of
invasive aspergillosis in immunocompromised people
Query:
exp Aspergillosis/
exp Pulmonary Aspergillosis/
exp Aspergillus/
(aspergillosis or aspergillus or aspergilloma or "A.fumigatus" or
"A. flavus" or "A. clavatus" or "A. terreus" or "A. niger").ti,ab.
or/1-4
exp Nucleic Acid Amplification Techniques/
pcr.ti,ab.
"polymerase chain reaction*".ti,ab.
or/6-8
5 and 9
exp Animals/ not Humans/
10 not 11
Pmid’s:
25815649
26065322
...
For the construction of the qrel files, we considered the reference section of the
50 systematic reviews. The references are split into three categories: Included,
Exclude, and Additional. Included are the studies that are relevant to the sys-
tematic review. Excluded are the studies that in the abstract and title screening
stage were considered relevant, but at the article screening phase were considered
irrelevant to the study and hence excluded from it. Additional are additional ref-
erences that do not impact the outcome of the study, and hence irrelevant to it.
The included references were the relevant studies at the document-level qrels,
while both the included and excluded references were considered relevant at the
abstract-level qrels. The format of the qrels followed the standard TREC format:
Topic Iteration Document Relevance
where Topic is the topic ID of the systematic review, Iteration in our case is a
dummy field always zero and not used, Document is the PMID, and Relevancy
is a binary code of 0 for not relevant and 1 for relevant studies. The order
of documents in the qrel files is not indicative of relevance. Studies that were
returned by the Boolean query but were not relevant based on the above process,
were considered irrelevant. Those are studies that were excluded at the abstract
and title screening phase. All other documents in MEDLINE were also assumed
to be irrelevant, given that they were not judged by the human assessor.
Important Note: Note that, as mentioned earlier, the references of a system-
atic review were produced after a number of Boolean queries were submitted
to a number of medical databases, and their titles and abstracts were screened.
The PMID’s provided however were only those that came out of the MEDLINE
query. Therefore, there was a number of abstract-level relevant studies (the gray
area in the Venn diagram below) that were not part of the result set of the
Boolean query provided to the participants. For the development set, the qrel
file contained those additional PMIDs, for those participants that would decide
to search the entire MEDLINE database, and not only consider the studies pro-
vided to them in the Topic files. To the best of our knowledge, no one submitted
such a system, hence to avoid any bias we excluded those relevant studies from
the test set.
MEDLINE Boolean Query Relevant Studies
Table 1 shows the distribution of the relevant documents at abstract or doc-
ument level for all the topics in the development set and the test set. The total
number of unique PMID is 149,405 for the development set and 117,562 for
the test set. Their percentages of relevant documents at abstract level are quite
close, which is 1.88% for the development set and 1.58% for the test set. This
is not true at document level, however, where the relevant documents in the
test set is almost twice as large as in the development set, even though there
are 0.52% and 0.33% of relevant studies, respectively. In [17], a test collection
was developed based on a random selection of 93 Cochrane systematic reviews
14
(not just DTAs), and reported a slightly higher rate of relevance ( 1159 = 1.2%).
However, compared with the TREC campaign, the rate of relevant documents is
5.45%, 2.78% for the Adhoc track of TREC 8 and the Web track of TREC 2002.
Overall, the number of relevant documents is not very high in this lab, making
locating them quite a difficult task.
Important Note: As one can observe in Table 1, there are topics for which
the output of the Boolean query is rather narrow, with as few as 64 studies to be
reviewed for topic CD008760. Cochrane is conducting systematic reviews on a
regular basis, in an attempt to update each review every two-three years. Some
of the reviews considered for the construction of the benchmark collection, such
file name Topic # total PMIDs # abs rel # doc rel % abs rel % doc rel
Development Set
1 CD010438 3250 39 3 1.20 0.09
11 CD007427 1521 123 17 8.09 1.12
14 CD009593 14922 78 24 0.52 0.16
19 CD011549 12705 2 1 0.02 0.01
23 CD011134 1953 215 49 11.01 2.51
28 CD008686 3966 7 5 0.18 0.13
33 CD011975 8201 619 60 7.55 0.73
35 CD009323 3881 122 9 3.14 0.23
37 CD009020 1584 162 12 10.23 0.76
38 CD011548 12708 113 5 0.89 0.04
4 CD011984 8192 454 28 5.54 0.34
43 CD010409 43363 76 41 0.18 0.09
44 CD008054 3217 274 41 8.52 1.27
45 CD010771 322 48 1 14.91 0.31
50 CD009591 7991 144 41 1.80 0.51
53 CD008691 1316 73 20 5.55 1.52
54 CD010632 1504 32 14 2.13 0.93
55 CD007394 2545 95 47 3.73 1.85
6 CD008643 15083 11 4 0.07 0.03
9 CD009944 1181 117 64 9.91 5.42
total 149405 2804 486 1.88 0.33
Test Set
10 CD007431 2074 24 15 1.16 0.72
12 CD008803 5220 99 99 1.90 1.90
15 CD008782 10507 45 34 0.43 0.32
16 CD009647 2785 56 17 2.01 0.61
17 CD009135 791 77 19 9.73 2.40
18 CD008760 64 12 9 18.75 14.06
2 CD010775 241 11 4 4.56 1.66
21 CD009519 5971 104 46 1.74 0.77
22 CD009372 2248 25 10 1.11 0.44
25 CD010276 5495 54 24 0.98 0.44
26 CD009551 1911 46 16 2.41 0.84
27 CD012019 10317 3 1 0.03 0.01
29 CD008081 970 26 10 2.68 1.03
31 CD009185 1615 92 23 5.70 1.42
32 CD010339 12807 114 9 0.89 0.07
34 CD010653 8002 45 0 0.56 0.00
36 CD010542 348 20 8 5.75 2.30
39 CD010896 169 6 3 3.55 1.78
40 CD010023 981 52 14 5.30 1.43
41 CD010772 316 47 11 14.87 3.48
42 CD011145 10872 202 48 1.86 0.44
47 CD010705 114 23 18 20.18 15.79
48 CD010633 1573 4 3 0.25 0.19
49 CD010173 5495 23 10 0.42 0.18
5 CD009786 2065 10 6 0.48 0.29
51 CD010386 626 2 1 0.32 0.16
56 CD010783 10905 30 11 0.28 0.10
57 CD010860 94 7 4 7.45 4.26
7 CD009579 6455 138 79 2.14 1.22
8 CD009925 6531 460 55 7.04 0.84
total 117562 1857 607 1.58 0.52
Table 1. Statistics of development and test set.
as the CD008760 review, are updates to previous reviews. These updates, only
specify a query for a time range that starts after the last review on the topic
was conducted. Hence, the 64 studies, are the output of the Boolean query for
this short time range, hence its small number. If the Boolean query were to run
against the entire MEDLINE database, the number of studies would be in the
range of tens of thousands, as is the case for some other reviews considered, e.g.
CD008782.
3 Task Description
The CLEF 2017 e-Health Lab [8], task 2, focused on retrieving studies for con-
ducting Diagnostic Test Accuracy (DTA) systematic reviews. Retrieval in this
area is generally considered very difficult, where sensitive searches result in large
quantities of references to be screened manually, and a breakthrough in this field
would likely be applicable to other areas as well. The task has a focus on the
second stage of the process, i.e. given the results of a Boolean search how to
make abstract and title screening more effective and efficient. Currently a typi-
cal number needed to read (NNR), the number of studies to screen to identify
1 eligible study, for DTA systematic reviews is approximately 80 when applied
to potential abstracts that need further full text assessment. With an average
of 7000 results to be screened, which would take approximately 120 hours to
screen (1 minute per abstract [18]), a huge benefit can be made in reducing the
workload in this process.
Given the results of the Boolean search from stage 1 as the starting point,
participants were asked to rank the set of the provided abstracts. The task
had two goals: (i) to produce an efficient ordering of the documents, such that
all the relevant abstracts are retrieved above the irrelevant ones, and (ii) to
identify the relevant subset of abstracts to be shown to a user, that is a stopping
point in the ranked list of abstract, where a researcher could confidently stop
screening abstracts and titles. Therefore, we solicited two types of submissions:
(i) ranking submission: automatic or manual methods that rank all abstracts,
with the goal of retrieving relevant abstracts as early in the ranking as possible,
and (ii) thresholding submission: thresholding can be performed in a batch, or
iterative manner as well.
We also considered two evaluation frameworks, (a) a simple evaluation, and
(b) a cost-effective evaluation. The assumption behind the simple evaluation
framework is the following: The user of your system is the researcher that per-
forms the abstract and title screening of the retrieved articles. Every time an
abstract is returned (i.e. ranked) there is an incurred cost/effort of CA, while
the abstract is either irrelevant (in which case no further action will be taken)
or relevant (and hence passed to the next stage of document screening) to the
topic under review. The assumption behind the cost-effective evaluation is the
following: The user that performs the screening is not the end-user. The user
can interchangeably perform abstract and title screening, or document screen-
ing, and decide what PMIDs to pass to the end-user. Every time an abstract
is returned the user can either (a) read the abstract (with an incurred cost of
CA) and decide whether to pass this PMID to the end-user, or (b) read the
full document (with an incurred cost of CA+CD) and decide whether to pass
this PMID to the end-user, or (c) directly pass the PMID to the end user (with
an incurred cost of 0), or (d) directly discard the PMID and not pass it to the
end user (with an incurred cost of 0). For every PMID passed to the end-user
there is a cost of attached to it: CA if the abstract passed on is not relevant,
and CA + CD if the abstract passed on is relevant (that is, we assume that
the end-user completes a two-round abstract and document screening, as usual,
but only for the PMIDs the algorithm+feedback user decided to be relevant).
Although a small number of teams participated in the cost-effective sub-task,
the lab focused on the simple evaluation sub-task, and this is what is described
in the remaining of this report.
4 Evaluation
Evaluation within the context of using technology to assist in the reviewing
process is very much dependent on how the user(s) interact with the system - and
the goal of the technology assistance. For example, is the goal of the assistance
to automate the screening process - where the system assess all the abstracts and
returns a subset of the initial set to be screened by the end-user (i.e. screened in
batch mode). Or, it could be used to identify all the relevant documents as soon
as possible, in an iterative manner - where the system asks for feedback from the
end-user to help improve the ranking. Of course, then the an open problem is
decide when to stop requesting feedback, and when to stop assessing abstracts.
In which case a subset of abstracts is identified, which consist of abstracts have
been screened during the feedback cycles and the remainder that are screened
but are not used for feedback (i.e. in batch mode). There are, of course, many
other possible variations. For the purposes of this initial track/task, we consider
the problem as a ranking task - that is to rank the set of documents associated
with the topic in decreasing order of relevance. We consider a document relevant
if the abstract passed the abstract screening phase (regardless of whether it was
included or excluded from the study).
For this task we employ a number of standard measures, typically used in
IR ranking evaluations, along with other measures from related tracks and some
new measures we have developed.
– Standard Measures
• Average Precision (AP)
• Normalized cumulative gain @ 0% to 100% of documents shown; for the
simple case that judgments are binary, normalized cumulative gain @ %
is simply Recall @ % of shown documents[10]
• Number of Relevant Found (nr)
• Recall r = nr/R, where R the total number of relevant documents
• Number of documents returned/shown (n)
– Related Measures (from [6,5]
• LOSS-R lossr = (1 − r)2
• LOSS-E losse = (n/(R + 100) ∗ 100/N )2 , where N is the size of the
collection
• Reliability = lossr + losse [6]
• Work Saved over Sampling at r, W SS@Recall = (T N +F N )/N (1−r)[5]
– Proposed Measures
• Last Rel Found: Minimum number of documents returned to retrieve all
nr relevant documents
• Total Cost (TC);
• Total Cost with Uniform penalty (TCU)
• Total Cost with Weighted Penalty (TCW)
To calculate the cost based measured, we considered three possible inter-
actions to support a range of different ways to screen the items and to utilize
feedback when ranking. We consider the follow possibilities:
1. suppose we have an ranking algorithm, which uses no feedback from the user,
simply ranks the list of abstracts. The list is then presented to the end-user,
who evaluates them in a batch. In this case, no feedback is requested, and
abstracted are marked, NF.
2. suppose we have a ranking algorithm which uses feedback (i.e. abstract(s)
are presented to the user, feedback on their relevance is obtained, which is
then used by the algorithm, thus simulating online feedback from the user).
In this case, for each document where feedback from the users is requested,
abstracts are marked AF, but if no feedback is requested it is marked NF.
Abstracts marked NF, are then presented to the end-user to evaluate in a
final batch.
3. for either above option, the algorithm may decided that an abstract is not
relevant, and thus it does not need to be shown to a user, and so are marked
NS.
To calculate the total cost (TC), we calculated:
T C = #N F.Ca + #AF.(Ca + Cf ) (1)
where Ca is the cost of assessing the abstract, Cf is the cost of asking for feedback
#N F is the number of NF items, #AF is the number of AF items.
We also created two additional cost measures which included a penalty for
missing relevant abstracts (a) with a uniform penalty and (b) a weighted penalty.
The uniform penalty was calculated as follows:
T CU = T C + (R − r/R) ∗ (N − n) ∗ Cp (2)
where Cp is the cost of the penalty of missing a relevant abstract, N is the
total number of documents in the set for the topic. The assumption behind this
penalty is that the end-user would need to continue examining abstracts before
they would from the remaining (R − r) relevant items, and encounters them
at a uniform rate in the remaining N − n abstracts which were not shown. So
if half the relevant items were missing, then the penalty component would be
(N − n)Cp /2. If no relevant items were missing the penalty component would
be zero.
The weighted penalty was calculated as follows:
(R−r)
X
T CW = T C + (1/2i )(N − n) ∗ CP (3)
i=1
where the assumption is that the end user would been to examine half of the
remaining documents to find the next relevant abstract, per missing relevant
abstract. So if all relevant items were missing, then the summation would tend
to one, and the penalty component tends to (N − n) ∗ Cp , while if only one
relevant item is missing then, the penalty component is (N − n) ∗ Cp /2.
To compute these measures we set Ca = 1,Cf = 2 and Cp = 2, to represent
the relative costs of the different actions. Note that these are not based on any
empirical data and used as a way to regulate penalize feedback and no shows.
5 Participants
Fourteen groups from eleven countries submitted a total of 68 runs for this task:
1. Amsterdam Medical Center, The Netherlands (AMC)
2. Aristotle University of Thessaloniki, Greece (AUTH)
3. Centre Nationnal de la Recherche Scientifique, France & Amsterdam Medical
Center, The Netherlands (CNRS)
4. East China Normal University, China (ECNU)
5. Eidgenoessische Technische Hochschule Zurich, Switzerland (ETH)
6. International Institute of Information Technology, Hyderabad, India (IIIT)
7. North Carolina State University, United States (NCSU)
8. Nanyang Technological University, Singapore (NTU)
9. University of Padua, Italy (Padua)
10. University of Sheffield, United Kingdom (Sheffield)
11. University College London, United Kingdom & Northeastern University,
USA (UCL)
12. University of Waterloo, Canada (Waterloo)
13. Queensland University of Technology & CSIRO, Australia (QUT)
14. University of Strathclyde, United Kingdom (UOS)
Table 2 categorizes the participating runs along five dimensions: (a) auto-
matic vs manual runs; (b) use of the development set; (c) use of supervised
and semi-supervised learning algorithms, (d) use of relevance feedback; and (e)
thresholding the ranked list of articles. The categorization has been performed
by the lab coordinators – not by the participants – based on the submitted
participants description of their algorithms. Hence, there is always a chance of
mis-classifying some run. Out of the 68 runs submitted, 52 focused on the simple
Team Run Auto Develop- Supervised Feedback Threshold
ment
AMC amc.run.res X X X x x
AUTH simple.run1/run2/run3/run4 X X X X x
BASELINE BM25 X x x x x
BASELINE random.pubmed X x x x x
CNRS cnrs.abrupt.all X X X X x
CNRS cnrs.gradual.all X X X X x
CNRS cnrs.noaf.all X X X x x
CNRS cnrs.noaffull.all X X X x x
ECNU run1 X x x x x
ECNU run2 X X X x X
ECNU run3 X X X x X
ETH m1 X X X x X
ETH m2 X X X X X
ETH m4 X X X x X
IIIT run1/run2/run3/run4 X x x X X
NCSU simple X x X X X
NCSU abs X x X X X
NTU run1/run2/run3 X X X x x
Padua iafa_m10k150f0m10 x X X x x
Padua iafap_m10p2f0m10 x X X x x
Padua iafap_m10p5f0m10 x X X x x
Padua iafas_m10k50f0m10 x X X x x
QUT ca_bool_ltr X X X x x
QUT ca_pico_ltr x X X x x
QUT rf_bool_ltr X X X x x
QUT rf_pico_ltr x X X x x
QUT bool_es X x x x x
QUT pico_es x x x x x
Sheffield run1/run2/run3/run4 X x x x x
UCL abstract X X X x x
UCL fulltext X X X x x
UOS sis.AL30Q_BM25 X x x X X
UOS sis.TMBEST_BM25 X x x x x
UOS sis.TMAL30Q_BM25 X x x X x
UOS sis.bm25_t1.5 X x x x X
UOS sis.bm25_t1 X x x x X
UOS sis.bm25_t2.5 X x x x X
UOS sis.bm25_t2 X x x x X
Waterloo A-rank-normal.txt X x X X x
Waterloo A-thresh-normal.txt X x X X X
Waterloo B-rank-normal.txt X x X X x
Waterloo B-thresh-normal.txt X x X X X
Table 2. Categorization of participant’s runs in the simple evaluation framework along
five dimensions.
evaluation framework, while 16 on the cost-effective one. Out of the 52 submit-
ted runs for the simple sub-task, 35 ranked all the PMIDs that were returned by
the Boolean query, while 17 tested different stopping criteria over the ranking.
Participants employed both supervised and unsupervised methods, for ranking
articles. A large number of runs were trained over the provided development
set, and their generalization was tested against the test topics. 26 runs used the
development set in some fashion, while 26 made no explicit use of it; it may be
the case that participants tried different models and algorithms over the devel-
opment set, and selected to submit the best performing ones, hence there may
be a flavor of model selection, however we did not consider this as use of the
development set. Participants represented the textual data in a variety of ways,
including document-topic features, bag-of-words, topic model distributions, em-
beddings, metadata. In the remainder of section, by article we mean the abstract
and the title of an article. We are not aware of any participant that worked on
the full text of these articles.
In particular, AMC took a batch supervised approach, training a Random
Forest over a topic model representation of the articles. A 75-topic model was
fitted over all articles in the collection, and the Topic-to-Document matrix was
used to extract features [2].
AUTH took a learning-to-rank approach, using both batch and active learn-
ing. Their model, HybridRankSVM, consists of two parts: an inter-topic model
which utilizes XGBoost and is trained over the entire development corpus and
an intra-topic model, an iteratively-built SVM, trained over relevance feedback
provided partially in the test topics. For the inter-topic model a total of 24 topic-
document (or solely topic) features were computed over the title, abstract and
mesh terms of the articles and the query. For the intra-topic model a TF-IDF
vectorization of the articles was used [3].
CNRS trained a logistic regression model on n-gram features from the titles
and abstracts and structured data from the Medline citations. One of their mod-
els was trained using stochastic gradient descent on the majority of the features,
and one on the principal components of a subset of the features. Class imbal-
ance was handled by reweighting and undersampling, while two approaches for
relevance feedback were investigated [13].
ECNU took a learning-to-rank approach, using BM25, PL2, and BB2 as
features. The trained model was also combined with a vector space model [4].
ETH used a LAMBDA-Mart model trained on features, such as BM25, Fuzzy
search, Vector content representation, publishing data. This model was used to
experiment with different stopping criteria. One of the approaches taken was to
use minimal relevance feedback to estimate the distribution of positive samples
by score. This was done by sampling from the articles, preferring articles with
higher score. A Gaussian distribution was fitted on the positive samples and
the resulting biased distribution was corrected. The correction worked by first
adapting the mean and then iteratively finding the standard deviation matching
the sampled data the best. For more details the reader can refer to [9].
NCSU adopted a continuous active learning framework for this task. An
SVM classifier was trained on the relevance feedback labels and undersampling
of the negatively labeled articles removing those furthest from the SVM deci-
sion hyperplane was employed. Different runs made use of different weights on
the labels depending on whether the abstract or the full text was considered
relevant [20].
NTU examined the role of convolutional neural networks for classifying med-
ical articles for systematic reviews [12].
Padua used a two-dimensional probabilistic version of BM25 to rank articles.
The parameters were tuned using the development set. Further, the top abstract
returned by BM25 was provided to two non-experts who generated one addi-
tional query each. The tree queries were then used to re-rank articles. Different
approaches for relevance feedback and thresholding were investigated [14].
QUT trained a learning-to-rank model using domain specific features. As
domain specific features, PICO annotations (Population, Intervention, Control,
Outcome) were used; these were extracted automatically from articles and man-
ually from the Boolean queries [16].
Sheffield automatically parsed the Boolean queries to extract both the terms
and MeSH heading,s and used TF-IDF cosine similarity to calculate the similar-
ity score between document title and abstracts [1].
UOS explored two methods: (i) topic models, where they used Latent Dirich-
let Allocation to identify topics within the set of retrieved articles, and then rank-
ing articles by the topic most likely to be relevant to the query, and (ii) relevance
feedback, where they used Rocchio’s algorithm to update the query model for
subsequent rounds of interaction. A third approach combined the topic model
and relevance feedback approaches to quickly identify the relevant articles. For
the thresholding task, they applied a score threshold over BM25 [11].
UCL took a supervised approach and trained a deep model architecture to
identify studies pertaining to a given review topic [19].
Waterloo applied the Baseline Model Implementation (BMI) from the TREC
Total Recall Track (2015-2016). They further applied their "knee-method" stop-
ping criterion to BMI to determine how many abstracts should be examined for
each topic [7].
6 Results
Tables 7, 8, 9, 10 provide the results of a selection of the evaluation measures
for all participating runs, both against the abstract and the document level
relevance judgments, for the simple evaluation scenario. Figure 1 shows the cor-
responding box plots for Average Precision, with the Mean Average Precision
against the abstract and document level judgments respectively denoted with a
blue rectangle over the box plot.
In the following subsections we present results separately for ranking and
thresholding runs, so that comparisons can be more meaningful.
Fig. 1. Average precision using the abstract (top) and document (bottom) level rele-
vance judgments.
6.1 Ranking Abstracts
Table 3 presents a number of evaluation measures for those runs that ranked the
entire set of articles provided by the original Boolean queries; no thresholding
has been applied. Some runs, as it may appear from Tables 7, 8, 9, 10, even
though they applied no stopping criterion, still missed a number of documents.
There may be multiple reasons for that, e.g. missing some topic, or not being
able to download the abstract text, since participants were provided by PIDs
only. The number of documents for which feedback was requested appears in the
second column of the table, while the remaining of the columns report different
measures of performance.
Figure 2 shows the recall-effort curves for the participating runs, that is the
recall value at different percentage of documents shown to the user. The straight
pink line with the triangular markers on x=y is the results of the Boolean query
randomly shuffled, and it serves as a naive baseline, provided by the UOS team.
The brown curve with the triangular markers is the BM25 retrieval function,
also provided by the UOS team as a baseline; it ranks abstracts by BM25 over
the Boolean query terms, with the default BM25 parameters setting.
Fig. 2. Recall at different percentage of shown documents.
Figure 3 presents the box-plots of Mean Average Precision values for runs
that do not make use of relevance feedback (left) and runs that make use of
relevance feedback (right) respectively. On average relevance feedback boosts
Run Feedback Last wss@100 wss@95 Area AP
Rank Under
Rel Recall
amc.run 0 2913 0.249 0.333 0.761 0.129
auth.simple.run1 41337 2143 0.519 0.693 0.928 0.297
auth.simple.run2 41377 2124 0.521 0.697 0.920 0.293
auth.simple.run3 23337 2183 0.511 0.678 0.924 0.285
auth.simple.run4 41537 2119 0.519 0.690 0.920 0.293
BASELINE.BM25 0 2851 0.285 0.400 0.809 0.174
BASELINE.pubmed.random 0 3722 0.040 0.034 0.484 0.045
cnrs.abrupt.all 19980 3414 0.173 0.243 0.735 0.143
cnrs.gradual.all 23683 3406 0.195 0.288 0.708 0.146
cnrs.noaf.all 0 2993 0.261 0.362 0.780 0.145
cnrs.noaffull.all 0 2250 0.412 0.497 0.839 0.179
ecnu.run1 0 3633 0.099 0.121 0.627 0.091
ntu.run1 0 3403 0.089 0.108 0.612 0.078
ntu.run2 0 3204 0.117 0.131 0.595 0.060
ntu.run3 0 3570 0.091 0.075 0.538 0.052
padua.iafa_m10k150f0m10 2350 2269 0.415 0.508 0.896 0.280
padua.iafap_m10p2f0m10 2367 2395 0.366 0.476 0.875 0.253
padua.iafap_m10p5f0m10 5893 2260 0.398 0.496 0.885 0.269
padua.iafas_m10k50f0m10 4320 2304 0.410 0.517 0.892 0.266
qut.ca_bool_ltr 0 3142 0.201 0.288 0.733 0.114
qut.ca_pico_ltr 0 3344 0.212 0.294 0.751 0.153
qut.rf_bool_ltr 0 3099 0.194 0.267 0.705 0.106
qut.rf_pico_ltr 0 3155 0.235 0.293 0.727 0.121
sheffield.run1 0 2678 0.310 0.422 0.818 0.170
sheffield.run2 0 2441 0.385 0.493 0.845 0.218
sheffield.run3 0 2404 0.384 0.473 0.841 0.199
sheffield.run4 0 2382 0.395 0.488 0.847 0.218
ucl.run_abstract 0 3801 0.072 0.064 0.507 0.060
ucl.run_fulltext 0 3755 0.077 0.076 0.522 0.053
uos.sis.TMAL30Q_BM25 35432 2305 0.398 0.530 0.837 0.162
uos.sis.TMBEST_BM25 0 3124 0.274 0.324 0.727 0.124
waterloo.A-rank-normal 117558 1464 0.601 0.700 0.927 0.279
waterloo.B-rank-normal 117558 1469 0.611 0.701 0.933 0.318
Table 3. Evaluation results for submitted runs ranking the entire set of articles pro-
vided by the Boolean query.
the effectiveness of the ranking algorithms, as expected, however it may come
with additional cost in terms of assessing the relevance of abstract (based on the
screening setup considered).
Fig. 3. Box-plots of Mean Average Precision for runs that do not make use of relevance
feedback and those that do make use.
6.2 Drawing a Threshold
Table 4 presents a number of evaluation measures for those runs that applied
a threshold criterion. The total number of shown to the user abstracts can be
found in the second column of the table, the number of documents for which
feedback was requested in the third, while the remaining of the columns report
different measures of performance. The cost measures account both for the cost
of presenting a document to the user and for the additional cost of requesting
feedback for a document, while they also account for the cost one would need to
pay to reach 100% recall, under certain assumptions. Reliability considers the
cost of not finding all relevant documents but makes no discrimination between
the documents returned to the user and those for which feedback is requested.
Average precision is well defined under the stopping criterion but hard to be
used for comparing runs that use different thresholds. An easy to understand
measure is the achieved recall at the rank of the threshold.
Figure 5 presents recall at the point of the threshold as a function of the
number of documents presented to the user; that is at different stopping criteria,
Run Docs Feedback Rel Cost w/ Cost w/ Area AP Recall@ Relia-
Shown Docs Uniform Weighted Under Thresh -bility
Found Penalty Penalty Recall
ecnu.run2 30000 0 1191 4003 6641 0.64 0.16 0.71 0.44
ecnu.run3 30000 0 1197 4016 6717 0.65 0.17 0.72 0.44
eth.m1 51640 0 1686 2306 4740 0.81 0.22 0.93 0.20
eth.m2 51604 5063 1702 2676 4720 0.80 0.21 0.90 0.14
eth.m4 27046 0 1406 2527 5590 0.74 0.21 0.82 0.14
iiit.run1 15354 15354 1006 3550 6685 0.68 0.16 0.74 0.15
iiit.run2 15354 15354 1006 3550 6685 0.68 0.16 0.74 0.15
iiit.run3 15354 15354 1006 3550 6685 0.68 0.16 0.74 0.15
iiit.run4 15354 15354 1006 3550 6685 0.68 0.16 0.74 0.15
ncsu.abs 12942 12942 1073 4409 7695 0.61 0.11 0.71 0.33
ncsu.simple 27950 27950 1611 4145 6964 0.68 0.11 0.83 0.18
qut.bool_es 69951 0 1475 3480 4976 0.64 0.13 0.76 0.36
qut.pico_es 63018 0 1414 3527 5168 0.62 0.12 0.74 0.34
uos.bm25_1 103051 0 1828 3454 3786 0.81 0.17 0.99 0.45
uos.bm25_2.5 76104 0 1758 2905 3902 0.79 0.17 0.94 0.27
uos.bm25_2 84740 0 1784 3117 3748 0.80 0.17 0.95 0.33
uos.sis.AL30Q 94967 0 1809 3280 3865 0.80 0.17 0.97 0.38
waterloo.A-thresh 87767 87767 1842 8809 9543 0.93 0.28 1.00 0.50
waterloo.B-thresh 60936 60936 1548 6470 7150 0.91 0.31 0.97 0.43
Table 4. Evaluation results for submitted runs using different threshold criteria; mea-
sures are computed using abstract-level relevance judgments.
but also with different ranking and thresholding algorithms. As expected the
more documents presented to the user (the lower the threshold criterion) the
higher the achieved recall. Nevertheless, there are still algorithms that dominate
others. The figure present the Pareto frontier. Figure 5 presents recall at the
point of the threshold as a function of the feedback documents requested. As it
can be viewed, although feedback documents, are in principle helpful towards
achieving a high recall, there are algorithms that used no relevance feedback and
still achieved high recall at a threshold.
6.3 Topic Difficulty
Table 5 provides statistics on the topics used in the test set, along with the
average Average Precision (AAP) for each topic, a measure that can be seen as
a proxy of the difficult of each topic. The Pearson correlation coefficient between
AAP and the percentage of relevant documents, the total number of documents,
and the total number of relevant documents is -0.4868 (p-value = 0.006), 0.1295
(p-value = 0.495), and 0.8994 (p-value = 0). Figures 6 and 7 visually demonstrate
this correlation.
Fig. 4. Recall at the threshold rank as a function of the number of documents shown
to the user.
Fig. 5. Recall at the threshold rank as a function of the number of documents for which
feedback is requested.
Topic Average AP % of Relevant Documents Relevant
CD010173 0.035 0.42 5495 23
CD010783 0.036 0.28 10905 30
CD010386 0.040 0.32 626 2
CD012019 0.042 0.03 10317 3
CD010339 0.051 0.89 12807 114
CD008081 0.076 2.68 970 26
CD007431 0.077 1.16 2074 24
CD009786 0.078 0.48 2065 10
CD010653 0.079 0.56 8002 45
CD010276 0.094 0.98 5495 54
CD008782 0.096 0.43 10507 45
CD009647 0.096 2.01 2785 56
CD009372 0.102 1.11 2248 25
CD011145 0.107 1.86 10872 202
CD010896 0.119 3.55 169 6
CD008803 0.132 1.90 5220 99
CD010633 0.146 0.25 1573 4
CD010542 0.149 5.75 348 20
CD009551 0.156 2.41 1911 46
CD009519 0.158 1.74 5971 104
CD009185 0.254 5.70 1615 92
CD010775 0.266 4.56 241 11
CD009925 0.269 7.04 6531 460
CD010023 0.290 5.30 981 52
CD010860 0.310 7.45 94 7
CD009579 0.317 2.14 6455 138
CD009135 0.351 9.73 791 77
CD010772 0.395 14.87 316 47
CD008760 0.423 18.75 64 12
CD010705 0.524 20.18 114 23
Table 5. Average Average Precision (AAP) per topic as a measure of topic difficulty,
along with statistics about relevant documents.
Fig. 6. Average Average Precision (AAP) as a function of the percentage of relevant
documents.
Fig. 7. Average Average Precision (AAP) as a function of the total number of docu-
ments.
7 Conclusions
The CLEF 2017 e-Health Lab Task 2 constructed a benchmark collection of 50
Diagnostic Test Accuracy systematic reviews to study the effectiveness and ef-
ficiency of information retrieval and machine learning algorithms in prioritizing
the studies to be screened at the abstract and title screening stage, and pro-
viding a stopping criterion over the ranked list. The results demonstrate that
automatic methods can be trusted for finding most, if not all, relevant studies
in a fraction of the time manual screening can do the same. Given that across
different runs many parameters change simultaneously it is not easy to come to
certain conclusions about the relative performance of automatic methods.
Regarding the benchmark collection itself, there is a number of limitations to
be considered: (a) Pivoting on the results of the the OVID MEDLINE Boolean
query limits our ability to identify all relevant studies, i.e. relevant studies that
are outputted by Boolean queries over different databases, and relevant studies
that are actually not found by these Boolean queries. The former can be overcome
by considering all the different queries submitted; for the latter extra manual
judgments would be required. (b) Pivoting on abstract and title only we miss the
opportunity to study the effect of automatic methods when applied to the full
text of the studies, that would present an opportunity to completely overcome
the multi-stage process of systematic reviews. However, most of the full text ar-
ticles are protected under copyright laws that do not give all participants access
to those. (c) The evaluation setup of ranking does not allows us to consider the
cost of the process, since given a ranking a researcher would have to still go over
all studies ranked. A more realistic setup, e.g. a double-screening setup, could
be considered. (d) In the construction of relevant judgments we considered the
included and excluded references of the systematic reviews under study, which
prevented us to study the noise and disagreement between reviewers. (e) In our
effort to allow iterative algorithms, e.g. active learning algorithms, to be sub-
mitted, we handed the test sets’ relevant judgments directly to the participants,
which is rather unusual for this type of evaluation exercises. An alternative would
be the setup used by the TREC Total Recall, where participants submitted their
running algorithms to the organizers. (f) When it comes to evaluation measures
there is a large variety of those, all of which take a different often useful view
point on the effectiveness of algorithm, but which makes it difficult to decide
upon a single golden measure to rank participants’ runs.
References
1. Alharbi, A., Stevenson, M.: Ranking abstracts to identify relevant evidence for
systematic reviews: The university of sheffield’s approach to clef ehealth 2017 task
2. In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation
forum, Dublin, Ireland, September 11-14, 2017. CEUR Workshop Proceedings,
CEUR-WS.org (2017)
2. van Altena, A.J.: Predicting publication inclusion for diagnostic accuracy test re-
views using random forests and topic modelling. In: Working Notes of CLEF 2017
- Conference and Labs of the Evaluation forum, Dublin, Ireland, September 11-14,
2017. CEUR Workshop Proceedings, CEUR-WS.org (2017)
3. Anagnostou, A., Lagopoulos, A., Tsoumakas, G., Vlahavas, I.: Hybridranksvm:
A cost-effective hybrid ltr approach for document ranking. In: Working Notes
of CLEF 2017 - Conference and Labs of the Evaluation forum, Dublin, Ireland,
September 11-14, 2017. CEUR Workshop Proceedings, CEUR-WS.org (2017)
4. Chen, J., Chen, S., Song, Y., Liu, H., Wang, Y., Hu, Q., He, L.: Ecnu at 2017
ehealth task 2: Technologically assisted reviews in empirical medicine. In: Working
Notes of CLEF 2017 - Conference and Labs of the Evaluation forum, Dublin, Ire-
land, September 11-14, 2017. CEUR Workshop Proceedings, CEUR-WS.org (2017)
5. Cohen, A.M., Hersh, W.R., Peterson, K., Yen, P.Y.: Reducing workload in sys-
tematic review preparation using automated citation classification. Journal of the
American Medical Informatics Association 13(2), 206–219 (2006)
6. Cormack, G.V., Grossman, M.R.: Engineering quality and reliability in technology-
assisted review. In: Proceedings of the 39th International ACM SIGIR Confer-
ence on Research and Development in Information Retrieval. pp. 75–84. SIGIR
’16, ACM, New York, NY, USA (2016), http://doi.acm.org/10.1145/2911451.
2911510
7. Cormack, G.V., Grossman, M.R.: Technology-assisted review in empirical
medicine: Waterloo participation in clef ehealth 2017. In: Working Notes of CLEF
2017 - Conference and Labs of the Evaluation forum, Dublin, Ireland, September
11-14, 2017. CEUR Workshop Proceedings, CEUR-WS.org (2017)
8. Goeuriot, L., Kelly, L., Suominen, H., Névéol, A., Robert, A., Kanoulas, E., Spi-
jker, R., Palotti, J., Zuccon, G.: CLEF 2017 eHealth evaluation lab overview. In:
CLEF 2017 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes in
Computer Science (LNCS). Springer (September 2017)
9. Hollmann, N., Eickhoff, C.: Relevance-based stopping for recall-centric medical
document retrieval. In: Working Notes of CLEF 2017 - Conference and Labs of
the Evaluation forum, Dublin, Ireland, September 11-14, 2017. CEUR Workshop
Proceedings, CEUR-WS.org (2017)
10. Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of ir techniques.
ACM Trans. Inf. Syst. 20(4), 422–446 (Oct 2002), http://doi.acm.org/10.1145/
582415.582418
11. Kalphov, V., Georgiadis, G., Azzopardi, L.: Sis at clef 2017 ehealth tar task. In:
Working Notes of CLEF 2017 - Conference and Labs of the Evaluation forum,
Dublin, Ireland, September 11-14, 2017. CEUR Workshop Proceedings, CEUR-
WS.org (2017)
12. Lee, G.E.: Medical document classification for systematic reviews using convolu-
tional neural networks: Sysreview at clef ehealth 2017. In: Working Notes of CLEF
2017 - Conference and Labs of the Evaluation forum, Dublin, Ireland, September
11-14, 2017. CEUR Workshop Proceedings, CEUR-WS.org (2017)
13. Norman, C., Leeflang, M., Neveol, A.: Limsi@clef ehealth 2017 task 2: Logistic
regression for automatic article ranking. In: Working Notes of CLEF 2017 - Con-
ference and Labs of the Evaluation forum, Dublin, Ireland, September 11-14, 2017.
CEUR Workshop Proceedings, CEUR-WS.org (2017)
14. Nunzio, G.M.D., Beghini, F., Vezzani, F., Henrot, G.: An interactive two-
dimensional approach to query aspects rewriting in systematic reviews. ims unipd
at clef ehealth task 2. In: Working Notes of CLEF 2017 - Conference and Labs of
the Evaluation forum, Dublin, Ireland, September 11-14, 2017. CEUR Workshop
Proceedings, CEUR-WS.org (2017)
15. O’Mara-Eves, A., Thomas, J., McNaught, J., Miwa, M., Ananiadou, S.: Using text
mining for study identification in systematic reviews: a systematic review of current
approaches. Systematic reviews 4(1), 5 (2015)
16. Scells, H., Zuccon, G., Deacon, A., Koopman, B.: Qut ielab at clef 2017 technology
assisted reviews track: Initial experiments with learning to rank. In: Working Notes
of CLEF 2017 - Conference and Labs of the Evaluation forum, Dublin, Ireland,
September 11-14, 2017. CEUR Workshop Proceedings, CEUR-WS.org (2017)
17. Scells, H., Zuccon, G., Koopman, B., Deacon, A., Geva, S., Azzopardi, L.: A test
collection for evaluating retrieval of studies for inclusion in systematic reviews.
In: To appear in Proceedings of the 40th international ACM SIGIR conference on
Research and development in Information Retrieval. ACM (2017)
18. Shemilt, I., Khan, N., Park, S., Thomas, J.: Use of cost-effectiveness analysis to
compare the efficiency of study identification methods in systematic reviews. Sys-
tematic Reviews 5(1), 140 (Aug 2016)
19. Singh, G., Marshall, I., Thomas, J., Wallace, B.: Identifying diagnostic test accu-
racy publications using a deep model. In: Working Notes of CLEF 2017 - Confer-
ence and Labs of the Evaluation forum, Dublin, Ireland, September 11-14, 2017.
CEUR Workshop Proceedings, CEUR-WS.org (2017)
20. Yu, Z., Menzies, T.: Technologically assisted reviews in empirical medicine: Data
balancing or reweighting. In: Working Notes of CLEF 2017 - Conference and Labs
of the Evaluation forum, Dublin, Ireland, September 11-14, 2017. CEUR Workshop
Proceedings, CEUR-WS.org (2017)
10.1002/14651858.CD010438.pub2/full 10.1002/14651858.CD009551.pub3/full
10.1002/14651858.CD010775.pub2/full 10.1002/14651858.CD012019/full
10.1002/14651858.CD009175.pub2/full 10.1002/14651858.CD008686.pub2/full
10.1002/14651858.CD011984/full 10.1002/14651858.CD009020.pub2/full
10.1002/14651858.CD009786.pub2/full 10.1002/14651858.CD011548/full
10.1002/14651858.CD008643.pub2/full 10.1002/14651858.CD010896.pub2/full
10.1002/14651858.CD009579.pub2/full 10.1002/14651858.CD010023.pub2/full
10.1002/14651858.CD009925/full 10.1002/14651858.CD010772.pub2/full
10.1002/14651858.CD009944.pub2/full 10.1002/14651858.CD011145.pub2/full
10.1002/14651858.CD007431.pub2/full 10.1002/14651858.CD010409.pub2/full
10.1002/14651858.CD007427.pub2/full 10.1002/14651858.CD008054.pub2/full
10.1002/14651858.CD008803.pub2/full 10.1002/14651858.CD010771.pub2/full
10.1002/14651858.CD008122.pub2/full 10.1002/14651858.CD009694.pub2/full
10.1002/14651858.CD009593.pub3/full 10.1002/14651858.CD010705.pub2/full
10.1002/14651858.CD008782.pub4/full 10.1002/14651858.CD010633.pub2/full
10.1002/14651858.CD009647.pub2/full 10.1002/14651858.CD010173.pub2/full
10.1002/14651858.CD009135.pub2/full 10.1002/14651858.CD009591.pub2/full
10.1002/14651858.CD008760.pub2/full 10.1002/14651858.CD010386.pub2/full
10.1002/14651858.CD011549/full 10.1002/14651858.CD011021.pub2/full
10.1002/14651858.CD009263.pub2/full 10.1002/14651858.CD008691.pub2/full
10.1002/14651858.CD009519.pub2/full 10.1002/14651858.CD010632.pub2/full
10.1002/14651858.CD009372.pub2/full 10.1002/14651858.CD007394.pub2/full
10.1002/14651858.CD011134.pub2/full 10.1002/14651858.CD010783.pub2/full
10.1002/14651858.CD010079.pub2/full 10.1002/14651858.CD010860.pub2/full
10.1002/14651858.CD010276.pub2/full 10.1002/14651858.CD007424.pub2/full
10.1002/14651858.CD008081.pub3/full 10.1002/14651858.CD011431/full
10.1002/14651858.CD009185.pub2/full 10.1002/14651858.CD010339.pub2/full
10.1002/14651858.CD011975/full 10.1002/14651858.CD010653.pub2/full
10.1002/14651858.CD009323.pub2/full 10.1002/14651858.CD010542.pub2/full
Table 6. The DOI’s of the studies considered for the construction of the benchmark
collection
Run Docs Feedback Rel Last wss@100 wss@95 Cost w/ Cost w/ Area AP Recall@ Reliability
Shown Docs Rank Uniform Weighted Under Thresh
Found Rel Penalty Penalty Recall
amc.run 117548 0 1857 2913 0.25 0.33 3918 3918 0.76 0.13 1.00 0.54
auth.simple.run1 117561 41337 1857 2143 0.52 0.69 6674 6674 0.93 0.30 1.00 0.54
auth.simple.run2 117561 41377 1857 2124 0.52 0.70 6677 6677 0.92 0.29 1.00 0.54
auth.simple.run3 117561 23337 1857 2183 0.51 0.68 5474 5474 0.92 0.28 1.00 0.54
auth.simple.run4 117561 41537 1857 2119 0.52 0.69 6687 6687 0.92 0.29 1.00 0.54
BASELINE.BM25 117550 0 1857 2851 0.28 0.40 3918 3918 0.81 0.17 1.00 0.54
BASELINE.pubmed.random 117562 0 1857 3722 0.04 0.03 3918 3918 0.48 0.04 1.00 0.54
cnrs.abrupt.all 117557 19980 1857 3414 0.17 0.24 5250 5250 0.73 0.14 1.00 0.54
cnrs.gradual.all 117557 23683 1857 3406 0.20 0.29 5497 5497 0.71 0.15 1.00 0.54
cnrs.noaf.all 117557 0 1857 2993 0.26 0.36 3918 3918 0.78 0.14 1.00 0.54
cnrs.noaffull.all 117557 0 1857 2250 0.41 0.50 3918 3918 0.84 0.18 1.00 0.54
ecnu.run1 117561 0 1857 3633 0.10 0.12 3918 3918 0.63 0.09 1.00 0.54
ecnu.run2 30000 0 1191 699 0.07 0.16 4003 6641 0.64 0.16 0.71 0.44
ecnu.run3 30000 0 1197 725 0.08 0.17 4016 6717 0.65 0.17 0.72 0.44
eth.m1 51640 0 1686 1372 0.24 0.28 2306 4740 0.81 0.22 0.93 0.20
eth.m2 51604 5063 1702 1435 0.14 0.24 2676 4720 0.80 0.21 0.90 0.14
eth.m4 27046 0 1406 785 0.12 0.16 2527 5590 0.74 0.21 0.82 0.14
iiit.run1 15354 15354 1006 548 0.11 0.14 3550 6685 0.68 0.16 0.74 0.15
iiit.run2 15354 15354 1006 548 0.11 0.14 3550 6685 0.68 0.16 0.74 0.15
iiit.run3 15354 15354 1006 548 0.11 0.14 3550 6685 0.68 0.16 0.74 0.15
iiit.run4 15354 15354 1006 548 0.11 0.14 3550 6685 0.68 0.16 0.74 0.15
ncsu.abs 12942 12942 1073 378 0.12 0.16 4409 7695 0.61 0.11 0.71 0.33
ncsu.simple 27950 27950 1611 928 0.14 0.27 4145 6964 0.68 0.11 0.83 0.18
ntu.run1 111170 0 1795 3403 0.09 0.11 3936 4130 0.61 0.08 0.98 0.55
ntu.run2 111170 0 1795 3204 0.12 0.13 3936 4130 0.59 0.06 0.98 0.55
ntu.run3 111196 0 1795 3570 0.09 0.07 3937 4130 0.54 0.05 0.98 0.55
padua.iafa_m10k150f0m10 117557 2350 1857 2269 0.41 0.51 4075 4075 0.90 0.28 1.00 0.54
padua.iafap_m10p2f0m10 117557 2367 1857 2395 0.37 0.48 4076 4076 0.88 0.25 1.00 0.54
padua.iafap_m10p5f0m10 117557 5893 1857 2260 0.40 0.50 4311 4311 0.89 0.27 1.00 0.54
padua.iafas_m10k50f0m10 117557 4320 1857 2304 0.41 0.52 4206 4206 0.89 0.27 1.00 0.54
Table 7. PART I: Evaluation results for submitted runs computed using abstract-level relevance judgments
Run Docs Feedback Rel Last wss@100 wss@95 Cost w/ Cost w/ Area AP Recall@ Reliability
Shown Docs Rank Uniform Weighted Under Thresh
Found Rel Penalty Penalty Recall
qut.ca_bool_ltr 117557 0 1857 3142 0.20 0.29 3918 3918 0.73 0.11 1.00 0.54
qut.ca_pico_ltr 117557 0 1857 3344 0.21 0.29 3918 3918 0.75 0.15 1.00 0.54
qut.rf_bool_ltr 117557 0 1857 3099 0.19 0.27 3918 3918 0.70 0.11 1.00 0.54
qut.fr_pico_ltr 117557 0 1857 3155 0.23 0.29 3918 3918 0.73 0.12 1.00 0.54
qut.bool_es_test 69951 0 1475 1972 0.10 0.11 3480 4976 0.64 0.13 0.76 0.36
qut.pico_es_test 63018 0 1414 1873 0.11 0.13 3527 5168 0.62 0.12 0.74 0.34
sheffield.run1 117562 0 1857 2678 0.31 0.42 3918 3918 0.82 0.17 1.00 0.54
sheffield.run2 117562 0 1857 2441 0.39 0.49 3918 3918 0.84 0.22 1.00 0.54
sheffield.run3 117562 0 1857 2404 0.38 0.47 3918 3918 0.84 0.20 1.00 0.54
sheffield.run4 117562 0 1857 2382 0.40 0.49 3918 3918 0.85 0.22 1.00 0.54
ucl.run_abstract 117562 0 1857 3727 0.04 0.03 3918 3918 0.48 0.04 1.00 0.54
ucl.run_fulltext 117562 0 1857 3727 0.04 0.03 3918 3918 0.48 0.04 1.00 0.54
uos.bm25_threshold1 103051 0 1828 2503 0.28 0.40 3454 3786 0.81 0.17 0.99 0.45
uos.bm25_threshold2.5 76104 0 1758 1877 0.22 0.35 2905 3902 0.79 0.17 0.94 0.27
uos.bm25_threshold2 84740 0 1784 2068 0.23 0.37 3117 3748 0.80 0.17 0.95 0.33
uos.sis.AL30Q_BM25 94967 0 1809 2333 0.27 0.39 3280 3865 0.80 0.17 0.97 0.38
uos.sis.TMAL30Q_BM25 117551 35432 1857 2305 0.40 0.53 6280 6280 0.84 0.16 1.00 0.54
uos.sis.TMBEST_BM25 117557 0 1857 3124 0.27 0.32 3918 3918 0.73 0.12 1.00 0.54
waterloo.A-rank-normal 117558 117558 1857 1464 0.60 0.70 11755 11755 0.93 0.28 1.00 0.54
waterloo.A-thresh-normal 87767 87767 1842 1161 0.56 0.70 8809 9543 0.93 0.28 1.00 0.50
waterloo.B-rank-normal 117558 117558 1857 1469 0.61 0.70 11755 11755 0.93 0.32 1.00 0.54
waterloo.B-thresh-normal 60936 60936 1548 914 0.54 0.66 6470 7150 0.91 0.31 0.97 0.43
Table 8. PART II: Evaluation results for submitted runs computed using abstract-level relevance judgments.
Run Docs Feedback Rel Last wss@100 wss@95 Cost w/ Cost w/ Area AP Recall@ Reliability
Shown Docs Rank Uniform Weighted Under Thresh
Found Rel Penalty Penalty Recall
amc.run 109547 0 607 1742 0.51 0.51 3777 3777 0.84 0.10 1.00 0.74
auth.simple.run1 109559 39337 607 853 0.80 0.82 6490 6490 0.95 0.23 1.00 0.74
auth.simple.run2 109559 39377 607 857 0.79 0.81 6493 6493 0.94 0.21 1.00 0.74
auth.simple.run3 109559 22337 607 839 0.80 0.82 5318 5318 0.95 0.22 1.00 0.74
auth.simple.run4 109559 39537 607 858 0.79 0.81 6504 6504 0.94 0.21 1.00 0.74
BASELINE.BM25 109549 0 607 1664 0.54 0.57 3777 3777 0.85 0.14 1.00 0.74
BASELINE.pubmed.random 109560 0 607 3316 0.09 0.07 3777 3777 0.48 0.02 1.00 0.74
cnrs.abrupt.all 109555 19980 607 2619 0.35 0.39 5155 5155 0.80 0.11 1.00 0.74
cnrs.gradual.all 109555 22684 607 2384 0.41 0.46 5342 5342 0.77 0.11 1.00 0.74
cnrs.noaf.all 109555 0 607 2263 0.42 0.50 3777 3777 0.82 0.10 1.00 0.74
cnrs.noaffull.all 109555 0 607 1678 0.59 0.64 3777 3777 0.89 0.13 1.00 0.74
ecnu.run1 109559 0 607 2905 0.26 0.27 3777 3777 0.66 0.06 1.00 0.74
ecnu.run2 29000 0 426 515 0.29 0.30 3069 5286 0.72 0.12 0.79 0.49
ecnu.run3 29000 0 426 486 0.30 0.31 3069 5286 0.73 0.12 0.79 0.49
eth.m1 47500 0 561 1000 0.44 0.58 1876 2170 0.86 0.16 0.97 0.24
eth.m2 46538 4662 553 1050 0.42 0.55 2196 2489 0.84 0.15 0.93 0.18
eth.m4 25381 0 476 596 0.31 0.36 1850 3739 0.80 0.15 0.87 0.15
iiit.run1 15234 15234 406 501 0.15 0.19 3583 5094 0.70 0.12 0.77 0.18
iiit.run2 15234 15234 406 501 0.15 0.19 3583 5094 0.70 0.12 0.77 0.18
iiit.run3 15234 15234 406 501 0.15 0.19 3583 5094 0.70 0.12 0.77 0.18
iiit.run4 15234 15234 406 501 0.15 0.19 3583 5094 0.70 0.12 0.77 0.18
ncsu.abs 12682 12682 480 356 0.35 0.38 3354 5978 0.69 0.07 0.81 0.31
ncsu.simple 27950 27950 607 960 0.66 0.66 2891 2891 0.80 0.06 1.00 0.13
ntu.run1 103170 0 606 2954 0.20 0.22 3606 3557 0.65 0.05 1.00 0.72
ntu.run2 103170 0 606 2779 0.20 0.20 3606 3557 0.62 0.04 1.00 0.72
ntu.run3 103194 0 606 3050 0.15 0.14 3607 3558 0.55 0.02 1.00 0.72
padua.iafa_m10k150f0m10 109555 2286 607 1055 0.71 0.71 3935 3935 0.93 0.22 1.00 0.74
padua.iafap_m10p2f0m10 109555 2206 607 1007 0.66 0.69 3929 3929 0.92 0.20 1.00 0.74
padua.iafap_m10p5f0m10 109555 5492 607 838 0.71 0.70 4156 4156 0.93 0.21 1.00 0.74
padua.iafas_m10k50f0m10 109555 4170 607 990 0.71 0.72 4065 4065 0.93 0.19 1.00 0.74
Table 9. PART I: Evaluation results for submitted runs computed using document-level relevance judgments.
Run Docs Feedback Rel Last wss@100 wss@95 Cost w/ Cost w/ Area AP Recall@ Reliability
Shown Docs Rank Uniform Weighted Under Thresh
Found Rel Penalty Penalty Recall
qut.ca_bool_ltr 109555 0 607 2582 0.33 0.36 3777 3777 0.75 0.08 1.00 0.74
qut.ca_pico_ltr 109555 0 607 2638 0.36 0.40 3777 3777 0.78 0.11 1.00 0.74
qut.rf_bool_ltr 109555 0 607 2477 0.33 0.35 3777 3777 0.72 0.06 1.00 0.74
qut.rf_pico_ltr 109555 0 607 2610 0.35 0.38 3777 3777 0.76 0.09 1.00 0.74
qut.bool_es 65389 0 465 1595 0.22 0.25 3217 4041 0.68 0.10 0.81 0.43
qut.pico_es 58456 0 451 1424 0.21 0.23 3251 4519 0.67 0.09 0.78 0.40
sheffield.run1 109560 0 607 1801 0.52 0.54 3777 3777 0.84 0.12 1.00 0.74
sheffield.run2 109560 0 607 1928 0.53 0.58 3777 3777 0.87 0.18 1.00 0.74
sheffield.run3 109560 0 607 1902 0.52 0.59 3777 3777 0.87 0.15 1.00 0.74
sheffield.run4 109560 0 607 1846 0.54 0.59 3777 3777 0.87 0.18 1.00 0.74
ucl.run_abstract 109560 0 607 3472 0.13 0.12 3777 3777 0.51 0.04 1.00 0.74
ucl.run_fulltext 109560 0 607 3505 0.14 0.12 3777 3777 0.51 0.04 1.00 0.74
uos.bm25_threshold1 95721 0 601 1548 0.53 0.57 3311 3406 0.85 0.14 0.99 0.59
uos.bm25_threshold2.5 73548 0 591 1483 0.52 0.56 2580 2926 0.85 0.13 0.98 0.36
uos.bm25_threshold2 80976 0 597 1506 0.53 0.57 2820 3037 0.85 0.14 0.99 0.45
uos.sis.AL30Q_BM25 109549 33300 607 906 0.68 0.69 6074 6074 0.90 0.16 1.00 0.74
uos.sis.TMAL30Q_BM25 109550 33002 607 1228 0.65 0.67 6053 6053 0.87 0.11 1.00 0.74
uos.sis.TMBEST_BM25 109555 0 607 1980 0.50 0.49 3777 3777 0.76 0.09 1.00 0.74
waterloo.A-rank-normal 109556 109556 607 461 0.82 0.81 11333 11333 0.95 0.19 1.00 0.74
waterloo.A-thresh-normal 79765 79765 607 461 0.82 0.81 8251 8251 0.95 0.19 1.00 0.66
waterloo.B-rank-normal 109556 109556 607 469 0.83 0.82 11333 11333 0.95 0.23 1.00 0.74
waterloo.B-thresh-normal 52934 52934 575 375 0.78 0.77 5765 6559 0.94 0.23 0.98 0.53
Table 10. PART II: Evaluation results for submitted runs computed using document-level relevance judgments