=Paper=
{{Paper
|id=Vol-1866/paper_98
|storemode=property
|title=QUT ielab at CLEF 2017 Technology Assisted Reviews Track: Initial Experiments with Learning To Rank
|pdfUrl=https://ceur-ws.org/Vol-1866/paper_98.pdf
|volume=Vol-1866
|authors=Harrisen Scells,Guido Zuccon,Anthony Deacon,Bevan Koopman
|dblpUrl=https://dblp.org/rec/conf/clef/ScellsZDK17
}}
==QUT ielab at CLEF 2017 Technology Assisted Reviews Track: Initial Experiments with Learning To Rank==
QUT ielab at CLEF eHealth 2017 Technology
Assisted Reviews Track: Initial Experiments
with Learning To Rank
Harrisen Scells1 , Guido Zuccon1 , Anthony Deacon1 , Bevan Koopman2
1
Queensland University of Technology, Brisbane, Australia
2
Australian E-Health Research Centre, CSIRO, Brisbane, Australia
harrisen.scells@hdr.qut.edu.au, g.zuccon@qut.edu.au
aj.deacon@qut.edu.au, bevan.koopman@csiro.au
Abstract. In this paper we describe our participation to the CLEF
eHealth 2017 Technology Assisted Reviews track (TAR). This track aims
to evaluate and advance search technologies aimed at supporting the cre-
ation of biomedical systematic reviews. In this context, the track explores
the task of screening prioritisation: the ranking of studies to be screened
for inclusion in a systematic review. Our solution addresses this chal-
lenge by developing ranking strategies based on learning to rank tech-
niques and exploiting features derived by the use of the PICO framework.
PICO (Population, Intervention, Control or comparison and Outcome)
is a technique used in evidence based practice to frame and answer clin-
ical questions and is used extensively in the compilation of systematic
reviews. Our experiments show that the use of the PICO-based feature
within learning to rank provides improvements over the use of baseline
features alone.
1 Introduction
A systematic review is a type of literature review that appraises and synthesises
the work of primary research studies to answer one or more research questions.
Most authors follow the Preferred Reporting Items for Systematic Reviews and
Meta-Analyses (PRISMA) method for conducting and reporting these reviews.
This includes the definition of a formal search strategy to retrieve studies which
are to be considered for inclusion in the review.
Given a research question and a set of inclusion/exclusion criteria, researchers
undertaking a systematic review define a search strategy (the query) to be issued
to one or more search engines that index published literature (e.g. PubMed). In
medical and biomedical research, search strategies are commonly expressed as
(large) boolean queries. After the search strategy has been executed, the title,
and then abstract, of studies retrieved by it are reviewed in a process known as
screening. Where the study appears relevant the full-text is then retrieved for
more detailed examination.
The compilation of systematic reviews can take significant time and resources,
hampering their effectiveness. Tsafnat et al. report that it can take several years
to complete and publish a systematic review [4]. When systematic reviews take
such significant time to complete, they can become out-of-date even at time of
publishing. While the compilation of a systematic review involves several steps,
one of the most time-consuming is screening. Thus, the development of IR meth-
ods that decrease the number of documents to be screened, would have a major
impact on the time and resources required to undertake systematic reviews. Sim-
ilarly, the ordering of studies to be screened according to the likelihood of sat-
isfying the inclusion criteria of the systematic reviews (screening prioritisation)
would allow relevant studies to be identified early on in the screening process,
thus providing a feedback loop to improve the development of search strategies.
Screening prioritisation is typically done as a two-stage process. An initial set
of studies are retrieved using a boolean retrieval process; these are then ranked
according to some relevance measure.
The challenge of compiling systematic reviews can be fertile ground for infor-
mation retrieval (IR) research, as this can provide techniques to improve current
screening and screening prioritisation processes. The CLEF eHealth 2017 Tech-
nology Assisted Reviews track (TAR) [1,2] joins our recent work [3] in devising
evaluation resources for evaluation of information retrieval techniques that at-
tempt to automate and improve processes involved in the creation of systematic
reviews. The TAR track considers two tasks: (1) to produce an the efficient or-
dering of studies retrieved by a boolean search strategy, such that all of the
relevant abstracts are retrieved as early as possible, and (2) to identify a subset
of the ranked studies which contains all or as many of the relevant abstracts for
the least effort (i.e. total number of abstracts to be assessed). In our submissions,
we tackle the first task, and use learning to rank to produce a re-ranking of the
initial set of studies retrieved for screening by the systematic review’s boolean
search strategy.
2 Our Approach for TAR
We trained a learning to rank model using domain specific features to provide
an efficient ordering of studies retrieved by a systematic review. Specifically,
we aim to observe what effect PICO features have with respect to learning to
rank algorithms. PICO (Population, Intervention, Control or comparison and
Outcome) is a technique used in evidence based practice to frame and answer
clinical questions and is used extensively in the compilation of systematic re-
views. We investigated several learning to rank algorithms and observed the
effect queries annotated with PICO elements had on the reordering of results
compared to the original Boolean queries.
We trained two learning to rank models using both the original queries pro-
vided by the task organiser, and another modified set of queries which contains
annotations from the PICO framework. In total, we used seven features to train
our learning to rank model. Table 1 summarises the features used. The first four
features (IDFSum, IDFStd, IDFMax, and IDFAvg) calculate the inverse doc-
ument frequency (idf) for each of the terms in the document that also appear
Id Feature
1 IDFSum
2 IDFStd
3 IDFMax
4 IDFAvg
5 PopulationCount
6 InterventionCount
7 OutcomeCount
Table 1: Features used as training for our learning to rank model.
in the query. IDFSum sum of all idf scores, IDFStd is the standard deviation
of the idf scores, IDFMax is the maximum idf score and IDFAvg is the mean
idf score. The other three features (PopulationCount, InterventionCount, and
OutcomeCount) are the number of terms in the document and in the query
that also appear in the respective PICO annotation. PICO annotations for doc-
uments were automatically extracted using RobotReviewer [5]. This automatic
process only annotates the Population, Intervention, and Outcome for studies
(the Control element is not annotated). PICO annotations for queries were man-
ually collected by one of the team members, who is a clinician (AD). Afterwards,
search strategies (both the original boolean query, and the new boolean query
with PICO annotations) were manually transformed into Elasticsearch queries.
The result is two Elasticsearch queries per topic — one which is representa-
tive of the original query made by the systematic review authors, and another
annotated with PICO elements.
Initial testing on a recent collection we developed [3] allowed us to select a
number of candidate learning to rank algorithms that may be effective in the
screening prioritisation of systematic reviews.
We then empirically evaluated the selected five learning to rank algorithms
listed in Table 2 and found that Coordinate Ascent provided us with the best
MAP score on validation data for CLEF eHealth 2017 over the other mod-
els. Each time we trained a model, we used the default values for that model1
and we set aside the same 30% of queries for validation. Table 2 summarises
the NDCG@10 and average precision (AP) scores for both the original Boolean
queries and the annotated PICO queries. We found that Coordinate Ascent
was the best algorithm for learning to rank these types of studies. Additionally,
we found that Random Forests and MART methods both had similar levels of
NCG@10 and AP.
Additionally, we also used Elasticsearch (version 5.3) to produce a re-ranking.
We did this by issuing the Boolean and PICO query to Elasticsearch and limited
the results to only the PubMed identifiers contained in the topic file for each
query. We then let Elasticsearch rank these documents using BM25 with the
default settings2 . We considered the Elasticsearch runs as our baseline.
1
The default values for each model can be found at the following URL https://
sourceforge.net/p/lemur/wiki/RankLib%20How%20to%20use/
2
k1 = 1.2, b = 0.75.
NCG@10 AP
Boolean PICO Boolean PICO
Elasticsearch 0.397 0.409 0.104 0.102
MART 0.237 0.327 0.066 0.086
AdaRank 0.0875 0.2197 0.0255 0.0619
Coordinate Ascent* 0.305 0.378 0.076 0.114
LambdaMART 0.259 0.377 0.068 0.097
Random Forests* 0.247 0.275 0.061 0.088
Table 2: Evaluation of each learning to rank model using the features listed in
Table 1 on the test data compared to the Elasticsearch baseline. Algorithms
marked with * indicate our submitted runs.
3 Results and Analysis
We found that a learning to rank approach to re-ranking studies for system-
atic reviews shows promising results. Table 2 illustrates our submitted runs
compared to the baseline Elasticsearch ranking and additional runs performed
post-submission. The models trained using the search strategies annotated using
PICO achieved slightly better results than the provided Boolean search strate-
gies. None of our models were able to score higher than the baseline in NCG@10,
however the Coordinate Ascent model trained using PICO annotations outper-
formed the baseline in AP.
Additionally, we report AP, NCG@10, WSS@100 and the position of the
last relevant document (last rel) in Figure 1. These visualisations show that the
Coordinate Ascent model provides the most effective ranking of documents (in
terms of AP and WSS@100) and scores the highest amongst the learning to
rank models for recall based measurements (NCG@10). Figure 1c shows that
learning to rank models trained on the Boolean search strategies positioned the
last relevant document in the re-ranked list the highest; and that the baseline
Elasticsearch runs do not do this as well.
Figure 2 examines the effect PICO had on re-ranking. The effect appears
negligible on the baseline, however, we notice an increase in precision when
PICO annotations are used as training data for learning to rank models. This
suggests that the use of PICO provides a trade off between precision and recall.
Our results illustrate this clearly when precision-based measures are compared
against recall-based measures.
4 Future Work
We plan to further increase the precision of our experiments by tuning the hyper
parameters of the best performing learning to rank models. Our learning to rank
models were trained using only a small number of features. We will investigate
the effects of other features that are commonly used for learning to rank, and
explore more domain specific features in addition to PICO.
References
1. Goeuriot, L., Kelly, L., Suominen, H., Névéol, A., Robert, A., Kanoulas, E., Spijker,
R., Palotti, J., Zuccon, G.: CLEF 2017 ehealth evaluation lab overview. In: CLEF
2017 - 8th Conference and Labs of the Evaluation Forum. Lecture Notes in Computer
Science (LNCS), Springer (2017)
2. Kanoulas, E., Li, D., Azzopardi, L., Spijker, R.: CLEF 2017 technologically assisted
reviews in empirical medicine overview. In: Working Notes of CLEF 2017 - Con-
ference and Labs of the Evaluation forum, Dublin, Ireland, September 11-14, 2017.
CEUR Workshop Proceedings, CEUR-WS.org (2017)
3. Scells, H., Zuccon, G., Koopman, B., Deacon, A., Azzopardi, L., Geva, S.: A test
collection for evaluating retrieval of studies for inclusion in systematic reviews. In:
Proceedings of SIGIR ’17 (2017)
4. Tsafnat, G., Glasziou, P., Choong, M.K., Dunn, A., Galgani, F., Coiera, E.: Sys-
tematic review automation technologies. Systematic reviews 3(1), 74 (2014)
5. Wallace, B.C., Kuiper, J., Sharma, A., Zhu, M.B., Marshall, I.J.: Extracting PICO
sentences from clinical trial reports using supervised distant supervision. Journal of
Machine Learning Research 17(132), 1–25 (2016)
pico_CA pico_es
bool_es bool_es
pico_es pico_CA
pico_LambdaMART pico_LambdaMART
pico_RF pico_MART
pico_MART bool_CA
bool_CA pico_RF
bool_LambdaMART bool_LambdaMART
bool_MART bool_RF
pico_AdaRank bool_MART
bool_RF pico_AdaRank
pico_RankNet pico_RankNet
bool_RankNet bool_RankNet
bool_AdaRank bool_AdaRank
0.00 0.02 0.04 0.06 0.08 0.10 0.0 0.1 0.2 0.3 0.4
mean(ap) mean(NCG@10)
(a) Average Precision (AP) for each run (in- (b) Normalised Cumulative Gain at posi-
cluding baselines: bool es and pico es). tion 10 (NCG@10) for each run (including
baselines: bool es and pico es).
bool_RF pico_CA
bool_MART pico_RF
bool_CA bool_CA
pico_MART pico_MART
pico_RF bool_RF
pico_CA bool_MART
bool_LambdaMART pico_LambdaMART
bool_es pico_es
pico_LambdaMART bool_es
pico_es bool_LambdaMART
pico_AdaRank pico_AdaRank
bool_RankNet pico_RankNet
pico_RankNet bool_RankNet
bool_AdaRank bool_AdaRank
0 1000 2000 3000 0.0 0.1 0.2 0.3
mean(last_rel) mean(wss_100)
(c) Position in the re-ranked list of the last (d) Position in the re-ranked list of the last
study retrieved (including baselines: bool es study retrieved (including baselines: bool es
and pico es). and pico es).
Fig. 1: Comparison of the effects each algorithm had on different measures.
0.25
Boolean Boolean
PICO
Interpolated Precision
PICO
Interpolated Precision
0.20 0.3
0.15
0.2
0.10
0.1
0.05
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Recall Recall
(a) Elasticsearch (b) Coordinate Ascent
Boolean 0.25 Boolean
PICO PICO
Interpolated Precision
Interpolated Precision
0.15
0.20
0.10 0.15
0.10
0.05 0.05
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Recall Recall
(c) Random Forests (d) LambdaMART
Fig. 2: Precision-recall curves for the Elasticsearch baselines and the Coordinate
Ascent models.