=Paper=
{{Paper
|id=Vol-1609/16090015
|storemode=property
|title=The IR Task at the CLEF eHealth Evaluation Lab 2016: User-centred Health Information Retrieval
|pdfUrl=https://ceur-ws.org/Vol-1609/16090015.pdf
|volume=Vol-1609
|authors=Guido Zuccon,Joao Palotti,Lorraine Goeuriot,Liadh Kelly,Mihai Lupu,Pavel Pecina,Henning Müller,Julie Budaher,Anthony Deacon
|dblpUrl=https://dblp.org/rec/conf/clef/ZucconPGKLPMBD16
}}
==The IR Task at the CLEF eHealth Evaluation Lab 2016: User-centred Health Information Retrieval==
The IR Task at the CLEF eHealth Evaluation
Lab 2016: User-centred Health Information
Retrieval
Guido Zuccon1 , Joao Palotti2 , Lorraine Goeuriot3 , Liadh Kelly4 , Mihai Lupu2 ,
Pavel Pecina5 , Henning Müller6 , Julie Budaher3 , and Anthony Deacon1
1
Queensland University of Technology, Brisbane, Australia,
[g.zuccon, aj.deacon]@qut.edu.au
2
Vienna University of Technology, Vienna, Austria,
[palotti,lupu]@ifs.tuwien.ac.at
3
Université Grenoble Alpes, France
[firstname.lastname]@imag.fr
4
Trinity College Dublin, Irland
liadh.kelly@tcd.ie
5
Charles University, Prague, Czech Republic
pecina@ufal.mff.cuni.cz
6
University of Applied Sciences Western Switzerland, Switzerland
henning.mueller@hevs.ch
Abstract. This paper details the collection, systems and evaluation
methods used in the IR Task of the CLEF 2016 eHealth Evaluation Lab.
This task investigates the e↵ectiveness of web search engines in providing
access to medical information for common people that have no or little
medical knowledge. The task aims to foster advances in the development
of search technologies for consumer health search by providing resources
and evaluation methods to test and validate search systems.
The problem considered in this year’s task was to retrieve web pages
to support the information needs of health consumers that are faced by
a medical condition and that want to seek relevant health information
online through a search engine. As part of the evaluation exercise, we
gathered 300 queries users posed with respect to 50 search task scenar-
ios. The scenarios were developed from real cases of people seeking health
information through posting requests of help on a web forum. The pres-
ence of query variations for a single scenario helped us capturing the
variable quality at which queries are posed. Queries were created in En-
glish and then translated into other languages. A total of 49 runs by
10 di↵erent teams were submitted for the English query topics; 2 teams
submitted 29 runs for the multilingual topics.
Keywords: Evaluation, Health Search
1 Introduction
This document reports on the CLEF 2016 eHealth Evaluation Lab, IR Task
(task 3). The task investigated the problem of retrieving web pages to support
information needs of health consumers (including their next-of-kin) that are con-
fronted with a health problem or medical condition and that use a search engine
to seek better understanding about their health. This task has been developed
within the CLEF 2016 eHealth Evaluation Lab, which aims to foster the devel-
opment of approaches to support patients, their next-of-kin, and clinical sta↵ in
understanding, accessing and authoring health information [15].
The use of the Web as source of health-related information is a wide-spread
practice among health consumers [19] and search engines are commonly used as a
means to access health information available online [7]. Previous iterations of this
task (i.e. the 2013 and 2014 CLEFeHealth Lab Task 3 [8,9]) aimed at evaluating
the e↵ectiveness of search engines to support people when searching for infor-
mation about their conditions, e.g. to answer queries like “thrombocytopenia
treatment corticosteroids length”. These two evaluation exercises have provided
valuable resources and an evaluation framework for developing and testing new
and existing techniques. The fundamental contribution of these tasks to the im-
provement of search engine technology aimed at answering this type of health
information need is demonstrated by the improvements in retrieval e↵ectiveness
provided by the best 2014 system [27] over the best 2013 system [30] (using di↵er-
ent, but comparable, topic sets). The 2015 task has instead focused on support-
ing consumers searching for self-diagnosis information [23], an important type
of health information seeking activity [7]. This year’s task expands on the 2015
task, by considering not only self-diagnosis information needs, but also needs re-
lated to treatment and management of health conditions. Previous research has
shown that exposing people with no or scarce medical knowledge to complex
medical language may lead to erroneous self-diagnosis and self-treatment and
that access to medical information on the Web can lead to the escalation of con-
cerns about common symptoms (e.g., cyberchondria) [3,29]. Research has also
shown that current commercial search engines are still far from being e↵ective
in answering such unclear and underspecified queries [33].
The remainder of this paper is structured as follows: Section 2 details the
sub-tasks we considered this year; Section 3 described the query set and the
methodology used to create it; Section 5 details the baselines created by the
organisers as a benchmark for participants; Section 6 describes participants sub-
missions; Section 7 details the methods used to create the assessment pools and
relevance criteria; Section 8 lists the evaluation metrics used for this Task; finally,
Section 9 concludes this overview paper.
2 Tasks
2.1 Sub-Task 1: Ad-hoc Search
Queries for this task are generated by mining health web forums to identify
example information needs, as detailed in section 3. Every query is treated as
independent and participants are asked to generate retrieval runs in answer to
such queries, as in a common ad-hoc search task. This task extends the evaluation
framework used in 2015 (which considered, along with topical relevance, also
the readability of the retrieved documents) to consider further dimensions of
relevance such as the reliability of the retrieved information.
2.2 Sub-Task 2: Query Variations
This task explores query variations for each single information need. Previous
research has shown that di↵erent users tend to issue di↵erent queries for the
same information need and that the use of query variations for evaluation of
IR systems leads to as much variability as system variations [1,2,23]. This was
the case also in this year’s task. Note that we explored query variations also in
the 2015 IR task [23], and we found that for the same image showing a health
condition, di↵erent query creators issued very di↵erent queries: they di↵er not
only in terms of the keywords contained in the query, but also with respect to
their retrieval e↵ectiveness.
Di↵erent query variations are generated for the same information need (ex-
tracted from a web forum entry, as explained in section 3), thus capturing the
variability intrinsic in how people search when they have the same information
need. Participants were asked to exploit query variations when building their
systems: participants were told which queries related to the same information
need and they were required to produce one set of results to be used as answer
for all query variations of an information need. This task aims to foster research
into building systems that are robust to query variations, for example, through
considering the fusion of ranked lists produced in answer to each single query
variation.
2.3 Sub-Task 3: Multilingual Ad-hoc Search
The multilingual task extends the Ad-hoc Search task by providing a transla-
tion of the queries from English into Czech, French, Hungarian, German, Polish,
Spanish and Swedish. The goal of this sub-task is to support research in mul-
tilingual information retrieval, developing techniques to support users that can
express their information need well in their native language and can read the
results in English.
3 Query Set
We considered real health information needs expressed by the general public
through posts published in public health web forums. Forum posts were ex-
tracted from the AskDocs section of Reddit7 . This section allows users to post a
description of a medical case or ask a medical question seeking medical informa-
tion such as diagnosis, or details regarding treatments. Users can also interact
through comments. We selected posts that were descriptive, clear and under-
standable. Posts with information regarding the author or patient (in case the
7
https://www.reddit.com/r/AskDocs/
post author sought help for another person), such as demographics (age, gender),
medical history and current medical condition, were preferred.
In order to collect query variants that could be compared, we also selected
posts where a main and single information need could be identified. These con-
straints guarantee as much as possible getting queries on the same aspects of
the post.
The comments were also taken into account in the selection. Any user can
add a comment to a post, and all users are labeled according to their medical
expertise8 . We mainly selected posts with comments, including some from la-
beled users. The post were manually selected by a student, and a total of 50
posts were used for query creation.
Each of the selected forum posts were presented to 6 query creators with
di↵erent medical expertise: these included 3 medical experts (final year medical
students undertaking rotations in hospitals) and 3 lay users with no prior medical
knowledge.
All queries were preprocessed to correct for spelling mistakes; this was done
using the Linux program aspell. This was however manually supervised so that
spelling correction was performed only when appropriate, as for example not to
change drug names. We explicitly did not remove punctuation marks from the
queries, e.g., participants could take advantage of the quotation marks used by
the query creators to indicate proximity terms or other features.
A total of 300 queries were created. Queries were numbered using the follow-
ing convention: the first 3 digits of a query id identify a post number (information
need), while the last 3 digits of a query id identify each individual query creator.
Expert query creators used the identifier 1, 2 and 3 and laypeople query creator
used the identifiers 4, 5 and 6. In Figure 2 we show variants 1, 2 (both generated
by laypeople) and 4 (generated by an expert) created for post number 103 (posts
started from number 101), shown in Figure 1.
For the query variations element of the task (sub-task 2), participants were
told which queries were related to the same information need, to allow them to
produce one set of results to be used as answer for all query variations of an
information need.
For the multilingual element of the challenge (sub-task 3), Czech, French,
Hungarian, German, Polish, Spanish and Swedish translations of the queries
were provided. Queries were translated by medical experts hired through a pro-
fessional translation company.
4 Dataset
Previous IR tasks in the CLEF eHealth Lab have used the Khresmoi collec-
tion [12,10], a collection of about 1 million health web pages. This year we set
a new challenge to the participants by using the ClueWeb12-B139 , a collection
8
To be labeled as a medical expert, users have to send Reddit a proof such as a
student ID, or a diploma.
9
http://lemurproject.org/clueweb12/
Fig. 1: Post from Reddit’s Section AskDocs. It was used to generate queries ranging
from 103001 to 103006
of more than 52 million web pages. As opposed to the Khresmoi collection, the
crawl in ClueWeb12-B13 is not limited to certified Health On the Net websites
and known health portals, but it is a higher-fidelity representation of a common
Internet crawl, making the dataset more in line with the content current web
search engines index and retrieve.
For participants who did not have access to the ClueWeb dataset, Carnegie
Mellon University granted the organisers permission to make the dataset avail-
able through cloud computing instances10 provided by Microsoft Azure. The
Azure instances that were made available to participants for the IR challenge
included (1) the Clueweb12-B13 dataset, (2) standard indexes built with the Ter-
rier11 [18] and the Indri12 [28] toolkits, (3) additional resources such as a spam
list [6], Page Rank scores, anchor texts [13], urls, etc. made available through
the ClueWeb12 website.
5 Baselines
We generated 55 runs, from which 19 were for Sub-Task 1 and 36 for Sub-Task
2, based on common baseline models and simple approaches for fusing query
variations. In this section we describe the baseline runs.
10
The organisers are thankful to Carnegie Mellon University, and in particular to
Jamie Callan and Christina Melucci, for their support in obtaining the permission
to redistribute ClueWeb 12. The organisers are also thankful to Microsoft Azure
who provided the Azure cloud computing infrastructure that was made available to
participants through the Microsoft Azure for Research Award CRM:0518649.
11
http://terrier.org/
12
http://www.lemurproject.org/indri.php
...
103001
headaches relieved by blood donation
103002
high iron headache
...
103004
headaches caused by too much blood or
"high blood pressure"
...
Fig. 2: Extract from the official query set released.
5.1 Baselines for Sub-Task 1
A total of 12 standard baselines were generated using:
– Indri v5.9 with default parameters for models LMDirichlet, OKAPI, and
TFIDF.
– Terrier v4.0 with default parameters for model BM25, DirichletLM and
TFIDF.
For both systems, we created the runs using and not using the default pseudo-
relevance feedback (PRF) of each toolkit. When using PRF, we added to the
original query the top 10 terms of the top 3 documents. All these baseline runs
were created using the Terrier and Indri instances made available to participants
in the Azure platform.
Additionally, we created a set of baseline runs that take into account the
reliability and understandability of information.
Five reliability baselines were created based on the Spam rankings distributed
with ClueWeb1213 [6]. For a given run, we removed all documents that had a
spam score smaller than a given threshold th. We used the BM25 baseline run
of Terrier, and 5 di↵erent values for th (50, 60, 70, 80 and 90).
Two understandability baselines were created using readability formulae. We
created runs based on CLI (Coleman-Liau Index) and GFI (Gunning Fox Index)
scores [4,11], which are a proxy for the number of years of the school required
to read the text being evaluated. These two readability formulae were chosen
13
http://www.mansci.uwaterloo.ca/~msmucker/cw12spam/
because they showed to be robust across di↵erent methods for HTML prepro-
cessing [24]. We followed one of the methods suggested in [24], in which the
HTML documents are preprocessed using Justext14 [26], the main text is ex-
tracted, periods at the end of sentences are added whenever they are necessary
(e.g., in presence of line breaks), and then readability scores are calculated. Given
the initial score S for a document and its readability score R, the final score for
each document is the combination of score obtained as S ⇥ 1.0/R.
5.2 Baselines for Sub-Task 2
We explored three ways to combine query variations:
– Concatenation: we concatenated the text of each query variation into a single
query.
– Reciprocal Ranking Fusion [5]: we fuse the ranks of each query variations
using the reciprocal ranking fusion approach, i.e.,
X 1
RRF Score(d) = ,
k + r(d)
r2R
where D is set the documents to be ranked, R is the set of document rankings
retrieved for each query variation by the same retrieval model, r(d) is the
rank of document d, and k is a constant set to 60, as in [5].
– Rank Biased Precision Fusion: similarly to the reciprocal ranking fusion,
we fuse the documents retrieved for each query variation with the Ranking
Biased Precision (RBP) formula [20],
X
RBP Score(d) = (1 p) ⇥ (p)r(d) 1 ,
r2R
where p is the free parameter of the RBP model used to estimate user per-
sistence. Here we set p = 0.80.
For each of the three methods described above, we created a run based on
each of three baselines for Terrier and Indri, with and without pseudo-relevance
feedback. A total of 36 baseline runs were created for sub-task 2 (combination
of 3 fusion approaches, 2 toolkits, 3 models, with and without PRF).
6 Participant Submissions
The number of registered participants for CLEF eHealth IR Task was 58; of
these, 10 submitted at least one run for any of the sub-tasks, as shown in Table 1.
Each team could submit up to 3 runs for Sub-Tasks 1 and 2, and up to 3 runs
for each language of Sub-Task 3.
We include below a summary of the approach of each team, self described by
them when their runs were submitted.
14
https://pypi.python.org/pypi/jusText
Table 1: Participating teams and the number of submissions for each Sub-Task.
Sub-Task
Team Name University Country
1 2 3
CUNI Charles University in Prague Czech Republic 2 2 21
ECNU East China Normal University China 3 3 8
GUIR Georgetown University United States 3 3 -
InfoLab Universidade do Porto Portugal 3 3 -
KDEIR Toyohashi University of Technology Japan 3 2
KISTI Korean Institute of Science and Technology Information Korea 3 - -
MayoNLPTeam Mayo Clinic United States 3 - -
MRIM Laboratoire d’Infomatique de Grenoble France 3 - -
ub-botswana University of Botswana Botswana 3 - -
WHUIRgroup Wuhan University China 3 3 -
10 Teams 10 Institutions 8 Countries 29 16 29
CUNI: The CUNI team participated in all the subtasks but their main focus
was put on the multilingual search in sub-task 3. The monolingual runs in
sub-tasks 1 and 2 are mainly intended for comparison with the multilingual
runs in sub-task 3. In sub-task 1 (Ad-Hoc Search), Run 1 employs the Terrier
implementation of Dirichlet-smoothed language model with the µ parameter
tuned on the data from previous CLEF eHealth tasks; Run 2 uses the Terrier
vector space TF/IDF model. In sub-task 2 (Query variants), all query variants
of one information need are searched for by the retrieval system (Dirichlet-
smoothed language model in Run 1 and vector space TF/IDF model in Run2)
and the resulting lists of ranked documents are merged and reranked by doc-
ument scores to produce one ranked list of documents for each information
need. In sub-task 3 (Multilingual search), all the non-English queries (includ-
ing variants) are translated into English using their own statistical machine
translation systems adapted to translate search queries in the medical domain.
For each non-English query, 15 translation variants (hypotheses) are obtained.
Their multilingual Run 1 employs the single best translation for each query
as provided by the translation systems. In their multilingual Run 2, the top
15 translation hypotheses are reranked using a discriminative regression model
employing a) features provided by the translation system and b) various kinds
of features extracted from the document collection, external resources (UMLS -
Unified Medical Language System, Wikipedia), or the translations themselves.
Run 3 employs the same reranking method applied to the translation system
features only.
ECNU: The ECNU team proposes a Web-based query expansion model and
a combination method to better understand and satisfy the task. They use as
baseline the Terrier implementation of the BM25 model. The other runs for sub-
task 1, 2 and 3 explore Google search and MeSH to do query expansion. BM25,
DFR BM25, BB2, the PL2 models of Terrier and TF IDF, the BM25 models
of Indri were used and combined. For the sub-task 3 runs, Google Translator
was used to translate the queries from Czech, French, Polish and Swedish to
English before applying the same methods of sub-task 1.
GUIR: GUIR studies the use of medical terms for query reformulation. Syn-
onyms and hypernyms from UMLS are used to generate reformulations of
the queries; Terrier with Divergence from Randomness is used for retrieving
and scoring documents. For sub-task 1, results obtained from the reformulated
queries are used with the Borda rank aggregation algorithm. For sub-task 2,
for each topic, results obtained are merged by any reformulated query in the
topic using the Borda rank aggregation algorithm.
Infolab: Team InfoLab analyses the performance of several query expansion
strategies using di↵erent methods to select the terms to be added to the original
query. One of the methods uses the similarity between Wikipedia articles, found
through an analysis of incoming and outgoing links, for term selection. The
other method applies the Latent Dirichlet Allocation to Wikipedia articles to
extract topics each containing a set of words that are used for term selection.
In the end, readability metrics were used to re-rank the documents retrieved
using the expanded queries.
KDEIR: KDEIR submitted runs for sub-tasks 1 and 2. In both sub-tasks the
Waterloo spam score was used to filter out the spammiest documents, and the
link structure present in the remaining documents was explored on top of their
language model baseline.
KISTI: KISTI attempts two approaches using word vectors learned by Word2Vec
based on medical Wikipedia pages. At first, initial documents are obtained us-
ing a search engine. Based on the documents, pseudo-relevance feedback (PRF)
is applied with two di↵erent usage of the word vectors. In the first approach,
PRF is performed with new relevance scores using the word vectors, while it
is performed with a new query expanded using the word vectors in the second
approach.
MayoNLPTeam: Mayo explores a Part-of-Speech (POS) based query term
weighting approach which assigns di↵erent weights to the query terms according
to their POS categories. The weights are learned by defining an objective func-
tion based on the mean average precision. They apply the proposed approach
with the optimal weights obtained from the TREC 2011 and 2012 Medical
Records Tracks into the Query Likelihood model (Run 2) and Markov Random
Field (MRF) models (Run 3). The conventional Query Likelihood model was
implemented as the baseline (Run 1).
MRIM: MRIM’s objective is to investigate the e↵ectiveness of the word em-
bedding for query expansion on consumer health search, as well as the e↵ect of
the learning resource for learning on the results. Their system uses the Terrier
index provided by the organizers. As a retrieval model the Dirichlet language
model is used with default settings. Query expansion is applied on two training
sets using word embedding sources. Word2vec is used for word embedding.
ub-botswana: In this participation, the e↵ectiveness of three retrieval strate-
gies is evaluated. In particular, PL2 is deployed with a Boolean Fallback score
modifier as baseline system. If any of the retrieved documents contains all
undecorated query terms (i.e. query terms without any operators), then doc-
uments are removed from the result set that do not contain all undecorated
query terms With this score modifier. Otherwise, nothing is done. In another
approach, the collection enrichment approach is employed, where the original
query is expanded with additional terms from an external collection (collection
not being searched). To deliver an e↵ective ranking, the first two rankers are
combined using data fusion techniques.
WHUIRgroup: WHUIR uses Indri to conduct the experiments. CHV is used
to expand queries and propose a learning-to-rank algorithm to re-rank the
result.
7 Assessments
A Pool of 25,000 documents was created using the RBP-based Method A (Sum-
ming contributions) by Mo↵at et al. [20], in which documents are weighted ac-
cording to their overall contribution to the e↵ectiveness evaluation as provided
by the RBP formula (with p=0.8, following Park and Zhang [25]). This stategy
was chosen because it was shown that it should be preferred over traditional
fixed-depth or stratified pooling when deciding upon the pooling strategy to be
used to evaluate systems under fixed assessment budget constraints [17]. A total
of 100 runs were used (all baselines + all participant runs for Sub-Tasks 1 and
2) to form the assessment pools.
Assessment was performed by paid final year medical students who had access
to queries, documents, and relevance criteria drafted by a junior medical doctor.
The relevance criteria were drafted considering the entirety of the forum posts
used to create the queries, a link to the forum posts was also provided to the
assessors.
Relevance assessments were provided with respect to the grades Highly rel-
evant, Somewhat relevant and Not Relevant. Readability/understandability and
reliability/trustworthiness judgments were also collected for the documents in
the assessment pool. These judgements were collected using a integer value be-
tween 0 and 100 (lower values meant harder to understand document / low
reliability) provided by judges through a slider tool; these judgements were used
to evaluate systems across di↵erent dimensions of relevance [32,31]. All assess-
ments were collected through a purpously costumised version of the Relevation
toolkit [16].
8 Evaluation Metrics
System evaluation was conducted using precision at 10 (p@10) and normalised
discounted cumulative gain [14] at 10 (nDCG@10) as the primary and secondary
measures, respectively. Precision was computed using the binary relevance as-
sessments by collapsing Highly relevant and Somewhat relevant assessments into
the Relevant class, while nDCG was computed using the graded relevance as-
sessments.
A separate evaluation was conducted using the multidimensional relevance
assessments (topical relevance, understandability and trustworthiness) follow-
ing the methods in [31]. For all runs, Rank biased precision (RBP) [20], with a
persistence parameter p = 0.80 (see [25]), was computed along with the multidi-
mensional modifications of RBP, namely uRBP (using binary understandability
assessments), uRBPgr (using graded understandability assessments), u+tRBP
(using binary understandability and trustworthiness assessments).
Precision and nDCG were computed using trec eval15 along with RBP,
while the multidimensional evaluation was performed using ubire16 [31].
9 Conclusions
This paper describes methods, results and analysis of the CLEF 2016 eHealth
Evaluation Lab, IR Task. The task considers the problem of retrieving web pages
for people seeking health information regarding medical conditions, treatments
and suggestions. The task was divided into 3 sub-tasks including ad-hoc search,
query variations, and multilingual ad-hoc search. Ten teams participated in the
task; relevance assessment is underway and assessments along with the partici-
pants results will be released at the CLEF 2016 conference (and will be available
at the task’s GitHub repository).
As a by-product of this evaluation exercise, the task makes available to the
research community a collection with associated assessments and evaluation
framework (including readability and reliability evaluation) that can be used
to evaluate the e↵ectiveness of retrieval methods for health information seeking
on the web (e.g. [21,22]).
Baseline runs, participant runs and results, assessments, topics and query
variations are available online at the GitHub repository for this Task: https:
//github.com/CLEFeHealth/CLEFeHealth2016Task3.
10 Acknowledgments
This work has received funding from the European Union’s Horizon 2020 re-
search and innovation programme under grant agreement No 644753 (KCon-
nect), and from the Austrian Science Fund (FWF) projects P25905-N23 (AD-
mIRE) and I1094-N23 (MUCKE). We also would like to thank Microsoft Azure
grant (CRM:0518649), ESF for the support for financial relevance assessments
and query creation, and the many assessor for their hard work.
References
1. L. Azzopardi. Query Side Evaluation: An Empirical Analysis of E↵ectiveness and
E↵ort. In Proc. of SIGIR, 2009.
2. P. Bailey, A. Mo↵at, F. Scholer, and P. Thomas. User Variability and IR System
Evaluation. In Proc. of SIGIR, 2015.
15
http://trec.nist.gov/trec_eval/trec_eval_latest.tar.gz
16
https://github.com/ielab/ubire
3. M. Benigeri and P. Pluye. Shortcomings of health information on the internet.
Health promotion international, 18(4):381–386, 2003.
4. M. Coleman and T. L. Liau. A computer readability formula designed for machine
scoring. Journal of Applied Psychology, 60:283–284, 1975.
5. G. V. Cormack, C. L. A. Clarke, and S. Buettcher. Reciprocal rank fusion outper-
forms condorcet and individual rank learning methods. In Proceedings of the 32Nd
International ACM SIGIR Conference on Research and Development in Informa-
tion Retrieval, SIGIR ’09, pages 758–759, New York, NY, USA, 2009. ACM.
6. G. V. Cormack, M. D. Smucker, and C. L. Clarke. Efficient and e↵ective spam
filtering and re-ranking for large web datasets. Information retrieval, 14(5):441–
465, 2011.
7. S. Fox. Health topics: 80% of internet users look for health information online.
Pew Internet & American Life Project, 2011.
8. L. Goeuriot, G. J. Jones, L. Kelly, J. Leveling, A. Hanbury, H. Müller, S. Salantera,
H. Suominen, and G. Zuccon. Share/clef ehealth evaluation lab 2013, task 3:
Information retrieval to address patients’ questions when reading clinical reports.
CLEF 2013 Online Working Notes, 8138, 2013.
9. L. Goeuriot, L. Kelly, W. Lee, J. Palotti, P. Pecina, G. Zuccon, A. Hanbury, and
H. M. Gareth J.F. Jones. ShARe/CLEF eHealth Evaluation Lab 2014, Task 3:
User-centred health information retrieval. In CLEF 2014 Evaluation Labs and
Workshop: Online Working Notes, Sheffield, UK, 2014.
10. L. Goeuriot, L. Kelly, G. Zuccon, and J. Palotti. Building evaluation datasets for
consumer-oriented information retrieval. In Proceedings of the Tenth International
Conference on Language Resources and Evaluation (LREC 2016), Paris, France,
may 2016. European Language Resources Association (ELRA).
11. R. Gunning. The Technique of Clear Writing. McGraw-Hill, 1952.
12. A. Hanbury. Medical information retrieval: an instance of domain-specific search.
In Proceedings of SIGIR 2012, pages 1191–1192, 2012.
13. D. Hiemstra and C. Hau↵. Mirex: Mapreduce information retrieval experiments.
arXiv preprint arXiv:1004.4489, 2010.
14. K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques.
ACM Transactions on Information Systems, 20(4):422–446, 2002.
15. L. Kelly, L. Goeuriot, H. Suominen, A. Neveol, J. Palotti, and G. Zuccon. Overview
of the CLEF eHealth Evaluation Lab 2016. In Information Access Evaluation.
Multilinguality, Multimodality, and Visualization. Springer Berlin Heidelberg, 2016.
16. B. Koopman and G. Zuccon. Relevation! an open source system for information
retrieval relevance assessment. arXiv preprint, 2013.
17. A. Lipani, G. Zuccon, L. Mihai, B. Koopman, and A. Hanbury. The impact of
fixed-cost pooling strategies on test collection bias. In Proceedings of the 2016
International Conference on The Theory of Information Retrieval, ICTIR ’16, New
York, NY, USA, 2016. ACM.
18. C. Macdonald, R. McCreadie, R. L. Santos, and I. Ounis. From puppy to maturity:
Experiences in developing terrier. Proc. of OSIR at SIGIR, pages 60–63, 2012.
19. D. McDaid and A. Park. Online health: Untangling the web. evidence from the
bupa health pulse 2010 international healthcare survey. Technical report, 2011.
20. A. Mo↵at and J. Zobel. Rank-biased precision for measurement of retrieval e↵ec-
tiveness. ACM Trans. Inf. Syst., 27(1):2:1–2:27, Dec. 2008.
21. J. Palotti, L. Goeuriot, G. Zuccon, and A. Hanbury. Ranking health web pages
with relevance and understandability. In Proceedings of the 39th international ACM
SIGIR conference on Research and development in information retrieval, 2016.
22. J. Palotti, G. Zuccon, J. Bernhardt, A. Hanbury, and L. Goeuriot. Assessors Agree-
ment: A Case Study across Assessor Type, Payment Levels, Query Variations and
Relevance Dimensions. In Experimental IR Meets Multilinguality, Multimodality,
and Interaction: 7th International Conference of the CLEF Association, CLEF’16
Proceedings. Springer International Publishing, 2016.
23. J. Palotti, G. Zuccon, L. Goeuriot, L. Kelly, A. Hanburyn, G. J. Jones, M. Lupu,
and P. Pecina. CLEF eHealth Evaluation Lab 2015, Task 2: Retrieving Information
about Medical Symptoms. In CLEF 2015 Online Working Notes. CEUR-WS, 2015.
24. J. Palotti, G. Zuccon, and A. Hanbury. The influence of pre-processing on the
estimation of readability of web documents. In Proceedings of the 24th ACM In-
ternational on Conference on Information and Knowledge Management, CIKM ’15,
pages 1763–1766, New York, NY, USA, 2015. ACM.
25. L. Park and Y. Zhang. On the distribution of user persistence for rank-biased
precision. In Proceedings of the 12th Australasian document computing symposium,
pages 17–24, 2007.
26. J. Pomikálek. Removing boilerplate and duplicate content from web corpora. PhD
thesis, Masaryk university, Faculty of informatics, Brno, Czech Republic, 2011.
27. W. Shen, J.-Y. Nie, X. Liu, and X. Liui. An investigation of the e↵ec-
tiveness of concept-based approach in medical information retrieval GRIUM@
CLEF2014eHealthTask 3. In Proceedings of the CLEF eHealth Evaluation Lab,
2014.
28. T. Strohman, D. Metzler, H. Turtle, and W. B. Croft. Indri: A language model-
based search engine for complex queries. In Proceedings of the International Con-
ference on Intelligent Analysis, volume 2, pages 2–6. Citeseer, 2005.
29. R. W. White and E. Horvitz. Cyberchondria: studies of the escalation of medical
concerns in web search. ACM TOIS, 27(4):23, 2009.
30. D. Zhu, S. T.-I. Wu, J. J. Masanz, B. Carterette, and H. Liu. Using discharge
summaries to improve information retrieval in clinical domain. In Proceedings of
the CLEF eHealth Evaluation Lab, 2013.
31. G. Zuccon. Understandability biased evaluation for information retrieval. In Ad-
vances in Information Retrieval, pages 280–292, 2016.
32. G. Zuccon and B. Koopman. Integrating understandability in the evaluation of
consumer health search engines. In Medical Information Retrieval Workshop at
SIGIR 2014, page 32, 2014.
33. G. Zuccon, B. Koopman, and J. Palotti. Diagnose this if you can: On the e↵ective-
ness of search engines in finding medical self-diagnosis information. In Advances
in Information Retrieval, pages 562–567. Springer, 2015.