=Paper= {{Paper |id=Vol-1609/16090015 |storemode=property |title=The IR Task at the CLEF eHealth Evaluation Lab 2016: User-centred Health Information Retrieval |pdfUrl=https://ceur-ws.org/Vol-1609/16090015.pdf |volume=Vol-1609 |authors=Guido Zuccon,Joao Palotti,Lorraine Goeuriot,Liadh Kelly,Mihai Lupu,Pavel Pecina,Henning Müller,Julie Budaher,Anthony Deacon |dblpUrl=https://dblp.org/rec/conf/clef/ZucconPGKLPMBD16 }} ==The IR Task at the CLEF eHealth Evaluation Lab 2016: User-centred Health Information Retrieval== https://ceur-ws.org/Vol-1609/16090015.pdf
    The IR Task at the CLEF eHealth Evaluation
     Lab 2016: User-centred Health Information
                     Retrieval

Guido Zuccon1 , Joao Palotti2 , Lorraine Goeuriot3 , Liadh Kelly4 , Mihai Lupu2 ,
   Pavel Pecina5 , Henning Müller6 , Julie Budaher3 , and Anthony Deacon1
           1
            Queensland University of Technology, Brisbane, Australia,
                       [g.zuccon, aj.deacon]@qut.edu.au
              2
                 Vienna University of Technology, Vienna, Austria,
                        [palotti,lupu]@ifs.tuwien.ac.at
                       3
                          Université Grenoble Alpes, France
                         [firstname.lastname]@imag.fr
                          4
                            Trinity College Dublin, Irland
                               liadh.kelly@tcd.ie
                  5
                    Charles University, Prague, Czech Republic
                            pecina@ufal.mff.cuni.cz
        6
          University of Applied Sciences Western Switzerland, Switzerland
                            henning.mueller@hevs.ch


      Abstract. This paper details the collection, systems and evaluation
      methods used in the IR Task of the CLEF 2016 eHealth Evaluation Lab.
      This task investigates the e↵ectiveness of web search engines in providing
      access to medical information for common people that have no or little
      medical knowledge. The task aims to foster advances in the development
      of search technologies for consumer health search by providing resources
      and evaluation methods to test and validate search systems.
      The problem considered in this year’s task was to retrieve web pages
      to support the information needs of health consumers that are faced by
      a medical condition and that want to seek relevant health information
      online through a search engine. As part of the evaluation exercise, we
      gathered 300 queries users posed with respect to 50 search task scenar-
      ios. The scenarios were developed from real cases of people seeking health
      information through posting requests of help on a web forum. The pres-
      ence of query variations for a single scenario helped us capturing the
      variable quality at which queries are posed. Queries were created in En-
      glish and then translated into other languages. A total of 49 runs by
      10 di↵erent teams were submitted for the English query topics; 2 teams
      submitted 29 runs for the multilingual topics.

      Keywords: Evaluation, Health Search


1    Introduction
This document reports on the CLEF 2016 eHealth Evaluation Lab, IR Task
(task 3). The task investigated the problem of retrieving web pages to support
information needs of health consumers (including their next-of-kin) that are con-
fronted with a health problem or medical condition and that use a search engine
to seek better understanding about their health. This task has been developed
within the CLEF 2016 eHealth Evaluation Lab, which aims to foster the devel-
opment of approaches to support patients, their next-of-kin, and clinical sta↵ in
understanding, accessing and authoring health information [15].
    The use of the Web as source of health-related information is a wide-spread
practice among health consumers [19] and search engines are commonly used as a
means to access health information available online [7]. Previous iterations of this
task (i.e. the 2013 and 2014 CLEFeHealth Lab Task 3 [8,9]) aimed at evaluating
the e↵ectiveness of search engines to support people when searching for infor-
mation about their conditions, e.g. to answer queries like “thrombocytopenia
treatment corticosteroids length”. These two evaluation exercises have provided
valuable resources and an evaluation framework for developing and testing new
and existing techniques. The fundamental contribution of these tasks to the im-
provement of search engine technology aimed at answering this type of health
information need is demonstrated by the improvements in retrieval e↵ectiveness
provided by the best 2014 system [27] over the best 2013 system [30] (using di↵er-
ent, but comparable, topic sets). The 2015 task has instead focused on support-
ing consumers searching for self-diagnosis information [23], an important type
of health information seeking activity [7]. This year’s task expands on the 2015
task, by considering not only self-diagnosis information needs, but also needs re-
lated to treatment and management of health conditions. Previous research has
shown that exposing people with no or scarce medical knowledge to complex
medical language may lead to erroneous self-diagnosis and self-treatment and
that access to medical information on the Web can lead to the escalation of con-
cerns about common symptoms (e.g., cyberchondria) [3,29]. Research has also
shown that current commercial search engines are still far from being e↵ective
in answering such unclear and underspecified queries [33].
    The remainder of this paper is structured as follows: Section 2 details the
sub-tasks we considered this year; Section 3 described the query set and the
methodology used to create it; Section 5 details the baselines created by the
organisers as a benchmark for participants; Section 6 describes participants sub-
missions; Section 7 details the methods used to create the assessment pools and
relevance criteria; Section 8 lists the evaluation metrics used for this Task; finally,
Section 9 concludes this overview paper.


2     Tasks
2.1   Sub-Task 1: Ad-hoc Search
Queries for this task are generated by mining health web forums to identify
example information needs, as detailed in section 3. Every query is treated as
independent and participants are asked to generate retrieval runs in answer to
such queries, as in a common ad-hoc search task. This task extends the evaluation
framework used in 2015 (which considered, along with topical relevance, also
the readability of the retrieved documents) to consider further dimensions of
relevance such as the reliability of the retrieved information.


2.2    Sub-Task 2: Query Variations

This task explores query variations for each single information need. Previous
research has shown that di↵erent users tend to issue di↵erent queries for the
same information need and that the use of query variations for evaluation of
IR systems leads to as much variability as system variations [1,2,23]. This was
the case also in this year’s task. Note that we explored query variations also in
the 2015 IR task [23], and we found that for the same image showing a health
condition, di↵erent query creators issued very di↵erent queries: they di↵er not
only in terms of the keywords contained in the query, but also with respect to
their retrieval e↵ectiveness.
    Di↵erent query variations are generated for the same information need (ex-
tracted from a web forum entry, as explained in section 3), thus capturing the
variability intrinsic in how people search when they have the same information
need. Participants were asked to exploit query variations when building their
systems: participants were told which queries related to the same information
need and they were required to produce one set of results to be used as answer
for all query variations of an information need. This task aims to foster research
into building systems that are robust to query variations, for example, through
considering the fusion of ranked lists produced in answer to each single query
variation.


2.3    Sub-Task 3: Multilingual Ad-hoc Search

The multilingual task extends the Ad-hoc Search task by providing a transla-
tion of the queries from English into Czech, French, Hungarian, German, Polish,
Spanish and Swedish. The goal of this sub-task is to support research in mul-
tilingual information retrieval, developing techniques to support users that can
express their information need well in their native language and can read the
results in English.


3     Query Set

We considered real health information needs expressed by the general public
through posts published in public health web forums. Forum posts were ex-
tracted from the AskDocs section of Reddit7 . This section allows users to post a
description of a medical case or ask a medical question seeking medical informa-
tion such as diagnosis, or details regarding treatments. Users can also interact
through comments. We selected posts that were descriptive, clear and under-
standable. Posts with information regarding the author or patient (in case the
7
    https://www.reddit.com/r/AskDocs/
post author sought help for another person), such as demographics (age, gender),
medical history and current medical condition, were preferred.
    In order to collect query variants that could be compared, we also selected
posts where a main and single information need could be identified. These con-
straints guarantee as much as possible getting queries on the same aspects of
the post.
    The comments were also taken into account in the selection. Any user can
add a comment to a post, and all users are labeled according to their medical
expertise8 . We mainly selected posts with comments, including some from la-
beled users. The post were manually selected by a student, and a total of 50
posts were used for query creation.
    Each of the selected forum posts were presented to 6 query creators with
di↵erent medical expertise: these included 3 medical experts (final year medical
students undertaking rotations in hospitals) and 3 lay users with no prior medical
knowledge.
    All queries were preprocessed to correct for spelling mistakes; this was done
using the Linux program aspell. This was however manually supervised so that
spelling correction was performed only when appropriate, as for example not to
change drug names. We explicitly did not remove punctuation marks from the
queries, e.g., participants could take advantage of the quotation marks used by
the query creators to indicate proximity terms or other features.
    A total of 300 queries were created. Queries were numbered using the follow-
ing convention: the first 3 digits of a query id identify a post number (information
need), while the last 3 digits of a query id identify each individual query creator.
Expert query creators used the identifier 1, 2 and 3 and laypeople query creator
used the identifiers 4, 5 and 6. In Figure 2 we show variants 1, 2 (both generated
by laypeople) and 4 (generated by an expert) created for post number 103 (posts
started from number 101), shown in Figure 1.
    For the query variations element of the task (sub-task 2), participants were
told which queries were related to the same information need, to allow them to
produce one set of results to be used as answer for all query variations of an
information need.
    For the multilingual element of the challenge (sub-task 3), Czech, French,
Hungarian, German, Polish, Spanish and Swedish translations of the queries
were provided. Queries were translated by medical experts hired through a pro-
fessional translation company.


4   Dataset
Previous IR tasks in the CLEF eHealth Lab have used the Khresmoi collec-
tion [12,10], a collection of about 1 million health web pages. This year we set
a new challenge to the participants by using the ClueWeb12-B139 , a collection
8
  To be labeled as a medical expert, users have to send Reddit a proof such as a
  student ID, or a diploma.
9
  http://lemurproject.org/clueweb12/
Fig. 1: Post from Reddit’s Section AskDocs. It was used to generate queries ranging
from 103001 to 103006



of more than 52 million web pages. As opposed to the Khresmoi collection, the
crawl in ClueWeb12-B13 is not limited to certified Health On the Net websites
and known health portals, but it is a higher-fidelity representation of a common
Internet crawl, making the dataset more in line with the content current web
search engines index and retrieve.
    For participants who did not have access to the ClueWeb dataset, Carnegie
Mellon University granted the organisers permission to make the dataset avail-
able through cloud computing instances10 provided by Microsoft Azure. The
Azure instances that were made available to participants for the IR challenge
included (1) the Clueweb12-B13 dataset, (2) standard indexes built with the Ter-
rier11 [18] and the Indri12 [28] toolkits, (3) additional resources such as a spam
list [6], Page Rank scores, anchor texts [13], urls, etc. made available through
the ClueWeb12 website.


5    Baselines

We generated 55 runs, from which 19 were for Sub-Task 1 and 36 for Sub-Task
2, based on common baseline models and simple approaches for fusing query
variations. In this section we describe the baseline runs.
10
   The organisers are thankful to Carnegie Mellon University, and in particular to
   Jamie Callan and Christina Melucci, for their support in obtaining the permission
   to redistribute ClueWeb 12. The organisers are also thankful to Microsoft Azure
   who provided the Azure cloud computing infrastructure that was made available to
   participants through the Microsoft Azure for Research Award CRM:0518649.
11
   http://terrier.org/
12
   http://www.lemurproject.org/indri.php

    ...
    
         103001 
        headaches relieved by blood donation
    
    
         103002 
        high iron headache
    
    ...
    
         103004 
        headaches caused by too much blood or
                              "high blood pressure"
    
    ...


                  Fig. 2: Extract from the official query set released.



5.1     Baselines for Sub-Task 1

A total of 12 standard baselines were generated using:

 – Indri v5.9 with default parameters for models LMDirichlet, OKAPI, and
   TFIDF.
 – Terrier v4.0 with default parameters for model BM25, DirichletLM and
   TFIDF.

For both systems, we created the runs using and not using the default pseudo-
relevance feedback (PRF) of each toolkit. When using PRF, we added to the
original query the top 10 terms of the top 3 documents. All these baseline runs
were created using the Terrier and Indri instances made available to participants
in the Azure platform.
    Additionally, we created a set of baseline runs that take into account the
reliability and understandability of information.
    Five reliability baselines were created based on the Spam rankings distributed
with ClueWeb1213 [6]. For a given run, we removed all documents that had a
spam score smaller than a given threshold th. We used the BM25 baseline run
of Terrier, and 5 di↵erent values for th (50, 60, 70, 80 and 90).
    Two understandability baselines were created using readability formulae. We
created runs based on CLI (Coleman-Liau Index) and GFI (Gunning Fox Index)
scores [4,11], which are a proxy for the number of years of the school required
to read the text being evaluated. These two readability formulae were chosen
13
     http://www.mansci.uwaterloo.ca/~msmucker/cw12spam/
because they showed to be robust across di↵erent methods for HTML prepro-
cessing [24]. We followed one of the methods suggested in [24], in which the
HTML documents are preprocessed using Justext14 [26], the main text is ex-
tracted, periods at the end of sentences are added whenever they are necessary
(e.g., in presence of line breaks), and then readability scores are calculated. Given
the initial score S for a document and its readability score R, the final score for
each document is the combination of score obtained as S ⇥ 1.0/R.

5.2     Baselines for Sub-Task 2
We explored three ways to combine query variations:

 – Concatenation: we concatenated the text of each query variation into a single
   query.
 – Reciprocal Ranking Fusion [5]: we fuse the ranks of each query variations
   using the reciprocal ranking fusion approach, i.e.,
                                               X        1
                            RRF Score(d) =                    ,
                                                     k + r(d)
                                               r2R

   where D is set the documents to be ranked, R is the set of document rankings
   retrieved for each query variation by the same retrieval model, r(d) is the
   rank of document d, and k is a constant set to 60, as in [5].
 – Rank Biased Precision Fusion: similarly to the reciprocal ranking fusion,
   we fuse the documents retrieved for each query variation with the Ranking
   Biased Precision (RBP) formula [20],
                                      X
                      RBP Score(d) =     (1 p) ⇥ (p)r(d) 1 ,
                                         r2R

      where p is the free parameter of the RBP model used to estimate user per-
      sistence. Here we set p = 0.80.

    For each of the three methods described above, we created a run based on
each of three baselines for Terrier and Indri, with and without pseudo-relevance
feedback. A total of 36 baseline runs were created for sub-task 2 (combination
of 3 fusion approaches, 2 toolkits, 3 models, with and without PRF).


6      Participant Submissions
The number of registered participants for CLEF eHealth IR Task was 58; of
these, 10 submitted at least one run for any of the sub-tasks, as shown in Table 1.
Each team could submit up to 3 runs for Sub-Tasks 1 and 2, and up to 3 runs
for each language of Sub-Task 3.
    We include below a summary of the approach of each team, self described by
them when their runs were submitted.
14
     https://pypi.python.org/pypi/jusText
   Table 1: Participating teams and the number of submissions for each Sub-Task.
                                                                                 Sub-Task
Team Name      University                                          Country
                                                                                 1 2 3
CUNI        Charles University in Prague                           Czech Republic 2   2   21
ECNU        East China Normal University                           China          3   3   8
GUIR        Georgetown University                                  United States 3    3   -
InfoLab     Universidade do Porto                                  Portugal       3   3   -
KDEIR       Toyohashi University of Technology                     Japan          3   2
KISTI       Korean Institute of Science and Technology Information Korea          3   -   -
MayoNLPTeam Mayo Clinic                                            United States 3    -   -
MRIM        Laboratoire d’Infomatique de Grenoble                  France         3   -   -
ub-botswana University of Botswana                                 Botswana       3   -   -
WHUIRgroup Wuhan University                                        China          3   3   -
10 Teams       10 Institutions                                     8 Countries   29 16 29




CUNI: The CUNI team participated in all the subtasks but their main focus
 was put on the multilingual search in sub-task 3. The monolingual runs in
 sub-tasks 1 and 2 are mainly intended for comparison with the multilingual
 runs in sub-task 3. In sub-task 1 (Ad-Hoc Search), Run 1 employs the Terrier
 implementation of Dirichlet-smoothed language model with the µ parameter
 tuned on the data from previous CLEF eHealth tasks; Run 2 uses the Terrier
 vector space TF/IDF model. In sub-task 2 (Query variants), all query variants
 of one information need are searched for by the retrieval system (Dirichlet-
 smoothed language model in Run 1 and vector space TF/IDF model in Run2)
 and the resulting lists of ranked documents are merged and reranked by doc-
 ument scores to produce one ranked list of documents for each information
 need. In sub-task 3 (Multilingual search), all the non-English queries (includ-
 ing variants) are translated into English using their own statistical machine
 translation systems adapted to translate search queries in the medical domain.
 For each non-English query, 15 translation variants (hypotheses) are obtained.
 Their multilingual Run 1 employs the single best translation for each query
 as provided by the translation systems. In their multilingual Run 2, the top
 15 translation hypotheses are reranked using a discriminative regression model
 employing a) features provided by the translation system and b) various kinds
 of features extracted from the document collection, external resources (UMLS -
 Unified Medical Language System, Wikipedia), or the translations themselves.
 Run 3 employs the same reranking method applied to the translation system
 features only.
ECNU: The ECNU team proposes a Web-based query expansion model and
 a combination method to better understand and satisfy the task. They use as
 baseline the Terrier implementation of the BM25 model. The other runs for sub-
 task 1, 2 and 3 explore Google search and MeSH to do query expansion. BM25,
 DFR BM25, BB2, the PL2 models of Terrier and TF IDF, the BM25 models
 of Indri were used and combined. For the sub-task 3 runs, Google Translator
 was used to translate the queries from Czech, French, Polish and Swedish to
 English before applying the same methods of sub-task 1.
GUIR: GUIR studies the use of medical terms for query reformulation. Syn-
 onyms and hypernyms from UMLS are used to generate reformulations of
 the queries; Terrier with Divergence from Randomness is used for retrieving
 and scoring documents. For sub-task 1, results obtained from the reformulated
 queries are used with the Borda rank aggregation algorithm. For sub-task 2,
 for each topic, results obtained are merged by any reformulated query in the
 topic using the Borda rank aggregation algorithm.
Infolab: Team InfoLab analyses the performance of several query expansion
 strategies using di↵erent methods to select the terms to be added to the original
 query. One of the methods uses the similarity between Wikipedia articles, found
 through an analysis of incoming and outgoing links, for term selection. The
 other method applies the Latent Dirichlet Allocation to Wikipedia articles to
 extract topics each containing a set of words that are used for term selection.
 In the end, readability metrics were used to re-rank the documents retrieved
 using the expanded queries.
KDEIR: KDEIR submitted runs for sub-tasks 1 and 2. In both sub-tasks the
 Waterloo spam score was used to filter out the spammiest documents, and the
 link structure present in the remaining documents was explored on top of their
 language model baseline.
KISTI: KISTI attempts two approaches using word vectors learned by Word2Vec
 based on medical Wikipedia pages. At first, initial documents are obtained us-
 ing a search engine. Based on the documents, pseudo-relevance feedback (PRF)
 is applied with two di↵erent usage of the word vectors. In the first approach,
 PRF is performed with new relevance scores using the word vectors, while it
 is performed with a new query expanded using the word vectors in the second
 approach.
MayoNLPTeam: Mayo explores a Part-of-Speech (POS) based query term
 weighting approach which assigns di↵erent weights to the query terms according
 to their POS categories. The weights are learned by defining an objective func-
 tion based on the mean average precision. They apply the proposed approach
 with the optimal weights obtained from the TREC 2011 and 2012 Medical
 Records Tracks into the Query Likelihood model (Run 2) and Markov Random
 Field (MRF) models (Run 3). The conventional Query Likelihood model was
 implemented as the baseline (Run 1).
MRIM: MRIM’s objective is to investigate the e↵ectiveness of the word em-
 bedding for query expansion on consumer health search, as well as the e↵ect of
 the learning resource for learning on the results. Their system uses the Terrier
 index provided by the organizers. As a retrieval model the Dirichlet language
 model is used with default settings. Query expansion is applied on two training
 sets using word embedding sources. Word2vec is used for word embedding.
ub-botswana: In this participation, the e↵ectiveness of three retrieval strate-
 gies is evaluated. In particular, PL2 is deployed with a Boolean Fallback score
 modifier as baseline system. If any of the retrieved documents contains all
 undecorated query terms (i.e. query terms without any operators), then doc-
 uments are removed from the result set that do not contain all undecorated
 query terms With this score modifier. Otherwise, nothing is done. In another
approach, the collection enrichment approach is employed, where the original
query is expanded with additional terms from an external collection (collection
not being searched). To deliver an e↵ective ranking, the first two rankers are
combined using data fusion techniques.
WHUIRgroup: WHUIR uses Indri to conduct the experiments. CHV is used
to expand queries and propose a learning-to-rank algorithm to re-rank the
result.


7   Assessments
A Pool of 25,000 documents was created using the RBP-based Method A (Sum-
ming contributions) by Mo↵at et al. [20], in which documents are weighted ac-
cording to their overall contribution to the e↵ectiveness evaluation as provided
by the RBP formula (with p=0.8, following Park and Zhang [25]). This stategy
was chosen because it was shown that it should be preferred over traditional
fixed-depth or stratified pooling when deciding upon the pooling strategy to be
used to evaluate systems under fixed assessment budget constraints [17]. A total
of 100 runs were used (all baselines + all participant runs for Sub-Tasks 1 and
2) to form the assessment pools.
    Assessment was performed by paid final year medical students who had access
to queries, documents, and relevance criteria drafted by a junior medical doctor.
The relevance criteria were drafted considering the entirety of the forum posts
used to create the queries, a link to the forum posts was also provided to the
assessors.
    Relevance assessments were provided with respect to the grades Highly rel-
evant, Somewhat relevant and Not Relevant. Readability/understandability and
reliability/trustworthiness judgments were also collected for the documents in
the assessment pool. These judgements were collected using a integer value be-
tween 0 and 100 (lower values meant harder to understand document / low
reliability) provided by judges through a slider tool; these judgements were used
to evaluate systems across di↵erent dimensions of relevance [32,31]. All assess-
ments were collected through a purpously costumised version of the Relevation
toolkit [16].


8   Evaluation Metrics
System evaluation was conducted using precision at 10 (p@10) and normalised
discounted cumulative gain [14] at 10 (nDCG@10) as the primary and secondary
measures, respectively. Precision was computed using the binary relevance as-
sessments by collapsing Highly relevant and Somewhat relevant assessments into
the Relevant class, while nDCG was computed using the graded relevance as-
sessments.
    A separate evaluation was conducted using the multidimensional relevance
assessments (topical relevance, understandability and trustworthiness) follow-
ing the methods in [31]. For all runs, Rank biased precision (RBP) [20], with a
persistence parameter p = 0.80 (see [25]), was computed along with the multidi-
mensional modifications of RBP, namely uRBP (using binary understandability
assessments), uRBPgr (using graded understandability assessments), u+tRBP
(using binary understandability and trustworthiness assessments).
    Precision and nDCG were computed using trec eval15 along with RBP,
while the multidimensional evaluation was performed using ubire16 [31].


9      Conclusions

This paper describes methods, results and analysis of the CLEF 2016 eHealth
Evaluation Lab, IR Task. The task considers the problem of retrieving web pages
for people seeking health information regarding medical conditions, treatments
and suggestions. The task was divided into 3 sub-tasks including ad-hoc search,
query variations, and multilingual ad-hoc search. Ten teams participated in the
task; relevance assessment is underway and assessments along with the partici-
pants results will be released at the CLEF 2016 conference (and will be available
at the task’s GitHub repository).
    As a by-product of this evaluation exercise, the task makes available to the
research community a collection with associated assessments and evaluation
framework (including readability and reliability evaluation) that can be used
to evaluate the e↵ectiveness of retrieval methods for health information seeking
on the web (e.g. [21,22]).
    Baseline runs, participant runs and results, assessments, topics and query
variations are available online at the GitHub repository for this Task: https:
//github.com/CLEFeHealth/CLEFeHealth2016Task3.


10      Acknowledgments

This work has received funding from the European Union’s Horizon 2020 re-
search and innovation programme under grant agreement No 644753 (KCon-
nect), and from the Austrian Science Fund (FWF) projects P25905-N23 (AD-
mIRE) and I1094-N23 (MUCKE). We also would like to thank Microsoft Azure
grant (CRM:0518649), ESF for the support for financial relevance assessments
and query creation, and the many assessor for their hard work.


References

 1. L. Azzopardi. Query Side Evaluation: An Empirical Analysis of E↵ectiveness and
    E↵ort. In Proc. of SIGIR, 2009.
 2. P. Bailey, A. Mo↵at, F. Scholer, and P. Thomas. User Variability and IR System
    Evaluation. In Proc. of SIGIR, 2015.
15
     http://trec.nist.gov/trec_eval/trec_eval_latest.tar.gz
16
     https://github.com/ielab/ubire
 3. M. Benigeri and P. Pluye. Shortcomings of health information on the internet.
    Health promotion international, 18(4):381–386, 2003.
 4. M. Coleman and T. L. Liau. A computer readability formula designed for machine
    scoring. Journal of Applied Psychology, 60:283–284, 1975.
 5. G. V. Cormack, C. L. A. Clarke, and S. Buettcher. Reciprocal rank fusion outper-
    forms condorcet and individual rank learning methods. In Proceedings of the 32Nd
    International ACM SIGIR Conference on Research and Development in Informa-
    tion Retrieval, SIGIR ’09, pages 758–759, New York, NY, USA, 2009. ACM.
 6. G. V. Cormack, M. D. Smucker, and C. L. Clarke. Efficient and e↵ective spam
    filtering and re-ranking for large web datasets. Information retrieval, 14(5):441–
    465, 2011.
 7. S. Fox. Health topics: 80% of internet users look for health information online.
    Pew Internet & American Life Project, 2011.
 8. L. Goeuriot, G. J. Jones, L. Kelly, J. Leveling, A. Hanbury, H. Müller, S. Salantera,
    H. Suominen, and G. Zuccon. Share/clef ehealth evaluation lab 2013, task 3:
    Information retrieval to address patients’ questions when reading clinical reports.
    CLEF 2013 Online Working Notes, 8138, 2013.
 9. L. Goeuriot, L. Kelly, W. Lee, J. Palotti, P. Pecina, G. Zuccon, A. Hanbury, and
    H. M. Gareth J.F. Jones. ShARe/CLEF eHealth Evaluation Lab 2014, Task 3:
    User-centred health information retrieval. In CLEF 2014 Evaluation Labs and
    Workshop: Online Working Notes, Sheffield, UK, 2014.
10. L. Goeuriot, L. Kelly, G. Zuccon, and J. Palotti. Building evaluation datasets for
    consumer-oriented information retrieval. In Proceedings of the Tenth International
    Conference on Language Resources and Evaluation (LREC 2016), Paris, France,
    may 2016. European Language Resources Association (ELRA).
11. R. Gunning. The Technique of Clear Writing. McGraw-Hill, 1952.
12. A. Hanbury. Medical information retrieval: an instance of domain-specific search.
    In Proceedings of SIGIR 2012, pages 1191–1192, 2012.
13. D. Hiemstra and C. Hau↵. Mirex: Mapreduce information retrieval experiments.
    arXiv preprint arXiv:1004.4489, 2010.
14. K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques.
    ACM Transactions on Information Systems, 20(4):422–446, 2002.
15. L. Kelly, L. Goeuriot, H. Suominen, A. Neveol, J. Palotti, and G. Zuccon. Overview
    of the CLEF eHealth Evaluation Lab 2016. In Information Access Evaluation.
    Multilinguality, Multimodality, and Visualization. Springer Berlin Heidelberg, 2016.
16. B. Koopman and G. Zuccon. Relevation! an open source system for information
    retrieval relevance assessment. arXiv preprint, 2013.
17. A. Lipani, G. Zuccon, L. Mihai, B. Koopman, and A. Hanbury. The impact of
    fixed-cost pooling strategies on test collection bias. In Proceedings of the 2016
    International Conference on The Theory of Information Retrieval, ICTIR ’16, New
    York, NY, USA, 2016. ACM.
18. C. Macdonald, R. McCreadie, R. L. Santos, and I. Ounis. From puppy to maturity:
    Experiences in developing terrier. Proc. of OSIR at SIGIR, pages 60–63, 2012.
19. D. McDaid and A. Park. Online health: Untangling the web. evidence from the
    bupa health pulse 2010 international healthcare survey. Technical report, 2011.
20. A. Mo↵at and J. Zobel. Rank-biased precision for measurement of retrieval e↵ec-
    tiveness. ACM Trans. Inf. Syst., 27(1):2:1–2:27, Dec. 2008.
21. J. Palotti, L. Goeuriot, G. Zuccon, and A. Hanbury. Ranking health web pages
    with relevance and understandability. In Proceedings of the 39th international ACM
    SIGIR conference on Research and development in information retrieval, 2016.
22. J. Palotti, G. Zuccon, J. Bernhardt, A. Hanbury, and L. Goeuriot. Assessors Agree-
    ment: A Case Study across Assessor Type, Payment Levels, Query Variations and
    Relevance Dimensions. In Experimental IR Meets Multilinguality, Multimodality,
    and Interaction: 7th International Conference of the CLEF Association, CLEF’16
    Proceedings. Springer International Publishing, 2016.
23. J. Palotti, G. Zuccon, L. Goeuriot, L. Kelly, A. Hanburyn, G. J. Jones, M. Lupu,
    and P. Pecina. CLEF eHealth Evaluation Lab 2015, Task 2: Retrieving Information
    about Medical Symptoms. In CLEF 2015 Online Working Notes. CEUR-WS, 2015.
24. J. Palotti, G. Zuccon, and A. Hanbury. The influence of pre-processing on the
    estimation of readability of web documents. In Proceedings of the 24th ACM In-
    ternational on Conference on Information and Knowledge Management, CIKM ’15,
    pages 1763–1766, New York, NY, USA, 2015. ACM.
25. L. Park and Y. Zhang. On the distribution of user persistence for rank-biased
    precision. In Proceedings of the 12th Australasian document computing symposium,
    pages 17–24, 2007.
26. J. Pomikálek. Removing boilerplate and duplicate content from web corpora. PhD
    thesis, Masaryk university, Faculty of informatics, Brno, Czech Republic, 2011.
27. W. Shen, J.-Y. Nie, X. Liu, and X. Liui. An investigation of the e↵ec-
    tiveness of concept-based approach in medical information retrieval GRIUM@
    CLEF2014eHealthTask 3. In Proceedings of the CLEF eHealth Evaluation Lab,
    2014.
28. T. Strohman, D. Metzler, H. Turtle, and W. B. Croft. Indri: A language model-
    based search engine for complex queries. In Proceedings of the International Con-
    ference on Intelligent Analysis, volume 2, pages 2–6. Citeseer, 2005.
29. R. W. White and E. Horvitz. Cyberchondria: studies of the escalation of medical
    concerns in web search. ACM TOIS, 27(4):23, 2009.
30. D. Zhu, S. T.-I. Wu, J. J. Masanz, B. Carterette, and H. Liu. Using discharge
    summaries to improve information retrieval in clinical domain. In Proceedings of
    the CLEF eHealth Evaluation Lab, 2013.
31. G. Zuccon. Understandability biased evaluation for information retrieval. In Ad-
    vances in Information Retrieval, pages 280–292, 2016.
32. G. Zuccon and B. Koopman. Integrating understandability in the evaluation of
    consumer health search engines. In Medical Information Retrieval Workshop at
    SIGIR 2014, page 32, 2014.
33. G. Zuccon, B. Koopman, and J. Palotti. Diagnose this if you can: On the e↵ective-
    ness of search engines in finding medical self-diagnosis information. In Advances
    in Information Retrieval, pages 562–567. Springer, 2015.