=Paper=
{{Paper
|id=Vol-2696/paper_260
|storemode=property
|title=Overview of the CLEF eHealth 2020 Task 2: Consumer Health Search with Ad Hoc and Spoken Queries
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_260.pdf
|volume=Vol-2696
|authors=Lorraine Goeuriot,Hanna Suominen,Liadh Kelly,Zhengyang Liu,Gabriella Pasi,Gabriela González Sáez,Marco Viviani,Chenchen Xu
|dblpUrl=https://dblp.org/rec/conf/clef/GoeuriotSKLPSVX20
}}
==Overview of the CLEF eHealth 2020 Task 2: Consumer Health Search with Ad Hoc and Spoken Queries==
<pdf width="1500px">https://ceur-ws.org/Vol-2696/paper_260.pdf</pdf>
<pre>
   Overview of the CLEF eHealth 2020 Task 2:
    Consumer Health Search with ad-hoc and
                Spoken Queries?

                Lorraine Goeuriot1[0000−0001−7491−1980] ,
                   2,3,4[0000−0002−4195−1641]
 Hanna Suominen                         , Liadh Kelly5[0000−0003−1131−5238] ,
          Zhengyang Liu , Gabriella Pasi6[0000−0002−6080−8170] ,
                         2

              Gabriela Gonzalez Saez1[0000−0003−0878−5263] ,
         Marco Viviani6[0000−0002−2274−9050] , and Chenchen Xu2,3
  1
    Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, F-38000 Grenoble France,
                            Lorraine.Goeuriot@imag.fr,
            gabriela-nicole.gonzalez-saez@univ-grenoble-alpes.fr
2
  The Australian National University, Canberra, ACT, Australia, {hanna.suominen,
                    zhengyang.liu, chenchen.xu}@anu.edu.au
    3
      Data61/Commonwealth Scientific and Industrial Research Organisation,
                              Canberra, ACT, Australia
                        4
                          University of Turku, Turku, Finland
                5
                  Maynooth University, Ireland, liadh.kelly@mu.ie
6
  University of Milano-Bicocca, Dept. of Informatics, Systems, and Communication,
            Milan, Italy, {gabriella.pasi, marco.viviani}@unimib.it


      Abstract. In this paper, we provide an overview of the CLEF eHealth
      Task 2 on Information Retrieval (IR), organized as part of the eighth
      annual edition of the CLEF eHealth evaluation lab by the Conference
      and Labs of the Evaluation Forum. Its aim was to address laypeople’s
      difficulties in retrieving and digesting valid and relevant information, in
      their preferred language, to make health-centred decisions. The task was
      a novel extension of the most popular and established task in CLEF
      eHealth on Consumer Health Search (CHS), which makes responses to
      spoken ad-hoc queries. In total, five submissions were made to its two
      subtasks; three addressed the ad-hoc IR task on text data and two con-
      sidered the spoken queries. Herein, we describe the resources created for
      the task and evaluation methodology adopted. We also summarize lab
      submissions and results. As in previous years, organizers have made data,
      methods, and tools associated with the lab tasks available for future re-
      search and development.

  Copyright © 2020 for this paper by its authors. Use permitted under Creative
  Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25
  September 2020, Thessaloniki, Greece.
?
  With equal contribution, LG & HS were the co-leaders of the task. LK, ZL, GP,
  GGS, MV, and CX were the task co-organizers and contributors to the evaluation
  conceptualization, dataset creation, assessments, and measurements.
        Keywords: eHealth · Evaluation · Health Records · Medical Informat-
        ics · Information Storage and Retrieval · Speech Recognition · Test-set
        Generation · Self-Diagnosis · Dimensions of Relevance


1     Introduction

In recent years, electronic health (eHealth) content has become available in a
variety of forms ranging from patient records and medical dossiers, scientific
publications, and health-related websites to medical-related topics shared across
social networks. Laypeople, clinicians, and policy-makers need to easily retrieve,
and make sense of such medical content to support their decision making. The
increasing difficulties experienced by these stakeholders in retrieving and digest-
ing valid and relevant information in their preferred language to make health-
centred decisions has motivated CLEF eHealth to organise yearly shared chal-
lenges since 2013.
    More specifically, CLEF eHealth7 was established as a lab workshop in 2012
as part of the Conference and Labs of the Evaluation Forum (CLEF). Since 2013,
it has offered evaluation labs in the fields of layperson and professional health
information extraction, management, and retrieval with the aims of bringing
together researchers working on related information access topics and providing
them with datasets to work with and validate the outcomes. These labs and
their subsequent workshops target:

 1. developing processing methods and resources (e.g., dictionaries, abbrevia-
    tion mappings, and data with model solutions for method development and
    evaluation) in a multilingual setting to enrich difficult-to-understand eHealth
    texts, support personalized reliable access to medical information, and pro-
    vide valuable documentation;
 2. developing an evaluation setting and releasing evaluation results for these
    methods and resources;
 3. contributing to participants and organizers’ professional networks and in-
    teraction with all interdisciplinary actors of the ecosystem for producing,
    processing, and consuming eHealth information.

     The vision for the Lab is two-fold:

 1. to develop tasks that potentially impact laypeople’s understanding of medi-
    cal information, and
 2. to provide the community with an increasingly sophisticated dataset of clin-
    ical narrative, enriched with links to standard knowledge bases, evidence-
    based care guidelines, systematic reviews, and other further information, to
    advance the state-of-the-art in multilingual information extraction (IE) and
    information retrieval (IR) in healthcare.
7
    https://clefehealth.imag.fr/
   The eighth annual CLEF eHealth lab, CLEF eHealth 2020 [13], aiming to
build upon the resource development and evaluation approaches offered in the
previous years of the lab [53, 20, 11, 19, 12, 51, 21], offered the following two tasks:
 – Task 1. Multilingual IE [28] and
 – Task 2. Consumer Health Search (CHS).
   The CHS task was a continuation of the previous CLEF eHealth IR tasks
that ran in 2013, 2014, 2015, 2016, 2017 and 2018 [8, 10, 33, 62, 34, 17, 9], and
embraced the Text REtrieval Conference (TREC) – style evaluation process,
with a shared collection of documents and queries, the contribution of runs
from participants and the subsequent formation of relevance assessments and
evaluation of participants submissions. The 2020 task used the representative
web corpus developed in the 2018 challenge. This year we offered spoken queries,
as well as textual transcripts of these queries. The task was structured into a two
optional subtasks, covering (1) ad-hoc searchand (2) query variation using the
spoken queries, textual transcripts of the spoken queries, or provided automatic
speech-to-text conversions of the spoken queries.
   The multilingual IE task focused on Spanish. Further details on this challenge
are available in [13] and [28].
   The remainder of this paper is structured as follows: First, in Section 2,
we detail the task, evaluation, and datasets created. Second, in Section 3, we
describe the task submissions and results. Finally, in Section 4, we discuss the
study and provide conclusions.

2     Materials and Methods
In this section, we describe the materials and methods used in the two subtasks.
After specifying our document collection, we address the spoken, transcribed,
and speech recognized queries. Then, we describe our evaluation methods. Fi-
nally, we introduce our human relevance assessments for information topicality,
understandability, and credibility.

2.1   Documents
The 2018 CLEF eHealth Consumer Health Search document collection was used
in this year’s IR challenge. As detailed in [17], this collection consists of web
pages acquired from the CommonCrawl.
    An initial list of websites was identified for acquisition. The list was built by
submitting queries on the 2018/2020 topics to the Microsoft Bing Application
Programming Interfaces (APIs), through the Azure Cognitive Services, repeat-
edly over a period of a few weeks, and acquiring the uniform resource locators
(URLs) of the retrieved results. The domains of the URLs were then included
in the list, except some domains that were excluded for decency reasons.
    The list was further augmented by including a number of known reliable
health websites and other known unreliable health websites. This augmentation
was based on lists previously compiled by health institutions and agencies.
2.2   Queries

Historically, the CLEF eHealth IR task has released text queries representative of
layperson medical information needs in various scenarios. In recent years, query
variations issued by multiple laypeople for the same information need have been
offered. In this year’s task, we extended this to also include spoken queries.
    These spoken queries were generated by six laypeople in English. All native
English speakers. Efforts were made to include a diverse set of accents. Narratives
for query generation were those used in the 2018 challenge. These narratives
relate to real medical queries compiled by the Khresmoi project [15] which were
issued to a search engine by laypeople; full details are available in the CLEF
eHealth 2018 IR task overview paper [17]. Spoken transcripts of these narratives
were generated for use in query generation in this year’s challenge.
    To create a spoken query the layperson listened to the narrative; and gen-
erated their spoken query associated with the narrative. The layperson then
listened to their generated spoken query and created a textual transcript of the
query. To ensure accuracy in transcription, they were required to repeat this
process of listening to their narrative and textually transcribing it. This allowed
us to generate accurate transcripts of the spoken queries. We did not preprocess
the textual transcripts of queries; for example, any spelling mistakes that may
be present were not removed.The final generated query set consisted of 50 topics,
with 6 query variations for each topic.
    Ethical approval was obtained to generate the spoken queries, and informed
consent obtained from study participants. Spoken queries were downloadable
from a secured server for the purpose of participating in this year’s CLEF eHealth
IR challenge, on completion of a signed use agreement by the participating team.
    We also provided participants with the textual transcripts of these spo-
ken queries and automatic speech-to-text translations. This transcription of
the audio files was generated using the End-to-End Speech Processing Toolkit
(ESPNET), Librispeech, CommonVoice, and Google API (with three models).
Speech recognition is assessed using Kaldi [40], an open-source speech recognition
toolkit distributed under a free license. We use mel-frequency cepstral coefficient
(MFCC) acoustic features (13 coefficients expanded with delta and double delta
features and energy : 40 features) with various feature transformations including
linear discriminant analysis (LDA), maximum likelihood linear transformation
(MLLT), and feature space maximum likelihood linear regression (fMLLR) with
speaker adaptive training (SAT).
    The speech transcription process is carried out in two passes: an automatic
transcript is generated with a GMM-HMM model of 12000 states and 200000
Gaussians. The second pass is performed using DNN (nnet3 recipe in kaldi
toolkit) acoustic model trained on acoustic features normalized with the fM-
LLR matrix.
    TEDLIUM dataset [42] was used for training acoustic models. It was devel-
oped for large vocabulary continuous speech recognition (LVCSR). The train
part of the dataset is composed 118 hours of speech.
    The English language model is trained with MIT language model toolkit
using following corpora : News commentary 2007-2012 [55], Gigaword version
5 [14], TDT 2-4 [57]. The vocabulary size is 150K based on most frequent words.


2.3   Evaluation Methods

Similar to the 2016, 2017, and 2018 pools, we created the pool using the RBP-
based Method A (Summing contributions) by Moffat et al. [29], in which doc-
uments are weighted according to their overall contribution to the effectiveness
evaluation as provided by the RBP formula (with p = 0.8, following Park and
Zhang [35]). This strategy, named RBPA, was chosen because it was shown that
it should be preferred over traditional fixed-depth or stratified pooling when
deciding upon the pooling strategy to be used to evaluate systems under fixed
assessment budget constraints [24], as it is the case for this task.
    Since the 2018 topics were used in 2020,the pool used in the 2020 CHS task
was an extension of 2018’s pool. In other words, the merged 2018&2020 pool
was used in 2020. For Subtasks 1 and 2, participants could submit up to 4 runs
in the TREC format. Evaluation measures were NDCG@10, BPref, and RBP.
Metrics such as uRBP were used to capture various relevance dimensions, as
elaborated below.


2.4   Human Assessments for Topicality, Understandability,
      and Credibility

Relevance assessments were conducted on three relevance dimensions: topical-
ity, understandability, and credibility. Topicality referred to a classical relevance
dimension ensuring that the document and the query are on the same topic
and the document answers the query. Understandability was an estimation of
whether the document is understandable by a patient. In assessment guidelines,
assessors were required to estimate how readable they think the documents were
to a layperson, that is, a person without any medical background. Topicality
and understandability have been used as relevance dimensions in the CHS task
of CLEF eHealth for several years.
    This year, to take into consideration the phenomenon of the spread of dis-
information online (especially on health-related topics), we introduced a novel
dimension, that is, credibility. Over the years, the interest in studying the concept
of credibility has gradually moved from traditional communication environments,
characterized by interpersonal and persuasive communication, to mass commu-
nication and interactive-mediated communication, with particular reference to
online communication [26]. In this scenario, retrieving credible information is
becoming a fundamental issue [18, 23, 38, 39, 59], also in the health-related con-
text [44].
    In general, credibility is described in the literature as a perceived quality of
the information receiver [6], and it is composed of multiple dimensions that have
to be considered and evaluated together in the process of information credibility
assessment [6, 27, 45]. These dimensions usually include the source that dissem-
inates content, characteristics related to the message diffused, and some social
aspects if the information is disseminated through a virtual community [56].
   For this reason, when evaluating information credibility in the health-related
context, assessors were asked in the CLEF eHealth 2020 Task 2 to evaluate the
aforementioned multiple aspects by considering, at the same time:
 1. any information available about the trustworthiness of the source [2, 3, 25] of
    the health-related information (the fact that information comes from a Web
    site with a good or bad reputation, or the level of expertise of an individual
    answering on a blog or a question-answering system, etc.);
 2. syntactic/semantic characteristics of the content [7] (in terms of, e.g., com-
    pleteness, language register, or style); and
 3. any information emerging from social interactions when available [37] (the
    fact that the circle of social relationships of the author of a content is reliable
    or not, the fact that the author is involved in many discussions, etc.).
Obviously, it must be taken into account that the ability to judge the credibility
of information related to health depends very much on the sociocultural back-
ground of the assessor, on the availability of information about the social network
of the source of information, and on the ease versus complexity of identifying in
or inferring from the document the different dimensions of credibility.
    Assessors considered the three dimensions in assessments (i.e., topicality,
understandability, and credibility) on a 3-levels scale:
 – not relevant/understandable/credible,
 – somewhat relevant/understandable/credible, and
 – highly relevant/understandable/credible.
In particular, we added a 4th option for credibility for assessors uncertainty: I
am not able to judge. This was motivated by the fact that, as illustrated above,
the documents to be assessed may actually lack (or it may not be entirely clear)
the minimum information necessary to assess their level of credibility.
   Assessments were implemented online by expanding and customising the Rel-
evation! tool for relevance assessments [22] to capture our task dimensions, scales,
and other preferences. The number of assessors was 30, of which about 12 women
(40%) and 18 men (60%). They were based in European countries and Australia.
Their expertise ranged from being a medical doctor (in different disciplines) or
a nurse to being a layperson with no or limited background in medicine or
healthcare. Each assessor was assigned 1 to 5 queries to be evaluated. Each
query (concerning a specific domain linked to health) was associated with 150
documents to be evaluated with respect to the three dimensions of relevance
mentioned above.

3    Results
CLEF eHealth IR/CHS tasks offered in 2013–2020 have brought together re-
searchers working on health information access topics. The tasks have provided
them with data and computational resources to work with and validate their
outcomes. These contributions have accelerated pathways from scientific ideas
through influencing research and development to societal impact. The task niche
has been addressing health information needs of laypeople (including, but not
limited to, patients, their families, clinical staff, health scientists, and healthcare
policy makers) — and not healthcare experts only — in a range of languages and
modalities — in retrieving and digesting valid and relevant eHealth information
to make health-centered decisions [4, 16, 5, 48, 49].
   Next, we report on the 2020 participants, method submissions, and their
resulting evaluation outcomes. This expands the brief results section of our lab
overview [13].


3.1     Participation

In 2020, 24 teams registered to CLEF eHealth Task 2. Of these teams, 3 took
part in the task. Registering for a CLEF task consisted of filling in a form on the
CLEF conference website with contact information, and tick boxes corresponding
to the labs of interest. This was done several months before run submission, which
explains the drop in the numbers.
    Also, the task was difficult and demanding which is another explanation for
the drop.
    Although the interest and participation numbers were considerably smaller
than before [48–50], organizers were pleased with this newly introduced task,
with its novel spoken queries element attracting interest and submissions (Table
1). In addition, every participating team took the offered more-traditional ad-hoc
task.


3.2     Participants’ Method Submissions

Among five submissions to the 2020 CLEF eHealth Task 2, the ad-hoc IR sub-
task was the most popular with its three submissions; the subtasks that used
transcriptions of the spoken queries and the original audio files received one
submission each. Specifically, the subtask that used transcriptions of the spo-
ken queries had one submission and the subtask where the original audio files
were processed had one submission. The submitting teams were from Australia,
France, and Italy and had 4, 5, and 3 team members, respectively.8 Each team
had members from a single university without other partner organizations.
    The Italian submission to the ad-hoc search and spoken queries using tran-
scription subtasks was from the Information Management System (IMS) Group
of the University of Padua [32]. Its members were Associate Professor Giorgio
Maria Di Nunzio, Stefano Marchesin, and Federica Vezzani. The submission to
the former task included BM25 of the original query; Reciprocal Rank fusion
with BM25, Query Language Model (QLM), and Divergence from Randomness
8
    Please note that these numbers are corrected from [13], based on the team’s finalised
    working notes.
            Table 1. Descriptive statistics about the submitting teams

Subtasks             No. of coauthors   Authors’ affliction Affiliation country
ad-hoc Search &              3          1 university        Italy
Spoken Queries Us-
ing Transcriptions
ad-hoc Search &              5          1 university         France
Spoken       Queries
Using Audio Files
ad-hoc Search                4          1 university         Australia


(DFR) approaches. Reciprocal Rank fusion with BM25, QLM, and DFR ap-
proaches using pseudo relevance feedback with 10 documents and 10 terms (the
query weight of 0.5); and Reciprocal rank fusion with BM25 run on manual
variants of the query. The submission to the latter task included the Recipro-
cal Rank fusion with BM25; Reciprocal Rank fusion with BM25 using pseudo
relevance feedback with 10 documents and 10 terms (the query weight of 0.5);
Reciprocal Rank fusion of BM25 with all transcriptions; and Reciprocal Rank
fusion of BM25 with all transcripts using pseudo relevance feedback with 10
documents and 10 terms (the query weight of 0.5).
    The French team, LIG-Health, was formed by Dr Philippe Mulhem, Aidan
Mannion, Gabriela Gonzalez Saez, Dr Didier Schwab, and Jibril Frej from the
Laboratoire d’Informatique de Grenoble of the Univ. Grenoble Alpes [30]. To the
ad-hoc search task, they submitted runs using Terrier BM25 as a baseline, and
explored various expansion methods using UMLS, using the Consumer Health
Vocabulary, expansion using Fast Text, and RF (bose-Einstein) weighted ex-
pansion. For the spoken queries subtask they used various transcriptions on the
same models, opting for the best performing ones based on 2018 qrels. They
submitted merged runs for each query.
    The Australian team – called SandiDoc – was from the Our Health In Our
Hands (OHIOH) Big Data program, Research School of Computer Science, Col-
lege of Engineering and Computer Science The Australian National Univer-
sity [43]. Its members were Sandaru Seneviratne, Dr Eleni Daskalaki, Dr Artem
Lenskiy, and Dr Zakir Hossain. Differently from the other two teams, Sandi-
Doc took part in the ad-hoc search task only. Its IR method had three steps
and was founded on TF×IDF scoring: First, both the dataset and the queries
were pre-processed. Second, TF×ID scores were computed for the queries and
used to retrieve the most similar documents for the queries. Third, the team
supplemented this method by working on the clefehealth2018 B dataset using
the medical skip-gram word embeddings provided. To represent the documents
and queries, the team used the average word vector representations as well as
the average of minimum and maximum vector representations of the document
or query. In documents, these representations were derived using the 100 most
frequent words in a document. For each representation, the team calculated the
similarity among documents and queries using the cosine measure to obtain the
final task results. The team’s aim was to experiment with different vector rep-
resentations for text.


3.3   Organizers’ Baseline Methods

In addition to these participants’ methods, we as organizers developed baseline
methods that were based on the renowned OKapi BM25 but now with query ex-
pansion optimized in a REINFORCE [58] fashion. In this section, we introduce
the two steps of query expansion and document retrieval. First, we pre-trained
our query expansion model on the generally available corpora. Similar to the
REINFORCE learning protocol as introduced in [58], this pre-training step was
done by iterations of exploration trials and optimization of the current model by
rewarding the explorations. Each time an input query was enriched by the query
expansion model from the last iteration, and from where several candidate new
queries were generated. The system was rewarded or penalized by matching the
retrieved documents from these candidate queries against the ground truth doc-
ument ranking. To expand the trial of generating new queries and thus provide
more training sources, the system used the context words found in these newly
retrieved documents to construct queries for the next iteration. For this baseline
model, the same datasets from [31], TREC-CAR, Jeopardy, and MSA were used
for pre-training.
     One key challenge for information retrieval in the eHealth domain is that
layperson may lack the professional knowledge to precisely describe medical
topics or symptoms. A layperson’s input query into the system can be lengthy
and inaccurate, while the documents to be matched for these queries are usually
composed by people of medical mind and background and thus rigorous in word-
ing. The query expansion phase was added to increase the chance of matching
more candidate documents by enriching the original query. With this intuition
in mind, we employed similar candidate query construction method and opti-
mization target as introduced in [31]. Given an original query q0 , the system
retrieves several ranked documents set D0 . The system constructed new candi-
date query q00 by selecting words from the union of words from the original query
q0 and context words from the retrieved documents D0 . The new query was fed
back into the retrieval system to fetch documents D00 as the learning source
for the next iteration. The system iteratively apply the retrieval of documents
and reformulation of new candidate queries to create the supervision examples
{(q00 , D00 ), (q10 , D10 ), . . .}. At each iteration, the system memorized the selection
operation of words in constructing the new query as actions to be judged. The
documents retrieved Dk0 along with their ranking were then compared with the
ground truth document ranking. Correct ranking of the documents becomes re-
ward to the new query and thus also the actions that generate it. Particularly,
the stochastic objective function for calculating the reward was:
                                          X
                          Ca = (R − R̄)         − log P (t | q0 ) ,
                                          t∈T
where R and R̄ are the reward from the new query and baseline reward, and
t ∈ T are words from the new query. With the actions and reward being properly
defined, the system can be optimized under the REINFORCE learning frame-
work [58]. At inference stage, the system will greedily peek the optimal selection
operations to generate a few candidate queries from the input query.
    After enriching the input queries by the pre-trained query expansion model,
the second step of this baseline model reused the commonly-used BM25 algo-
rithm [41].


3.4   Evaluating Topicality

Table 2 gives the global results of all the teams for the 2 subtasks. This table
shows the task metrics (MAP, BPREF, NDCG, uRBP and cRBP) for all the
participants runs, and the organizers baselines.


Table 2. Runs evaluation using the MAP, BPref, nDCG@10, uRBP, cRBP in the
ad-hoc search task. Bold cells are the highest scores. Statistical significance tests were
conducted but the highest participants runs were not statistically significantly higher
than the best baselines.
                                        ad-hoc Search Task
          Team                Run              MAP BPref NDCG@10 uRBP cRBP
                             Bing all         0.0137 0.0164 0.4856 0.2433 0.3489
                  elastic BM25 QE Rein 0.1758 0.3071        0.5782 0.3072 0.4542
                    elastic BM25f noqe        0.2707 0.4207 0.7197 0.3494 0.5424
                      elastic BM25f qe        0.1107 0.2110 0.5625 0.2711 0.4399
                    indri dirichlet noqe      0.0790 0.1807 0.4104 0.1893 0.3299
                      indri dirichlet qe      0.0475 0.1234 0.3235 0.1504 0.2649
                      indri okapi noqe        0.1102 0.2227 0.4708 0.2180 0.3738
                         indri okapi qe       0.1192 0.2394 0.4732 0.2109 0.3703
        Baseline        indri tfidf noqe      0.1215 0.2396 0.4804 0.2139 0.3740
                          indri tfidf qe      0.1189 0.2344 0.4824 0.2163 0.3799
                       terrier BM25 cli       0.2643 0.3923 0.4963 0.2298 0.3836
                      terrier BM25 gfi        0.2632 0.3922 0.4923 0.2345 0.3828
                     terrier BM25 noqe        0.2627 0.3964 0.5919 0.2993 0.4615
                       terrier BM25 qe        0.2453 0.3784 0.5698 0.2803 0.4516
                 terrier DirichletLM noqe 0.2706 0.4160     0.6054 0.2927 0.4789
                  terrier DirichletLM qe 0.1453 0.2719      0.5521 0.2662 0.4456
                   terrier TF IDF noqe        0.2613 0.3958 0.6292 0.3134 0.4845
                     terrier TF IDF qe        0.2500 0.3802  0.608 0.3013 0.4831
                            bm25 orig         0.2482 0.3909 0.6493 0.3204 0.5125
          IMS          original rm3 rrf       0.2834 0.4320 0.6491 0.3141 0.5039
                           original rrf       0.2810 0.4232 0.6593 0.3225 0.5215
                           variant rrf        0.2022 0.3712 0.6339 0.3511 0.4863
                       FT Straight.res        0.2318 0.3669 0.5617 0.2909 0.4555
          LIG       Noexp Straight.res        0.2627 0.3964 0.5919 0.2993 0.4615
                         UMLS RF.res          0.2258 0.3616 0.5918 0.3123 0.4593
                    UMLS Straight.res         0.2340 0.3645 0.5769 0.3062 0.4614
        SandiDoc               tfidf          0.0239 0.0536 0.3235 0.2413 0.2281


   Table 4 shows the topical relevance results for all the teams and the organizers
baselines. The team achieving the highest results on 2 metrics is IMS, although
the scores are not statistically significantly higher than the best baseline. Their
top run, original rm3 rrf run uses Reciprocal Rank fusion with BM25, QLM,
Table 3. Runs evaluation using the MAP, BPref, nDCG@10, uRBP, cRBP in the
Spoken search task

                                      ad-hoc Search Task
           Team                     Run        MAP BPref NDCG@10 uRBP cRBP
    Best ad-hoc baseline    elastic BM25f noqe 0.2707 0.4207 0.7197 0.3494 0.5424
    IMS                         bm25 all rrf   0.1952 0.3722 0.4478 0.2981 0.4622
                             bm25 all rrf rm3  0.2144 0.3975 0.4711 0.2854 0.4348
                                  bm25 rrf     0.1962 0.3745 0.4553 0.2923 0.4636
                               bm25 rrf rm3    0.2188 0.4036 0.4781 0.2892 0.4493
    LIG Merged                 FT binary RF    0.1626 0.2964 0.3627 0.2264 0.3592
                                 Noexp RF      0.1810 0.3279 0.3963 0.2601 0.4067
                             UMLS binary RF    0.1582 0.3054 0.3658 0.2557 0.3864
                             UMLS weight RF    0.1671 0.3085 0.3721 0.2407 0.3722
    LIG Participant 1       DE FT binary RF 0.1565 0.3096    0.3705 0.2441 0.3906
                               DE Noexp RF     0.1726 0.3192 0.3853 0.2645 0.4051
                           DE UMLS binary RF 0.1271 0.2783   0.3238 0.2327 0.3435
                           DE UMLS weight RF 0.1416 0.2928   0.3473 0.2333 0.3582
    LIG Participant 2       DE FT binary RF 0.0995 0.2314    0.2764 0.1970 0.2906
                               DE Noexp RF     0.1206 0.2634  0.313 0.2066 0.3236
                           DE UMLS binary RF 0.1017 0.2384   0.2769 0.1998 0.2960
                           DE UMLS weight RF 0.1133 0.2575   0.3039 0.2313 0.3338
    LIG Participant 3       VE FT binary RF 0.1274 0.2664    0.3159 0.2106 0.3176
                               VE Noexp RF     0.1447 0.3023  0.356 0.2106 0.3389
                           VE UMLS binary RF 0.1385 0.2984   0.3451 0.2055 0.3164
                           VE UMLS weight RF 0.1485 0.3114   0.3637 0.2194 0.3412
    LIG Participant 4        PE FT binary RF   0.1090 0.2582 0.2985 0.1908 0.2987
                               PE Noexp RF     0.1301 0.2880 0.3382 0.2164 0.3375
                           PE UMLS binary RF 0.1246 0.2852   0.3285 0.2020 0.3061
                           PE UMLS weight RF 0.1282 0.2877   0.3348 0.2150 0.3317
    LIG Participant 5        PE FT binary RF   0.0952 0.2287 0.2512 0.1424 0.2460
                               PE Noexp RF     0.1035 0.2412 0.2679 0.1644 0.2765
                           PE UMLS binary RF 0.1036 0.2470   0.2741 0.1514 0.2670
                           PE UMLS weight RF 0.0917 0.2227   0.2466 0.1535 0.2615
    LIG Participant 6       DE FT binary RF 0.1509 0.2921    0.3496 0.2016 0.3522
                               DE Noexp RF     0.1744 0.3238 0.3849 0.2220 0.3724
                           DE UMLS binary RF 0.1478 0.3019   0.3499 0.2008 0.3233
                           DE UMLS weight RF 0.1594 0.3072   0.3618 0.2098 0.3411


DFR approaches using pseudo relevance feedback with 10 documents and 10
terms (query weight 0.5) and achieves 0.28 MAP and 0.43 BPref. The second
best run for MAP and BPref is also from IMS, using the same ranking system
without PRF. For NDCG, the organizers baseline using ElasticSearch BM25
without query expansion obtains higher results. Interstingly, we observe in that
table that BM25 gives very good performances, but its various implementations
can give very different results (MAP ranging from 0.11 to 0.26).
    Since the ad-hoc task used the same topics and documents as 2018 but in-
tended to extend 2018’s pool, we compared teams results for all the ad-hoc task
metrics in Figure 1. The figure shows that the extension of the pool had a rel-
atively limited impact on the performances of each submitted systems, except
for Bpref measure that shows a consistent decrease. Bpref, correlated to the av-
erage precision, is more robust to reduced pools [1] and penalises systems for
ranking non-relevant documents above relevant ones. Therefore, this decrease
can be explained by the fact that the extension of the pool contained relevant
documents.
Fig. 1. Comparison of the task metrics with 2018 and 2020 pools. Blue and orange
bars represent 2018 and 2018 results respectively


    Table 3 shows the results of the two teams taking part in the spoken queries
subtask. For comparison purpose, the first line gives the results of the best overall
baseline, ElasticSearch BM25. All the participants used the provided transcrip-
tion. Team IMS, who obtained the highest results, used reciprocal ranked fusion
with BM25. They obtained the best results with Reciprocal Rank fusion with
BM25 using pseudo relevance feedback with 10 documents and 10 terms (query
weight 0.5). Team LIG submitted both merged runs for all participant speaker,
using the transcription giving the highest performances. The merged runs gave
better results and no expansion method but relevance feedback seemed to help
retrieval. This table shows that retrieval from spoken queries is a challenging
task and that transcription does not allow to achieve the same results than the
original query.
Table 4. Ranking of topicality relevance using MAP, Bpref and NDCG@10 in the
ad-hoc search task
                                        ad-hoc Search Task
            run            MAP              run          BPref            run          NDCG
 1 IMS                     0.2834 IMS                    0.4320 Baseline               0.7197
    original rm3 rrf              original rm3 rrf              elastic BM25f noqe
 2 IMS                     0.2810 IMS                    0.4232 IMS                    0.6593
    original rrf                  original rrf                  original rrf
 3 Baseline                0.2707 Baseline               0.4207 IMS                    0.6493
    elastic BM25f noqe            elastic BM25f noqe            bm25 orig
 4 Baseline                0.2706 Baseline               0.4160 IMS                    0.6491
    terrier DirichletLM noqe      terrier DirichletLM noqe      original rm3 rrf
 5 Baseline                0.2643 Baseline               0.3964 IMS                    0.6339
    terrier BM25 cli              terrier BM25 noqe             variant rrf
 6 Baseline                0.2632 LIG                    0.3964 Baseline               0.6292
    terrier BM25 gfi              Noexp Straight.res            terrier TF IDF noqe
 7 Baseline                0.2627 Baseline               0.3958 Baseline               0.608
    terrier BM25 noqe             terrier TF IDF noqe           terrier TF IDF qe
 8 LIG                     0.2627 Baseline               0.3923 Baseline               0.6054
    Noexp Straight.res            terrier BM25 cli              terrier DirichletLM noqe
 9 Baseline                0.2613 Baseline               0.3922 Baseline               0.5919
    terrier TF IDF noqe           terrier BM25 gfi              terrier BM25 noqe
 10 Baseline               0.2500 IMS                    0.3909 LIG                    0.5919
    terrier TF IDF qe             bm25 orig                     Noexp Straight.res


3.5   Evaluating Understandability

The evaluation of understandability have been measure with understandability-
ranked biased precision (uRBP) [61]. uRBP evaluate IR systems by taking into
account both topicality and understandability dimensions of relevance.
   Particularly, the function for calculating uRBP was:

                                          K
                                          X
                      uRBP = (1 − ρ)            ρk−1 r(d@k)· u(d@k),                      (1)
                                          k=1


where r(d@k) is the relevance of the document d at position k, u(d@k) is the
understandability value of the document d at position k, and the persistent
parameter ρ models the user desire to examine every answer, which was set to
0.50, 0.80 and 0.95 to obtain three version of uRBP, according to different user
behaviors.
    The results for all the participants for understandability evaluation is shown
in the second last column of Table 2. Table 5 shows the top ten runs submitted
in the ad-hoc Task. The best run was obtained with Reciprocal rank fusion with
BM25 run on manual variants of the query by team IMS. The BM25 baseline
gives very close performances. The run ranking does not differ from the topical-
ity evaluation. This could be due to the fact than none of the submitted runs
included features specifically designed to assess understandability of the results.
    These results have been obtained with the binary relevance as-
sessment, and the graded understandability assessment, and rbp eval
0.5 as distributed by RBP group. For further details, please refer to
https://github.com/jsc/rbp eval.
Table 5. Understandability and Credibility ranking evaluations using the uRBP and
cRBP in the search task
                                       ad-hoc Search Task
                       run                uRBP                    run                cRBP
       1 IMS variant rrf                  0.3511 Baseline elastic BM25f noqe         0.5424
       2 Baseline elastic BM25f noqe      0.3494 IMS original rrf                    0.5215
       3 IMS original rrf                 0.3225 IMS bm25 orig                       0.5125
       4 IMS bm25 orig                    0.3204 IMS original rm3 rrf                0.5039
       5 IMS original rm3 rrf             0.3141 IMS variant rrf                     0.4863
       6 Baseline terrier TF IDF noqe     0.3134 Baseline terrier TF IDF noqe        0.4845
       7 LIG UMLS RF.res                  0.3123 Baseline terrier TF IDF qe          0.4831
       8 Baseline elastic BM25 QE Rein    0.3072 Baseline terrier DirichletLM noqe   0.4789
       9 LIG UMLS Straight.res            0.3062 Baseline terrier BM25 noqe          0.4615
      10 Baseline terrier TF IDF qe       0.3013 LIG Noexp Straight.res              0.4615


3.6     Evaluating Credibility

In this section, we report the results produced for the two subtasks: ad-hoc search
and spoken queries retrieval. In particular, for the ad-hoc subtask, only the ad-
hoc IR subtask is considered (no speech recognition). For each subtask, the
results of both baseline methods and team submission methods in the context of
credibility assessment related to IR are reported and commented. The measures
employed to assess the credibility of the tasks considered are detailed below.
    In the literature, accuracy is certainly the measure that has been used most
frequently to evaluate a classification task, and, as such, it has been usually em-
ployed to evaluate the effectiveness of credibility assessment. In fact, to date, the
problem of assessing the credibility of information has mostly been approached
as a binary classification problem (by identifying credible versus non-credible
information) [56]. Some works have also proposed the computation of credibility
values for each piece of information considered, by proposing users a credibility-
based ranking of the considered information items, or by leaving to users the
choice of trusting the information items based on these values [36, 56, 60]. Ob-
viously, once an adequate threshold has been chosen, it would be possible to
transform these approaches into approaches that produce a binary classification.
    However, our purpose in asking assessors to evaluate the documents from
the point of view of their credibility is to be able to generate IR Systems that
can retrieve credible information, besides understandable and topically relevant.
For this reason, in CLEF eHealth 2020 we have adapted the understandability-
ranked biased precision (uRBP) illustrated in [61] to credibility, by employing
the so-called cRBP measure. In this case the function for calculating cRBP is the
same used to calculate uRBP (see Equation 1 in Section 3.5 , replacing u(d@k)
by the credibility value of the document d at position k, c(d@k) :

                                            K
                                            X
                        cRBP = (1 − ρ)            ρk−1 r(d@k)· c(d@k),                        (2)
                                            k=1

As in uRBP the parameter ρ was set to three values, from impatient user (0.50)
to more persistent users (0.80 and 0.95).
    It is important to underline that, in defining IR approaches implemented by
both the organizers and the three groups that submitted runs, no explicit refer-
ence was made to solutions for assessing the credibility of documents. Therefore,
any potential increase in the evaluation figures must be considered purely coin-
cidental.


Evaluation with Accuracy. The results illustrated in this section were ob-
tained with a binary credibility assessment for 2020 and trustworthiness for 2018
data. A document assessed with a credibility/trustworthiness value ≥ 50% was
considered as credible. The accuracy of the credibility assessment was calculated
over the top 100 documents retrieved for each query as follows:

                               #credible retrieved docs top 100(q)
                    acc(q) =                                       .
                                  #retrieved docs top 100(q)


Table 6. Credibility evaluations using the accuracy (Acc) in the ad-hoc search task

                                     ad-hoc Search Task
         Baseline            Acc        IMS        Acc        LIG          Acc SandiDoc Acc
Bing all                     0.84 bm25 orig        0.75 FT Straight.res    0.75 tfidf   0.17
elastic BM25 QE Rein         0.63 original rm3 rrf 0.74 Noexp Straight.res 0.80
elastic BM25f noqe.out       0.73 original rrf     0.80 UMLS RF.res        0.74
elastic BM25f qe.out         0.53 variant rrf      0.64 UMLS Straight.res 0.75
indri dirichlet noqe.out     0.67
indri dirichlet qe.out       0.57
indri okapi noqe.out         0.68
indri okapi qe.out           0.63
indri tfidf noqe.out         0.67
indri tfidf qe.out           0.64
terrier BM25 cli.out         0.85
terrier BM25 gfi.out         0.85
terrier BM25 noqe.out        0.80
terrier BM25 qe.out          0.79
terrier DirichletLM noqe.out 0.80
terrier DirichletLM qe.out   0.64
terrier TF IDF noqe.out      0.82
terrier TF IDF qe.out        0.78


    Table 6, referring to the ad-hoc Search subtask, shows that most of the ap-
proaches tested presented a good accuracy value when it comes to the credibility
of the retrieved documents. However, SandiDoc’s submission had, unfortunately,
a very low accuracy value. With respect to the spoken queries IR subtask, the
results were available only for IMS and LIG, as follows:
    With respect to this second subtask, the accuracy values illustrated in Table
7 were lower than those referred to the previous task with respect to all the
approaches tested by IMS and LIG. This is most likely due to errors from speech
recognition multiplying in IR, similar to what we experienced in the CLEF
eHealth 2015 and 2016 tasks on speech recognition and IE to support nursing
shift-change handover communication [47, 54].
Table 7. Credibility evaluations using the accuracy (Acc) in the spoken queries search
task
                           Spoken Queries Retrieval Task
                          IMS         Acc         LIG         Acc
                     bm25 all rrf     0.56 FT binary RF.res   0.52
                     bm25 all rrf rm3 0.59 Noexp RF.res       0.52
                     bm25 rrf         0.56 UMLS binary RF.res 0.49
                     bm25 rrf rm3     0.59 UMLS weight RF.res 0.51


Evaluation with cRBP. These results have been obtained with the binary
relevance assessment, and the graded credibility assessment, with the same pro-
gram referred in section 3.5 (rbp eval 0.5). To obtain cRBP with different
persistence values, rbp eval was ran as follows for credibility:

    rbp eval -q -H qrels.credibility.clef201820201.test.binary
    runName

    As for the values obtained by using the cRBP measure, at 0.50, 0.80, and
0.95, it is possible to say that, regardless of the specific approach used, and
with respect to the ad-hoc search task, they range in the 0.15–0.57 interval
for the baseline, with an average of 0.40; in the 0.41–0.57 interval for IMS,
with an average of 0.50; in the 0.40 – 0.49 for LIG, with an average of 0.45; in
the 0.16–0.23 interval for SandiDoc, with an average of 0.20. These results are
pretty coherent with those obtained for accuracy. Considering the spoken queries
retrieval subtask, also in this case the results are available only for IMS and LIG.
In this case, for IMS they range in the 0.37–0.50 interval, with an average of 0.44,
while for LIG they range in the 0.30–0.45 interval, with an average of 0.37.


4    Discussion

This year’s challenge offered an ad-hoc search subtask and a spoken query re-
trieval subtask. In Section 3 we provided an analysis of the results obtained by
the 3 teams who took part in these tasks. We showed the results achieved by sev-
eral baselines provided by the task organizers. We also discussed and compared
these results in Section 3. We consider three dimensions of relevance - topicality,
understanability, and credibility.
    The importance of considering several dimensions of relevance, beyond the
traditional topicality measure is highlighted in the results obtained, where we
find that different retrieval techniques score higher under each of the relevance
dimensions (topicality, understanability, and credibility).
    As might be expected, retrieval performance is impacted when the queries are
presented in spoken form. Speech-to-text conversion is required before the queries
can be used in the developed retrieval approaches. The retrieval performance
then is impacted by the quality of the speech-to-text conversion. Future studies
will explore this phenomenon in greater detail.
    We next look at the limitations of this year’s challenge. We then reflect on
prior editions of the challenge and the challenges future, before concluding the
paper.


4.1   Limitations

As previously illustrated, the concept of credibility has been studied in many
different research fields, including both psychology/sociology and computer sci-
ence. Introducing the concept of credibility in the context of IR is not an easy
task. On one hand, it represents a characteristic that can be only perceived
by human beings [6]. On the other hand, it can be considered as an objective
property of an information item, which an automated system can only estimate,
and, as a consequence, the ‘certainty’ of the correct estimate can be expressed
by numerical degrees [56]. For instance, let us consider the extreme example of
information seekers who think that Covid-19 does not exist and that, therefore,
any measure of social distancing is useless in preventing a non-existent conta-
gion. From the point of topical relevance, documents that affirms that the virus
is pure invention should be more relevant to them in a context of personalized
search; however, documents claiming that it is useless to take precautions not to
get infected should be assessed as non-credible in any case. In this context, the
assessment of the global relevance of documents, which should take into account
both objective and subjective dimensions of relevance, becomes highly problem-
atic. For the above reasons, it is difficult to measure and evaluate credibility
with the traditional measures used in IR.
    As illustrated in Section 2.4, measures that have been used so far to evaluate
the effectiveness of a systems in identifying credible or not credible information
are the traditional ones that are used in machine learning for classification tasks,
in particular accuracy. To be able to evaluate IR systems that are capable to
retrieve credible information (as well as understandable and topically relevant)
one could consider more IR-oriented metrics. In CLEF eHealth 2020, we have
adapted uRBP to credibility, by employing the so-called cRBP measure (as il-
lustrated in Section 3.6).
    However, this measure considers the credibility related to an information item
as subjective, while we believe that we should assess credibility in an objective
way. This makes this measure only partially suitable to our purposes (as well as
the accuracy was only partially suitable). In this scenario, it becomes essential
to develop measures that go beyond taking information credibility into account
as a a binary value, as is done in classification systems. This problem calls for
advancing IR systems and developing related evaluation measures that factor in
the joint goodness of the ranking produced by a search engine with respect to
multiple dimensions of relevance. These include, but are not limited to, topicality,
understandability, and credibility. Including credibility is critical in order to
balance between the subjectivity (of assessors) and the objectivity of a fact
reported in an information item. Consequently, its inclusion is certainly one of
the most ambitious goals we set for our future work.
4.2   Comparison with Prior Work

The inaugural CLEF eHealth CHS/IR task was organized in 2013 on the founda-
tion set by the 2012 CLEF eHealth workshop. The principal finding of this work-
shop, set to prepare for future evaluation labs, was identifying laypeople’s health
information needs and related patient-friendly health information access meth-
ods as a theme of the community’s research and development interest [46]. The
resulting CLEF eHealth tasks on CHS/IR, offered yearly from 2013 to 2020 [53,
20, 11, 19, 12, 52, 21, 13], brought together researchers to work on the theme by
providing them with timely task specifications, document collections, processing
methods, evaluation settings, relevance assessments, and other tools. Targeted
use scenarios for the designed, developed, and evaluated CHS/IR technologies in
these CLEF eHealth tasks included easing patients, their families, clinical staff,
health scientists, and healthcare policy makers in accessing and understanding
health information. As a result, the annual tasks accelerated technology trans-
fers from their conceptualisation in academia to generating societal impact [48,
49].
    This achieved impact has led to CLEF eHealth establishing its presence and
becoming by 2020 one of the primary evaluation lab and workshop series for
all interdisciplinary actors of the ecosystem for producing, processing, and con-
suming eHealth information [4, 16, 5]. Its niche in CHS/IR tasks is formed by
addressing health information needs of laypeople — and not healthcare experts
only — in accessing and understanding eHealth information in multilingual,
multi-modal settings with simultaneous methodological contributions to dimen-
sions of relevance assessments (e.g., topicality, understandability, and credibility
of returned information).


4.3   A Vision for the Task beyond 2020

The general purpose of the CLEF eHealth workshops and its preceding CHS/IR
tasks has been throughout the years from 2012 to 2020 to assist laypeople into
finding and understanding health information in order to make enlightened deci-
sions concerning their health and/or healthcare [46, 53, 20, 11, 19, 12, 51, 21, 13].
In that sense, the evaluation challenge will focus in the coming years on patient-
centered IR in a both multilingual and multi-modal setting.
    Improving multilingual and multi-modal methods is crucial to guarantee a
better access to information, and to understand it. Breaking language and modal-
ity barriers has been a priority in CLEF eHealth over the years, and this will
continue. Text has been the major media of interest, but as of 2020, also speech
has been included as a major new way of people interacting with the systems.
    Patient-centered IR/CHS task has been running since 2013 — yet, every edi-
tion has allowed to identify unique difficulties and challenges that have shaped
the task evolution [53, 20, 11, 19, 12, 51, 21, 13]. The task has considered in the
past, for example, multilingual queries, contextualized queries, spoken queries,
and query variants. Resources used to build these queries have also been changed.
Further exploration of query construction, aiming at a better understanding of
information seekers’ health information needs are needed. The task will also
further explore relevance dimensions (e.g., topicality, understandably, and cred-
ibility), with a particular emphasis on information credibility and methods to
take these dimensions into consideration.


4.4     Conclusions

This paper provided an overview of the CLEF eHealth 2020 Task 2 on IR/CHS.
The CLEF eHealth workshop series was established in 2012 as a scientific work-
shop with an aim of establishing an evaluation lab [46]. Since 2013, this annual
workshop has been supplemented with two or more preceding shared tasks each
year, in other words, the CLEF eHealth 2013–2020 evaluation labs [53, 20, 11, 19,
12, 51, 21, 13]. These labs have offered a recurring contribution to the creation
and dissemination of text analytics resources, methods, test collections, and
evaluation benchmarks in order to ease and support patients, their next-of-kins,
clinical staff, and health scientists in understanding, accessing, and authoring
eHealth information in a multilingual setting.
    In 2020 the CLEF eHealth lab offered two shared task. One on multilingual
IE and the other on consumer health search. These tasks built on the IE and
IR tasks offered by the CLEF eHealth lab series since its inception in 2013. Test
collections generated by these shared tasks offered a specific task definition, im-
plemented in a dataset distributed together with an implementation of relevant
evaluation metrics to allow for direct comparability of the results reported by
systems evaluated on the collections.
    These established CLEF IE and IR tasks used a traditional shared task model
for evaluation in which a community-wide evaluation is executed in a controlled
setting: independent training and test datasets are used and all participants gain
access to the test data at the same time, following which no further updates to
systems are allowed. Shortly after releasing the test data (without labels or other
solutions), the participating teams submit their outputs from the frozen systems
to the task organizers, who evaluate these results and report the resulting bench-
marks to the community.
    The annual CLEF eHealth workhops and evaluation labs have matured and
established their presence in 2012–2020 in proposing novel tasks in IR/CHS.
Given the significance of the tasks, all problem specifications, test collections,
and text analytics resources associated with the lab have been made available
to the wider research community through our CLEF eHealth website9 .


Acknowledgements

We gratefully acknowledge the contribution of the people and organizations in-
volved in CLEF eHealth in 2012–2020 as participants or organizers. We thank the
CLEF Initiative, Benjamin Lecouteux (Université Grenoble Alpes), João Palotti
9
    https://clefehealth.imag.fr/
(Qatar Computing Research Institute), Harrisen Scells (University of Queens-
land), and Guido Zuccon (University of Queensland). We thank the individuals
who generated spoken queries for the IR challenge. We also thank the individuals
at University of Queensland who contributed to the IR query generation tool and
process. We are very grateful to our assessors that helped despite the COVID-19
crisis: Paola Alberti, Vincent Arnone, Nathan Baran, Pierre Barbe, Francesco
Bartoli, Nicola Brew-Sam, Angela Calabrese, Sabrina Caldwell, Daniele Cava-
lieri, Madhur Chhabra, Luca Cuffaro, Yerbolat Dalabayev, Emine Darici, Marco
Di Sarno, Mauro Guglielmo, Weiwei Hou, Yidong Huang, Zhengyang Liu, Fed-
erico Moretti, Marie Revet, Paritosh Sharma, Haozhan Sun, Christophe Zeinaty.
The lab has been supported in part by The Australian National University
(ANU), College of Engineering and Computer Science, Research School of Com-
puter Science; the Our Health in Our Hands (OHIOH) initiative; and the CLEF
Initiative. OHIOH is a strategic initiative of The ANU which aims to transform
healthcare by developing new personalised health technologies and solutions in
collaboration with patients, clinicians, and healthcare providers. We acknowl-
edge the Encargo of Plan TL (SEAD) to CNIO and BSC for funding, and the
scientific committee for their valuable comments and guidance.


References
 1. Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In:
    Proceedings of the 27th annual international ACM SIGIR conference on Research
    and development in information retrieval. pp. 25–32 (2004)
 2. Carminati, B., Ferrari, E., Viviani, M.: A multi-dimensional and event-based model
    for trust computation in the social web. In: International Conference on Social
    Informatics. pp. 323–336. Springer (2012)
 3. Damiani, E., Viviani, M.: Trading anonymity for influence in open communities
    voting schemata. In: 2009 International Workshop on Social Informatics. pp. 63–67.
    IEEE (2009)
 4. Demner-Fushman, D., Elhadad, N.: Aspiring to unintended consequences of natural
    language processing: A review of recent developments in clinical and consumer-
    generated text processing. Yearbook of Medical Informatics 1, 224–233 (2016)
 5. Filannino, M., Uzuner, Ö.: Advancing the state of the art in clinical natural lan-
    guage processing through shared tasks. Yearbook of Medical Informatics 27(01),
    184–192 (2018)
 6. Fogg, B.J., Tseng, H.: The elements of computer credibility. In: Proc. of SIGCHI
    (1999)
 7. Fontanarava, J., Pasi, G., Viviani, M.: Feature analysis for fake review detection
    through supervised classification. In: 2017 IEEE International Conference on Data
    Science and Advanced Analytics (DSAA). pp. 658–666. IEEE (2017)
 8. Goeuriot, L., Jones, G.J., Kelly, L., Leveling, J., Hanbury, A., Müller, H., Salantera,
    S., Suominen, H., Zuccon, G.: ShARe/CLEF eHealth Evaluation Lab 2013, Task 3:
    Information retrieval to address patients’ questions when reading clinical reports.
    CLEF 2013 Online Working Notes 8138 (2013)
 9. Goeuriot, L., Jones, G.J., Kelly, L., Leveling, J., Lupu, M., Palotti, J., Zuccon, G.:
    An Analysis of Evaluation Campaigns in ad-hoc Medical Information Retrieval:
    CLEF eHealth 2013 and 2014. Springer Information Retrieval Journal (2018)
10. Goeuriot, L., Kelly, L., Lee, W., Palotti, J., Pecina, P., Zuccon, G., Hanbury, A.,
    Gareth J.F. Jones, H.M.: ShARe/CLEF eHealth Evaluation Lab 2014, Task 3:
    User-centred health information retrieval. In: CLEF 2014 Evaluation Labs and
    Workshop: Online Working Notes. Sheffield, UK (2014)
11. Goeuriot, L., Kelly, L., Suominen, H., Hanlen, L., Névéol, A., Grouin, C., Palotti,
    J., Zuccon, G.: Overview of the clef ehealth evaluation lab 2015. In: Information Ac-
    cess Evaluation. Multilinguality, Multimodality, and Visualization. Springer Berlin
    Heidelberg (2015)
12. Goeuriot, L., Kelly, L., Suominen, H., Névéol, A., Robert, A., Kanoulas, E., Spi-
    jker, R., Palotti, J., Zuccon, G.: Clef 2017 ehealth evaluation lab overview. In:
    International Conference of the Cross-Language Evaluation Forum for European
    Languages, pp. 291–303. Springer Berlin Heidelberg (2017)
13. Goeuriot, L., Suominen, H., Kelly, L., Miranda-Escalada, A., Krallinger, M., Liu,
    Z., Pasi, G., Gonzales Saez, G., Viviani, M., Xu, C.: Overview of the CLEF eHealth
    evaluation lab 2020. In: Arampatzis, A., Kanoulas, E., Tsikrika, T., Vrochidis, S.,
    Joho, H., Lioma, C., Eickhoff, C., Névéol, A., Cappellato, L., Ferro, N. (eds.) Ex-
    perimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings
    of the Eleventh International Conference of the CLEF Association (CLEF 2020)
    . Lecture Notes in Computer Science (LNCS) Volume number: 12260, Springer,
    Heidelberg, Germany (2020)
14. Graff, D., Kong, J., Chen, K., Maeda, K.: English gigaword. Linguistic Data Con-
    sortium, Philadelphia 4(1), 34 (2003)
15. Hanbury, A., Müller, H.: Khresmoi – multimodal multilingual medical information
    search. In: Medical Informatics Europe 2012 (MIE 2012), Village of the Future
    (2012)
16. Huang, C.C., Lu, Z.: Community challenges in biomedical text mining over 10
    years: Success, failure and the future. Briefings in Bioinformatics 17(1), 132–144
    (2016)
17. Jimmy, ., Zuccon, G., Palotti, J., Goeuriot, L., Kelly, L.: Overview of the clef 2018
    consumer health search task. In: Working Notes of Conference and Labs of the
    Evaluation (CLEF) Forum. CEUR Workshop Proceedings (2018)
18. Kakol, M., Nielek, R., Wierzbicki, A.: Understanding and predicting web content
    credibility using the content credibility corpus. Information Processing & Manage-
    ment 53(5), 1043–1061 (2017)
19. Kelly, L., Goeuriot, L., Suominen, H., Névéol, A., Palotti, J., Zuccon, G.: Overview
    of the CLEF eHealth evaluation lab 2016. In: International Conference of the Cross-
    Language Evaluation Forum for European Languages, pp. 255–266. Springer Berlin
    Heidelberg (2016)
20. Kelly, L., Goeuriot, L., Suominen, H., Schreck, T., Leroy, G., Mowery, D.L., Velupil-
    lai, S., Chapman, W., Martinez, D., Zuccon, G., Palotti, J.: Overview of the
    ShARe/CLEF eHealth evaluation lab 2014. In: Information Access Evaluation.
    Multilinguality, Multimodality, and Visualization, pp. 172–191. Springer Berlin
    Heidelberg (2014)
21. Kelly, L., Suominen, H., Goeuriot, L., Neves, M., Kanoulas, E., Li, D., Azzopardi,
    L., Spijker, R., Zuccon, G., Scells, H., Palotti, J.: Overview of the clef ehealth
    evaluation lab 2019. In: Crestani, F., Braschler, M., Savoy, J., Rauber, A., Müller,
    H., Losada, D.E., Heinatz Bürki, G., Cappellato, L., Ferro, N. (eds.) Experimental
    IR Meets Multilinguality, Multimodality, and Interaction. pp. 322–339. Springer
    International Publishing, Cham (2019)
22. Koopman, B., Zuccon, G.: Relevation!: an open source system for information re-
    trieval relevance assessment. In: Proceedings of the 37th international ACM SIGIR
    conference on Research & development in information retrieval. pp. 1243–1244.
    ACM (2014)
23. Lewandowski, D.: Credibility in web search engines. In: Online credibility and digi-
    tal ethos: Evaluating computer-mediated communication, pp. 131–146. IGI Global
    (2013)
24. Lipani, A., Palotti, J., Lupu, M., Piroi, F., Zuccon, G., Hanbury, A.: Fixed-cost
    pooling strategies based on ir evaluation measures. In: European Conference on
    Information Retrieval. pp. 357–368. Springer (2017)
25. Livraga, G., Viviani, M.: Data confidentiality and information credibility in on-line
    ecosystems. In: Proceedings of the 11th International Conference on Management
    of Digital EcoSystems. pp. 191–198 (2019)
26. Metzger, M.J.: Making sense of credibility on the web: Models for evaluating online
    information and recommendations for future research. JASIST 58(13), 2078–2091
    (2007)
27. Metzger, M.J., Flanagin, A.J.: Credibility and trust of information in online en-
    vironments: The use of cognitive heuristics. Journal of pragmatics 59, 210–220
    (2013)
28. Miranda-Escalada, A., Gonzalez-Agirre, A., Armengol-Estapé, J., Krallinger, M.:
    Overview of automatic clinical coding: annotations, guidelines, and solutions for
    non-english clinical cases at codiesp track of CLEF eHealth 2020. In: Conference
    and Labs of the Evaluation (CLEF) Working Notes. CEUR Workshop Proceedings
    (CEUR-WS.org) (2020)
29. Moffat,      A.,     Zobel,      J.:   Rank-biased      precision    for     measure-
    ment      of   retrieval    effectiveness.   ACM      Trans.    Inf.   Syst.   27(1),
    2:1–2:27         (Dec        2008).        https://doi.org/10.1145/1416950.1416952,
    http://doi.acm.org/10.1145/1416950.1416952
30. Mulhem, P., Sáez, G.N.G., Mannion, A., Schwab, D., Frej, J.: LIG-Health at Adhoc
    and Spoken IR Consumer Health Search: expanding queries using umls and fast-
    text. In: Conference and Labs of the Evaluation (CLEF) Working Notes. CEUR
    Workshop Proceedings (CEUR-WS.org) (2020)
31. Nogueira, R., Cho, K.: Task-oriented query reformulation with reinforcement
    learning. In: Proceedings of the 2017 Conference on Empirical Methods in Nat-
    ural Language Processing. Association for Computational Linguistics (2017).
    https://doi.org/10.18653/v1/d17-1061
32. Nunzio, G.M.D., Marchesin, S., Vezzani, F.: A Study on Reciprocal Ranking Fu-
    sion in Consumer Health Search. IMS UniPD ad CLEF eHealth 2020 Task 2. In:
    Conference and Labs of the Evaluation (CLEF) Working Notes. CEUR Workshop
    Proceedings (CEUR-WS.org) (2020)
33. Palotti, J., Zuccon, G., Goeuriot, L., Kelly, L., Hanburyn, A., Jones, G.J., Lupu,
    M., Pecina, P.: CLEF eHealth Evaluation Lab 2015, Task 2: Retrieving Informa-
    tion about Medical Symptoms. In: CLEF 2015 Online Working Notes. CEUR-WS
    (2015)
34. Palotti, J., Zuccon, G., Jimmy, Pecina, P., Lupu, M., Goeuriot, L., Kelly, L., Han-
    bury, A.: CLEF 2017 Task Overview: The IR Task at the eHealth Evaluation
    Lab. In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum.
    CEUR Workshop Proceedings (2017)
35. Park, L.A., Zhang, Y.: On the distribution of user persistence for rank-biased pre-
    cision. In: Proceedings of the 12th Australasian document computing symposium.
    pp. 17–24 (2007)
36. Pasi, G., De Grandis, M., Viviani, M.: Decision making over multiple criteria to
    assess news credibility in microblogging sites. In: IEEE World Congress on Com-
    putational Intelligence (WCCI) 2020, Proceedings. IEEE (2020)
37. Pasi, G., Viviani, M.: Information credibility in the social web: Contexts, ap-
    proaches, and open issues. arXiv preprint arXiv:2001.09473 (2020)
38. Popat, K., Mukherjee, S., Strötgen, J., Weikum, G.: Credibility assessment of tex-
    tual claims on the web. In: Proceedings of the 25th ACM International on Confer-
    ence on Information and Knowledge Management. pp. 2173–2178 (2016)
39. Popat, K., Mukherjee, S., Strötgen, J., Weikum, G.: Where the truth lies: Explain-
    ing the credibility of emerging claims on the web and social media. In: Proceedings
    of the 26th International Conference on World Wide Web Companion. pp. 1003–
    1012 (2017)
40. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hanne-
    mann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.: The kaldi speech recognition
    toolkit. In: IEEE 2011 workshop on automatic speech recognition and understand-
    ing. No. CONF, IEEE Signal Processing Society (2011)
41. Robertson, S.: The probabilistic relevance framework: BM25 and beyond.
    Foundations and Trends® in Information Retrieval 3(4), 333–389 (2010).
    https://doi.org/10.1561/1500000019
42. Rousseau, A., Deléglise, P., Esteve, Y.: Ted-lium: an automatic speech recognition
    dedicated corpus. In: LREC. pp. 125–129 (2012)
43. Sandaru Seneviratne, Eleni Daskalaki, A.L., Hossain, M.Z.: SandiDoc at CLEF
    2020 - Consumer Health Search : AdHoc IR Task. In: Conference and Labs of
    the Evaluation (CLEF) Working Notes. CEUR Workshop Proceedings (CEUR-
    WS.org) (2020)
44. Sbaffi, L., Rowley, J.: Trust and credibility in web-based health information: a
    review and agenda for future research. Journal of medical Internet research 19(6),
    e218 (2017)
45. Self, C.C.: Credibility. In: An integrated approach to communication theory and
    research, pp. 449–470. Routledge (2014)
46. Suominen, H.: CLEFeHealth2012 — The CLEF 2012 Workshop on Cross-Language
    Evaluation of Methods, Applications, and Resources for eHealth Document Anal-
    ysis. In: Forner, P., Karlgren, J., Womser-Hacker, C., Ferro, N. (eds.) CLEF 2012
    Working Notes. CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073,
    http://ceur-ws.org/Vol-1178/ (2012)
47. Suominen, H., Hanlen, L., Goeuriot, L., Kelly, L., Jones, G.: Task 1a of the CLEF
    eHealth evaluation lab 2015: Clinical speech recognition. In: Online Working Notes
    of CLEF. CLEF (2015)
48. Suominen, H., Kelly, L., Goeuriot, L.: Scholarly influence of the Conference and
    Labs of the Evaluation Forum eHealth Initiative: Review and bibliometric study
    of the 2012 to 2017 outcomes. JMIR Research Protocols 7(7), e10961 (2018).
    https://doi.org/10.2196/10961
49. Suominen, H., Kelly, L., Goeuriot, L.: The scholarly impact and strategic intent of
    CLEF eHealth Labs from 2012 to 2017. In: Ferro, N., Peters, C. (eds.) Informa-
    tion Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of
    CLEF. pp. 333–363. Springer International Publishing, Cham (2019)
50. Suominen, H., Kelly, L., Goeuriot, L., Krallinger, M.: Clef ehealth evaluation lab
    2020. In: Jose, J.M., Yilmaz, E., Magalhães, J., Castells, P., Ferro, N., Silva, M.J.,
    Martins, F. (eds.) Advances in Information Retrieval. pp. 587–594. Springer Inter-
    national Publishing, Cham (2020)
51. Suominen, H., Kelly, L., Goeuriot, L., Névéol, A., Ramadier, L., Robert, A.,
    Kanoulas, E., Spijker, R., Azzopardi, L., Li, D., Jimmy, Palotti, J., Zuccon, G.:
    Overview of the clef ehealth evaluation lab 2018. In: International Conference of the
    Cross-Language Evaluation Forum for European Languages, pp. 286–301. Springer
    Berlin Heidelberg (2018)
52. Suominen, H., Kelly, L., Goeuriot, L., Névéol, A., Ramadier, L., Robert, A.,
    Kanoulas, E., Spijker, R., Azzopardi, L., Li, D., Jimmy, Palotti, J., Zuccon, G.:
    Overview of the CLEF eHealth evaluation lab 2018. In: Bellot, P., Trabelsi, C.,
    Mothe, J., Murtagh, F., Nie, J.Y., Soulier, L., SanJuan, E., Cappellato, L., Ferro,
    N. (eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction.
    pp. 286–301. Springer International Publishing, Cham, Switzerland (2018)
53. Suominen, H., Salanterä, S., Velupillai, S., Chapman, W.W., Savova, G., Elhadad,
    N., Pradhan, S., South, B.R., Mowery, D.L., Jones, G.J., Leveling, J., Kelly, L.,
    Goeuriot, L., Martinez, D., Zuccon, G.: Overview of the ShARe/CLEF eHealth
    evaluation lab 2013. In: Information Access Evaluation. Multilinguality, Multi-
    modality, and Visualization, pp. 212–231. Springer Berlin Heidelberg (2013)
54. Suominen, H., Zhou, L., Goeuriot, L., Kelly, L.: Task 1 of the CLEF eHealth
    evaluation lab 2016: Handover information extraction. In: CLEF 2016 Evaluation
    Labs and Workshop: Online Working Notes. CEUR-WS (2016)
55. Tiedemann, J.: Parallel data, tools and interfaces in opus. In: Lrec. vol. 2012, pp.
    2214–2218 (2012)
56. Viviani, M., Pasi, G.: Credibility in social media: opinions, news, and health infor-
    mation—a survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge
    Discovery 7(5), e1209 (2017)
57. Wayne, C., Doddington, G., et al.: Tdt2 multilanguage text version 4.0 ldc2001t57.
    Philadelphia: Linguistic Data Consortium (LDC) (2001)
58. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist
    reinforcement learning. In: Reinforcement Learning, pp. 5–32. Springer US (1992)
59. Yamamoto, Y., Tanaka, K.: Enhancing credibility judgment of web search results.
    In: Proceedings of the SIGCHI Conference on Human Factors in Computing Sys-
    tems. pp. 1235–1244 (2011)
60. Yang, K.C., Varol, O., Davis, C.A., Ferrara, E., Flammini, A., Menczer, F.: Arming
    the public with artificial intelligence to counter social bots. Human Behavior and
    Emerging Technologies 1(1), 48–61 (2019)
61. Zuccon, G.: Understandability biased evaluation for information retrieval. In: Ad-
    vances in Information Retrieval. pp. 280–292 (2016)
62. Zuccon, G., Palotti, J., Goeuriot, L., Kelly, L., Lupu, M., Pecina, P., Mueller, H.,
    Budaher, J., Deacon, A.: The IR Task at the CLEF eHealth Evaluation Lab 2016:
    User-centred Health Information Retrieval. In: CLEF 2016 Evaluation Labs and
    Workshop: Online Working Notes, CEUR-WS (September 2016)

</pre>