<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Bari, Italy
" rishabh.upadhyay@unimib.it (R. Upadhyay); gabriella.pasi@unimib.it (G. Pasi); marco.viviani@unimib.it
(M. Viviani)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>An Overview on Evaluation Labs and Open Issues in Health-related Credible Information Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Discussion Paper</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rishabh Upadhyay</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriella Pasi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Viviani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Milano-Bicocca - Department of Informatics</institution>
          ,
          <addr-line>Systems, and Communication (DISCo) Information and Knowledge Representation, Retrieval, and Reasoning (IKR3) Lab Edificio U14, Viale Sarca, 336 - 20126 Milan, Italy - https:// ikr3.disco.unimib.it</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Faced with the problem of widespread health misinformation, in recent years credibility is being considered as one of the important dimensions for health-related information access and retrieval. To encourage research in this field, a couple of evaluation labs have been recently set up to provide large test collections, baselines, and evaluation metrics to interested researchers. The purpose of this article is to provide an overview of such evaluation labs, discussing their characteristics and open issues.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Health-related Information</kwd>
        <kwd>Credibility</kwd>
        <kwd>Consumer Health Search</kwd>
        <kwd>Evaluation Labs</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In the last years, the Web is increasingly being used to search for various health-related
information, ranging from medical therapies and treatments to lifestyle and wellness. According to
distinct surveys, a range of 60-70 percent of adults in the USA [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], and around one in two EU
citizens [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], look for health information online. Recently, as health-related searches on Google
are getting so popular, the term "Dr. Google" has also been coined [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Furthermore, exchange
health-related information on social media is also becoming a common practice [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]; Twitter, for
example, is a widely used microblogging platform employed by both patients and healthcare
professionals [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], for both consumer health search and advertisement purposes.
      </p>
      <p>
        In this context, it is increasingly easy for people to run into health misinformation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Although numerous attempts have been made to provide online users with genuine content in
distinct domains, the development of automated approaches to tackle this issue in the health
scenario is still in its infancy. Notwithstanding, relying on misinformation in such a context
can be particularly harmful, especially for users without suficient health literacy.
      </p>
      <p>In this overview paper, our purpose is to outline and discuss the major issues related to
healthrelated information credibility, and to present two evaluation labs that have been established in
recent years to the aim of accounting for the above issues when considering IR in the health
domain.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Health-related Information Credibility and Evaluation Labs</title>
      <p>
        The problem of the credibility of online information has been studied for at least a decade now,
both in computer and data science [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. While in social sciences credibility is understood as a
subjective characteristic perceived by the information receiver [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], in computer science, and,
therefore, in the development of automated solutions to verify the genuineness of information,
it is necessary to produce an "objective" credibility assessment, by considering various
characteristics (i.e., features) related to the contents, their authors and the social relations between
users in the case of information spread through social platforms [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].1 Such assessment can
be used to produce either a binary (i.e., credible versus non-credible), multi-class (e.g., credible,
non-credible, non-judged) or ordinal classification (e.g., non-credible, partially credible, credible,
highly credible) of the information, or even a credibility-based ranking.
      </p>
      <p>
        In recent years, some works have tried to consider the credibility of information as an aspect
of relevance in various Information Retrieval tasks [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ], including consumer health search
(CHS). In this context, some evaluation initiatives based on the Cranfield paradigm have been set
up to allow researchers to test the ability of their systems to account for credibility in relevance
assessment. Below, we discuss the major evaluation labs that addresses the above-mentioned
issue. Although in evaluation initiatives such as FIRE and NTCIR, information credibility has
somewhat been considered (in particular in the UrduFake [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and Lab-PoliInfo [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] labs), only
in TREC and CLEF a couple of recent labs have considered the health domain.
2.1. TREC
The Text REtrieval Conference (TREC) has included the Health Misinformation Track in 2019.2
Participants in this track are required, among other sub-tasks, to develop systems that "return
relevant and credible information that will help searchers make correct decisions" in the health
domain. Depending on the Track edition, other criteria have been considered beyond credibility,
which are briefly illustrated in the following sections.
      </p>
      <sec id="sec-2-1">
        <title>2.1.1. Data Collections</title>
        <p>The 2019 Health Misinformation Track used the ClueWeb12-B13 dataset as a corpus.3 Such a
corpus is constituted by English Web pages collected in 2012 related to various health issues, and
containing both correct and incorrect information, and of varying credibility and quality. The
2020 Track used a dataset provided by Common Crawl, in particular related to diferent news
collected in the first four months of 2020. 4 On such dataset, 74 COVID-19 related topics have
1The problem is made even more complex by the fact that some related but not totally overlapping terms are
used in the literature in addition to credibility, including veracity, trustworthiness, reliability, etc. This article is not
intended to disambiguate the use of these terms, but this is an issue that will certainly need to be further addressed.
2https://trec-health-misinfo.github.io/2019.html
3https://lemurproject.org/clueweb12/
4https://commoncrawl.org/2016/10/news-dataset-available/
been selected, and Web pages filtered accordingly. In the current 2021 edition, the "noclean"
version of the C4 dataset used by Google has been employed.5</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.1.2. Human Assessment</title>
        <p>In the diferent editions of the Health Misinformation Track, documents have been labeled
by human assessors with respect to distinct criteria: relevance, eficacy , and credibility in 2019,
usefulness, correctness, and credibility in 2020, and usefulness, credibility, and supportiveness in
2021. Eficacy concerns the presence in the document of "correct" information regarding the
topic’s treatment. This is similar to the correctness criterion employed in 2020. Both eficacy
and correctness have been assessed on a three-point scale, including a "non-judged" label.
Supportiveness is intended as the ability of the document to support or dissuade the use of
the treatment in the topic’s question. This criterion has been assessed on a three-point scale,
including a neutral value. It is important to note that in all three editions, documents judged as
non-relevant (or non-useful) have not been further assessed with respect to additional criteria.</p>
        <p>As regards the criterion that interests us most, namely credibility, it had been assessed on a
three-point scale (including a "non-judged" label) in 2019, and on a binary scale in the two last
editions. Human assessors were asked to provide a credibility label based, among others, on
the following aspects: the amount of expertise, authoritativeness, and trustworthiness of the
document, the indication of an author or an institute that published the Web document and
their credentials, the presence of citations to trustworthy/credible sources, the style of writing
(well written or poorly written), the purpose for which the document is written (to provide
information or for advertising purposes). In each edition, around 20,000 labeled documents
with over 50 topics have been provided.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.1.3. Baselines and Evaluation Metrics</title>
        <p>
          In both the 2019 and 2020 Health Misinformation Track, baselines based on the BM25 retrieval
model implemented by employing the Anserini toolkit,6 with default parameters, have been
employed. Both the baselines and the submitted runs have been evaluated with respect to the
following measures, proposed in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], to take into account the diferent criteria considered:
Normalized Local Rank Error (NLRE): the rank positions of documents are compared pairwise
checking for "errors", which are defined as misplacement of documents, i.e., relevant or credible
documents placed after non-relevant or non-credible documents; Normalized Weighted
Cumulative Score (nWCS): a single label out of the multiple criteria is generated, and the standard
nDCG measure is computed; Convex Aggregating Measure (CAM): each criterion is considered
separately, and either AP or nDCG with respect to the ranking obtained with the single criterion
is computed. Finally, the average AP or nDCG value is computed. In 2020, runs have been also
evaluated in terms of "traditional" evaluation measures, i.e., nDCG@ and MAP, to compare
measures accounting for relevance only and those that account for usefulness, credibility, and
correctness.
        </p>
        <sec id="sec-2-3-1">
          <title>5https://huggingface.co/datasets/allenai/c4 6https://github.com/castorini/anserini</title>
          <p>
            2.2. CLEF
The Conference and Labs of the Evaluation Forum (CLEF) has included, starting from 2018, tasks
related to the automatic identification and verification of claims in social media, with the
CheckThat! Lab [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ]; however, it is only since 2020 that the concept of credibility has been
considered in the context of CHS in the eHealth Lab,7 which also sees us as co-organizers. The
aim is to assess the ability of systems to retrieve documents that are relevant, readable, and
credible; in addition, there is a sub-task that is specifically dedicated to credibility prediction.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>2.2.1. Data Collections</title>
        <p>For both the 2020 and 2021 eHealth editions, Web pages has been collected by repeatedly
submitting a set of CLEF eHealth 2018 queries to the Microsoft Bing APIs,8 over a period of
a few weeks. The list of obtained Web documents have been further augmented to add other
reliable and unreliable Web pages, and this augmentation was based on the Web sites previously
complied by health institutions and agencies. Additionally, in the 2021 edition, social media
content from Reddit and Twitter has been also considered. Such content has been gathered with
respect to 150 health-related topics. Queries have been manually generated from such topics,
and used to filter posts and tweets from Reddit and Twitter, respectively. A Reddit document
consists of a so-called "submission", i.e. a post that have a title and a description, in which
a question is generally made, and a "comment", i.e., a "reply" to the submission, whereas for
Twitter, a single tweet and related metadata constitutes the document.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.2.2. Human Assessment</title>
        <p>In CLEF eHealth, documents have been labeled with respect to three criteria, i.e., (topical)
relevance, readability (or understandability), and credibility. Relevance and readability have
been assessed on a three-point scale, i.e., non-relevant/readable, partially relevant/readable,
relevant/readable. Regarding credibility, it was considered useful to introduce a fourth label,
namely "not able to judge", given the peculiarity of this criterion.</p>
        <p>In particular, in assessing the credibility of Web pages and social content, human assessors
have been required to consider the availability of trustworthiness indicators of the source (e.g.,
expertise, Web reputation, etc.), the syntactic and semantic characteristics of the content (e.g.,
the writing style), the emotions that the text seeks to evoke, the presence of verifiable facts
and assertions (e.g., by the presence of citations or external links), the analysis of the social
relationships of the author of a post (in the case of social content).</p>
      </sec>
      <sec id="sec-2-6">
        <title>2.2.3. Baselines and Evaluation Metrics</title>
        <p>
          In CLEF eHealth 2020, organizers have developed baseline methods based on the Okapi BM25
retrieval model and query expansion optimized via reinforcement learning. The query expansion
model has been pre-trained using the TREC-CAR, Jeopardy, and Microsoft Academic datasets
from [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], and the expanded queries employed as the input of the BM25 model. In CLEF eHealth
        </p>
        <sec id="sec-2-6-1">
          <title>7https://clefehealth.imag.fr/</title>
          <p>
            8https://github.com/CLEFeHealth/CLEFeHealth2018IRtask/blob/master/clef2018_queries_task2_task3.txt
2021, six baselines systems based on the Okapi BM25, Dirichlet Language Model (DirichletLM),
and Term Frequency times Inverse Document Frequency (TF×IDF) retrieval models with default
parameters have been provided. Further details can be found in [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ].
          </p>
          <p>
            The following evaluation metrics have been used to assess the baselines and the submitted
runs: MAP, BPref, nDCG, uRBP and cRBP, to evaluate the the systems with respect to the ranking
produced by considering three criteria, and Accuracy, F1-score, and AUC, to assess the goodness
of the (binary) classification of documents with respect to credibility only. Regarding the uRBP
(understandability Rank Biased Precision) and cRBP (credibility Rank Biased Precision) metrics,
they serve the purpose to account for the contribution of understandability and credibility in
the ranking produced by the retrieval models. uRPB has been introduced in [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ] while cRPB
has been employed for the first time (based on uRBP), in the 2020 edition of CLEF eHealth [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ].
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Discussion and Open Issues</title>
      <p>Access to credible health-related information is one of the most challenging aspects of current
research in Information Retrieval. The development of evaluation initiatives that take this into
consideration, as outlined in this article, is undoubtedly promising, however some aspects need
to be further considered.</p>
      <p>It is first necessary to consider that the health domain is characterized by the presence of
medical experts, and that health-related content can be marked by the use of a very specific
language. It is also necessary to consider that health-related information is disseminated both
in the form of Web pages and short texts (i.e., social content), and so it is necessary to consider
the problem of what constitutes a single unit of retrievable information. Finally, there is the
problem of evaluating the efectiveness of a retrieval system in considering credibility over
other relevance criteria.</p>
      <p>
        We have seen that current evaluation labs attempt to consider some of these issues. However,
in the future, it would be necessary to act in the following directions: () provide diferent
scenarios regarding content published by experts (for informational purposes) versus content
published for example in virtual communities (in a context of opinion exchange) (a first attempt
was made in CLEF eHealth 2020); () better consider that evaluation by human assessors may
be diferent than evaluating Web pages and synthetic social content (a first attempt was made
in CLEF eHealth 2021); () identify well, with respect to both Web pages and social content,
what content is being assessed (e.g., in a Web page there may be several sections with diferent
credibility, whereas the credibility of social media posts may be considered individually or with
respect to the thread containing it); () develop new credibility-oriented assessment measures
(the work of [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and the metrics employed in both the TREC Health Misinformation Track and
CLEF eHealth constitute an important first step in this direction).
      </p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>This work is supported by the EU Horizon 2020 Research and Innovation Programme under the
Marie Skłodowska-Curie Grant Agreement No 860721 – DoSSIER: “Domain Specific Systems
for Information Extraction and Retrieval”.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Purcell</surname>
          </string-name>
          , et al.,
          <article-title>Understanding the participatory news consumer</article-title>
          ,
          <source>Pew Internet and American Life Project</source>
          <volume>1</volume>
          (
          <year>2010</year>
          )
          <fpage>19</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Fox</surname>
          </string-name>
          , et al.,
          <article-title>The social life of health information</article-title>
          ,
          <source>California Healthcare Foundation</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>AA.VV.</surname>
          </string-name>
          ,
          <article-title>ICT usage in households and by individuals (isoc_i). Reference Metadata in Euro SDMX Metadata Structure (ESMS)</article-title>
          ,
          <source>Technical Report, EUROSTAT</source>
          ,
          <year>2021</year>
          . URL: https: //ec.europa.eu/eurostat/web/products-eurostat-news/-/edn-20210406-1.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Millenson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Baldwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zipperer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Beyond</given-names>
            <surname>Dr</surname>
          </string-name>
          .
          <article-title>Google: the evidence on consumer-facing digital tools for diagnosis</article-title>
          ,
          <source>Diagnosis</source>
          <volume>5</volume>
          (
          <year>2018</year>
          )
          <fpage>95</fpage>
          -
          <lpage>105</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Antheunis</surname>
          </string-name>
          , et al.,
          <article-title>Patients' and health professionals' use of social media in health care: motives, barriers and expectations</article-title>
          ,
          <source>Patient education and counseling 92</source>
          (
          <year>2013</year>
          )
          <fpage>426</fpage>
          -
          <lpage>431</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W.-Y. S.</given-names>
            <surname>Chou</surname>
          </string-name>
          , et al.,
          <article-title>Addressing health-related misinformation on social media</article-title>
          ,
          <source>Jama</source>
          <volume>320</volume>
          (
          <year>2018</year>
          )
          <fpage>2417</fpage>
          -
          <lpage>2418</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Viviani</surname>
          </string-name>
          , G. Pasi,
          <article-title>Credibility in social media: opinions, news, and health information-a survey</article-title>
          ,
          <source>WIREs Data Mining and Knowledge Discovery</source>
          <volume>7</volume>
          (
          <year>2017</year>
          )
          <article-title>e1209</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Metzger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Flanagin</surname>
          </string-name>
          ,
          <article-title>Online health information credibility, Encyclopedia of Health Communication</article-title>
          . Thousand Oaks, CA: SAGE (
          <year>2011</year>
          )
          <fpage>976</fpage>
          -
          <lpage>978</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>W.</given-names>
            <surname>Weerkamp</surname>
          </string-name>
          , M. de Rijke,
          <article-title>Credibility-inspired ranking for blog post retrieval</article-title>
          ,
          <source>Information retrieval 15</source>
          (
          <year>2012</year>
          )
          <fpage>243</fpage>
          -
          <lpage>277</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D. G. P.</given-names>
            <surname>Putri</surname>
          </string-name>
          , et al.,
          <article-title>Social search and task-related relevance dimensions in microblogging sites</article-title>
          ,
          <source>in: International Conference on Social Informatics</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>297</fpage>
          -
          <lpage>311</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Amjad</surname>
          </string-name>
          , et al.,
          <source>UrduFake@FIRE2020: Shared Track on Fake News Identification in Urdu, in: Forum for Information Retrieval Evaluation</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kimura</surname>
          </string-name>
          , et al.,
          <article-title>Overview of the NTCIR-</article-title>
          14
          <string-name>
            <given-names>QA</given-names>
            <surname>Lab-PoliInfo Task</surname>
          </string-name>
          ,
          <source>in: 14th NTCIR Conference on Evaluation of Information Access Technologies</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>121</fpage>
          -
          <lpage>140</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lioma</surname>
          </string-name>
          , et al.,
          <article-title>Evaluation measures for relevance and credibility in ranked lists</article-title>
          ,
          <source>in: Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>91</fpage>
          -
          <lpage>98</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeno</surname>
          </string-name>
          , et al.,
          <source>Overview of CheckThat!</source>
          <year>2020</year>
          :
          <article-title>Automatic identification and verification of claims in social media</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>215</fpage>
          -
          <lpage>236</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <article-title>Task-oriented query reformulation with reinforcement learning</article-title>
          ,
          <source>arXiv preprint arXiv:1704.04572</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          , et al.,
          <source>Consumer Health Search at CLEF eHealth</source>
          <year>2021</year>
          , in: CLEF (Working Notes),
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>G.</given-names>
            <surname>Zuccon</surname>
          </string-name>
          ,
          <article-title>Understandability biased evaluation for information retrieval</article-title>
          ,
          <source>in: European Conference on Information Retrieval</source>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>280</fpage>
          -
          <lpage>292</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          , et al.,
          <article-title>Overview of the CLEF eHealth 2020 Task 2: Consumer Health Search with Ad Hoc and Spoken Queries</article-title>
          ,
          <source>in: CLEF (Working Notes)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>