=Paper=
{{Paper
|id=Vol-2936/paper-62
|storemode=property
|title=Consumer Health Search at CLEF eHealth 2021
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-62.pdf
|volume=Vol-2936
|authors=Lorraine Goeuriot,Hanna Suominen,Gabriella Pasi,Elias Bassani,Nicola Brew-Sam,Gabriela Nicole González Sáez,Liadh Kelly,Philippe Mulhem,Sandaru Seneviratne,Rishabh Upadhyay,Marco Viviani,Chenchen Xu
|dblpUrl=https://dblp.org/rec/conf/clef/GoeuriotSPBBSKM21
}}
==Consumer Health Search at CLEF eHealth 2021==
Consumer Health Search at CLEF eHealth 2021 Lorraine Goeuriot1 , Hanna Suominen2,3,4 , Gabriella Pasi5 , Elias Bassani5,6 , Nicola Brew-Sam2 , Gabriela González-Sáez1 , Liadh Kelly7 , Philippe Mulhem1 , Sandaru Seneviratne2 , Rishabh Upadhyay5 , Marco Viviani5 and Chenchen Xu2,3 1 Université Grenoble Alpes, CNRS, Grenoble INP, LIG, F-38000 Grenoble France 2 The Australian National University, Canberra, ACT, Australia 3 Data61/Commonwealth Scientific and Industrial Research Organisation, Canberra, ACT, Australia 4 University of Turku, Turku, Finland 5 University of Milano-Bicocca, DISCo, Italy 6 Consorzio per il Trasferimento Tecnologico - C2T, Milan, Italy 7 Maynooth University, Ireland Abstract This paper details materials, methods, results, and analyses of the Consumer Health Search Task of the CLEF eHealth 2021 Evaluation Lab. This task investigates the effectiveness of information retrieval (IR) approaches in providing access to medical information to laypeople. For this a TREC-style evaluation methodology was applied: a shared collection of documents and queries is distributed, participants’ runs received, relevance assessments generated, and participants’ submissions evaluated. The task gen- erated a new representative web corpus including web pages acquired from a 2021 CommonCrawl and social media content from Twitter and Reddit, along with a new collection of 55 manually generated layperson medical queries and their respective credibility, understandability, and topicality assessments for returned documents. This year’s task focused on three subtask: (i) ad-hoc IR, (ii) weakly supervised IR, and (iii) document credibility prediction. In total, 15 runs were submitted to the three subtasks: eight addressed the ad-hoc IR task, three the weakly supervised IR challenge, and 4 the document credibility prediction challenge. As in previous years, the organizers have made data and tools associated with the task available for future research and development. Keywords Dimensions of Relevance, eHealth, Evaluation, Health Records, Medical Informatics, Information Stor- age and Retrieval, Self-Diagnosis, Test-set Generation CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " lorraine.goeuriot@univ-grenoble-alpes.fr (L. Goeuriot); hanna.suominen@anu.edu.au (H. Suominen); gabriella.pasi@unimib.it (G. Pasi); elias.assani@unimib.it (E. Bassani); nicola.brew-sam@anu.edu.au (N. Brew-Sam); gabriela-nicole.gonzalez-saez@univ-grenoble-alpes.fr (G. González-Sáez); liadh.kelly@mu.ie (L. Kelly); philippe.mulhem@univ-grenoble-alpes.fr (P. Mulhem); sandaru.seneviratne@anu.edu.au (S. Seneviratne); r.upadhyay@campus.unimib.it (R. Upadhyay); marco.viviani@unimib.it (M. Viviani); chenchen.xu@anu.edu.au (C. Xu) ~ http://mrim.imag.fr/User/lorraine.goeuriot/ (L. Goeuriot); https://researchers.anu.edu.au/researchers/suominen-h (H. Suominen); https://www.unimib.it/gabriella-pasi (G. Pasi); https://researchers.anu.edu.au/researchers/brew-sam-n (N. Brew-Sam); https://www.maynoothuniversity.ie/people/liadh-kelly (L. Kelly); https://ikr3.disco.unimib.it/people/marco-viviani/ (M. Viviani) 0000-0001-7491-1980 (L. Goeuriot); 0000-0002-4195-164 (H. Suominen); 0000-0002-6080-8170 (G. Pasi); 0000-0001-7544-4499 (N. Brew-Sam); 0000-0003-1131-5238 (L. Kelly); 0000-0002-2274-9050 (M. Viviani) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1. Introduction In today’s information overloaded society, using a web search engine to find information related to health and medicine that is credible, easy to understand, and relevant to a given information need is increasingly difficult, thereby hindering patient and public involvement in healthcare [1]. These problems are referred to as credibility, understandability, and topicality dimensions of relevance in information retrieval (IR) [2, 3, 4, 5, 6]. The CLEF eHealth lab (https:// clefehealth.imag.fr/) of the Conference and Labs of the Evaluation Forum (CLEF, formerly known as Cross-Language Evaluation Forum, http://www.clef-initiative.eu/) is a research initiative that aims at providing datasets and gathering researchers working on information extraction, management and retrieval tasks in the medical comain. Since its establishment in 2012, CLEF eHealth has included eight IR tasks on CHS with more more than twenty subtasks in total [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. Its usage scenario is to ease and support laypeople, policymakers, and healthcare professionals in understanding, accessing, and authoring eHealth information in a multilingual setting. In 2021, CLEF eHealth initiative has organised a CHS task with the following three IR subtasks: 1. Adhoc IR, 2. Weakly Supervised IR, and 3. Document Credibility Prediction. The task has challenged researchers, scientists, engineers, analysts, and graduate students to develop better IR systems to support creating web-based search tools that return webpages that are better suited to laypeople’s information needs from perspectives of 1. information topicality (i.e., how relevant are the contents to the search topic), 2. information understandability (i.e., how easily can a layperson understand the con- tents), and 3. information credibility (i.e., should the contents be trusted). As a continuation of the previous CLEF eHealth IR tasks that ran in 2013–2018, and 2020 [2, 18, 19, 20, 21, 22, 23, 24], the 2021 CHS task has embraced the Text REtrieval Conference (TREC) -style evaluation process. Namely, it has designed, developed, and deployed a shared collection of documents and queries, called for the contribution of runs from participants, and conducted the subsequent formation of relevance assessments and evaluation of the participants’ submissions. The main contributions of the CLEF eHealth 2021 task on CHS are as follows: 1. generating a novel representative web corpus, 2. collecting layperson medical queries, 3. attracting new submissions from participants, 4. contributing to IR evaluation metrics relevant to the three dimensions of document relevance, and 5. evaluating IR systems (i.e., runs submitted by the task participants or from organizers’ baseline systems) on newly conducted assessments. The remainder of this paper is structured as follows: First, Section 2 details the task. Then, Section 3 introduces the document collection, topics, baselines, pooling strategy, and evaluation metrics. After this, Section 4 presents the participants and their approaches while Section 5 addresses their results. Finally, Section 6 concludes the paper. 2. Description of the Tasks In this section, we provide a description of the three subtasks offered in this year’s CHS task, namely: Subtask 1 on ad-hoc IR; Subtask 2 on weakly supervised IR; and Subtask 3 on document credibility prediction. 2.1. Subtask 1: Adhoc Information Retrieval Similar to previous years of the CHS task, this was a standard ad-hoc IR task. A document collection was provided to task participants along with realistic layperson medical information need use cases, both described in Section 3. The purpose of the task was to evaluate IR systems’ abilities to provide users with credible, understandable, and topical documents. As such, participating teams submitted their runs, which were pooled together with baseline runs and manual relevance assessments conducted, also described in Section 3. These systems’ performance was assessed on multiple dimensions of relevance — credibility, understandability, and topicality. Evaluation metrics employed for this task are described in Section 3.6. 2.2. Subtask 2: Weakly Supervised Information Retrieval This task aimed to evaluate the ability of machine learning-based ad-hoc IR models, trained with weak supervision, to retrieve relevant documents in the health domain. In order to train neural models to address the search task, a large collection of real-world health related queries extracted from commercial search engine query logs and synthetic (weak) relevance scores computed with a competitive IR system were considered. The submissions were evaluated against the same test set as Subtask 1 submissions, in order to allow a full comparison of traditional vs neural approaches. Details on the methods and materials associated with this task are described in Section 3. 2.3. Subtask 3: Document Credibility Prediction The purpose of this task was the automatic assessment of the credibility of information that is disseminated online, through the Web and social media. Using the dataset related to Subtask 1, this task aimed at comparing approaches estimating documents credibility. The ground truth for this classification task is based on the credibility labels assigned to documents during the relevance assessment process. Evaluation of the runs includes classical classification metrics and investigates the ability of the participants systems to perform well on both web documents and social media content. 3. Materials and Methods In this section, we will describe the materials and methods used in the CHS task of the CLEF eHealth evaluation lab 2021. After introducing our new document collection and novel topics, we will describe our baseline systems and pooling methodology. Finally, we will address our relevance assessments and evaluation metrics. 3.1. Document Collection The 2021 CLEF eHealth Consumer Health Search document collection consisted of two separate crawls of documents: web documents acquired from the CommonCrawl and social media documents composed by Reddit and Twitter submissions. First, we acquired web pages from the CommonCrawl. We extracted an initial list of websites from the 2018 CHS task of CLEF eHealth. This list was built by submitting a set of medical queries to the Microsoft Bing Application Programming Interfaces (through the Azure Cognitive Services) repeatedly over a period of a few weeks, and acquiring the uniform resource Locators (URL) of the retrieved results [22, 13]. We included the domains of the acquired URLs in the 2021 list, except some domains that were excluded for decency reasons. After this, similarly to 2018, we augmented the list by including a number of known reliable and unreliable health websites, domains, and social media contents of ranging reliability levels. Finally, we further extended the list by including websites that were highly relevant for the task queries to finalize our list of 600 domains. In summary, we introduced thirteen new domains in 2021 compared to the 2018 collection, and then the newly crawled domains from the latest CommonCrawl 2021-04 (https://commoncrawl.org/2021/02/january-2021-crawl-archive-now-available/). Second, we complemented the collection with social media documents from Reddit and Twitter. As the first step, we selected a list of 150 health topics related to various health conditions. Then, we generated search queries manually from those topics and submitted them to Reddit to retrieve posts and comments. After this, we applied the same process on Twitter to get related tweets from the platform. Please note that for the purposes of the 2021 CHS task, we defined a social media document as a text obtained by a single interaction. This implied that • on Reddit, a document is composed by a post, one comment of the post, and associated meta-information and • on Twitter, a document is a single tweet with its associated meta-information. 3.2. Topics The set of topics of CLEF eHealth IR task aimed at being representative of laypersons’ medical information needs in various scenarios. This year, the set of topics was collected from two sources as follows: • The first part was based on our insights from consulting laypeople with lived experience of multiple sclerosis (MS) or diabetes to motivate, validate, and refine the search scenar- ios; we, as experts in IR and CHS tasks, captured these layperson-informed insights as the scenarios. • The second part was based on use cases from Reddit health forums; we extracted and manually selected a list of topics from Google trends to best fit each use case. As a result, we had a new set of 55 queries in English for authoring realistic search scenarios. To describe the scenarios, we enriched each query manually by labels in English either to characterize the search intent (for manually created queries) or to capture the submission text (for social medial queries) (Table 3.2). Scenario Query Narrative 8 best apps daily activ- I’m a 15 year old with diabetes. I’m planning to join the school ity exercise diabetes hiking club. What are the best apps to track my daily activity and exercises? 57 multiple sclerosis I read that MS develops with several stages and includes phases stages phases of relapse. I want to know more about how this disease develops over time. 126 birth control sup- So I’ve been on birth control taking only active pills for months pression antral now and I went in to a doctor’s office to inquire about egg freez- follicle count ing. She seemed optimistic that my ultrasound would be very re- assuring but instead she came away from the ultrasound deeply concerned. I did an AMH test and it came back 0.3 which was consistent with what was seen in the ultrasound. I’m terrified. Like...suicidal terrified. I don’t have any underlying conditions and while I’m not young I’m not old either. I don’t smoke and I’m not terribly overweight. I don’t have insulin resistance and my reproductive organs looks good other than my ovaries. I’d been taking the combo pill to skip periods for months now. I don’t even take the brown pills. When I first heard that you can skip peri- ods and only have 2-3 a year when taking brown pills I started on them. The first time I tried to have a period on the brown pills nothing happened. No period. This was 2 years ago. I freaked out but they said I just respond strongly to the pills. So I went off of the pills completely and had 2-3 normal periods starting about 2 months later. Could the birth control be suppressing my antral follicle count and amh this much? Please help. Table 1 Illustration of Some of Our Queries and Narratives Subtasks 1, 2, and 3 used these 55 queries with 5 released for training and 50 reserved for test- ing; the test topics contained a balanced sample of the manually constructed and automatically extracted search scenarios. 3.3. Baseline Systems With respect to both Subtask 1 and Subtask 2, we provided six baseline systems. These systems applied the Okapi BM25 (BM for Best Matching), Dirichlet Language Model (DirichletLM), and Term Frequency times Inverse Document Frequency (TF×IDF) algorithms with default parameters. Each of them was implemented without and with pseudo relevance feedback (RF) using default parameters (DFR Bo1 model [25] on three documents, selecting ten terms). This resulted in the 3 × 2 = 6 baseline systems. The systems are implemented using Terrier version 5.4 [26] as following: terrier batchretrieve -t topics.txt -w [TF_IDF|DirichletLM|BM25] [-q|]. Regarding Subtask 3, where it is necessary to evaluate the effectiveness of the approaches with respect to assessing the credibility of information, it was decided to approach the problem as a binary classification (identification of credible versus non-credible information). For this reason, simple baselines were developed based on supervised classifiers that act on a set of features that can be extracted from the documents under consideration. Dealing with documents that are both Web pages and social content, it was necessary to consider, in baseline development, only those features that are directly extractable from the text of the documents. For social content, it would also be possible to consider other metadata related to the social network of the authors of the posts, but for consistency with the evaluation of Web pages, such metadata have not been considered. The employed supervised classifiers are based on Support Vector Machines (SVM), Random Forests (RF), and Logistic Regression (LR), i.e., the machine learning techniques that have proven to be more effective for binary credibility assessment in the literature [27]. Such baselines act on the following linguistic features: the TF×IDF text representation, the word embedding text representation obtained by Word2vec pre-trained on Google News,1 and, given the health-related scenario considered, the word embedding representation obtained by Word2vec pre-trained on both PubMed and Medical Information Mart for Intensive Care III (MIMIC-III).2 Python and the scikit-learn library [28] have been employed for implementing the machine learning algorithms and for producing the TF×IDF text representation. Furthermore, we have considered threshold tuning to get the optimal cut-off for the runs, based on the Youden’s index [29]. 3.4. Pooling Methodology Similar to the 2016, 2017, 2018, and 2020 pools, we created the pool using the rank-biased precision (RBP)-based Method A (Summing contributions) [30] in which documents are weighted according to their overall contribution to the effectiveness evaluation as provided by the RBP formula (with 𝑝 = 0.8, following a study published in 2007 on RBP [31]). This strategy, called RBPA, has been proven more efficient than traditional fixed-depth or stratified pooling to evaluate systems under fixed assessment budget constraints [32], as it was the case for this task. All participants’ runs were considered on the document’s pool, along with six baselines provided by the organizers. In order to guarantee the judgements of the documents of the participants’ runs, half of the pool was composed by their documents and half from documents of the baselines’ runs which resulted in 250 documents per query in the pool. 3.5. Relevance Assessment The credibility, understandability, and topicality assessments were performed by 26 volunteers (19 women and 7 men) in May–June 2021. Of these assessors, 16 were from Australia, 4 from Italy, 3 from France, 2 from Ireland, and 1 from Finland. We recruited, trained, and supervised them by using bespoke written materials from April to June 2021. The recruitment took place via email and on social media, using both our existing contacts and snowballing. We implemented these assessments online by expanding and customising the Relevation! tool for relevance assessments [33] to capture the three dimensions of document relevance, and their scale (see [17, Figures 4–6] for illustrations of the online assessment environment). 1 https://github.com/mmihaltz/word2vec-GoogleNews-vectors 2 https://github.com/ncbi-nlp/BioSentVec We associated every query with 250 documents to be assessed with respect to their credibility, understandability, and topicality. Initially, we allocated each assessor with 2 queries for their assessment and then revised these allocations based on their individual needs and availability. In the end, every assessor completed 1–4 queries. Ethical approval (2021/013) was obtained for all aspects of this assessment study involving human participants as assessors from the Human Research Ethics Committee of the Australian National University (ANU). Each study participant provided informed consent. 3.6. Evaluation Metrics We considered evaluation measures that allowed to evaluate both: 1. the effectiveness of the systems with respect to the ranking produced by taking into account the three criteria considered, and 2. the accuracy of the (binary) classification of documents with respect to credibility (Sub- task 3). Specifically, with regard to the first aspect, this included the following performance evalua- tion measures: Mean Average Precision (MAP), preference-based BPref metric, normalized Discounted Cumulative Gain (nDCG), understandability-based variant of Rank Biased Precision (uRBP), and credibility-ranked Rank Biased Precision (cRBP) [6]. With respect to the second aspect, we referred to classical measures to assess the goodness of a classifier, such as Accuracy and the Area under the Receiver Operating Characteristic (ROC) Curve (AUC) [34]. In particular, since Subtask 3 is also devoted to assess credibility in relation to information needs (topics), also the average of a Topic-based Credibility Precision w.r.t. each topic, namely CP(𝑞), has been considered. A brief explanation of the three measures that need more detail, namely uRBP, cRBP, and CP(𝑞) is provided below. 3.6.1. Understandability-Ranked Biased Precision The uRBP measure evaluates IR systems by taking into account both topicality and understand- ability dimensions of relevance. In particular, the function for calculating uRBP was [35]: 𝐾 ∑︁ uRBP = (1 − 𝜌) 𝜌𝑘−1 𝑟(𝑑@𝑘)· 𝑢(𝑑@𝑘), (1) 𝑘=1 where 𝑟(𝑑@𝑘) is the relevance of the document 𝑑 at position 𝑘, 𝑢(𝑑@𝑘) is the understandability value of the document 𝑑 at position 𝑘, and the persistent parameter 𝜌 models the user desire to examine every answer, which was set to 0.50, 0.80, and 0.95 to obtain three version of uRBP, according to different user behaviors. 3.6.2. Credibility-Ranked Biased Precision In CLEF eHealth 2020, we have adapted uRBP to credibility, obtaining the so-called credibility- ranked biased precision (cRBP) measure [6]. In this case the function for calculating cRBP was the same used to calculate uRBP, replacing 𝑢(𝑑@𝑘) by the credibility value of the document 𝑑 at position 𝑘, 𝑐(𝑑@𝑘): 𝐾 ∑︁ cRBP = (1 − 𝜌) 𝜌𝑘−1 𝑟(𝑑@𝑘)· 𝑐(𝑑@𝑘). (2) 𝑘=1 As in uRBP, the parameter 𝜌 was set to three values, from an impatient user (0.50) to more persistent users (0.80 and 0.95). 3.6.3. Topic-based Credibility Precision The precision in retrieving credible documents can be calculated over the top-𝑘 documents in the ranking, for each topic (query) 𝑞, as follows: #𝑐𝑟𝑒𝑑𝑖𝑏𝑙𝑒_𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑_𝑑𝑜𝑐𝑠_𝑡𝑜𝑝_𝑘(𝑞) CP(𝑞) = . #𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑_𝑑𝑜𝑐𝑠_𝑡𝑜𝑝_𝑘(𝑞) 4. Participants and Approaches In 2021, 43 teams registered for the task on the web site and two teams submitted runs to the subtasks. We provided the registered participants with the crawler code and the domain list of the crawl for both the web documents and social media documents. They also had access to indexes built by organizers from the document collection on demand. Participants’ submissions were due by 8 May 2021. Of the four run submissions, two were to Subtask 1 on Adhoc IR, one was to Subtask 2 on Weakly Supervised IR, and one was to Subtask 3 on Document Credibility Prediction. The teams were from two countries (i.e., China and Italy) in two continents (i.e., Asia and Europe). In Subtask 1, the submissions were by 1. a 4-member team from the School of Computer Science, Zhongyuan University of Tech- nology (ZUT) in Zhengzhou, China [36] and 2. a 2-member team from the Information Management Systems (IMS) Research Group, University of Padova (UniPd), Padova, Italy [37]. In Subtasks 2 and 3, the submissions were by the leader of this IMS UniPd team [37] — a regular participant in our previous CHS tasks. The two teams submitted the following types of document ranking approaches as runs to the Adhoc IR subtask (Table 4): Team ZUT used a learning-to-rank approach in all its four submitted runs, but with four different machine learning algorithms to train their models [36]. In contrast, Team UniPd founded their four submissions on a renown Python framework for IR called PyTerrier as variants of Reciprocal Rank Fusion with the provided Terrier index [37]. The UniPd team submitted closely related approaches to the other two subtasks (Table 4). Namely, they also submitted their aforementioned four approaches to the Weekly Supervised IR subtask and two of them and another two of their variants to the Document Credibility Prediction subtask [37]. Subtasks Team Run Approach Algorithmic Details 1 ZUT i Learning to Rank LM 1 ZUT ii Learning to Rank MR 1 ZUT iii Learning to Rank RFs 1 ZUT iv Learning to Rank RB 1–3 UniPd a RRF using PyTerrier with the BM25, QLM, & DFR provided Terrier index 1&2 UniPd b RRF using PyTerrier with the BM25, QLM, & DFR using the provided Terrier index RM3 relevance language model for pseudo RF 1–3 UniPd c RRF using PyTerrier with the BM25, QLM, & DFR on manual provided Terrier index variants of the query 1&2 UniPd d RRF using PyTerrier with the BM25, QLM, & DFR on manual provided Terrier index variants using RM3 pseudo RF 3 UniPd e RRF using PyTerrier with the UniPd’s runs a & b above (i.e., the provided Terrier index ones without manual variants of the query) merged with min-max normalization 3 UniPd f RRF using PyTerrier with the UniPd’s runs c & d above (i.e., the provided Terrier index ones with manual variants of the query) merged with min-max nor- malization Table 2 Summary of Runs Submitted by Participating Teams. Acronyms: BM — Best Match, DFR — divergence from randomness, LM — LambdaMART, MR — MART, RB — RankBoost, RF — relevance feedback, RFs — random forests, RRF — Reciprocal Rank Fusion, QLM — query likelihood model 5. Results In 2021, the CLEF eHealth CHS task generated a new representative Web corpus including Web pages acquired from a 2021 CommonCrawl, and social media content from Twitter and Reddit. A new collection of 55 manually generated layperson medical queries was also created, along with their respective credibility, understandability, and topicality assessments for returned documents. In total, 15 runs were submitted to the three subtasks on adhoc IR, weakly supervised IR, and document credibility prediction, respectively. 5.1. Coverage of Relevance Assessments A total of 12, 500 assessments were made on 11, 357 documents: 7, 400 Web documents, and 3, 957 social media documents. Figure 1 shows the number of social media and Web documents that were assessed for each query. The bottom part shows which queries were created from discussion with patients (Expert queries), and queries created from discussions on social media (Social media queries). We can see that Web documents took a bigger part of the pool of documents. Nevertheless, the proportion of social media documents was bigger for queries based on social media. A special case is presented in queries 22, 63, and 116, where the pool Social Media Web # of documents Highly Somewhat Not Highly Somewhat Not Topical 925 1, 046 1, 555 2, 175 2, 540 4, 259 Understandable 1, 160 1, 766 600 4, 758 3, 014 1, 202 Credible 12 1, 427 1, 967 4, 552 3, 123 753 Table 3 Number of topical (a.k.a relevant), understandable, and credible documents in social media and Web documents. of documents was composed by Web documents only. This means that the submitted runs retrieved only Web documents. Figure 1: Proportion of social media and Web documents per each query, orange represents the pro- portion of Web documents and blue social media. The bottom line shows queries based on experts discussions with patients, and queries based on reddit posts. While the distribution of the assessments on topicality and understandability dimensions were similar on social media and Web documents, the credibility assessments presented a big difference (Table 5.1), with only 12 documents (less than 1% of social media assessed documents) were assessed as highly credible in the social media set, in contrast to the 4, 552 (54% of Web assessed documents) highly credible Web documents. Finally, an important part of social media documents were assessed as not credible. Figure 2 shows the relationship between the three dimensions of relevance; its diagonal exemplifies the distribution of each dimension. Each plot exposes the number of documents evaluated as “highly”, “somewhat”, or “not”, with respect to two dimensions of relevance (one in each axis). By example, in social media assessments the Figure 2 shows for the not credible documents that a fraction is not topical relevant, a set of documents is somewhat topical relevant, and a small part of the not credible is anyways highly topically relevant. Figure 2: Relationship between pairs of relevance dimensions, and distribution of assessments for each type of document. 5.2. Subtask 1: Adhoc Information Retrieval In this section we present the results for the Adhoc IR subtask where the systems were evaluated in different dimensions of relevance. For topicality, we evaluated the systems using MAP, BPref, and NDCG@10 performances metrics. For readability, we made use of uRBP that considers topicality and understandability of the documents. Finally, in order to include credibility in our evaluation, we measured systems’ cRBP performance that considers topicality and credibility of the document. Table 5.2 presents the ranking of participant systems and organizers baselines, with respect to three metrics of topical relevance: MAP, BPref, and NDCG@10. The team achieving the highest results was UniPd. Their top run, original_rm3_rrf used Reciprocal Rank Fusion with BM25, QLM, DFR approaches using pseudo relevance feedback with 10 documents and 10 terms (query weight 0.5). It achieved 0.43 MAP and 0.51 BPref. The best system of ZUT team was their run3, which used learning to rank techniques with a model trained using Random Forests. ZUT run3 was the best systems of the team on the three metrics with 0.4 MAP, 0.47 BPref, and 0.66 NDCG@10. For BPref and NDCG@10, the organizers baseline using TF×IDF with query expansion obtained higher results. Table 5.2 compares the results of our baseline systems with respect to last year’s Adhoc IR task results. In the case of MAP and BPref metrics, all the systems had better performance on 2021 test collection; for NDCG@10, DirichletLM with and without relevance feedback had a better performance on 2020 test collection. Despite this clear improvement from 2020 to 2021, the ranking of systems has changed in all the metrics. Namely, in 2020, the best MAP and BPref performance was achieve by DirichletLM, and TF×IDF with respect to NDCG@10 metric. In contrast, in this year’s task, the best baseline MAP, Bpref, and NDCG@10 was achieved by TF×IDF with relevance feedback. Rank Team - run MAP Team - run BPref Team - run NDCG@10 UniPd Baseline Baseline 1 0.431 0.511 0.654 original_rm3_rrf terrier_TF×IDF_qe terrier_TF×IDF_qe UniPd UniPd Baseline 2 0.431 0.508 0.646 simplified_rm3_rrf original_rm3_rrf terrier_TF×IDF ZUT UniPd Baseline 3 0.409 0.508 0.636 run3_clef2021_task2 simplified_rm3_rrf terrier_BM25 Baseline Baseline Baseline 4 0.397 0.499 0.635 terrier_TF×IDF_qe terrier_BM25_qe terrier_BM25_qe Baseline Baseline UniPd 5 0.390 0.474 0.619 terrier_BM25_qe terrier_TF×IDF original_rrf ZUT Baseline ZUT 6 0.373 0.472 0.615 run4_clef2021_task2 terrier_DirichletLM run3_clef2021_task2 Baseline Baseline UniPd 7 0.369 0.471 0.614 terrier_DirichletLM terrier_BM25 original_rm3_rrf Baseline ZUT Baseline 8 0.366 0.469 0.595 terrier_TF×IDF run3_clef2021_task2 terrier_DirichletLM Baseline ZUT ZUT 9 0.364 0.447 0.565 terrier_BM25 run4_clef2021_task2 run4_clef2021_task2 ZUT ZUT Baseline 10 0.338 0.441 0.536 run2_clef2021_task2 run1_clef2021_task2 terrier_DirichletLM_qe ZUT ZUT ZUT 11 0.338 0.428 0.526 run1_clef2021_task2 run2_clef2021_task2 run1_clef2021_task2 Baseline Baseline ZUT 12 0.242 0.369 0.482 terrier_DirichletLM_qe terrier_DirichletLM_qe run2_clef2021_task2 UniPd UniPd UniPd 13 0.236 0.300 0.478 original_rrf original_rrf simplified_rm3_rrf UniPd UniPd UniPd 14 0.215 0.298 0.459 simplified_rrf simplified_rrf simplified_rrf Table 4 Ad-hoc task ranking of systems MAP BPref NDCG@10 year 2020 2021 2020 2021 2020 2021 terrier_BM25 0.2627 0.3641 0.3964 0.4707 0.5919 0.6364 terrier_BM25_qe 0.2453 0.3903 0.3784 0.4994 0.5698 0.6352 terrier_DirichletLM 0.2706 0.3694 0.416 0.4724 0.6054 0.5952 terrier_DirichletLM_qe 0.1453 0.2423 0.2719 0.3691 0.5521 0.5362 terrier_TF×IDF 0.2613 0.3663 0.3958 0.4744 0.6292 0.6464 terrier_TF×IDF_qe 0.25 0.3974 0.3802 0.5106 0.608 0.6535 Table 5 MAP, BPref, and NDCG@10 2020 and 2021 Adhoc CHS task on baseline systems. Best system in bold The results for all participants for readability and credibility evaluation is shown in Table 5.2. For readability evaluation the best run was the baseline system TF×IDF with relevance feedback. The best participant system was ZUT’s run3 that is the same system that presented the best performance in terms of topicality as well. For credibility evaluation, the best run was the base- line system DirichletLM while the best participant system was UniPd’s original_rm3_rrf — the best system of the team in topicality evaluation. Rank Team - run rRBP Team - run cRBP 1 Baseline terrier_TF×IDF_qe 0.523 Baseline terrier_DirichletLM 0.458 2 Baseline terrier_TF×IDF 0.509 UniPd original_rm3_rrf 0.452 3 Baseline terrier_BM25_qe 0.507 Baseline terrier_TF×IDF_qe 0.450 4 Baseline terrier_BM25 0.501 Baseline terrier_BM25_qe 0.432 5 ZUT run3_clef2021_task2 0.494 Baseline terrier_DirichletLM_qe 0.429 6 UniPd original_rrf 0.491 UniPd original_rrf 0.427 7 UniPd original_rm3_rrf 0.475 Baseline terrier_TF×IDF 0.418 8 Baseline terrier_DirichletLM 0.463 ZUT run3_clef2021_task2 0.414 9 ZUT run4_clef2021_task2 0.450 ZUT run4_clef2021_task2 0.409 10 Baseline terrier_DirichletLM_qe 0.408 Baseline terrier_BM25 0.406 11 ZUT run1_clef2021_task2 0.408 UniPd simplified_rm3_rrf 0.331 12 UniPd simplified_rm3_rrf 0.369 ZUT run2_clef2021_task2 0.330 13 UniPd simplified_rrf 0.359 ZUT run1_clef2021_task2 0.319 14 ZUT run2_clef2021_task2 0.349 UniPd simplified_rrf 0.317 Table 6 Runs ranked for readability relevance dimension (rRBP) and credibility (cRBP). 5.3. Subtask 2: Weakly Supervised Information Retrieval The purpose of the task was to evaluate systems based on machine learning, trained on the weakly supervised dataset. UniPD submitted their runs to Subtask 1 for this subtask, which are not trained on the weakly supervised dataset. Their submission provides a kind of baseline for a system that does not use any information. Since no other team submitted runs to this task, we cannot provide any evaluation. 5.4. Subtask 3: Document Credibility Prediction The UniPd team, to assess document credibility, reused the runs computed in Subtask 1 and grouped them in order to produce a single score for each document. Their simple hypothesis was that documents that have a higher score across different search engines are also more credible. Their approach did not consider any additional information about the provenance of the document. 5.4.1. Credibility Assessment as a Binary Classification Problem The results illustrated in this section were obtained by performing a binary credibility assessment using the CLEF 2020 eHealth dataset for training the baselines and testing on a subset of the CLEF 2021 eHealth data, that is, those that were employed for both runs subtask1_ims_original and subtask1_ims_simplified submitted by the UniPd team. In this case, the topics against which the documents were retrieved are not taken into account. The results of classifying documents according to their credibility using both simple baselines (illustrated in Section 3.3) and considered runs are illustrated in Table 7. The results obtained from both baselines and runs were evaluated by means of classical measures in the context of document classification, that is, the AUC and Accuracy (as discussed in Section 3.6). Model AUC Accuracy Baseline SVM_orig_tf_idf 56.89% 63.38% Baseline RF_orig_tf_idf 47.37% 70.94% Baseline LR_orig_tf_idf 65.12% 64.98% Baseline SVM_orig_w2v_google 59.81% 64.26% Baseline RF_orig_w2v_google 51.82% 71.11% Baseline LR_orig_w2v_google 67.70% 65.47% Baseline SVM_orig_w2v_bio 65.30% 64.33% Baseline RF_orig_w2v_bio 56.10% 76.06% Baseline LR_orig_w2v_bio 69.50% 65.80% Baseline SVM_simp_tf_idf 57.31% 63.56% Baseline RF_simp_tf_idf 54.40% 74.34% Baseline LR_simp_tf_idf 61.43% 63.19% Baseline SVM_simp_w2v_google 60.24% 65.54% Baseline RF_simp_w2v_google 54.43% 75.64% Baseline LR_simp_w2v_google 62.35% 64.87% Baseline SVM_simp_w2v_bio 63.97% 64.66% Baseline RF_simp_w2v_bio 56.26% 78.14% Baseline LR_simp_w2v_bio 67.87% 67.45% Run subtask1_ims_original 62.33% 48.37% Run subtask1_ims_simplified 51.45% 45.11% Table 7 AUC and Accuracy values for baselines and runs when credibility assessment is treated as a binary classification problem. As it can be observed from the table, having considered only linguistic features, the classifica- tion results obtained were not particularly significant to the considered problem, neither for the runs nor for the baselines considered. The idea discussed by the members of the UniPd group, that documents having a higher score across different search engines are also more credible, did not appear to be supported by these results, but nevertheless merited future investigation, in particular in relation to the results reported in the next section. Regarding the baselines, it was possible to observe that, when supervised classifiers employ the word embedding text representation obtained by Word2vec pre-trained on biomedical datasets, such as Pubmed and MIMIC-III, we obtained best results as compared to their counterparts employing text representation features based on TF×IDF and Word2vec pre-trained on the general-purpose Google News dataset. However, it must be considered that, in general, the Accuracy values in particular could be influenced by the fact that the dataset considered was strongly unbalanced, being made up of about 20% of documents labeled as credible and 80% labeled as non-credible. Probably, this should also raise the need to deepen an analysis about the way in which human assessors evaluate the documents proposed to them. 5.4.2. Topic-based Credibility Assessment Regarding the problem of assessing the credibility of the documents retrieved with respect to the topics considered, we provide below the results relative to the runs sent by UniPD, namely subtask2_ims_original and subtask2_ims_simplified. In this case, for each topic considered, the precision in retrieving credible documents with respect to the topic was evaluated based on the value of CP(𝑞) calculated as shown in Section 3.6.3. In Table 8, we report the average CP(𝑞) values with respect to the set of topics considered. Specifically, the top-100 and top-200 documents retrieved with respect to each topic were taken into account in computing the CP(𝑞) metric. Model No of top-𝑘 documents Average CP(𝑞) Run subtask2_ims_original 𝑘 = 100 0.6767 Run subtask2_ims_simplified 𝑘 = 100 0.5221 Run subtask2_ims_original 𝑘 = 200 0.5205 Run subtask2_ims_simplified 𝑘 = 200 0.3433 Table 8 Average CP(𝑞) values when credibility assessment is performed by considering documents retrieved with respect to specific topics. As it can be observed from the table, with respect to the evaluations carried out in the previous case related to binary credibility assessment (i.e., Section 5.4.1), it seems that the methods applied in the two submitted runs that consider credibility associated with documents retrieved with respect to specific topics, actually managed to find credible information in the first top-𝑘 positions, in particular as 𝑘 decreases. In general, the most effective method seems to be the one implemented in the run subtask2_ims_original; however, from specific observations with respect to the results submitted by the participants (which are omitted in this paper), we can state that the second method, the one implemented in subtask2_ims_simplified, proved to be particularly efficient for some topics (even more than the method implemented in subtask2_ims_original) and much less so with respect to others, leading to a lower average CP(𝑞) value. It is therefore worth investigating this aspect more in the future. 6. Conclusions This paper has described methods, results and analysis of the Consumer Health Search (CHS) challenge at the CLEF eHealth 2021 Evaluation Lab. The task considered the problem of lay people searching for medical information related to their health condition on the web. In particular, across web pages and social media content. The task included three subtasks on ad hoc IR, weakly supervised IR, and document credibility prediction. 15 runs were submitted to these tasks. The CLEF eHealth CHS challenge, first ran at the inaugural CLEF eHealth lab in 2013, and has offered since then TREC-style IR challenges with medical datasets consisting of large medical web and lay person query collections. As a by-product of this evaluation exercise, the task contributes to the research community a collection with associated assessments and evaluation framework that can be used to evaluate the effectiveness of retrieval methods for health information seeking on the web. Queries, assessments, and participants’ runs for the 2021 CHS challenge are publicly available at https://github.com/CLEFeHealth/CHS-2021 and previous years’ CHS collections are available at https://github.com/CLEFeHealth/. Acknowledgments Thanks: The CHS task of the CLEF eHealth 2021 evaluation lab has been supported in part by the CLEF Initiative. It has also been supported in part by the Our Health in Our Hands (OHIOH) initiative of the Australian National University (ANU), as well as the ANU School of Computing, ANU Research School of Population Health, and Data61/Commonwealth Scientific and Industrial Research Organisation. OHIOH is a strategic initiative of the ANU which aims to transform health care by developing new personalised health technologies and solutions in collaboration with patients, clinicians, and health care providers. Moreover, the task has been supported in part by the bi-lateral Kodicare (Knowledge Delta based improvement and continuous evaluation of retrieval engines) project funded by the French ANR (ANR-19-CE23- 0029) and Austrian FWF. Finally, the task has been supported in part by the EU Horizon 2020 Research and Innovation Programme under the Marie Skłodowska-Curie Grant Agreement No 860721 – DoSSIER: “Domain Specific Systems for Information Extraction and Retrieval”. We are also thankful to the people involved in the query creation and relevance assessment exercises. Last but not least, we gratefully acknowledge the participating teams’ hard work. We thank them for their submissions and interest in the task. Author Contribution Statement: With equal contribution, Task 2 was led by LG, GP, and HS, and organized by EB, NB-S, GG-S, LK, PM, GP, SS, HS, RU, MV, and CX. References [1] S. H. Soroya, A. Farooq, K. Mahmood, J. Isoaho, S. e Zara, From information seeking to information avoidance: Understanding the health information behavior during a global health crisis, Information Processing and Management 58 (2021) 102440. [2] L. Goeuriot, G. J. Jones, L. Kelly, J. Leveling, A. Hanbury, H. Müller, S. Salantera, H. Suomi- nen, G. Zuccon, ShARe/CLEF eHealth Evaluation Lab 2013, Task 3: Information retrieval to address patients’ questions when reading clinical reports, CLEF 2013 Online Working Notes 8138 (2013). [3] H. Suominen, L. Kelly, L. Goeuriot, Scholarly influence of the Conference and Labs of the Evaluation Forum eHealth Initiative: Review and bibliometric study of the 2012 to 2017 outcomes, JMIR Research Protocols 7 (2018) e10961. [4] J. Palotti, G. Zuccon, A. Hanbury, Consumer health search on the web: Study of web page understandability and its integration in ranking algorithms, J Med Internet Res 21 (2019) e10986. [5] H. Suominen, L. Kelly, L. Goeuriot, The scholarly impact and strategic intent of CLEF eHealth Labs from 2012 to 2017, in: N. Ferro, C. Peters (Eds.), Information Retrieval Evalu- ation in a Changing World: Lessons Learned from 20 Years of CLEF, Springer International Publishing, Cham, 2019, pp. 333–363. [6] L. Goeuriot, Z. Liu, G. Pasi, G. G. Saez, M. Viviani, C. Xu, Overview of the CLEF eHealth 2020 task 2: consumer health search with ad hoc and spoken queries, in: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings, 2020. [7] H. Suominen (Ed.), The Proceedings of the CLEFeHealth2012 — the CLEF 2012 Workshop on Cross-Language Evaluation of Methods, Applications, and Resources for eHealth Document Analysis, NICTA, 2012. [8] H. Suominen, S. Salanterä, S. Velupillai, W. W. Chapman, G. Savova, N. Elhadad, S. Pradhan, B. R. South, D. L. Mowery, G. J. Jones, J. Leveling, L. Kelly, L. Goeuriot, D. Martinez, G. Zuccon, Overview of the ShARe/CLEF eHealth Evaluation Lab 2013, in: Information Access Evaluation. Multilinguality, Multimodality, and Visualization, Springer Berlin Heidelberg, 2013, pp. 212–231. [9] L. Kelly, L. Goeuriot, H. Suominen, T. Schreck, G. Leroy, D. L. Mowery, S. Velupillai, W. Chapman, D. Martinez, G. Zuccon, J. Palotti, Overview of the ShARe/CLEF eHealth Evaluation Lab 2014, in: Information Access Evaluation. Multilinguality, Multimodality, and Visualization, Springer Berlin Heidelberg, 2014, pp. 172–191. [10] L. Goeuriot, L. Kelly, H. Suominen, L. Hanlen, A. Névéol, C. Grouin, J. Palotti, G. Zuccon, Overview of the CLEF eHealth Evaluation Lab 2015, in: Information Access Evaluation. Multilinguality, Multimodality, and Visualization, Springer Berlin Heidelberg, 2015. [11] L. Kelly, L. Goeuriot, H. Suominen, A. Névéol, J. Palotti, G. Zuccon, Overview of the CLEF eHealth Evaluation Lab 2016, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer Berlin Heidelberg, 2016, pp. 255–266. [12] L. Goeuriot, L. Kelly, H. Suominen, A. Névéol, A. Robert, E. Kanoulas, R. Spijker, J. Palotti, G. Zuccon, CLEF 2017 eHealth Evaluation Lab overview, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer Berlin Heidelberg, 2017, pp. 291–303. [13] H. Suominen, L. Kelly, L. Goeuriot, A. Névéol, L. Ramadier, A. Robert, E. Kanoulas, R. Spijker, L. Azzopardi, D. Li, Jimmy, J. Palotti, G. Zuccon, Overview of the CLEF eHealth Evaluation Lab 2018, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer Berlin Heidelberg, 2018, pp. 286–301. [14] L. Kelly, H. Suominen, L. Goeuriot, M. Neves, E. Kanoulas, D. Li, L. Azzopardi, R. Spijker, G. Zuccon, H. Scells, J. Palotti, Overview of the CLEF eHealth Evaluation Lab 2019, in: F. Crestani, M. Braschler, J. Savoy, A. Rauber, H. Müller, D. E. Losada, G. Heinatz Bürki, L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction, Springer International Publishing, Cham, 2019, pp. 322–339. [15] L. Goeuriot, H. Suominen, L. Kelly, A. Miranda-Escalada, M. Krallinger, Z. Liu, G. Pasi, G. Gonzalez Saez, M. Viviani, C. Xu, Overview of the CLEF eHealth evaluation lab 2020, in: A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis, H. Joho, C. Lioma, C. Eickhoff, A. Névéol, L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multi- modality, and Interaction, Springer International Publishing, Cham, 2020, pp. 255–271. [16] L. Goeuriot, H. Suominen, L. Kelly, L. A. Alemany, N. Brew-Sam, V. Cotik, D. Filippo, G. G. Saez, F. Luque, P. Mulhem, G. Pasi, R. Roller, S. Seneviratne, J. Vivaldi, M. Viviani, C. Xu, CLEF eHealth 2021 Evaluation Lab, in: Advances in Information Retrieval — 43st European Conference on IR Research, Springer, Heidelberg, Germany, 2021. [17] H. Suominen, L. Goeuriot, L. Kelly, L. Alonso Alemany, E. Bassani, N. Brew-Sam, V. Cotik, D. Filippo, G. Gonzalez-Saez, F. Luque, P. Mulhem, G. Pasi, R. Roller, S. Seneviratne, R. Upadhyay, J. Vivaldi, M. Viviani, C. Xu, Overview of the CLEF eHealth evaluation lab 2021, in: CLEF 2021 - 11th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS), Springer, Heidelberg, Germany, 2021. [18] L. Goeuriot, L. Kelly, W. Lee, J. Palotti, P. Pecina, G. Zuccon, A. Hanbury, H. M. Gareth J.F. Jones, ShARe/CLEF eHealth Evaluation Lab 2014, Task 3: User-centred health infor- mation retrieval, in: CLEF 2014 Evaluation Labs and Workshop: Online Working Notes, Sheffield, UK, 2014. [19] J. Palotti, G. Zuccon, L. Goeuriot, L. Kelly, A. Hanburyn, G. J. Jones, M. Lupu, P. Pecina, CLEF eHealth Evaluation Lab 2015, Task 2: Retrieving Information about Medical Symptoms, in: CLEF 2015 Online Working Notes, CEUR-WS, 2015. [20] G. Zuccon, J. Palotti, L. Goeuriot, L. Kelly, M. Lupu, P. Pecina, H. Mueller, J. Budaher, A. Deacon, The IR Task at the CLEF eHealth Evaluation Lab 2016: User-centred Health Information Retrieval, in: CLEF 2016 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS, 2016. [21] J. Palotti, G. Zuccon, Jimmy, P. Pecina, M. Lupu, L. Goeuriot, L. Kelly, A. Hanbury, CLEF 2017 Task Overview: The IR Task at the eHealth Evaluation Lab, in: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum, CEUR Workshop Proceedings, 2017. [22] . Jimmy, G. Zuccon, J. Palotti, Overview of the CLEF 2018 consumer health search task, in: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum, CEUR Workshop Proceedings, 2018. [23] L. Goeuriot, G. J. Jones, L. Kelly, J. Leveling, M. Lupu, J. Palotti, G. Zuccon, An Analysis of Evaluation Campaigns in ad-hoc Medical Information Retrieval: CLEF eHealth 2013 and 2014, Springer Information Retrieval Journal (2018). [24] L. Goeuriot, H. Suominen, L. Kelly, Z. Liu, G. Pasi, G. S. Gonzales, M. Viviani, C. Xu, Overview of the CLEF eHealth 2020 task 2: Consumer health search with ad hoc and spoken queries, in: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum, CEUR Workshop Proceedings, 2020. [25] G. Amati, Probabilistic Models for Information Retrieval Based on Divergence from Ran- domness, Ph.D. thesis, Glasgow University, Glasgow, the UK, 2003. [26] I. Ounis, C. Lioma, C. Macdonald, V. Plachouras, Research directions in terrier: a search engine for advanced retrieval on the web, CEPIS Upgrade Journal 8 (2007). [27] M. Viviani, G. Pasi, Credibility in social media: opinions, news, and health information—a survey, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 7 (2017) e1209. [28] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Pret- tenhofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, G. Varoquaux, API design for machine learning software: experiences from the scikit-learn project, in: ECML PKDD Workshop: Languages for Data Mining and Machine Learning, 2013, pp. 108–122. [29] W. J. Youden, Index for rating diagnostic tests, Cancer 3 (1950) 32–35. [30] A. Moffat, J. Zobel, Rank-biased precision for measurement of retrieval effectiveness, ACM Transactions on Information Systems 27 (2008) 2:1–2:27. [31] L. A. Park, Y. Zhang, On the distribution of user persistence for rank-biased precision, in: Proceedings of the 12th Australasian document computing symposium, 2007, pp. 17–24. [32] A. Lipani, J. Palotti, M. Lupu, F. Piroi, G. Zuccon, A. Hanbury, Fixed-cost pooling strategies based on ir evaluation measures, in: European Conference on Information Retrieval, Springer, 2017, pp. 357–368. [33] B. Koopman, G. Zuccon, Relevation!: an open source system for information retrieval relevance assessment, in: Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, ACM, 2014, pp. 1243–1244. [34] H. Suominen, S. Pyysalo, M. Hiissa, F. Ginter, S. Liu, D. Marghescu, T. Pahikkala, B. Back, H. Karsten, T. Salakoski, Performance evaluation measures for text mining, in: M. Song, Y. Wu (Eds.), Handbook of Research on Text and Web Mining Technologies, IGI Global, Hershey, Pennsylvania, USA, 2008, pp. 724–747. [35] G. Zuccon, Understandability biased evaluation for information retrieval, in: Advances in Information Retrieval, 2016, pp. 280–292. [36] H. Yang, X. Liu, B. Zheng, G. Yang, Learning to rank for Consumer Health Search, in: CLEF 2021 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS, September 2021. [37] G. Di Nunzio, F. Vezzani, IMS-UNIPD @ CLEF eHealth Task 2: Reciprocal Ranking Fusion in CHS, in: CLEF 2021 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS, September 2021.