Introduction

A Study on Reciprocal Ranking Fusion in Consumer Health Search. IMS UniPD at CLEF eHealth 2020 Task 2

Giorgio Maria Di Nunzio

giorgiomaria.dinunzio@unipd.it 0 2

Stefano Marchesin

stefano.marchesin@unipd.it 0

Federica Vezzani

federica.vezzani@unipd.it 1 0 Dept. of Information Engineering 1 Dept. of Linguistic and Literary Studies 2 Dept. of Mathematics 3 University of Padua

In this paper, we describe the results of the participation of the Information Management Systems (IMS) group at CLEF eHealth 2020 Task 2, Consumer Health Search Task. In particular, we participated in both subtasks: Ad-hoc IR and Spoken queries retrieval. The goal of our work was to evaluate the reciprocal ranking fusion approach over 1) di erent query variants; 2) di erent retrieval functions; 3) w/out pseudo-relevance feedback. The results show that, on average, the best performances are obtained by a ranking fusion approach together with pseudo-relevance feedback.

Introduction

CLEF eHealth is an evaluation challenge where the goal is to provide researchers with datasets, evaluation frameworks, and events to evaluate the performance of IR systems in the medical IR domain. In the CLEF eHealth 2020 edition [ 5 ], the organizers set up two tasks to evaluate retrieval systems on di erent domains. In this paper, we report the results of our participation to the Task 2 \Consumer Health Search" [ 4 ]. This task investigates the problem of retrieving documents to support the needs of health consumers that are confronted with a health issue. In particular, we participated in both the subtasks available: the Ad-hoc IR task and the Spoken queries retrieval task.

The contribution of our experiments to both subtasks can be summarized as follows: { A study of a manual query variation approach similar to [ 7, 8 ]; { An evaluation of a ranking fusion approach [ 3 ] on di erent document retrieval strategies, with or without pseudo-relevance feedback [10]. id type text 151001 original anemia diet therapy 151001 variant 1 anaemia diet cure 151001 variant 2 diet treatment for the decrease in the total amount of red blood cells (RBCs)or hemoglobin in the blood 152001 original emotional and mental disorders 152001 variant 1 psychiatric disorder 152001 variant 2 psychological disorder 152001 variant 3 mental illness 152001 variant 4 mental disease 152001 variant 5 mental disorder 152001 variant 6 nervous breakdown 152001 variant 7 emotional disturbance such as: anxiety, bipolar, conduct, eating, obsessive-compulsive (OCD) and psychotic disorders

The remainder of the paper will introduce the methodology and a brief summary of the experimental settings that we used in order to create the o cial runs that we submitted for this task. 2

Methodology

In this section, we describe the methodology for merging the ranking list provided by di erent retrieval methods for di erent query variants. 2.1

Subtask 1: Ad-hoc IR

Query variants : In this subtask, we asked to an expert in the eld of medical Terminology to rewrite the original English query into as many variants as she preferred. The aim of the query rewriting was to describe in the best possible way (given the knowledge of the user) the information need expressed by the query. In table 1, we show the variants for the rst two queries (151001, 152001). These examples show how the number of variants as well as the complexity of the request (from a few keywords to complex sentences) may change across queries. Retrieval models : For each query, we run three di erent retrieval models: the Okapi BM25 model [9], the divergence from randomness model [ 1 ], the language model using Dirichlet priors [11]. We used the RM3 Positional Relevance model to implement a pseudo-relevance feedback strategy including query expansion [ 6 ]. Ranking fusion : Given di erent ranking lists, we used the reciprocal ranking fusion (RRF) approach to merge them [ 2 ]. id type text 151001 participant 1 anemia diet changes 151001 participant 2 Diet for anemia 151001 participant 3 What food can i eat on this diet 152001 participant 1 causes of withdrawal 152001 participant 2 What diseases may cause mental health? 152001 participant 3 what mental health conditions can cause mood alterations cause somebody to become more withdrawn Query variants : In this subtask, there are already available a number of query variants that were (audio) recorded by six users. For this task, we used the di erent transcriptions of these audio les: clean transcript, default variant, phone enhanced variant, video enhanced variant. In Table 2, we show three examples of variants (out of six) for the rst two queries.

Retrieval models : for this subtask, we used only the Okapi BM25 retrieval model and the RM3 pseudo-relevance feedback model.

Ranking fusion : given di erent ranking participants and di erent transcripts, we used the RRF approach to merge them. 3

Experiments

In this section, we describe the experimental settings and the results for each subtask. 3.1

Search Engine

For all the experiments, we used the Elasticsearch search engine4 and the indexes provided by the organizers of the task. We used the following parameter settings for each retrieval model: { BM25, k2 = 1.2, b = 0.75 { LMDirichlet, = 2000 { DFR, basic model = if, after e ect = b, normalization = h2

The RM3 pseudo-relevance feedback model was implemented with the following strategy: pick the 10 most relevant terms from the top 10 ranked documents, add these terms to the original query with a weight equal 0.5 (while the original terms are weighted 1.0), run the expanded query and produce the nal ranking list. 4 https://www.elastic.co/products/elasticsearch 3.2

Runs

For each subtask, we submitted four runs.

Subtask 1 . For the Ad-hoc retrieval subtask, the runs are: { clef bm25 orig: Only BM25 (no rank fusion) using the original query only; { clef original rrf: Reciprocal rank fusion with BM25, QLM, DFR models and the original query; { clef original rm3 rrf: Reciprocal Rank fusion with BM25, QLM, DFR approaches using RM3 pseudo relevance feedback and the original query; { clef variant rrf: BM25 and reciprocal rank fusion on the rankings produced by the original and manual variants of the query.

Subtask 2 . For the spoken queries retrieval subtask, the runs are: { bm25 rrf: Reciprocal rank fusion with BM25 on the six variants of the query; { bm25 rrf rm3: Reciprocal rank fusion with BM25 on the six variants of the query using pseudo relevance feedback with 10 documents and 10 terms (query weight 0.5); { bm25 all rrf: Reciprocal rank fusion with BM25 on all transcripts of the six variants of the query (a total of 18 variants per query) { bm25 all rrf rm3: Reciprocal rank fusion of BM25 with all transcripts using

RM3 pseudo relevance feedback. 3.3

Results

The organizers of this task provided the results (averaged across topics) achieved by many baselines compared to the runs of each participant. In Table 3, we show a summary of these results.

A preliminary analysis of the results shows that, in terms of standard evaluation measures such as MAP, Rprec, and bref, the use of the RM3 relevance feedback model improves the e ectiveness of the search engine (see Table 3).

For subtask 1, the use of reciprocal ranking together with RM3 produced satisfactory results, in most cases better than any baseline for many performance measures. The run with manual query variants without relevance feedback did not show any signi cant improvements.

For subtask 2, the use of pseudo relevance feedback achieved better results. It is interesting to see that, despite the noise of the formulation of the query by di erent participants, Precision@5 (P 5) was better, in general, than most of the baselines.

In terms of understandability (rRBP) and credibility (cRBP) of the retrieved results [12], we report in Table 4 the values of these two measures by cut-o (0.50, 0.50, 0.95) and ordered by map (same ordering of Table 3). From this set of results, one interesting thing emerges: the readability of the Ad-hoc manual query variant seems to improve compared to the runs that use the original query. This will be part of our future work.

Acknowledgements

This work was partially supported by the ExaMode Project, as a part of the European Union Horizon 2020 Program under Grant 825292. 9. Stephen E. Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333{ 389, 2009. 10. Ian Ruthven and Mounia Lalmas. A survey on the use of relevance feedback for information access systems. Knowl. Eng. Rev., 18(2):95{145, June 2003. 11. Chengxiang Zhai and John La erty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '01, pages 334{342, New York, NY, USA, 2001. Association for Computing Machinery. 12. Guido Zuccon. Understandability biased evaluation for information retrieval. In Nicola Ferro, Fabio Crestani, Marie-Francine Moens, Josiane Mothe, Fabrizio Silvestri, Giorgio Maria Di Nunzio, Claudia Hau , and Gianmaria Silvello, editors, Advances in Information Retrieval, pages 280{292, Cham, 2016. Springer International Publishing.

Gianni

Amati and Cornelis Joost Van Rijsbergen . Probabilistic models of information retrieval based on measuring the divergence from randomness . ACM Trans. Inf . Syst., 20 ( 4 ): 357 { 389 , October 2002 .

2. Gordon

Cormack , Charles L A Clarke , and Stefan Buettcher . Reciprocal rank fusion outperforms condorcet and individual rank learning methods . In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '09 , pages 758 { 759 , New York, NY, USA, 2009 . Association for Computing Machinery .

D. Frank

Hsu and

Isak

Taksa . Comparing rank and score combination methods for data fusion in information retrieval . Information Retrieval , 8 ( 3 ): 449 { 480 , 2005 .

Lorraine

Goeuriot , Hanna Suominen, Liadh Kelly, Zhengyang Liu, Gabriella Pasi, Gabriela Saez Gonzales, Marco Viviani, and

Chenchen

Xu . Overview of the CLEF eHealth 2020 task 2: Consumer health search with ad hoc and spoken queries . In Working Notes of Conference and Labs of the Evaluation (CLEF) Forum, CEUR Workshop Proceedings , 2020 .

Lorraine

Goeuriot , Hanna Suominen, Liadh Kelly, Antonio Miranda-Escalada, Martin Krallinger, Zhengyang Liu, Gabriella Pasi, Gabriela Saez Gonzales, Marco Viviani, and

Chenchen

Xu . Overview of the CLEF eHealth evaluation lab 2020 . In Avi Arampatzis, Evangelos Kanoulas, Theodora Tsikrika, Stefanos Vrochidis, Hideo Joho, Christina Lioma, Carsten Eickho , Aurelie Neveol, and Linda Cappellato andNicola Ferro, editors, Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020 ) , LNCS Volume number: 12260 , 2020 .

Yuanhua

Lv and ChengXiang Zhai . Positional relevance model for pseudo-relevance feedback . In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '10 , pages 579 { 586 , New York, NY, USA, 2010 . Association for Computing Machinery .

Giorgio

Maria Di Nunzio , Federica Beghini, Federica Vezzani, and

Genevieve

Henrot . An interactive two-dimensional approach to query aspects rewriting in systematic reviews. IMS unipd at CLEF ehealth task 2 . In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum , Dublin, Ireland, September 11-14 , 2017 ., 2017 .

Giorgio

Maria Di Nunzio , Giacomo Ciu reda, and Federica Vezzani. Interactive sampling for systematic reviews . IMS unipd at CLEF 2018 ehealth task 2 . In Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum , Avignon, France, September 10-14 , 2018 ., 2018 .