=Paper=
{{Paper
|id=Vol-2696/paper_128
|storemode=property
|title=A Study on Reciprocal Ranking Fusion in Consumer Health Search. IMS UniPD ad CLEF eHealth 2020 Task 2
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_128.pdf
|volume=Vol-2696
|authors=Giorgio Maria Di Nunzio,Stefano Marchesin,Federica Vezzani
|dblpUrl=https://dblp.org/rec/conf/clef/Nunzio0V20
}}
==A Study on Reciprocal Ranking Fusion in Consumer Health Search. IMS UniPD ad CLEF eHealth 2020 Task 2==
A Study on Reciprocal Ranking Fusion in Consumer Health Search. IMS UniPD at CLEF eHealth 2020 Task 2 Giorgio Maria Di Nunzio1,2 , Stefano Marchesin1 , and Federica Vezzani3 1 Dept. of Information Engineering – University of Padua [giorgiomaria.dinunzio,stefano.marchesin]@unipd.it 2 Dept. of Mathematics – University of Padua 3 Dept. of Linguistic and Literary Studies – University of Padua federica.vezzani@unipd.it Abstract. In this paper, we describe the results of the participation of the Information Management Systems (IMS) group at CLEF eHealth 2020 Task 2, Consumer Health Search Task. In particular, we partici- pated in both subtasks: Ad-hoc IR and Spoken queries retrieval. The goal of our work was to evaluate the reciprocal ranking fusion approach over 1) different query variants; 2) different retrieval functions; 3) w/out pseudo-relevance feedback. The results show that, on average, the best performances are obtained by a ranking fusion approach together with pseudo-relevance feedback. 1 Introduction CLEF eHealth is an evaluation challenge where the goal is to provide researchers with datasets, evaluation frameworks, and events to evaluate the performance of IR systems in the medical IR domain. In the CLEF eHealth 2020 edition [5], the organizers set up two tasks to evaluate retrieval systems on different domains. In this paper, we report the results of our participation to the Task 2 “Consumer Health Search” [4]. This task investigates the problem of retrieving documents to support the needs of health consumers that are confronted with a health issue. In particular, we participated in both the subtasks available: the Ad-hoc IR task and the Spoken queries retrieval task. The contribution of our experiments to both subtasks can be summarized as follows: – A study of a manual query variation approach similar to [7, 8]; – An evaluation of a ranking fusion approach [3] on different document retrieval strategies, with or without pseudo-relevance feedback [10]. Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece. Table 1. Examples of query variants given the original query for subtask 1. id type text 151001 original anemia diet therapy 151001 variant 1 anaemia diet cure 151001 variant 2 diet treatment for the decrease in the total amount of red blood cells (RBCs)or hemoglobin in the blood 152001 original emotional and mental disorders 152001 variant 1 psychiatric disorder 152001 variant 2 psychological disorder 152001 variant 3 mental illness 152001 variant 4 mental disease 152001 variant 5 mental disorder 152001 variant 6 nervous breakdown 152001 variant 7 emotional disturbance such as: anxiety, bipolar, conduct, eating, obsessive-compulsive (OCD) and psychotic disorders The remainder of the paper will introduce the methodology and a brief sum- mary of the experimental settings that we used in order to create the official runs that we submitted for this task. 2 Methodology In this section, we describe the methodology for merging the ranking list provided by different retrieval methods for different query variants. 2.1 Subtask 1: Ad-hoc IR Query variants : In this subtask, we asked to an expert in the field of medical Terminology to rewrite the original English query into as many variants as she preferred. The aim of the query rewriting was to describe in the best possible way (given the knowledge of the user) the information need expressed by the query. In table 1, we show the variants for the first two queries (151001, 152001). These examples show how the number of variants as well as the complexity of the request (from a few keywords to complex sentences) may change across queries. Retrieval models : For each query, we run three different retrieval models: the Okapi BM25 model [9], the divergence from randomness model [1], the language model using Dirichlet priors [11]. We used the RM3 Positional Relevance model to implement a pseudo-relevance feedback strategy including query expansion [6]. Ranking fusion : Given different ranking lists, we used the reciprocal ranking fusion (RRF) approach to merge them [2]. Table 2. Examples of query variants for subtask 2. Only the first three variants are shown. id type text 151001 participant 1 anemia diet changes 151001 participant 2 Diet for anemia 151001 participant 3 What food can i eat on this diet 152001 participant 1 causes of withdrawal 152001 participant 2 What diseases may cause mental health? 152001 participant 3 what mental health conditions can cause mood alterations cause somebody to become more withdrawn 2.2 Subtask 2: Spoken queries retrieval Query variants : In this subtask, there are already available a number of query variants that were (audio) recorded by six users. For this task, we used the different transcriptions of these audio files: clean transcript, default variant, phone enhanced variant, video enhanced variant. In Table 2, we show three examples of variants (out of six) for the first two queries. Retrieval models : for this subtask, we used only the Okapi BM25 retrieval model and the RM3 pseudo-relevance feedback model. Ranking fusion : given different ranking participants and different transcripts, we used the RRF approach to merge them. 3 Experiments In this section, we describe the experimental settings and the results for each subtask. 3.1 Search Engine For all the experiments, we used the Elasticsearch search engine4 and the indexes provided by the organizers of the task. We used the following parameter settings for each retrieval model: – BM25, k2 = 1.2, b = 0.75 – LMDirichlet, µ = 2000 – DFR, basic model = if, after effect = b, normalization = h2 The RM3 pseudo-relevance feedback model was implemented with the follow- ing strategy: pick the 10 most relevant terms from the top 10 ranked documents, add these terms to the original query with a weight equal 0.5 (while the original terms are weighted 1.0), run the expanded query and produce the final ranking list. 4 https://www.elastic.co/products/elasticsearch 3.2 Runs For each subtask, we submitted four runs. Subtask 1 . For the Ad-hoc retrieval subtask, the runs are: – clef bm25 orig: Only BM25 (no rank fusion) using the original query only; – clef original rrf: Reciprocal rank fusion with BM25, QLM, DFR models and the original query; – clef original rm3 rrf: Reciprocal Rank fusion with BM25, QLM, DFR ap- proaches using RM3 pseudo relevance feedback and the original query; – clef variant rrf: BM25 and reciprocal rank fusion on the rankings produced by the original and manual variants of the query. Subtask 2 . For the spoken queries retrieval subtask, the runs are: – bm25 rrf: Reciprocal rank fusion with BM25 on the six variants of the query; – bm25 rrf rm3: Reciprocal rank fusion with BM25 on the six variants of the query using pseudo relevance feedback with 10 documents and 10 terms (query weight 0.5); – bm25 all rrf: Reciprocal rank fusion with BM25 on all transcripts of the six variants of the query (a total of 18 variants per query) – bm25 all rrf rm3: Reciprocal rank fusion of BM25 with all transcripts using RM3 pseudo relevance feedback. 3.3 Results The organizers of this task provided the results (averaged across topics) achieved by many baselines compared to the runs of each participant. In Table 3, we show a summary of these results. A preliminary analysis of the results shows that, in terms of standard eval- uation measures such as MAP, Rprec, and bref, the use of the RM3 relevance feedback model improves the effectiveness of the search engine (see Table 3). For subtask 1, the use of reciprocal ranking together with RM3 produced satisfactory results, in most cases better than any baseline for many performance measures. The run with manual query variants without relevance feedback did not show any significant improvements. For subtask 2, the use of pseudo relevance feedback achieved better results. It is interesting to see that, despite the noise of the formulation of the query by different participants, Precision@5 (P 5) was better, in general, than most of the baselines. In terms of understandability (rRBP) and credibility (cRBP) of the retrieved results [12], we report in Table 4 the values of these two measures by cut-off (0.50, 0.50, 0.95) and ordered by map (same ordering of Table 3). From this set of results, one interesting thing emerges: the readability of the Ad-hoc manual query variant seems to improve compared to the runs that use the original query. This will be part of our future work. Table 3. Summary of the results for subtask 1 and 2. The upper part of the table shows the performances of many baselines (Base). The second and the third part of the table (bottom part) show the performance of our experiments for subtask 1 (Adhoc) and subtask 2 (Spoken). run map Rprec bpref recip rank P 5 AdHocIR Base.elastic BM25f noqe.out 0.271 0.344 0.421 0.911 0.808 AdHocIR Base.terrier DirichletLM noqe.out 0.271 0.357 0.416 0.869 0.736 AdHocIR Base.terrier BM25 cli.out 0.264 0.357 0.392 0.760 0.620 AdHocIR Base.terrier BM25 gfi.out 0.263 0.357 0.392 0.713 0.628 AdHocIR Base.terrier BM25 noqe.out 0.263 0.345 0.396 0.852 0.716 AdHocIR Base.terrier TF IDF noqe.out 0.261 0.347 0.396 0.854 0.764 AdHocIR Base.terrier TF IDF qe.out 0.250 0.328 0.380 0.875 0.740 AdHocIR Base.terrier BM25 qe.out 0.245 0.323 0.378 0.854 0.704 AdHocIR Base.elastic BM25 QE Rein.txt 0.176 0.252 0.307 0.793 0.684 AdHocIR Base.terrier DirichletLM qe.out 0.145 0.217 0.272 0.878 0.688 AdHocIR Base.indri tfidf noqe.out 0.121 0.209 0.240 0.758 0.600 AdHocIR Base.indri okapi qe.out 0.119 0.204 0.239 0.740 0.604 AdHocIR Base.indri tfidf qe.out 0.119 0.199 0.234 0.685 0.608 AdHocIR Base.elastic BM25f qe.out 0.111 0.163 0.211 0.892 0.720 AdHocIR Base.indri okapi noqe.out 0.110 0.195 0.223 0.786 0.600 AdHocIR Base.indri dirichlet noqe.out 0.079 0.160 0.181 0.748 0.540 AdHocIR Base.indri dirichlet qe.out 0.048 0.110 0.123 0.637 0.436 AdHocIR Base.Bing all.txt 0.014 0.017 0.016 0.832 0.632 AdHocIR IMS.original rm3 rrf.txt 0.283 0.364 0.432 0.864 0.780 AdHocIR IMS.original rrf.txt 0.281 0.362 0.423 0.916 0.800 AdHocIR IMS.bm25 orig.txt 0.248 0.328 0.391 0.888 0.796 AdHocIR IMS.variant rrf.txt 0.202 0.288 0.371 0.855 0.744 Spoken IMS.bm25 rrf rm3.txt 0.219 0.306 0.404 0.856 0.744 Spoken IMS.bm25 all rrf rm3.txt 0.214 0.304 0.398 0.827 0.700 Spoken IMS.bm25 rrf.txt 0.196 0.280 0.374 0.854 0.760 Spoken IMS.bm25 all rrf.txt 0.195 0.286 0.372 0.841 0.772 4 Final remarks and Future Work The aim of our participation to the CLEF eHealth Task 2 was to test the ef- fectiveness of the reciprocal ranking fusion approach together with a pseudo- relevance feedback strategy. The initial results show a promising path, but a failure analysis and a topic-by-topic comparison is needed to understand when and how the different combination in the retrieval pipeline are significantly better than simple models. 5 Acknowledgements This work was partially supported by the ExaMode Project, as a part of the European Union Horizon 2020 Program under Grant 825292. Table 4. Understandability (rRBP) and Credibility (cRBP) results at different levels of cut-off for each run. run rRBP 0.50 rRBP 0.80 rRBP 0.95 cRBP 0.50 cRBP 0.80 cRBP 0.95 AdHocIR IMS.original rm3 rrf.txt 0.322 0.314 0.304 0.523 0.504 0.453 AdHocIR IMS.original rrf.txt 0.339 0.323 0.302 0.567 0.522 0.468 AdHocIR IMS.bm25 orig.txt 0.347 0.320 0.292 0.551 0.513 0.448 AdHocIR IMS.variant rrf.txt 0.353 0.351 0.310 0.513 0.486 0.414 Spoken IMS.bm25 rrf rm3.txt 0.296 0.289 0.250 0.485 0.449 0.381 Spoken IMS.bm25 all rrf rm3.txt 0.289 0.285 0.257 0.469 0.435 0.383 Spoken IMS.bm25 rrf rm3.txt 0.296 0.289 0.250 0.506 0.464 0.373 Spoken IMS.bm25 all rrf.txt 0.308 0.298 0.248 0.504 0.462 0.372 References 1. Gianni Amati and Cornelis Joost Van Rijsbergen. Probabilistic models of informa- tion retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20(4):357–389, October 2002. 2. Gordon V. Cormack, Charles L A Clarke, and Stefan Buettcher. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’09, pages 758–759, New York, NY, USA, 2009. Association for Computing Machinery. 3. D. Frank Hsu and Isak Taksa. Comparing rank and score combination methods for data fusion in information retrieval. Information Retrieval, 8(3):449–480, 2005. 4. Lorraine Goeuriot, Hanna Suominen, Liadh Kelly, Zhengyang Liu, Gabriella Pasi, Gabriela Saez Gonzales, Marco Viviani, and Chenchen Xu. Overview of the CLEF eHealth 2020 task 2: Consumer health search with ad hoc and spoken queries. In Working Notes of Conference and Labs of the Evaluation (CLEF) Forum, CEUR Workshop Proceedings, 2020. 5. Lorraine Goeuriot, Hanna Suominen, Liadh Kelly, Antonio Miranda-Escalada, Martin Krallinger, Zhengyang Liu, Gabriella Pasi, Gabriela Saez Gonzales, Marco Viviani, and Chenchen Xu. Overview of the CLEF eHealth evaluation lab 2020. In Avi Arampatzis, Evangelos Kanoulas, Theodora Tsikrika, Stefanos Vrochidis, Hideo Joho, Christina Lioma, Carsten Eickhoff, Aurélie Névéol, and Linda Cappel- lato andNicola Ferro, editors, Experimental IR Meets Multilinguality, Multimodal- ity, and Interaction: Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020) , LNCS Volume number: 12260, 2020. 6. Yuanhua Lv and ChengXiang Zhai. Positional relevance model for pseudo-relevance feedback. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’10, pages 579–586, New York, NY, USA, 2010. Association for Computing Machinery. 7. Giorgio Maria Di Nunzio, Federica Beghini, Federica Vezzani, and Geneviève Hen- rot. An interactive two-dimensional approach to query aspects rewriting in sys- tematic reviews. IMS unipd at CLEF ehealth task 2. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017., 2017. 8. Giorgio Maria Di Nunzio, Giacomo Ciuffreda, and Federica Vezzani. Interactive sampling for systematic reviews. IMS unipd at CLEF 2018 ehealth task 2. In Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10-14, 2018., 2018. 9. Stephen E. Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333– 389, 2009. 10. Ian Ruthven and Mounia Lalmas. A survey on the use of relevance feedback for information access systems. Knowl. Eng. Rev., 18(2):95–145, June 2003. 11. Chengxiang Zhai and John Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Informa- tion Retrieval, SIGIR ’01, pages 334–342, New York, NY, USA, 2001. Association for Computing Machinery. 12. Guido Zuccon. Understandability biased evaluation for information retrieval. In Nicola Ferro, Fabio Crestani, Marie-Francine Moens, Josiane Mothe, Fabrizio Sil- vestri, Giorgio Maria Di Nunzio, Claudia Hauff, and Gianmaria Silvello, editors, Advances in Information Retrieval, pages 280–292, Cham, 2016. Springer Interna- tional Publishing.