Introduction

LIG-Health at Adhoc and Spoken IR Consumer Health Search: expanding queries using UMLS and FastText.

Philippe Mulhem

Gabriela Gonzalez Saez

Aidan Mannion

Didier Schwab

Jibril Frej

0 0 Univ. Grenoble Alpes , CNRS, Grenoble INP

This paper describes the work done by the LIG of Grenoble for the Adhoc and the Spoken Consumer Health search. Our focus for this participation is to study the e ectiveness of simple query expansions for health related retrieval. We focused on several query expansions, using knowledge-based or embedding-based techniques, with and without weighting of expansions, with and without Pseudo Relevance Feedback. The results obtained for Adhoc queries show that our baseline run outperforms the query expansions proposed. The results obtained for spoken queries show that several speakers lead to very di erent results, and that merging the results from several users improve the quality of the system.

Query Expansion UMLS FastText Query fusion

Introduction

This paper describes the experiments achieved by the LIG-Health team for the CLEF 2020 evaluation campaign [ 7 ]. We did participate to the task Consumer Health Search of CLEF eHealth 2020 [ 12 ], and more spci cally to the adHoc subtask and to the spoken queries subtask [ 6 ]. The people involved are for these experiments are members of the Information Retrieval group (MRIM) and the Natural Language Processing group (GETALP) of the Laboratoire d'Informatique de Grenoble1.

Our work targeted the two subtasks proposed: adhoc and spoken queries. For both subtasks, we explored the use of two query expansions methods: one knowledge-supported using the UMLS meta-thesaurus [ 2 ], and one using embeddings using FastText [ 3 ]. Binary and weighted expansions were processed in both case. For the retrieval stage, we considered both \Straight" (SR) and Relevance Feedback (RF) cases. We study how some simple processes may be adapted for both text and spoken queries. In the case of spoken queries, query expansions may be questionable because of the possible errors of speech to text steps. In all the cases, we made use of the assessments of CLEF eHealth 2018 to select the submissions.

We tackled the spoken queries by considering all the transcriptions provided, and applying the two expansions and the two retrieval described above, in a way to evaluate the best con guration to submit. For the fusion of runs, we did consider a simple fusion of result lists.

The remaining of the paper is organized as follows. In Section 2 we describe in detail the two expansions approaches used, before describing our proposal in Section 3. Section 4 focuses on the features and parameters of the Information Retrieval used. The o cial results are presented in part 6. We discuss the results in Section 7 before concluding in Section 8.

Expansion Approaches FastText-based

This rst expansion proposed relies on FastText [ 3, 10 ]. FastText proposes a framework to learn and manage words embeddings. It is able to consider subwords (using ngrams) as opposed to more classical embeddings models like Word2Vec [ 11 ], which create embeddings only for whole-word tokens. The FastText embedding vector of a word is the sum of the vectors of its component ngrams.

We used the pre-trained word vectors for English language, trained on Common Crawl and Wikipedia using FastText. The features of the model used are as follows; { Continuous bag-of-words (CBOW) with positional weighting { Vector embeddings of dimension d = 300 { Character n-grams of length 5 { Context window of size 5 { Sampling of 10 negative examples per positive example

Using such embeddings, in our experiments, we expand each query using terms with a cosine similarity greater than an experimentally determined threshold t with the original query terms - i.e. denoting the cosine similarity function as F Tcos, for each term w in the preprocessed query, we calculate its FastText embedding vector f (w) and then add all terms w0 for which F Tcos(f (w); f (w0)) t to the query. 2.2

UMLS-based

The second expansion strategy used in this work relies on the Uni ed Medical Language System (UMLS) Metathesaurus [ 2 ], a comprehensive biomedical thesaurus incorporating a network of semantically related concepts linking a large number of medical language resources. From the many information sources in UMLS Metathesaurus, we restricted our expansion search to one that is speci cally designed to deal consumer-level medical vocabulary: the Open Access and Collaborative Consumer Health Vocabulary2, known as the CHV, which contains more than 88 000 synonyms for more that 57 000 concepts.

The CHV is used to get the synonyms of query terms, and we denote the function mapping a term to its CHV synonym as CHVsyn in the following. As the synonyms were often too general or too numerous in initial experiments, we introduced in addition a ltering step based on FastText similarity F Tcos. Given that the goal of the UMLS-based expansion is to nd expansion terms that are semantically rather than syntactically related to the query terms, the CHV synonyms were only included in the expanded query if their FastText embedding had a cosine similarity less than 0.6 with the original query term they were associated with. 3

Query expansions proposed

The two query expansions proposed are described now. Each of them has two versions: the binary one and the weighted one. As their names suggest, the binary expansions do not consider any weight to query terms, and the weighted ones are able to indicate a level of importance of a term in the query. We detail them in the following. 3.1

Embedding-based only expansion.

This approach is quite similar to [ 1 ], one major di erence is the embeddings consider subwords, as described above in part 2.1. We do not use any manually de ned knowledge for these expansions. Formulas 1 and 2 describe the binary expansion based on FastText. In formula 1, the set V OC F T denotes the vocabulary that FastText manages. The manually-de ned threshold considered here, 0.75, is lower than the one in the UMLS expansion: it is consistent with [ 13 ] and a trade o between quality of suggested terms and the quantity of terms found.

T exp F Tbinary(qi) = feje 2 V OC F T q ^ F Tcos(qi; e) >= 0:75g (1) 2 https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/CHV/index.html Exp Query F Tbinary(q) = q [qi2q T exp F Tbinary(qi) (2)

For the weighted FT-based expansion, the principle is the same as before, but: { the initial query terms have a weight of 1; { the expanded terms are weighted by the cosine similarity between their Fast

Text embedding and that of their synonym in the original query; { if one expansion term occurs several times in the expansion, each (weighted) occurrence is considered in the expansion.

More formally, the formulas 3 and 4 describe such expansion :

The use of knowledge-based expansions of query is well studied, as in [ 8 ]. In the speci c case of medical search, the use of UMLS meta thesaurus is classical, as in [ 16 ]. The binary expansion is processed in follows query term by query term qi, as described in formulas 5 and 6. In our experiments for each query term qi from a query q, we look for the synonyms of qi in the consumer health vocabulary. Then, we apply a ltering that keeps the term if his similarity, using FastText [ 3 ], is larger that 0.8. Again, this threshold has been manually de ned and is consistent with [ 13 ] (even if [ 5 ] showed that such threshold can not be considered as a rule of thumb). This ltering allows to consolidate the trust we have in the synonyms provided by CHV.

T exp U M LSbinary(qi) = feje 2 CHVsyn(qi) ^ F Tcos(qi; e) >= 0:8g

Exp Query U M LSbinary(q) = q [qi2q T exp U M LSbinary(qi)

For the weighted UMLS-based expansion, the principle is the same as before, but: { the initial query terms have a weight of 1; { as CHV does not weight synonymy relationships, we propose that the expanded terms get the weight provided by FastText; { if one expansion term occurs several times on the expansion, each (weighted) occurrence is considered in the expansion. (5) (6) More formally, the formulas 7 and 8 describe such expansion :

4 Information Retrieval System

The information retrieval system used for the experiments is Terrier v5.23 [ 9 ]. We did not index the corpus, but we used the index provided by the organizers. This had a impact on the retrieval, as simple tests made us nd out that the index seem corrupted, leading to duplicate document identi ers in result lists. We did then post-process the result list in a way to remove these duplicated documents. Because this removal was applied on the top-1000 documents, our results lists are less than 1000 long.

The IR model used is BM25 [ 14 ], with b=0.75 after preliminary experiments, other parameters by default. Many experiments show that BM25 is a very good model to be used [ 15 ]. The Relevance Feedback model is Bose Einstein (bo1 model of Terrier), with default parameters (3 top documents considered, and 10 terms for expansion). The Bo1 relevance feedback model provides very good results. 5

Runs description

The di erent runs submitted were the best four runs of several con gurations. As described above in section 3, adding expanded con gurations, we get a total 10 runs: 1. Noexp: no expansion, straight query processing (i.e., without Relevance

Feedback) z; 2. Noexp RF: no expansion, RF query processing y; 3. FT Straight binary: FastText-based query expansion, binary expansion mode, no RF z; 4. FT Straight weighted: FastText-based weighted query expansion, no RF; 5. FT RF binary: FastText-based binary expansion, RF query processing y; 6. FT RF weighted: FastText-based weighted query excpansion, RF query processing; 7. UMLS Straight binary: UMLS-based query expansion, binary expansion mode, straight query processing z; 8. UMLS Straight weighted: UMLS-based weighted query expansion, straight query processing; 3 http://terrier.org/

A described below, we select among these con gurations our submissions for the two subtasks Adhoc (marked z) and Spoken queries (marked y). 5.1

Adhoc subtask

For the selection of our submitted run, we did evaluate the quality on the qrels of CLEF eHealth Adhoc 2018, using the MAP, of the 10 con gurations above. The results obtained are presented in Table 1. The best reference run between Noexp and Noexp RF plus the top three runs were submitted as our o cial runs. For the Adhoc subtask, dedicated to retrieve documents when asking one query, a set of 50 queries are provided. The runs with the best results over these 50 queries are chosen (marked with a z in section 5).

From Table 1, we see that, using the CLEF eHealth 2018 reference, the best run is the non-expanded and non-RF one. When binary con gurations achieve the same quality that their binary counterpart, we choose the binary con guration. This explain why the UMLS and TF-based binary expansions with straight query processing are selected. Overall, we notice that the Relevance Feedback query processing underperforms straight query processing for the FastText-based expansions, and that the weighted FastText expansions behave the same than their binary counterparts. On the Spoken subtask, the 50 topics from the Adhoc task had been recorded by six users (Participant 1 to Participant 6). Per participant, six transcriptions are provided: default enhanced transcription, ESPNET commonvoice, ESPNET librispeech, ESPNET librispeech rnnlm, phone enhanced, video enhanced. Weexplore all if these transcriptions for each participant. This leads to a total of 36 (= 6 participants 6 transcriptions) versions of the set of queries.

The full selection of the four submitted runs per user considers the 10 con gurations described previously over these versions of queries. It follows two steps: the rst one select one transcription per user, and in a second step we choose the con gurations used for the submission. More precisely:

1. Selection of the one transcription per participant

We rst choose the transcriptions that achieves the higher MAP values (according to CLEF eHEalth 2018 qrels) over the non-expanded runs. These results are presented in Figure 2. We see that the transcription quality varies a lot depending on the speaker: fort instance the default enhanced transcription is very good for the participants 1, 2, 3 and 6, but fails for participants 4 and 5. By analyzing this gure, we select the following transcriptions: { default enhanced transcription for Participant 1 { default enhanced transcription for Participant 2 { video enhanced for Participant 3 { phone enhanced for Participant 4 { phone enhanced for Participant 5 { default enhanced transcription for Participant 6

2. Selection of four con gurations per participant over the chosen transcription of step 1

Then, we computed the averaged MAP (over the 6 participants, using CLEF eHealth 2018 Adhoc assessments provided) for each IR con guration system. These results are presented in Figure 3. Considering these averaged maps, we choose the top-four con gurations with only one non-expanded run (marked as y in section 5): (a) Noexp RF: no expansion, RF query processing; (b) FT RF binary: FastText-based binary query expansion, RF query processing; (c) UMLS RF binary: UMLS-based binary query expansion, RF query processing (d) UMLS RF weighted: UMLS-based weighted query expansion, RF query processing.

We notice that all the selected con gurations use Relevance Feedback.

Fused runs

We also submitted 4 runs that fuse the results of the same con guration for each participant. To integrate these results, we used a simple sum of scores. This allows to study if the integration of several transcriptions from several participants outperforms single participant transcriptions. The MAP evaluation of these four con gurations on the CLEF eHealth 2018 assessments, as presented in Figure 4, show a slight increase compared to the top results per user. 6

Results

We present here the o cial results obtained by our runs, for the adhoc and spoken queries subtasks. We consider the following evaluation measures: MAP to assess globally the quality of the con gurations, Bpref [ 4 ] that takes into account the fact that the evaluation relies on incomplete assessments, and the classical ndcg@10 that focuses on the relevance of the top-10 results. For these runs, the other measures provided by the organizers, like RBP-based ones, lead to similar rankings of the con gurations tested. 6.1

Adhoc

Our o cial results for the Adhoc query runs are presented in Table 2. In this table, we see that the best MAP and ndcg@10 results are obtained without any query expansion, and without any relevance feedback. This means that none of the query expansions are able to increase the quality regarding these evaluation measures. However, binary UMLS and TF-based expansions, with straight query processing, slightly outperform the un-expanded runs with straight query processing.

We studied also the results obtained for RBP measures in table3. We see that the UMLS expanded run outperforms the non-expanded one for RBP and RBP readability. 6.2

Spoken queries

The o cial results of our results per participant are presented in Table 4. This table con rms our evaluations on 2018 assessments: there are large variations of quality (for all the measures) depending on the participant considered. From our four con gurations submitted, the best MAP for participant 6 is 0.1744 (Noexp with RF), where for participant 5 the best MAP (UMLS-based binary expansion with RF) is only 0.1036. The best measures (among the 3 presented) per user are obtained without any query expansion in 12 cases on 24. The UMLSbased binary expansion only provides twice the best measures (MAP and Bpref) for participant 5. The weighted expansion using UMLS outperforms other congurations in 4 cases. The weighted UMLS expansion outperforms the binary UMLS expansion on 12 cases over 18. FT-based expansion never produces the best results, but the second best Bpref and ndcg@10 for participant 1.

For the submitted merged results, we see that the merging always increase (over the best single participant) the evaluations measures for each con guration. Here, the binary UMLS-based expansion outperforms its weighted counterpart for Bpref and ndcg@10. FT-based expansions underperform UMLS-based expansions.

For the RBP evaluations of the merged spoken runs, the no-expanded run still outperforms our other submissions. 7

Discussion

We focus rst on the Adhoc subtask. According to the MAP values obtained on the 2018 assessments, our o cial results are consistent: the best run is the non-expanded one without relevance feedback. We see that the UMLS expanded query with relevance feedback performs as well as the non-expanded run for rst results, as the P@10 and ndcg@10 are almost equal. Binary UMLS and FT-based expansions slightly outperform our non-expanded runs for the Bpref measure, this shows that such expansions can be bene cial. We see in Figures 5 and 6 (i.e, our two best results according to MAP), that the expansion is much more unstable compared to the media results.

When considering spoken participant runs, we show again that some expansions proposed are able to outperform the non-expansions con gurations. More precisely, UMLS-based expansions obtain larger MAP and Bpref values than non-expansions for 33% (= 2/6) of the participants.

The fused spoken evaluation results that we get from the spoken queries are consistent with the adhoc results: the expansions underperform the nonexpanded runs. With merged results, the expansions are never close to the quality of the non-expanded runs: the reason is that the expansions over each user tend to disperse the initial query expression already subject to transcription errors. We present in 7 and 8 (i.e, our two best results according to ndcg@10) the results per query. We see again here that the expansion underperforms the median results more often than the non-expanded run.

A more detailed fusion process may improve the overall quality, but in any case merging results for spoken queries needs to be able to retrieve similar queries asked by several users, which is not an easy task.

This work is only focusing on simple expansions, and these expansions do not succeed in increasing the quality of the results. The expansion terms are not strongly enough related to teh initial query. Future experiments will be conducted to check exactly why these expansions fail.

For the o cial evaluation measures related to credibility, the UMLS expanded runs outperform by 4.3% (cRBP 0.80) the non-expanded one for the Adhoc search, but still the non-expanded runs achieve a higher quality than the expanded ones for the spoken runs. 8

Conclusion

We presented in this paper the con gurations of the retrieval for Adhoc and Spoken queries subtasks of the consumer Health Search task from CLEF eHealth 2020. We focused our proposal on several query expansions. The query expansions rely on UMLS meta-thasaurus, and on words embeddings using FastText.

The main ndings is that the expansion proposed for classical Adhoc underperforms simple retrieval (with or without relevance Feedback strategy). For spoken runs, we were able to detect that some query expansions (based on UMLS) do compete well with simple retrieval without expansion.

Other approaches based on reranking should be studied in the future, in a way to avoid the noise generated by the expansions of queries.

Acknowledgement

This work was partially supported by the ANR Kodicare project, grant ANR19-CE23-0029 of the French Agence Nationale de la Recherche.

Mohannad

Almasri , Catherine Berrut, and Jean-Pierre Chevallet . A Comparison of Deep Learning Based Query Expansion with Pseudo-Relevance Feedback and Mutual Information . In Conference ECIR, volume 42 , pages 369 { 715 , Padoue , Italy, March 2016 .

Olivier

Bodenreider . The Uni ed Medical Language System (UMLS): Integrating Biomedical Terminology . Nucleic Acids Res ., 32 ( Database-Issue ): 267 { 270 , 2004 .

Piotr

Bojanowski , Edouard Grave, Armand Joulin, and

Tomas

Mikolov . Enriching Word Vectors with Subword Information . CoRR, abs/1607.04606, 2016 .

Chris

Buckley and Ellen M. Voorhees . Retrieval Evaluation with Incomplete Information . In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR ' 04 , page 25 { 32 , New York, NY, USA, 2004 . Association for Computing Machinery .

Elekes ,

Schaeler , and

Boehm . On the Various Semantics of Similarity in Word Embedding Models . In 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL) , pages 1 { 10 , 2017 .

Lorraine

Goeuriot , Hanna Suominen, Liadh Kelly, Zhengyang Liu, Gabriella Pasi, Gabriela Saez Gonzales, Marco Viviani, and

Chenchen

Xu . Overview of the CLEF eHealth 2020 Task 2: Consumer Health Search with Ad Hoc and Spoken Queries . In Working Notes of Conference and Labs of the Evaluation (CLEF) Forum , CEUR Workshop Proceedings, 2020 .

Lorraine

Goeuriot , Hanna Suominen, Liadh Kelly, Antonio Miranda-Escalada, Martin Krallinger, Zhengyang Liu, Gabriella Pasi, Gabriela Saez Gonzales, Marco Viviani, and

Chenchen

Xu . Overview of the CLEF eHealth Evaluation Lab 2020 . In Avi Arampatzis, Evangelos Kanoulas, Theodora Tsikrika, Stefanos Vrochidis, Hideo Joho, Christina Lioma, Carsten Eickho , Aurelie Neveol, Linda Cappellato, and Nicola Ferro, editors, Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020 ) , LNCS Volume number: 12260 , 2020 .

Alexander

Kotov and ChengXiang Zhai . Tapping into Knowledge Base for Concept Feedback: Leveraging Conceptnet to Improve Search Results for Di cult Queries . In Eytan Adar, Jaime Teevan, Eugene Agichtein, and Yoelle Maarek, editors, WSDM , pages 403 { 412 . ACM, 2012 .

9. Craig

Macdonald

, Richard

McCreadie

, Rodrygo LT Santos, and

Iadh

Ounis . From Puppy to Maturity: Experiences in Developing Terrier . Proc. of OSIR at SIGIR , pages 60 { 63 , 2012 .

10. Tomas

Mikolov

, Edouard Grave, Piotr Bojanowski,

Christian

Puhrsch , and

Armand

Joulin . Advances in Pre-Training Distributed Word Representations . In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018 ), 2018 .

11. Tomas

Mikolov

, Ilya Sutskever, Kai Chen, Greg S Corrado, and

Dean . Distributed Representations of Words and Phrases and their Compositionality . In C. J. C. Burges , L.

Bottou , M.

Welling , Z.

Ghahramani , and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26 , pages 3111 { 3119 . Curran Associates, Inc., 2013 .

12. Antonio Miranda-Escalada, Aitor Gonzalez-Agirre, Jordi Armengol-Estape, and Martin Krallinger. Overview of Automatic Clinical Coding: Annotations, Guidelines, and Solutions for non-English Clinical Cases at CodiEsp Track of CLEF eHealth 2020 . In Working Notes of Conference and Labs of the Evaluation (CLEF) Forum , CEUR Workshop Proceedings, 2020 .

13. Navid

Rekabsaz

, Mihai Lupu, and

Allan

Hanbury . Exploration of a Threshold for Similarity Based on Uncertainty in Word Embedding . In Joemon M Jose, Claudia Hau , Ismail Sengor Alt ngovde, Dawei Song, Dyaa Albakour, Stuart Watt, and John Tait, editors, Advances in Information Retrieval , pages 396 { 409 , Cham , 2017 . Springer International Publishing.

14.

Stephen

Robertson ,

Walker ,

Jones ,

M. M.

Hancock-Beaulieu , and

Gatford . Okapi at trec-3. In Overview of the Third Text REtrieval Conference (TREC-3) , pages 109 { 126 . Gaithersburg , MD: NIST, January 1995 .

15.

Peilin

Yang and

Hui

Fang . A Reproducibility Study of Information Retrieval Models . ICTIR '16, page 77 { 86 , New York, NY, USA, 2016 . Association for Computing Machinery .

16. Liu Zhenyu and Chu Wesley W. Knowledge-based Query Expansion to Support Scenario-speci c Retrieval of Medical Free Text . Information Retrieval , 10 ( 2 ): 173 { 202 , 2007 .