195 Medical Query Expansion using Semantic Sources DBpedia and Wikidata Sarah Dahira, Jalil ElHassounib, Abderrahim El Qadic, Hamid Bennisa a IMAGE Laboratory, SCIAM Team, Graduate School of Technology, Moulay Ismail University of Meknes, Morocco b LRIT-CNRST (URAC’29), Faculty of Sciences, Rabat IT Center, Mohammed V University in Rabat, Morocco c ENSAM, Mohammed V University in Rabat, Morocco Abstract Since Query Expansion (QE) is known for its effectiveness in increasing query relevancy in IR Systems, and LOD are currently used in different domains for various objectives like suggesting, to users, alternative options based on features of their previous searches and interests. We suggest an approach to enhance Information Retrieval (IR) in the medical domain through QE using two Linked Open Data (LOD) bases: DBpedia and Wikidata. We use DBpedia entities within the PubMed abstract, as candidates for expansion, along with their associated labels (“rdfs:label”) in DBpedia base. We evaluate our suggested approach, using MEDLINE collection and Indri search engine. Our expansion approach lead to significant improvements; especially in terms of precision and Mean Average Precision (MAP) compared to related approaches; using only one domain dependant/independent source. Keywords DBpedia, Information Retrieval, PubMed, Query Expansion, Wikidata. aftermath of the pandemic on several other domains 1. Introduction such as economy (e.g. unemployment and stock market) and education e.g. school closure [1]. For Information Retrieval Systems (IRS) match the user instance, queries on the loss of smell attained 8% in query to a collection of documents. As a result, a Mars 23rd, 2020 and testing for corona queries subset of documents is returned. This subset is attained 97% on April 13th [1]. considered relevant because it contains the query The lockdown caused by the pandemic, increased, terms. But, sometimes words from the u1ser query are more than ever, our need for better IR for medical different from those contained in the relevant queries in general. Especially that this type of queries document set. This issue has been shown in various lacks technical terms that domain experts use in web studies; one from the medical field. Covid-19 symptoms (fever, sore throat, shortness of pages. This problem is often referred to as vocabulary mismatch. breath, loss of taste, and loss of smell) as well as One way to overcome this problem is to use query testing for coronavirus, and preventing measures (face expansion. This process is done through adding new mask, hand sanitizer, social distancing, and hand terms to the user query based on association rules washing) have become some of the most trending between the terms [2]. However, adding so many queries, along with other search trends related to the terms to the query can be more harmful than adding few ones [3]. ISIC 2021: International Semantic Intelligence Conference Linked Data2 take advantage from the Web to Email: sarah.dahir2012@gmail.com connect related data [4]. For this purpose Uniform (S.Dahir); abderrahim.elqadi@um5.ac.ma (A. El Qadi) ; Resource Identifier (URI) and Resource Description hamid.bennis@gmail.com (H. Bennis) Framework (RDF) are used among other technologies © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 2 http://linkeddata.org/ 196 and Linked Data standards. Some of them are open hyponyms). Consequently, this kind of techniques and others require a license agreement: cannot solve ambiguity issues [13]; • DBpedia: is a knowledge base that contains Query-log analysis: exploits log files’ information structured information from Wikipedia. This of earlier queries; like the click activity of the user. knowledge base describes 6 million entities; But, this technique requires large logs [14]; including 5000 diseases [5]. And it allows Linked Data techniques [15]: take into among other things: annotation of a text consideration the context of keywords. In [16] authors through the Web interface DBpedia Spotlight 3 explore just a small number of Dbpedia properties that performs Named Entity Recognition. Yet, which means that important properties may not have we noticed throughout our multiple accesses to been exploited. In [17] and [18] DBpedia is used to DBpedia Spotlight that the annotation stops expand queries by using indexed terms from feedback functioning from time to time. For more documents that share similar DBpedia features with precision, we were unable to annotate texts query terms. using this Web application for three times in In the medical domain; Linked Data allow four years. And whenever, it stops functioning, corresponding terms used by patients to those used by it stays that way for three to four days in a row. domain experts. In [19], authors used the “Unified • Wikidata: is one of the largest datasets. It is a Medical Language System” (UMLS) database to free knowledge database project hosted by determine synonyms for phrases within the user query. Wikimedia with 90,478,674 data items [6] In [20], authors expanded medical queries using (including concepts). Unlike other knowledge only MeSH thesaurus. After that, they extracted bases, Wikidata may be edited by users. documents based on the similarity between those Furthermore, it usually gives links that allow expanded queries and clusters of medical documents. browsing the resource in other databases like And In our previous work [21], we used attributes MeSH, PubMed, Freebase, etc. (features) values from Wikidata to expand medical In this work, we suggest expanding queries using queries. For this purpose, we considered only values two linked data sources (DBpedia, Wikidata) along that contained a query term. However Wikidata is not with a search engine (PubMed) that allows the search domain specific. Thus it lacks emphasis on the medical in the MEDLINE database, and the National Library data. But, since Wikidata has links to numerous of Medicine (NLM) controlled vocabulary thesaurus ontologies and databases from different domains, we (Medical Subject Headings (MeSH)) which is used to decided to exploit one of those links that is specific to index PubMed articles. the medical domain. It is the PubMed’s link. This paper is organized as follows: Section 2 discusses related work. Section 3 gives methodological 3. Proposed method details of our suggested approach, and section 4 presents its evaluation results, and gives an outlook on Be it domain dependant or independent, linked data or future work. not; every external source has its advantages, its limits, and its specificities. As a result, we suggested in this 2. Related work work a medical query expansion approach (Figure 1) that combines various sources; including two Query Expansion (QE) plays a crucial role in knowledge bases from Linked Data, a medical improving Web searches. The user’s initial query is database, and a medical thesaurus as explained in the reformulated by adding additional meaningful terms following steps: with similar significance. There are many queries 1. We first look for the longest n-gram that expansion techniques: covers most (if not all) DBpedia entities Linguistic analysis [7] - [8]: deals with each query within the query and returns results in the keyword separately from the others using for example Wikidata search engine. In case the n-gram the lexical database WordNet [9] - [10] - [11] that has does not feature all of the entities within the a limited coverage of concepts [12] and a very small query; use other n-grams too; featuring those number of relationships (synonyms, hypernyms, and entities. Table 1, shows an example of used queries from the MEDLINE collection. Most of those queries are long (more than 4 keywords) and consist from many sentences. 3 https://www.dbpedia-spotlight.org/demo/ 197 As a result, we must shorten them to avoid from the previous step, that are also not getting any results at all and to make available in the MeSH terms of the PubMed sure that we kept the most valuable page. keywords while shortening them. 2. Then, we search the n-gram(s) in Wikidata. Table 1 3. After that, we browse the PubMed identifier Example of a long query, from MEDLINE dataset, (“PubMed ID”) of the first result in containing several sentences. Wikidata. 4. Next, we perform Named-entity Recognition Query Query content on the PubMed abstract, of the previously number browsed page, using DBpedia. 14 renal amyloidosis as a complication of tuberculosis and the effects of steroids on 5. Then, we consider DBpedia entities within this condition. only the terms kidney diseases the PubMed abstract, as candidates for andnephrotic syndrome were selected by the expansion; along with their associated labels requester. prednisone and prednisolone are the only steroids of interest. (“rdfs:label”) in DBpedia. 6. Finally, we expand the query using the entities as well as their associated labels, Figure 1: The flowchart of our suggested Query Expansion approach Number of topics 30 4. Results and discussion Total number of tokens 159970 Total number of distinct (unique) tokens 13113 To evaluate our approach, we used MEDLINE (table Average number of tokens per text 100 2) collection. It is a set of articles from a medical journal that we indexed with a stop words list using Indri search engine. 4.1. Retrieval model Table 2 For the implementation of our approach, we used Description of MEDLINE collection Kulback Leibler(KL)[22] IR model [23]. In KL (1), we compare the document’s model with the query’s Total number of texts 1033 model. 198 𝑃(𝑥) 𝐷𝐾𝐿 (𝑃||𝑄)= ∑𝑥∈𝑋 𝑃(𝑥)𝑙𝑜𝑔⁡(𝑄(𝑥) ) (1) Where: p 2reli −1 Where P and Q are discrete probability distributions ⁡⁡⁡DCGP = ∑i=1 log (i+1) (6) 2 defined on the same probability space. And we use smoothing through Dirichlet to avoid With reli: the relevance score of document i; is getting a null result when a term is not present in the obtained after documents retrieval using an IR created language model. model. And: REL 2reli −1 IDCGP = ∑i=1 log (i+1) (7) 4.2. Evaluation metrics 2 In this work we used the following evaluation Where |REL| is the list of relevant documents measures: ranked based on their relevancy in the corpus. • Precision (2): is a measure that indicates how • Mean Reciprocal Rank (MMR) (8): The efficient a system is in retrieving only relevant Reciprocal Rank (RR) is the multiplicative documents [24]: inverse of the rank of the first exact answer [27]. And the MRR is the average of the RR of multiple Number⁡of⁡relevant⁡retrieved⁡documents Precision = ⁡⁡⁡⁡ queries Q [27]. Number⁡of⁡retrieved⁡documents (2) 1 |Q| 1 MRR = |Q| ∑i=1 rank (8) i Precision at rank N is evaluated by considering only top results returned by the system. Where rank i is the rank position of • Mean Average Precision (MAP) (3): The MAP for the first relevant document for the i-th query [27]. a set of queries is the mean of the Average Precision (AP) scores for every query [25]. 4.3. Results and discussion Q To evaluate our method (see Table 3, 4 and figure 2) ∑q=1 AveP(q) MAP = (3) we compared it first with “Wikidata expansion Q approach” [21]. As we consider this work to be quite Where Q is the number of queries, and: comparable to [21], since both works use Wikidata and ∑n (P(k)rel(k)) are suitable for long queries. Also, we compared our k=1 AveP= Nombre⁡de⁡documents⁡pertinents (4) approach with a non expansion approach (baseline) and a DBpedia method that uses DBpedia labels of Where rel(k) is equal to 1 if the element at rank entities within the query for expansion. Second, we « k » is a relevant document, and zero otherwise compared our work with “Clusters’ Retrieval Derived [25]. from Expanding Statistical Language Modeling • Normalized Discounted Cumulative Gain (nDCG) Similarity and Thesaurus-Query Expansion with (5): measures the quality of the ranking by Thesaurus” (CRDESLM-QET) [20] because it uses dividing the Discounted Cumulative Gain (DCG) MeSH terms and is thus comparable to our work. by the Ideal Discounted Cumulative Gain (IDCG) We chose to compare the approaches at 30 for most [26]. evaluation measures and 10 or 20 for the precision DCGp because users are more interested in the top results. ⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡NDCGP = (5) IDCGp Table 3 Comparison between “Wikidata expansion approach” [21] and our suggested query expansion approach using KL retrieval model on MEDLINE collection in Indri search engine. Approach P@20 MAP@30 MRR@30 NDCG@30 Baseline 0,461 0,446 0,818 0,637 DBpedia 0,483 0,450 0,837 0,645 Wikidata [21] 0,471 0,442 0,821 0,627 Our approach 0,525 0,500 0,844 0,671 199 Table 4 Comparison between our approach and an approach from related work [20] Approach P@10 MAP CRDESLM-QET [20] 0,500 0.361 Our approach 0,600 0,567 Figure 2 shows the impact of using low and high values of C in P@20, we varied the number of expansion concepts to C=1, 2, 5, 10, 15, and 20. P@20 0,6 0,55 0,5 0,45 0,4 0 5 10 15 20 Number of expansion concepts Figure 2: P@20 for different number of expansion concepts Based on the results in table 3, our approach carry important information for the extraction of outperformed the state of art’s work in [21].The use of documents that are relevant but use different terms to KL to retrieve documents improved the P@20 of our refer to the user’s query. Whereas the Wikidata “Multi semantic sources expansion” approach approach uses terms that may lead to the extraction of compared to the “Wikidata expansion approach” [21] documents that are related to the query but do not with 5,4%. Also, our expansion approach in this work necessarily correspond to the user’s intent. gave a 5,8% improvement in terms of MAP, a 2,3% Moreover, we reckon that our approach improvement in terms of MRR, and a 4,4% outperformed the Wikidata expansion approach [21] increasement in terms of NDCG compared to the because unlike the previous work [21] that uses only “Wikidata expansion approach” [21]. Similarly, our Wikidata to expand queries, our multi semantic approach increased the results of both the baseline and sources expansion approach benefits from several the DBpedia approach that performs better than semantic sources, some of them are general or domain Wikidata. independent (DBpedia, Wikidata) and others are From table 4, our approach outperforms CRDESLM- related to the Medical domain (PubMed, and MeSH). QET [20] in terms of P@10 with 10% and improves the So, along with Wikidata, we decided to use, in this MAP of CRDESLM-QET [20] with 20,6%. work, some domain specific databases by taking From figure 2, we noticed that using lower numbers advantage from identifiers’ links (e.g. PubMed ID) of concepts, especially C=5, leads to better results that are available in almost every Wikidata page of a compared to using higher numbers. certain resource or concept. And we had promising We think that by increasing the number of results because PubMed is one of the most valuable expansion terms, we increase the possibility of adding sources in the medical domain. Furthermore, our non relevant terms to the query. approach can be applied on queries of any domain by We believe that DBpedia improves Wikidata results switching to other identifiers depending of the domain because in the DBpedia approach we use labels that of the query. 200 As for CRDESLM-QET [20], it did not lead to high [3] Keikha, A., Ensan, F., and Bagheri, E.: Query results because it uses only MeSH terms. Although expansion using pseudo relevance feedback on MeSH terms are domain specific, they are very short wikipedia (2017) (formed with few words) compared to PubMed [4] Linked open data, Page consultée le 18/04/2020 à abstracts. Also, MeSH is only a thesaurus that follows partir de : a tree structure. Consequently, it is not rich in terms of https://wiki.digitalclassicist.org/Linked_open_data vocabulary compared to linked data sources. [5] DBpedia version 2016-04 | DBpedia [Internet]. In the future, we consider using other domain [cited 2020 Oct 31]. Available from: specific linked data sources, such as UMLS, for https://wiki.dbpedia.org/dbpedia-version-2016-04 comparison purposes. [6] Wikidata [Internet]. [cited 2020 Oct 31]. Available from: https://www.wikidata.org/wiki/Wikidata:Main_Pa 5. CONCLUSION ge [7] Moreau, F., Claveau, V., and Sébillot P.: Throughout the lockdown that occurred, nearly, in all Automatic morphological query expansion using of the countries in a row, medical queries became analogy-based machine learning. ECIR’07 - 29th some of the most trending ones. As a matter of fact, Eur. Conf. Inf. Retr., pp. 222–233 (2007) the need for relevant search results in this particular [8] Bhogal, J., Macfarlane, A., and Smith, P.: A domain, at this moment, pushed us to give more review of ontology based query expansion. Inf. attention to this field and do research in it. Process. Manag., vol. 43, no. 4, pp. 866–886 Our approach relies on various sources to determine (2007) expansion concepts. Two of these sources are LOD [9] Jain, A., Mittal, K., and Tayal, D. K.: and others are: a search engine on medical databases Automatically incorporating context meaning for (PubMed), and a controlled vocabulary (MeSH). query expansion using graph connectivity Since our suggested expansion approach that uses measures. Progress in Artificial domain independent as well as domain dependant Intelligence, Volume 2, Issue 2–3, pp. 129–139 semantic sources outperforms our DBpedia approach (2014) and expansion approaches from earlier works [20] and [10] Azad, H.K., and Deepak, A., A New Approach for [21], we may say that multiplying semantic sources in Query Expansion using Wikipedia and WordNet. Automatic Query Expansion and exploiting domain arXiv preprint arXiv:1901.10197 (2019) specific sources, like PubMED and MeSH, helps in the [11] Dahir, S., Khalifi, H., & El Qadi, A.. Query improvement of retrieval results. Furthermore, using Expansion Using DBpedia and WordNet. low numbers of expansion concepts helps in the In Proceedings of the ArabWIC 6th Annual improvement of retrieval results. Moreover, our new International Conference Research Track (pp. 1-6) approach can be used for any collection of documents (2019) and not only for collections in the medical domain [12] Sinha, R., Mihalcea, R.: Unsupervised graph- because Wikidata varies links (of identifiers) to a based word sense disambiguation using measures resource in other databases depending on the domain of word semantic similarity. In: Proceedings of of the query. ICSC (2007) In the future, we will try to further improve the [13] Carpineto, C., and Romano, G.: A Survey of results using other specific databases. Automatic Query Expansion in Information Retrieval. ACM Comput. Surv., vol. 44, no. 1, pp. 1–50 (2012) References [14] Guisado-Gámez, J., Dominguez-Sal, D., and Larriba-Pey, J.-L.: Massive Query Expansion by [1] Coronavirus search trends, Page consultée le Exploiting Graph Knowledge Bases for Image 18/04/2020 à partir de : Retrieval. Proc. Int. Conf. Multimed. Retr., no. i, https://trends.google.com/trends/story/ p. 33:33--33:40 (2014) [2] Bouziri, A., Latiri, C., Gaussier, É.: Expansion de [15] Abbes, R. et al.: Apport du Web et du Web de requêtes par apprentissage. Conférence en Données pour la recherche d’attributs. Conférence Recherche d'Informations et Applications (2016) en Recherche d'Information et Applications - CORIA (2013) 201 [16] Augenstein, I., Gentile, A.L., Norton, B., Zhang, [21] Dahir, S., El Qadi, A., & Bennis, H. Query Z., and Ciravegna, F.:.Mapping Keywords to expansion using Wikidata attributes’ values. Linked Data Resources for Automatic Query In Third International Conference on Computing Expansion. The Semantic Web: ESWC 2013 and Wireless Communication Systems, ICCWCS Satellite Events. Lecture Notes in Computer 2019. European Alliance for Innovation (EAI) Science, vol 7955. Springer, Berlin, Heidelberg (2019). (2013) [22] Boughanem, M., Kraaij, W., and Nie, [17] Dahir, S., El Qadi, A., and Bennis, H.: Enriching J.Y.: Modèles de langue pour la recherche User Queries Using DBpedia Features and d'information. In : Les systèmes de recherche Relevance Feedback. Procedia Computer Science. d'informations. majid Ihadjadene (Eds.), Hermes- Vol.127 Issue C, pp. 499-504 (2018) Lavoisier, Lavoisier, 11, rue Lavoisier 75008, pp. [18] Dahir, S., El Qadi, A., & Bennis, H. An 163-182 (2004) Association Based Query Expansion Approach [23] Lemur Retrieval Applications. Using Linked Data. In 2018 9th International http://www.lemurproject.org/lemur/retrieval.php Symposium on Signal, Image, Video and [24] Common Evaluation Measures. Communications (ISIVC) (pp. 340-344). IEEE https://trec.nist.gov/pubs/trec10/appendices/measu (2018). res.pdf [19] Le Maguer, S., Hamon, T., Grabar, N., and [25] Wikipedia contributors, Evaluation measures Claveau, V.: Recherche d'information médicale (information retrieval). Wikipedia, The Free pour le patient Impact de ressources Encyclopedia. Wikipedia, The Free Encyclopedia, terminologiques. COnférence en Recherche 23 Mar. 2019. Web. 17 Apr. (2019). d’Information et Applications, CORIA 2015, Mar [26] Goharian, N., Information Retrieval Evaluation, 2015, Paris, France. Actes de la conférence COSC 488: CORIA (2015) https://www.coursehero.com/file/8847955/Evaluat [20] Keyvanpour, M., & Serpush, F. (2019). ESLMT: a ion/ new clustering method for biomedical document [27] Wikipedia contributors. (2018, December 6). retrieval. Biomedical Mean reciprocal rank. In Wikipedia, The Free Engineering/Biomedizinische Technik, 64(6), Encyclopedia. Retrieved 12:41, April 28, 2020 729-741. from https://en.wikipedia.org/w/index.php?title= Mean_reciprocal_rank&oldid=872349108