195
Medical Query Expansion using Semantic Sources DBpedia and
Wikidata

Sarah Dahira, Jalil ElHassounib, Abderrahim El Qadic, Hamid Bennisa
a
  IMAGE Laboratory, SCIAM Team, Graduate School of Technology, Moulay Ismail University of Meknes, Morocco
b
  LRIT-CNRST (URAC’29), Faculty of Sciences, Rabat IT Center, Mohammed V University in Rabat, Morocco
c
  ENSAM, Mohammed V University in Rabat, Morocco


           Abstract
           Since Query Expansion (QE) is known for its effectiveness in increasing query relevancy in IR
           Systems, and LOD are currently used in different domains for various objectives like suggesting, to
           users, alternative options based on features of their previous searches and interests. We suggest an
           approach to enhance Information Retrieval (IR) in the medical domain through QE using two Linked
           Open Data (LOD) bases: DBpedia and Wikidata. We use DBpedia entities within the PubMed
           abstract, as candidates for expansion, along with their associated labels (“rdfs:label”) in DBpedia base.
           We evaluate our suggested approach, using MEDLINE collection and Indri search engine. Our
           expansion approach lead to significant improvements; especially in terms of precision and Mean
           Average Precision (MAP) compared to related approaches; using only one domain
           dependant/independent source.

           Keywords
           DBpedia, Information Retrieval, PubMed, Query Expansion, Wikidata.


                                                                              aftermath of the pandemic on several other domains
1.     Introduction                                                           such as economy (e.g. unemployment and stock
                                                                              market) and education e.g. school closure [1]. For
Information Retrieval Systems (IRS) match the user                            instance, queries on the loss of smell attained 8% in
query to a collection of documents. As a result, a                            Mars 23rd, 2020 and testing for corona queries
subset of documents is returned. This subset is                               attained 97% on April 13th [1].
considered relevant because it contains the query                                The lockdown caused by the pandemic, increased,
terms. But, sometimes words from the u1ser query are
                                                                              more than ever, our need for better IR for medical
different from those contained in the relevant
                                                                              queries in general. Especially that this type of queries
document set. This issue has been shown in various
                                                                              lacks technical terms that domain experts use in web
studies; one from the medical field.
   Covid-19 symptoms (fever, sore throat, shortness of                        pages. This problem is often referred to as vocabulary
                                                                              mismatch.
breath, loss of taste, and loss of smell) as well as
                                                                                 One way to overcome this problem is to use query
testing for coronavirus, and preventing measures (face
                                                                              expansion. This process is done through adding new
mask, hand sanitizer, social distancing, and hand
                                                                              terms to the user query based on association rules
washing) have become some of the most trending
                                                                              between the terms [2]. However, adding so many
queries, along with other search trends related to the
                                                                              terms to the query can be more harmful than adding
                                                                              few ones [3].
ISIC 2021: International Semantic Intelligence Conference                        Linked Data2 take advantage from the Web to
Email: sarah.dahir2012@gmail.com                                              connect related data [4]. For this purpose Uniform
(S.Dahir); abderrahim.elqadi@um5.ac.ma (A. El Qadi) ;
                                                                              Resource Identifier (URI) and Resource Description
hamid.bennis@gmail.com (H. Bennis)
                                                                              Framework (RDF) are used among other technologies
         © 2021 for this paper by its authors. Use permitted under Creative
         Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
                                                                              2 http://linkeddata.org/
                                                                                                                   196
and Linked Data standards. Some of them are open             hyponyms). Consequently, this kind of techniques
and others require a license agreement:                      cannot solve ambiguity issues [13];
   • DBpedia: is a knowledge base that contains                 Query-log analysis: exploits log files’ information
        structured information from Wikipedia. This          of earlier queries; like the click activity of the user.
        knowledge base describes 6 million entities;         But, this technique requires large logs [14];
        including 5000 diseases [5]. And it allows              Linked Data techniques [15]: take into
        among other things: annotation of a text             consideration the context of keywords. In [16] authors
        through the Web interface DBpedia Spotlight 3        explore just a small number of Dbpedia properties
        that performs Named Entity Recognition. Yet,         which means that important properties may not have
        we noticed throughout our multiple accesses to       been exploited. In [17] and [18] DBpedia is used to
        DBpedia Spotlight that the annotation stops          expand queries by using indexed terms from feedback
        functioning from time to time. For more              documents that share similar DBpedia features with
        precision, we were unable to annotate texts          query terms.
        using this Web application for three times in           In the medical domain; Linked Data allow
        four years. And whenever, it stops functioning,      corresponding terms used by patients to those used by
        it stays that way for three to four days in a row.   domain experts. In [19], authors used the “Unified
   • Wikidata: is one of the largest datasets. It is a       Medical Language System” (UMLS) database to
        free knowledge database project hosted by            determine synonyms for phrases within the user query.
        Wikimedia with 90,478,674 data items [6]                In [20], authors expanded medical queries using
        (including concepts). Unlike other knowledge         only MeSH thesaurus. After that, they extracted
        bases, Wikidata may be edited by users.              documents based on the similarity between those
        Furthermore, it usually gives links that allow       expanded queries and clusters of medical documents.
        browsing the resource in other databases like        And In our previous work [21], we used attributes
        MeSH, PubMed, Freebase, etc.                         (features) values from Wikidata to expand medical
   In this work, we suggest expanding queries using          queries. For this purpose, we considered only values
two linked data sources (DBpedia, Wikidata) along            that contained a query term. However Wikidata is not
with a search engine (PubMed) that allows the search         domain specific. Thus it lacks emphasis on the medical
in the MEDLINE database, and the National Library            data. But, since Wikidata has links to numerous
of Medicine (NLM) controlled vocabulary thesaurus            ontologies and databases from different domains, we
(Medical Subject Headings (MeSH)) which is used to           decided to exploit one of those links that is specific to
index PubMed articles.                                       the medical domain. It is the PubMed’s link.
   This paper is organized as follows: Section 2
discusses related work. Section 3 gives methodological       3.     Proposed method
details of our suggested approach, and section 4
presents its evaluation results, and gives an outlook on     Be it domain dependant or independent, linked data or
future work.                                                 not; every external source has its advantages, its limits,
                                                             and its specificities. As a result, we suggested in this
2.     Related work                                          work a medical query expansion approach (Figure 1)
                                                             that combines various sources; including two
Query Expansion (QE) plays a crucial role in                 knowledge bases from Linked Data, a medical
improving Web searches. The user’s initial query is          database, and a medical thesaurus as explained in the
reformulated by adding additional meaningful terms           following steps:
with similar significance. There are many queries                  1. We first look for the longest n-gram that
expansion techniques:                                                  covers most (if not all) DBpedia entities
   Linguistic analysis [7] - [8]: deals with each query                within the query and returns results in the
keyword separately from the others using for example                   Wikidata search engine. In case the n-gram
the lexical database WordNet [9] - [10] - [11] that has                does not feature all of the entities within the
a limited coverage of concepts [12] and a very small                   query; use other n-grams too; featuring those
number of relationships (synonyms, hypernyms, and                      entities. Table 1, shows an example of used
                                                                       queries from the MEDLINE collection.
                                                                       Most of those queries are long (more than 4
                                                                       keywords) and consist from many sentences.
3 https://www.dbpedia-spotlight.org/demo/
                                                                                                                      197
           As a result, we must shorten them to avoid                 from the previous step, that are also
           not getting any results at all and to make                 available in the MeSH terms of the PubMed
           sure that we kept the most valuable                        page.
           keywords while shortening them.
      2.   Then, we search the n-gram(s) in Wikidata.      Table 1
      3.   After that, we browse the PubMed identifier     Example of a long query, from MEDLINE dataset,
           (“PubMed ID”) of the first result in            containing several sentences.
           Wikidata.
      4.   Next, we perform Named-entity Recognition       Query               Query content
           on the PubMed abstract, of the previously       number
           browsed page, using DBpedia.                      14             renal amyloidosis as a complication of
                                                                            tuberculosis and the effects of steroids on
      5.   Then, we consider DBpedia entities within                        this condition. only the terms kidney diseases
           the PubMed abstract, as candidates for                           andnephrotic syndrome were selected by the
           expansion; along with their associated labels                    requester. prednisone and prednisolone are
                                                                            the only steroids of interest.
           (“rdfs:label”) in DBpedia.
      6.   Finally, we expand the query using the
           entities as well as their associated labels,


Figure 1: The flowchart of our suggested Query Expansion approach

                                                           Number of topics                            30
4.     Results and discussion                              Total number of tokens                      159970
                                                           Total number of distinct (unique) tokens    13113
To evaluate our approach, we used MEDLINE (table           Average number of tokens per text           100
2) collection. It is a set of articles from a medical
journal that we indexed with a stop words list using
Indri search engine.                                       4.1.    Retrieval model

Table 2                                                    For the implementation of our approach, we used
Description of MEDLINE collection                          Kulback Leibler(KL)[22] IR model [23]. In KL (1), we
                                                           compare the document’s model with the query’s
Total number of texts                   1033               model.
                                                                                                                           198
                                              𝑃(𝑥)
         𝐷𝐾𝐿 (𝑃||𝑄)= ∑𝑥∈𝑋 𝑃(𝑥)𝑙𝑜𝑔⁡(𝑄(𝑥) )                     (1)          Where:
                                                                                                       p    2reli −1
Where P and Q are discrete probability distributions                                     ⁡⁡⁡DCGP = ∑i=1 log (i+1)          (6)
                                                                                                                2
defined on the same probability space.
And we use smoothing through Dirichlet to avoid                            With reli: the relevance score of document i; is
getting a null result when a term is not present in the                    obtained after documents retrieval using an IR
created language model.                                                    model. And:
                                                                                                    REL 2reli −1
                                                                                        IDCGP = ∑i=1 log (i+1)             (7)
4.2.     Evaluation metrics                                                                                 2


In this work we used the following evaluation                              Where |REL| is the list of relevant documents
measures:                                                                  ranked based on their relevancy in the corpus.
• Precision (2): is a measure that indicates how                    •      Mean Reciprocal Rank (MMR) (8): The
    efficient a system is in retrieving only relevant                      Reciprocal Rank (RR) is the multiplicative
    documents [24]:                                                        inverse of the rank of the first exact answer [27].
                                                                           And the MRR is the average of the RR of multiple
                          Number⁡of⁡relevant⁡retrieved⁡documents
       Precision = ⁡⁡⁡⁡                                                    queries Q [27].
                              Number⁡of⁡retrieved⁡documents
                                                              (2)                               1     |Q|   1
                                                                                       MRR = |Q| ∑i=1 rank                 (8)
                                                                                                                i
       Precision at rank N is evaluated by considering
       only top results returned by the system.                            Where rank i is the rank position of
•      Mean Average Precision (MAP) (3): The MAP for                       the first relevant document for the i-th query [27].
       a set of queries is the mean of the Average
       Precision (AP) scores for every query [25].                  4.3.     Results and discussion
                                Q                                   To evaluate our method (see Table 3, 4 and figure 2)
                              ∑q=1 AveP(q)
                   MAP =                                      (3)   we compared it first with “Wikidata expansion
                                     Q
                                                                    approach” [21]. As we consider this work to be quite
       Where Q is the number of queries, and:                       comparable to [21], since both works use Wikidata and
                       ∑n    (P(k)rel(k))
                                                                    are suitable for long queries. Also, we compared our
                      k=1
        AveP= Nombre⁡de⁡documents⁡pertinents                                                 (4)
                                                                    approach with a non expansion      approach (baseline)
                                                                    and a DBpedia method that uses DBpedia labels of
       Where rel(k) is equal to 1 if the element at rank            entities within the query for expansion. Second, we
       « k » is a relevant document, and zero otherwise             compared our work with “Clusters’ Retrieval Derived
       [25].                                                        from Expanding Statistical Language Modeling
•      Normalized Discounted Cumulative Gain (nDCG)                 Similarity and Thesaurus-Query Expansion with
       (5): measures the quality of the ranking by                  Thesaurus” (CRDESLM-QET) [20] because it uses
       dividing the Discounted Cumulative Gain (DCG)                MeSH terms and is thus comparable to our work.
       by the Ideal Discounted Cumulative Gain (IDCG)                  We chose to compare the approaches at 30 for most
       [26].                                                        evaluation measures and 10 or 20 for the precision
                                           DCGp                     because users are more interested in the top results.
                ⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡NDCGP =                      (5)
                                           IDCGp


Table 3
Comparison between “Wikidata expansion approach” [21] and our suggested query expansion approach using KL
retrieval model on MEDLINE collection in Indri search engine.

                        Approach             P@20        MAP@30              MRR@30         NDCG@30
                        Baseline             0,461       0,446               0,818          0,637
                        DBpedia              0,483       0,450               0,837          0,645
                        Wikidata [21]        0,471       0,442               0,821          0,627
                        Our approach         0,525       0,500               0,844          0,671
                                                                                                             199

Table 4
Comparison between our approach and an approach from related work [20]
                                  Approach                     P@10           MAP
                                  CRDESLM-QET [20]             0,500          0.361
                                  Our approach                 0,600          0,567

  Figure 2 shows the impact of using low and high values of C in P@20, we varied the number of expansion
concepts to C=1, 2, 5, 10, 15, and 20.


                              P@20
                        0,6


                       0,55


                        0,5


                       0,45


                        0,4
                              0           5               10                 15       20

                                              Number of expansion concepts


Figure 2: P@20 for different number of expansion concepts

   Based on the results in table 3, our approach           carry important information for the extraction of
outperformed the state of art’s work in [21].The use of    documents that are relevant but use different terms to
KL to retrieve documents improved the P@20 of our          refer to the user’s query. Whereas the Wikidata
“Multi semantic sources expansion” approach                approach uses terms that may lead to the extraction of
compared to the “Wikidata expansion approach” [21]         documents that are related to the query but do not
with 5,4%. Also, our expansion approach in this work       necessarily correspond to the user’s intent.
gave a 5,8% improvement in terms of MAP, a 2,3%               Moreover, we reckon that our approach
improvement in terms of MRR, and a 4,4%                    outperformed the Wikidata expansion approach [21]
increasement in terms of NDCG compared to the              because unlike the previous work [21] that uses only
“Wikidata expansion approach” [21]. Similarly, our         Wikidata to expand queries, our multi semantic
approach increased the results of both the baseline and    sources expansion approach benefits from several
the DBpedia approach that performs better than             semantic sources, some of them are general or domain
Wikidata.                                                  independent (DBpedia, Wikidata) and others are
   From table 4, our approach outperforms CRDESLM-         related to the Medical domain (PubMed, and MeSH).
QET [20] in terms of P@10 with 10% and improves the        So, along with Wikidata, we decided to use, in this
MAP of CRDESLM-QET [20] with 20,6%.                        work, some domain specific databases by taking
   From figure 2, we noticed that using lower numbers      advantage from identifiers’ links (e.g. PubMed ID)
of concepts, especially C=5, leads to better results       that are available in almost every Wikidata page of a
compared to using higher numbers.                          certain resource or concept. And we had promising
   We think that by increasing the number of               results because PubMed is one of the most valuable
expansion terms, we increase the possibility of adding     sources in the medical domain. Furthermore, our
non relevant terms to the query.                           approach can be applied on queries of any domain by
   We believe that DBpedia improves Wikidata results       switching to other identifiers depending of the domain
because in the DBpedia approach we use labels that         of the query.
                                                                                                                   200
   As for CRDESLM-QET [20], it did not lead to high          [3] Keikha, A., Ensan, F., and Bagheri, E.: Query
results because it uses only MeSH terms. Although                 expansion using pseudo relevance feedback on
MeSH terms are domain specific, they are very short               wikipedia (2017)
(formed with few words) compared to PubMed                   [4] Linked open data, Page consultée le 18/04/2020 à
abstracts. Also, MeSH is only a thesaurus that follows            partir                                           de :
a tree structure. Consequently, it is not rich in terms of        https://wiki.digitalclassicist.org/Linked_open_data
vocabulary compared to linked data sources.                  [5] DBpedia version 2016-04 | DBpedia [Internet].
   In the future, we consider using other domain                  [cited 2020 Oct 31]. Available from:
specific linked data sources, such as UMLS, for                   https://wiki.dbpedia.org/dbpedia-version-2016-04
comparison purposes.                                         [6] Wikidata [Internet]. [cited 2020 Oct 31].
                                                                  Available                                     from:
                                                                  https://www.wikidata.org/wiki/Wikidata:Main_Pa
5.     CONCLUSION                                                 ge
                                                             [7] Moreau, F., Claveau, V., and Sébillot P.:
Throughout the lockdown that occurred, nearly, in all             Automatic morphological query expansion using
of the countries in a row, medical queries became                 analogy-based machine learning. ECIR’07 - 29th
some of the most trending ones. As a matter of fact,              Eur. Conf. Inf. Retr., pp. 222–233 (2007)
the need for relevant search results in this particular      [8] Bhogal, J., Macfarlane, A., and Smith, P.: A
domain, at this moment, pushed us to give more                    review of ontology based query expansion. Inf.
attention to this field and do research in it.                    Process. Manag., vol. 43, no. 4, pp. 866–886
   Our approach relies on various sources to determine            (2007)
expansion concepts. Two of these sources are LOD             [9] Jain, A., Mittal, K., and Tayal, D. K.:
and others are: a search engine on medical databases              Automatically incorporating context meaning for
(PubMed), and a controlled vocabulary (MeSH).                     query expansion using graph connectivity
   Since our suggested expansion approach that uses               measures.         Progress         in      Artificial
domain independent as well as domain dependant                    Intelligence, Volume 2, Issue 2–3, pp. 129–139
semantic sources outperforms our DBpedia approach                 (2014)
and expansion approaches from earlier works [20] and         [10] Azad, H.K., and Deepak, A., A New Approach for
[21], we may say that multiplying semantic sources in             Query Expansion using Wikipedia and WordNet.
Automatic Query Expansion and exploiting domain                   arXiv preprint arXiv:1901.10197 (2019)
specific sources, like PubMED and MeSH, helps in the         [11] Dahir, S., Khalifi, H., & El Qadi, A.. Query
improvement of retrieval results. Furthermore, using              Expansion Using DBpedia and WordNet.
low numbers of expansion concepts helps in the                    In Proceedings of the ArabWIC 6th Annual
improvement of retrieval results. Moreover, our new               International Conference Research Track (pp. 1-6)
approach can be used for any collection of documents              (2019)
and not only for collections in the medical domain           [12] Sinha, R., Mihalcea, R.: Unsupervised graph-
because Wikidata varies links (of identifiers) to a               based word sense disambiguation using measures
resource in other databases depending on the domain               of word semantic similarity. In: Proceedings of
of the query.                                                     ICSC (2007)
   In the future, we will try to further improve the         [13] Carpineto, C., and Romano, G.: A Survey of
results using other specific databases.                           Automatic Query Expansion in Information
                                                                  Retrieval. ACM Comput. Surv., vol. 44, no. 1, pp.
                                                                  1–50 (2012)
References
                                                             [14] Guisado-Gámez, J., Dominguez-Sal, D., and
                                                                  Larriba-Pey, J.-L.: Massive Query Expansion by
[1] Coronavirus search trends, Page consultée le
                                                                  Exploiting Graph Knowledge Bases for Image
    18/04/2020            à          partir         de :
                                                                  Retrieval. Proc. Int. Conf. Multimed. Retr., no. i,
    https://trends.google.com/trends/story/
                                                                  p. 33:33--33:40 (2014)
[2] Bouziri, A., Latiri, C., Gaussier, É.: Expansion de
                                                             [15] Abbes, R. et al.: Apport du Web et du Web de
    requêtes par apprentissage. Conférence en
                                                                  Données pour la recherche d’attributs. Conférence
    Recherche d'Informations et Applications (2016)
                                                                  en Recherche d'Information et Applications -
                                                                  CORIA (2013)
                                                                                                             201
[16] Augenstein, I., Gentile, A.L., Norton, B., Zhang,   [21] Dahir, S., El Qadi, A., & Bennis, H. Query
     Z., and Ciravegna, F.:.Mapping Keywords to               expansion using Wikidata attributes’ values.
     Linked Data Resources for Automatic Query                In Third International Conference on Computing
     Expansion. The Semantic Web: ESWC 2013                   and Wireless Communication Systems, ICCWCS
     Satellite Events. Lecture Notes in Computer              2019. European Alliance for Innovation (EAI)
     Science, vol 7955. Springer, Berlin, Heidelberg          (2019).
     (2013)                                              [22] Boughanem, M.,         Kraaij,   W.,    and    Nie,
[17] Dahir, S., El Qadi, A., and Bennis, H.: Enriching        J.Y.: Modèles de langue pour la recherche
     User Queries Using DBpedia Features and                  d'information. In : Les systèmes de recherche
     Relevance Feedback. Procedia Computer Science.           d'informations. majid Ihadjadene (Eds.), Hermes-
     Vol.127 Issue C, pp. 499-504 (2018)                      Lavoisier, Lavoisier, 11, rue Lavoisier 75008, pp.
[18] Dahir, S., El Qadi, A., & Bennis, H. An                  163-182 (2004)
     Association Based Query Expansion Approach          [23] Lemur              Retrieval          Applications.
     Using Linked Data. In 2018 9th International             http://www.lemurproject.org/lemur/retrieval.php
     Symposium on Signal, Image, Video and               [24] Common               Evaluation          Measures.
     Communications (ISIVC) (pp. 340-344). IEEE               https://trec.nist.gov/pubs/trec10/appendices/measu
     (2018).                                                  res.pdf
[19] Le Maguer, S., Hamon, T., Grabar, N., and           [25] Wikipedia contributors, Evaluation measures
     Claveau, V.: Recherche d'information médicale            (information retrieval). Wikipedia, The Free
     pour le patient Impact de ressources                     Encyclopedia. Wikipedia, The Free Encyclopedia,
     terminologiques. COnférence en Recherche                 23 Mar. 2019. Web. 17 Apr. (2019).
     d’Information et Applications, CORIA 2015, Mar      [26] Goharian, N., Information Retrieval Evaluation,
     2015, Paris, France. Actes de la conférence              COSC                                           488:
     CORIA (2015)                                             https://www.coursehero.com/file/8847955/Evaluat
[20] Keyvanpour, M., & Serpush, F. (2019). ESLMT: a           ion/
     new clustering method for biomedical document       [27] Wikipedia contributors. (2018, December 6).
     retrieval. Biomedical                                    Mean reciprocal rank. In Wikipedia, The Free
     Engineering/Biomedizinische       Technik, 64(6),        Encyclopedia. Retrieved 12:41, April 28, 2020
     729-741.                                                 from https://en.wikipedia.org/w/index.php?title=
                                                              Mean_reciprocal_rank&oldid=872349108