1. Introduction

Hybrid query expansion using linguistic resources and word embeddings

Nesrine Ksentini

Siwar Zayani

Mohamed Tmar

Faiez Gargouri

0 0 MIRACL Laboratory, City ons Sfax, University of Sfax , B.P.3023 Sfax TUNISIA

Since the large amount of textual data on the web, traditional information retrieval and query expansion techniques such as pseudo-relevance feedback are not always helpful to optimize the retrieval process. In this paper, we study the use of term relatedness in the context of query expansion with an hybrid approach based on linguistic resources and word embeddings such as the distributed neural language model word2vec. We perform experiments on Cystic Fibrosis Database. Obtained results are more robust than the baseline research system.

eol>Semantic relationships Word embeddings Word2Vec Skip-Gram MeSH thesaurus Information retrieval

1. Introduction

Information Retrieval System (IRS) has attracted an increasing research attention with the proliferation of web data. The major challenge of the IRS is to return relevant documents that meet user’s need (even they do not contain the query terms) and reject irrelevant documents (even they contain the query terms). Query expansion becomes an important task to ameliorate IRS results. Query Expansion methods based on Pseudo Relevance Feedback (PRF) are widely used and rely much on the assumption that the top ranked documents in the initial search are relevant and contain good terms for query expansion [ 1 ]. This assumption is not always checked. To overcome this limitation, it will be needed to define semantic relatedness between terms either by using linguistic resources either by using statistical methods in order to select relevant terms [ 2, 3, 4, 5, 6, 7 ].

Several semantic resources have been proposed, making it possible to model domain knowledge such as dictionaries, taxonomies, ontologies and thesaurus. For statistical methods, we focus in this paper on word embeddings which is a mapping that associates each word in a document collection to a vector representation with a size is significantly lower than the size of the vocabulary of the document collection [8, 9, 10]. By adding the original query with the expansion words derived from word embeddings, we could better present the users information need in specific topic.

The remainder of this paper is organized as follows. In section 2, we present a literature review of the diferent works to determine semantic relationships by hybrid approaches. In section 3, we focus on describing details of the proposed hybrid approach in the query expansion process.

In section 4, the evaluation process is presented and discussed. We draw at the end the conclusion and outline future works in section 5.

2. SEMANTIC RELATIONSHIPS BY HYBRID APPROACHES

Defining semantic relationships between terms has become a primary task in order to improve the performance of IRS [11, 12]. Several works have been proposed, can be classified into three main categories: those which are based on external semantic resources such as ontologies,thesaurus . . . , those which are based on methods based on the content of documents using statistical measures [13] and hybrid approaches combining these first two categories. In this section, we focus on defining semantic reationships by hybrid approaches.

Works using hybrid approaches:

Hybrid or mixed methods integrate both knowledge from a corpus of documents and external semantic resources are widely used to detect and evaluate semantic relationships. In [11], the authors combine the results obtained by the method based on external semantic resources and the method based on the content of returned documents to define semantic relationships between terms.

In fact, for the first method, the system determines the terms that are related to the term of the initial query using the WordNet thesaurus which is a lexical database [14] developed by a team of experts from the cognitive science laboratory.

It is the most famous lexical resource in the English language.

The terms in this database are organized as sets of synonyms that represent concepts called synsets [15, 16]. Each synset represents a specific meaning of a word. Generally, each term can be associated with one or more synsets (polysemy). These are classified by several types of semantic relations (hyponymy, hyperonymy, meronymy and Holonymy).

Figure 1 shows a graphical visualization of the word "Calcium" in WordNet 1 In [17], authors explore the Skip-gram model to determine the semantic relationships between medical concepts. Unlike the traditional Skip-gram model which allows the creation of distributed vector representations of words, the model proposed in this study exploits the distributed representations of UMLS concepts extracted from medical corpora, including clinical records and abstracts from medical journals.

The proposed system attempts to determine which terms are related by applying the context method. They collect snippets containing the first term searched by the search engine. Then they determine a context around these terms, replacing the term sought by the second term. Finally, they assume that the two terms are synonymous if there is no change of context. The authors of [18] extend the CBOW model to learn distinct representations of diferent meanings of a term by aligning them to terms in the WordNet database. For this, a revised

1http ://www.snappywords.com/

architecture of the CBOW algorithm is proposed which allows to jointly learn in the same space both the term and the diferent associated candidate senses.

In [19], authors propose a hybrid approach using two diferent sources: the UMLS model and the word2vec word vector model.

They use the natural language processing tool to identify medical concepts present in a query using UMLS. They explore embedded words using the Skip-gram architecture to ifnd the two closest terms in the original query. The size of the vector was set at 1000 and a vocabulary of 25,469 vectors were included in this model.

In [20], authors propose a method making it possible to link the terms used by patients extracted from the forum on “cancerdusein.org” to those used by professionals in the medical field in a vocabulary devoted to breast cancer developed by the ’National Cancer Institute INCa. The originality of this approach is to use texts written by patients (PAT) collected on forums, to build a consumer health vocabulary (CHV) in French. In fact, this method is structured in six steps: - Constitution of the corpus:

Use of the "cancerdusein.org" forum on breast cancer, where most of its members are patients or their loved ones. This forum facilitates the sharing of information with other patients. - Extraction of the candidate terms:

Using the BioTex [21] tool to search for corpus terms that belong to the medical domain to obtain a set = 1, · · · , . - Spelling correction:

This step corresponds to correct the spelling of all ∈ terms using Aspell software in order to obtain a set = 1, 2, · · · , , where is the number of correction proposals for a term .

The Levenshtein distance is used to compare the term and , choosing only terms with a distance to less than or equal to 2. - Abbreviations:

In the same way as the previous step, they seek in the whole set those which correspond to the abbreviations by adapting the Carry padding algorithm to using a list of common sufixes used in the biomedical field.

For a term belonging to , we obtain a set = 1, 2, · · · , , where is the number of proposed abbreviations included in . - Similarity between two terms:

In this step, the authors determine the similarity between terms by using 3 methods: — Consider a semantically structured resource ( ). - Consider the co-occurrences of terms from documents indexed by the Google search engine (standard Google similarity).

— Consider co-occurrences in patient messages using the Jaccard measurement. - Formalisation in SKOS:

Finally, the authors use the relationships obtained in the previous steps to create a SKOS ontology (Simple Knowledge Organisation System). This ontology associates an INCa term with the diferent terms of the patient: preferential terms are used to define the term MeSH representing the expert term, alternative terms are used to represent abbreviations and hidden terms are used to represent spelling errors.

In fact, this method is applied to the field of breast cancer and experienced for the French language, but it can be applied to many other areas and can be adapted to other languages.

Hybrid approaches represent a compromise between language approaches in using knowledge bases and other statistics.

The latter exploits the precision of linguistic approaches and the robustness of statistical approaches. By comparing them with statistical approaches, hybrid approaches are faster and more independent. The use of language resources at the level of hybrid approaches, makes it possible to obtain results more satisfying the needs of users.

Indeed, the intervention of linguists makes it possible to reduce the noise that can be generated by statistical approaches.

3. Semantic relationships in the query expansion process

IRS have envolved with the appearance of Semantic Web and aim to exploit semantic relationships between terms in order to enrich the user’s initial query. To ameliorate the user’s query, we integrate defined semantic relationships between terms by our proposed hybrid approach in the query expansion process based on the pseudo relevance feedback technique (PRF).

Hybrid definition of semantic relationships: MeSH + Word2Vec:

In this subsection, we present the combination between the liguistic definition and the statistical definition of semantic relationships. Thus, the study of defined relationships is based on the following assumption: first, we search from MeSH thesaurus, synonyms for terms in the initial query (liguistic definition). Then we define their vectors representation (see figure 2) resulting from the Skip-gram algorithm (statistical definition).

Afterwards, we measure the semantic similarity between them by calculating the cosine between their corresponding term vectors.

Finally, we take only first terms that have the largest cosine values to define a word bag for the query expansion process. 3.0.1. Semantic relationships with Mesh Define semantic relationships between terms is paramount to improve user’s queries and search quality. In our case, we try to find synonymy relations between terms of the initial query and MeSH thesaurus concepts with three methods: • scopeNote method: Method based on concepts descriptions extracted from the MeSH thesaurus. Indeed, for each term in the initial query, we select its description which represents the medical definition of term. • termList method: Method is based on the list of associated terms (TermList). This method try to select synonymous terms that are semantically linked to terms of the initial query. - If a term of the initial query is a MeSH concept, we take the list of synonymous terms linked to this concept.

- If a term of the initial query is a MeSH term, we take its parent concept. • fusion method: This third method is to mix the two previous methods. Indeed, we choose to use this method to add more semantically related terms to the context of initial query. 3.0.2. Semantic relationships with Word2Vec The Word2vec model proposed by [8], is based on an neural network to learn vectors terms representations and to detect synonymous terms or suggest additional terms [9]. Word2vec is a group of related models that are used to produce word embeddings. This model is based on a simple neural architecture and computational simplifications through mathematical expressions, allowing the exploitation of a very large amount of textual data to learn them. Indeed, Word2vec takes as its input a large corpus of text and produces a vector space for each unique word in the corpus.

This model has diferent parameters, the most important of which are: - The choice of the learning model (1 for the Skip-Gram model and 0 for the CBOW model)[8]. - The dimension of the vector space to be constructed: it represents the number of numerical descriptors used to describe terms in the corpus.[8]. - The size of the context window of a term: it represents the number of terms surrounding the word in question (authors in [8] suggest using contexts of size 10 with the Skip-Gram architecture and 5 with CBOW architecture).

In our case, we have trained word2vec model using the gensim library [22]and the Skip-Gram architecture. The basic idea of the Skip-Gram architecture is to use the current word in order to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words.

For example, if we have a vocabulary represented by this set of words (pseudomonas, aeruginosa, infection, cystic, fibrosis), and the target word is "infection".

The Skip-Gram architecture is as follows (see figure 3): As input layer, we find the target word "infection" with its binary representation whose length is equal to 5 (the size of the vocabulary in this example).

As output layer, we find four binary vectors corresponding to the words of the context: “pseudomonas”, “aeruginosa”, “cystic”, “fibrosis”.

The projection layer (hidden layer) ℎ is represented by the weight matrix which rows present words in a vocabulary and columns present hidden neurons. Before training step, the matrix is initialized with small random values.

We can calculate a score for each word of the vocabulary which represent a correspondence measure between the context window and the target word. This score is calculated by the scalar product between the predicted representation and the target word representation. Subsequently, we use the hierarchical Softmax activation function to determine which words are similar to the target word. This prediction is then corrected using backpropagation for each words in the context window.

Indeed, we use backpropagation to find the optimal weights of a neural network. These weights make it possible to minimize the loss function by applying the gradient descent algorithm. This backpropagation makes it possible to correct the global matrix by bringing words of their respective contexts. Finally, vectors resulting from the learning step are used to define semantic relationships.

4. Evaluation

We explored the efectiveness of our proposed method on the standard ad-hoc task using a 2 database of arround 1000 Documents. The evaluation model of search system is based on the evaluation model of the Cranfield project. 4.1. Cystic Fibrosis Database The Cystic Fibrosis Database (CF) is composed of 1239 documents discussing Cystic Fibrosis Aspects, and a set of 100 queries with the respective relevant documents as answers. Documents in this collection focus on cystic fibrosis disease. They present the symptoms, diagnoses, and treatments of this disease. 4.2. Results We have proposed an hybrid approach to define semantic relationships between terms in order to improve search results. Our approach based on the combination between the liguistic definition and the statistical definition of semantic relationships. Since, relationships are defined, we try to they exploit in the expansion query process.

We present results for retrieval experiments in Table 1 for both methods of semantic relationships definition.

We find from these results that better results are obtained when to use scopeNote and termList to define semantic relations for linguistic method, and the Skip-Gram model using the Softmax activation function for statistical method.

In order to improve the results obtained, we combined the linguistic method (fusion) with the

2https://people.ischool.berkeley.edu/ hearst/irbook/cfc.html

statistical method (Skip-Gram) to define semantic relationships and subsequently integrate them into the expansion process. We add only related terms to the initial query based on the ifrst 5 returned documents by adopting the Pseudo Return of Relevance (PRF) technique. Table 2 shows obtained results before and after query expansion.

In order to check and validate the influence of using defined semantic relations on the performance of our IRS, we use student test tool [23] whose p-value is less than or equal to 0.05. This tool allows to compare the means of two groups of samples. This validation is indicated by (*) for obtained results in table 2. We note from obtained results in table 2, an improvement of recall and precision with a relevance rate equal to 11.23%. (According to [24], from 5% improvement or relevance rate, we can consider that the system with expansion is better than the base system).

The appearance of the recall/precision curve for the 100 queries from the Cystic Fibrosis Database is presented in figure 4, which corresponds to the precision at 11 recall points. To better check the performance of our serach system, we used student’s t-test [23] and we obtained a significant result with < 0.004 < 0.0

5. Conclusion and future works

We present in this paper an hybrid query expansion using linguistic resources and word embeddings. Indeed, we try to define semantic relationships between terms based on the combination between the linguistic definition (MeSH) and the statistical definition (Skip-Gram). We look for synonymy relations between terms in the initial query and concepts of MeSH thesaurus. Then, we apply an artificial neural network to learn a continuous vectors representation of words which will be able to capture semantic relations.

Experiments performed on Cystic Fibrosis Database show that the query expansion process improves retrieval results. As future work, we will try to perform experiments on large databases.

6. Citations and Bibliographies References

[8] T. Mikolov, K. Chen, G. Corrado, J. Dean, Eficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781 (2013). [9] T. Mikolov, W.-t. Yih, G. Zweig, Linguistic regularities in continuous space word representations, in: Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, 2013, pp. 746–751. [10] D. Roy, D. Paul, M. Mitra, U. Garain, Using word embeddings for automatic query expansion, arXiv preprint arXiv:1606.07608 (2016). [11] E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Pasca, A. Soroa, A study on similarity and relatedness using distributional and wordnet-based approaches (2009). [12] N. Ksentini, M. Tmar, F. Gargouri, Detection of semantic relationships between terms with a new statistical method., in: WEBIST (2), 2014, pp. 340–343. [13] M. Sahami, T. D. Heilman, A web-based kernel function for measuring the similarity of short text snippets, in: Proceedings of the 15th international conference on World Wide Web, 2006, pp. 377–386. [14] G. A. Miller, WordNet: An electronic lexical database, MIT press, 1998. [15] S. Liu, F. Liu, C. Yu, W. Meng, An efective approach to document retrieval via utilizing wordnet and recognizing phrases, in: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, 2004, pp. 266–272. [16] G. Feki, R. Fakhfakh, A. B. Ammar, C. B. Amar, Knowledge structures: Which one to use for the query disambiguation?, in: 2015 15th international conference on intelligent systems design and applications (ISDA), IEEE, 2015, pp. 499–504. [17] L. De Vine, G. Zuccon, B. Koopman, L. Sitbon, P. Bruza, Medical semantic similarity with a neural language model, in: Proceedings of the 23rd ACM international conference on conference on information and knowledge management, 2014, pp. 1819–1822. [18] M. Mancini, J. Camacho-Collados, I. Iacobacci, R. Navigli, Embedding words and senses together via joint knowledge-enhanced training, arXiv preprint arXiv:1612.02703 (2016). [19] H. Yang, T. Gonçalves, Improving personalized consumer health search (2018). [20] M. D. Tapi Nzali, J. Azé, S. Bringay, C. Lavergne, C. Mollevi, T. Optiz, Reconciliation of patient/doctor vocabulary in a structured resource, Health Informatics Journal 25 (2019) 1219–1231. [21] J. A. Lossio-Ventura, C. Jonquet, M. Roche, M. Teisseire, Biotex: A system for biomedical terminology extraction, ranking, and validation, in: ISWC: International Semantic Web Conference, 1272, 2014, pp. 157–160. [22] R. Rehurek, P. Sojka, Software framework for topic modelling with large corpora, in: In Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, Citeseer, 2010. [23] S. Yue, P. Pilon, A comparison of the power of the t test, mann-kendall and bootstrap tests for trend detection/une comparaison de la puissance des tests t de student, de mann-kendall et du bootstrap pour la détection de tendance, Hydrological Sciences Journal 49 (2004) 21–37. [24] K. Sauvagnat, M. Boughanem, A la recherche de noeuds informatifs dans des corpus de documents xml., in: CORIA, 2005, pp. 119–134.

[1]

Wang ,

Huang ,

Feng , Query expansion with local conceptual word embeddings in microblog retrieval , IEEE Transactions on Knowledge and Data Engineering 33 ( 2019 ) 1737 - 1749 .

[2]

Ksentini ,

Tmar ,

Gargouri , Controlled automatic query expansion based on a new method arisen in machine learning for detection of semantic relationships between terms , in: 2015 15th International Conference on Intelligent Systems Design and Applications (ISDA), IEEE, 2015 , pp. 134 - 139 .

[3]

Ksentini ,

Tmar ,

Boughanem ,

Gargouri , Miracl at clef 2015: User-centred health information retrieval task ., in: CLEF (Working Notes) , 2015 .

[4]

Ksentini ,

Tmar ,

Gargouri , The impact of term statistical relationships on rocchio's model parameters for pseudo relevance feedback , International Journal of Computer Information Systems and Industrial Management Applications 8 ( 2016 ) 135 - 44 .

[5]

Ksentini ,

Mohamed ,

Gargouri , Towards automatic improvement of patient queries in health retrieval systems , Applied Medical Informatics 38 ( 2016 ) 73 - 80 .

[6]

Ksentini ,

Tmar ,

Gargouri , Towards a contextual and semantic information retrieval system based on non-negative matrix factorization technique , in: International Conference on Intelligent Systems Design and Applications , Springer, 2017 , pp. 892 - 902 .

[7]

Shalaby , W. Zadrozny, Measuring semantic relatedness using mined semantic analysis , arXiv preprint arXiv:1512.03465 ( 2015 ).