=Paper=
{{Paper
|id=Vol-2080/paper5
|storemode=property
|title=Local Word Embeddings for Query Expansion Based on Co-Authorship and Citations
|pdfUrl=https://ceur-ws.org/Vol-2080/paper5.pdf
|volume=Vol-2080
|authors=André Rattinger,Jean-Marie Le Goff,Christian Guetl
|dblpUrl=https://dblp.org/rec/conf/ecir/RattingerGG18
}}
==Local Word Embeddings for Query Expansion Based on Co-Authorship and Citations==
BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval Local Word Embeddings for Query Expansion based on Co-Authorship and Citations André Rattinger12 , Jean-Marie Le Goff1 , and Christian Guetl2 1 CERN, Switzerland andre.rattinger@cern.ch, jean-marie.le.goff@cern.ch 2 Graz University of Technology, Austria cguetl@iicm.edu Abstract. Word embedding techniques have gained a lot of interest from natural language processing researchers recently and they are valu- able resource in identifying a list of semantically related terms for a search query. These related terms build a natural addition for query ex- pansion, but might mismatch when the application domains use different jargon. Using the Skip-Gram algorithm of Word2Vec, terms are selected only from a specific subset of the corpus, which is extended by documents from co-authorship and citations. We demonstrate that locally-trained word embeddings with this extension provides a valuable augmentation and can improve retrieval performance. First result suggest that query expansion and word embeddings could also benefit from other related information. Keywords: word embeddings, query expansion, co-authorship, word2vec, pseudo relevance feedback 1 Introduction Methods to create fixed representations of words and documents have long been a staple of natural language processing (NLP) and information retrieval (IR) research. Recently neural network based method of generating those represen- tations have gained popularity in IR. Models such as Word2Vec [10], transform the terms from a document into high-dimensional vectors. Semantically similar terms in those vector representations are close to each other and the size of the vectors is much smaller than the size of the vocabulary compared to tra- ditional methods. This work focuses on the IR task of query expansion, and the applicability of word embeddings, even if the dataset for training is limited. Word embeddings provide a good fit for query expansion, as it can aid with the vocabulary mismatch between the query and the relevant documents. Em- beddings trained on a small topically-relevant corpus promise embeddings that produce better fitting terms for a specific area. The size of the dataset can be a limitation in training and the subsequent expansion process, as the quality of word embeddings benefits from bigger datasets. We therefore propose to use documents for the expansion process that promise to be relevant by association 46 BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval with the retrieved results: referenced documents and documents from co-authors. We test our approach on a small topically-relevant corpus and compare the re- sults with a bigger more general dataset. The smaller corpus is from a specific topically-relevant subsection of research papers, computational linguistics. The bigger dataset is made up of patents from all patent classes, and is therefore very general. For an additional comparison we also perform the same test with gen- eral purpose embeddings trained on articles from the English-language edition of Wikipedia. The more detailed description of the publication and patent datasets can be found in Section 3. Section 4 describes the general approach of the local query expansion method and Section 5 describes the experimental setup for the retrieval experiments. The results of the different retrieval experiments can be found in Section 6 and Section 7 concludes the paper. 2 Related Work A few attempts have been made at expanding queries with word embeddings. Roy et al. [13] demonstrate the effect generalization has on retrieval performance when using word embeddings. It was shown that while global methods can increase overall retrieval performance, they perform worse than co-occurrence based techniques. Diaz et al. [4] recently proposed a method for locally-trained word embeddings for query expansion, and is the closest to the work presented here. The difference between the work and our research is the application of the methods, the focus on pseudo relevance feedback and the implications of additional documents on the query expansion process. A different approach is the incorporation of word embeddings and using them to weight terms that are not part of the query [15]. This approach is similar to ours, but it uses different weighting scheme and does not operate on a local basis. Query expansion over the corpus which is indexed was previously performed and incorporated with pseudo relevance feedback [7], but with fairly big datasets which do not provide the same degree of locality. The scope of the used datasets is similar to the patent dataset presented here, however. Another approach is to use information from different local context, as was done as part of personalization of word em- beddings [2]. This approach did not provide promising results as other localized methods did though. 3 Datasets We conduct our experiments on two datasets: the ACL anthology collection and the English subset of the CLEF-IP 2011 collection. The ACL collection [12] is a small information retrieval dataset containing almost 10,000 research papers from the ACL anthology. The documents are scientific publications from the field of computational linguistics. It includes 82 research questions and their relevance assessments. As this is a small dataset for information retrieval, the information is supplemented by other documents available to us. These include other research papers from the authors as well as the work they cite in the articles contained in 47 BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval the collection. With this addition, the dataset contains 33,922 research articles. The additional research papers are used in query expansion, but not for the main indexing and retrieval. They are also not part of the relevance assessments. The English-language subset of the CLEF-IP 2011 collection [11] contains about 400,000 patents and 1,350 topics. Compared to the ACL dataset, a query is not represented by a set of terms, but by a whole patent document. The search terms have to be extracted from the document. The reason for this is that the goal in patent retrieval is to find similar documents that might invalidate a patent application. To generate the search queries, terms in the documents are weighted with tf-idf to extract the most relevant words from a document. Search queries with a length of 30 terms produced the best results and are used as a baseline for further experiments. No supplementation or addition with other documents is performed because of the size of the dataset is deemed sufficient. Citations are considered in the experiments if they are citing patents within the corpus. Patent citation promise to be valuable because they are not only added by the author, but also by the patent examiner. The CLEF-IP collection as a whole is used as a reference corpus to show the effect of query expansion on a dataset that is not as topically constrained as the ACL dataset. 4 Local Query Expansion Local methods for query expansion generally perform better than their global counterparts. This holds true for word embeddings as well as other techniques [4, 14]. The ACL collection represents a subset of research papers which focuses on a specific topic. This lends itself well for the training of word embeddings compared to a big general dataset. The applicability to a smaller local context can be demonstrated with a small example: Latent Semantic Indexing (LSI) [3] is a well-known NLP technique used in IR. When looking at the most similar words generated by the word embedding model for the term ”latent”, the global model generates terms such as ”inherent”, ”suppresses”, ”innate”, ”inhibition” or ”implicit”. Some of those words provide actual synonyms in an overall context, similar to what a thesaurus would provide. The local embedding version trained on the ACL collections provides a different represenatation of the data. Similar terms to ”latent” are: ”plsa”, ”lsa”, ”dirichlet”, ”allocation”, ”plsi” or ”proba- bilistic”. All of these are either terms in the direct context of the LSI technique or refer to similar techniques used in NLP applications. A similar observation is made in [4], where the word ”cut” is studied in a global context and compared to a local one. To train local word embeddings, a set of documents is required that reflects the local context. This is provided by the top-ranked documents in retrieval as well as the documents from references and co-authorships. 5 Experimental Setup We use the Skip-Gram algorithm from the Word2Vec model to train the word embeddings. The embeddings are used to choose the terms that are closest to 48 BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval the query, by using the cosine similarity between the projected vectors. This is done for noun phrases in the query as well as the single query terms. The most similar words to the query terms are then incorporated into the expansion model. The expansion is based on the n most relevant documents from the retrieval run, a method which is also known as pseudo relevance feedback (PRF). PRF is a proven method for expanding a query and in doing so achieving better retrieval performances. This provides a natural addition to the expansion process and helps together with word embeddings to mitigate the vocabulary mismatch problem that arises in IR when different terms are used to describe the same concepts [5]. For the evaluation both of the datasets underwent several setup steps. The setup steps are the same for both datasets with a few exceptions, notably stopword removal and tokenization. 5.1 Pre-processing As a preparation for indexing, the corpus for both of the datasets is tokenized with a regex tokenizer and transformed to lower case. The stopwords are filtered with the SMART stopword list3 . The stopword lists were extended with query and publication specific stopwords. Stopword removal was only performed for indexing, but not for the word embedding models, as they help by providing context for the training of the models [8]. Krovetz stemming [6] was applied to all documents to reduce the overall vocabulary size. This was beneficial in training the word embedding models, as it creates a sparser input space for the comparatively small ACL dataset. 5.2 Indexing and Initial Retrieval We are using an extension of the Bo1 model [1], a variant of the divergence from randomness (DFR) weighting model. Bo1 was chosen because it represents a stable version in the DFR framework [1]. Weights are assigned in the following way: 1+f w(t) = tf ∗ log2 + log2 1 + f (1) f where tf is the frequency of the terms within the set of top ranked feedback documents, and f is the term frequency of the term in the corpus divided by the N documents indexed from the whole document collection. Bo1 is used for initial weighting and candidate term selection, which provides us with a basis for measuring the information content of the different query expansion candidates. This is an important step in ranking them. For retrieval, the reference implementation of the Inverse Document Frequency model (InL2) for weighting from the terrier retrieval platform was used [9]. The first round of retrieval provides the basis for the pseudo relevance feedback. The number of feedback documents was set to 3, which produced the best overall results. 3 http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop- list/english.stop 49 BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval 5.3 Word2Vec Parameters and Learning of Embeddings The initial Word2Vec models are learned on the whole corpus for both datasets. Another model is learned from the English-language edition of Wikipedia. The initial models are learned because training a full local model is very inefficient. The model provides several parameters that can be set to improve the model results. As the ACL dataset is small, the default number of iterations is set from 5 to 20. The minimum frequency of the words appearing in the corpus was set to 8. The window size, which represents the maximum distance between the word Word2Vec looks at and the word it is trying to predict within a sentence, was set to 7. The Skip-Gram algorithm delivered superior results for the dataset com- pared to the continuous bag of words (CBOW) algorithm. All of those settings produced the best results combined. 5.4 Retrieval and Query Expansion Let Q be a query issued by the user, which can also represented as a list of terms q1 , q2 , ..., qn , and C be the list of candidate terms for query expansion, represented as c1 , c2 , ...ck . The initial set of C is selected out of all of the terms in the first m relevant documents, which includes all terms found in the documents. The pool of candidates C is then extended by all terms that appear in the reference section of the relevant documents. In addition to this, they are extended by similar documents of their co-authors, which creates an extended list of candidate terms, and their frequencies can then be used for the weighting by Bo1. The process of adding terms from co-author documents is only executed for the ACL dataset. For the resulting set of documents, the top k terms according to their weight assigned to them from Bo1 are used for further processing. The list of terms filtered by the stopword lists all provide the basis for the lookup of similar terms (i) (i) (i) e1 , e2 , ..., en with the Word2Vec model. Before generating candidate terms, the Word2Vec model is retrained on the same extended dataset the candidate term lookup was performed on. Training is done with the same settings as in the initial training step described in the previous section. The lookup of terms in the model creates another list of extended candidates, which is weighted by the Bo1 model. 6 Results In this section, we present the results of both datasets and different config- urations for query expansion. The results of both datasets have low retrieval performance in terms of the main metrics used, which can also be found in refer- ence works considered to provide baselines [11, 12]. The datasets are challenging because of the low number of relevant documents for each query, as can be seen in Table 1. The following notation is used for the result tables: – Baseline represents the retrieval without any query expansion method ap- plied. 50 BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval – QE global denotes the global query expansion approach with a general pur- pose query expansion model trained on a dataset from the English-language edition of Wikipedia. – QE local is the locally-trained model. – QE local ext. is the locally-trained model with the extension of reference documents and documents from co-authors. Table 1. Comparison of the datasets in terms of vocabulary size, documents and relevance judgments. Name Topics Vocab Size Indexed Docs Avg. Relevant Docs ACL 82 329,490 9,793 23.67 CLEF-IP 1,350 2,648,818 420,193 7.2 Table 2 shows the performance in terms of Mean Average Precision (MAP), Precision at 5 and Precision at 10 for the retrieved documents through the dif- ferent query expansion methods. All of the query expansion methods outperform the baseline result, but global query expansion does this by a very slight margin. Local embeddings perform generally better, with the local version reaching the highest retrieval results. Table 2. Results comparing the local query expansion method against the baseline for the ACL research paper collection. The best results in a column are bold. Significance testing has been performed with the paired t-test and is denoted by *. Method MAP P@5 P@10 Baseline 0.1497 0.2268 0.1683 QE global 0.1502 0.2268 0.1732 QE local 0.1623* 0.2347 0.1805 QE local ext. 0.1713* 0.2314 0.1822 Table 3 shows the results for the CLEF-IP dataset. The local method does not score higher by the same margin as it does in the ACL dataset. 7 Conclusion and Discussion In this paper we showed the implication of local query expansion by using docu- ments from references and co-authors. The inclusion of those documents provides further information for term selection for the models on two datasets with low baseline performances. Extending the approach to generate a bigger list of can- didates that are potentially relevant improved retrieval performance for the ACL 51 BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval Table 3. Results comparing the local query expansion method against the baseline for the CLEF-IP patent collection. Method MAP P@5 P@10 Baseline 0.0914 0.0630 0.0446 QE local 0.0916 0.0631 0.0448 QE local ext. 0.0923 0.0636 0.0455 dataset. For the CLEF-IP patent dataset only slight improvements can be ob- served and no statistic significance could be shown. One potential issue might be caused by the pre-training of the word embeddings. As training local embeddings is very costly and inefficient, retraining on a previously created dataset can speed up the training. The results might indicate that a certain level of topical rele- vance needs to be achieved for this approach to be effective, even if the system was trained on a relevant corpus. The addition of supplementary information from references and co-authors might not be as beneficial for datasets with bet- ter overall performance, as the number of retrieved documents that can be used reliably as a source for pseudo relevance feedback is greater. The retrieval results might be improved by switching the weighting of candidate terms from a distri- bution based method (Bo1) to association based term selection, which is used as a basis for other work in word embedding query expansion [15, 4]. Future work may help to shed more light on the implication of different weighting models as well as how topically restrained embeddings have to be in order to achieve the best results. Bibliography [1] Giambattista Amati. Probability models for information retrieval based on divergence from randomness. PhD thesis, University of Glasgow, 2003. [2] Nawal Ould Amer, Philippe Mulhem, and Mathias Géry. Toward word embedding for personalized information retrieval. In Neu-IR: The SIGIR 2016 Workshop on Neural Information Retrieval, 2016. [3] Scott Deerwester. Improving information retrieval with latent semantic indexing. 1988. [4] Fernando Diaz, Bhaskar Mitra, and Nick Craswell. Query expansion with locally-trained word embeddings. arXiv preprint arXiv:1605.07891, 2016. [5] George W. Furnas, Thomas K. Landauer, Louis M. Gomez, and Susan T. Dumais. The vocabulary problem in human-system communication. Com- munications of the ACM, 30(11):964–971, 1987. [6] Robert Krovetz. Viewing morphology as an inference process. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, pages 191–202. ACM, 1993. [7] Saar Kuzi, Anna Shtok, and Oren Kurland. Query expansion using word embeddings. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pages 1929–1932. ACM, 2016. 52 BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval [8] Jey Han Lau and Timothy Baldwin. An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368, 2016. [9] Craig Macdonald, Richard McCreadie, Rodrygo LT Santos, and Iadh Ounis. From puppy to maturity: Experiences in developing terrier. Proc. of OSIR at SIGIR, pages 60–63, 2012. [10] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Effi- cient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. [11] Florina Piroi, Mihai Lupu, Allan Hanbury, and Veronika Zenz. Clef-ip 2011: Retrieval in the intellectual property domain. In CLEF (notebook papers/labs/workshop), 2011. [12] Anna Ritchie. Citation Context Analysis for Information Retrieval. PhD thesis, University of Cambridge, UK, 2008. [13] Dwaipayan Roy, Debjyoti Paul, Mandar Mitra, and Utpal Garain. Us- ing word embeddings for automatic query expansion. arXiv preprint arXiv:1606.07608, 2016. [14] Xing Wei and W Bruce Croft. Lda-based document models for ad-hoc re- trieval. In Proceedings of the 29th annual international ACM SIGIR confer- ence on Research and development in information retrieval, pages 178–185. ACM, 2006. [15] Hamed Zamani and W Bruce Croft. Embedding-based query language mod- els. In Proceedings of the 2016 ACM international conference on the theory of information retrieval, pages 147–156. ACM, 2016. 53