-

Exploring Summary-Expanded Entity Embeddings for Entity Retrieval

Shahrzad Naseri

John Foley

jjfoley@smith.edu 1

James Allan

Brendan T. O'Connor

0 0 College of Information and Computer Sciences University of Massachusetts Amherst , USA 1 Department of Computer Science Smith College

Entity retrieval is an important part of any modern retrieval system and often satis es user information needs directly. Word and entity embeddings are a promising opportunity for new improvements in retrieval, especially in the presence of vocabulary mismatch problems. We present an approach to entity embedding that leverages the summary of entity articles from Wikipedia in order to form a richer representation of entities. We present a brief evaluation using the DBPedia-Entity-v2 dataset. Our evaluation shows that our new, summary-inspired representation provides improvements over both standard retrieval and pseudo-relevance feedback baselines as well as over a straightforward word-embedding model. We observe that this representation is particularly helpful for the verbose queries in the INEX-LD and QALD-2 subsets of our test collection.

Recently, knowledge cards, conversational answers, and other focused responses to user queries have become possible for most search engines. Underlying most of these answers in search engine response pages is search based on knowledge graphs and the availability of rich information for named entities. In particular, named entities such as people, organizations, or concepts are often provided as the focused response to user queries. In a study of the Yahoo web search query logs, Pound et al. [ 35 ] showed that more than 50% of the queries Copyright © CIKM 2018 for the individual papers by the papers' authors. Copyright © CIKM 2018 for the volume as a collection by its editors. This volume and its papers are published under target speci c entities or lists of entities. Since their study, more entity-focused responses have appeared in major web search engines.

Of course, rich knowledge bases play a key role in the use of entities in a search. Structured data published in knowledge bases such as DBpedia1, Freebase2, and YAGO3 continue to grow in a variety of languages. In order to answer the queries directly from such knowledge bases, the entity retrieval task has been de ned: return a ranked list of entities relevant to the user's query. This task is typically approached by nding entities with a \meaning" that is similar to the query.

Capturing that semantic (\meaning") similarity between vocabulary terms, pieces of text, and sentences has been a substantial problem in information retrieval and natural language processing (NLP), for which a wide variety of approaches have been introduced [ 10, 37 ]. The word embeddings method assigns terms a low-dimensional (compared to the vocabulary size) vector and represents vocabulary terms by capturing co-occurrence information between the terms, using a likelihood approximation of the terms' appearance within a window context. Word2vec [ 28 ] and GloVe [ 31 ] are examples of widely used word embeddings that are obtained based on a neural network-based language model and matrix factorization technique, respectively.

There has been substantial work on de ning embeddings for not just single words but for entities [ 45, 49, 8, 46, 24 ], but there is no clear baseline for ranking entities with such compressed semantic representations. In fact, when trying to re-use task-speci c entity embeddings for retrieval tasks, results can be less than impressive: e.g., RDF2Vec [ 38 ] was designed for data mining and has been shown to under-perform simple retrieval baselines like BM25 on more speci c tasks [ 29 ]. Although fully-deep models that leverage entities exist [ 44 ], often we do not have enough data 1http://dbpedia.org 2http://freebase.org 3http://www.mpi-inf.mpg.de/yago-naga/yago/ to train supervised embeddings.

We propose a simple entity embedding model that focuses on representing an entity based on other entities crucial to its summary. Here, we use the entities that appear inside a DBPedia abstract. Since we use links present in the abstract, these entity mentions were e ectively annotated by the human authors of those articles.

In summary, we investigate the problem of entity retrieval for improving retrieval results using word and entity embeddings. We use the queries of DBpediaEntity (v2) dataset introduced by Hasibi et al. [ 18 ] in order to evaluate our EntityVec representation on its ability to directly rank entities. We demonstrate that this is an e ective representation for use in entity ranking, one that provides gains beyond those provided by single-word embeddings and query expansion.

The rest of this work is organized in the following manner: We provide some background on entity retrieval in Section 2. In Section 3 we present our approach in detail. Finally, in Section 4 we empirically validate our hypotheses and discuss conclusions in Section 6. 2

Related Work

In this section, we rst introduce some prior work in entity retrieval. Then we discuss the key ideas behind the word embedding techniques whose purpose is to capture the semantic similarity between vocabulary terms.

Entities are useful for a diverse set of tasks including but not limited to academic search [ 45 ], entity disambiguation [ 49 ], entity summarization [ 16, 15 ], knowledge graph completion [ 46, 24 ], etc. We will focus our discussion on entity retrieval. 2.1

Entity Retrieval

Entity ranking is a task that focuses on retrieving entities in a knowledge base and presenting them in ranked order in response to a users' information need. This task was the focus of various benchmarking campaigns including the INEX Entity Ranking track [ 11 ], the INEX Linked Data Track [ 42 ], the TREC Entity track [ 41, 6, 3 ], the Semantic Search Challenge [ 7, 17 ], and the Question Answering over Linked Data (QALD) challenge series [ 25 ]. A common goal between all of these campaigns was to address the users' need in an entity-speci c way, instead of returning documents which might contain unnecessary information. However, these campaigns focused on di erent tasks such as list search [ 3, 11 ], related entity nding [ 41 ] and question answering [ 25 ]. All of the datasets from those campaigns were combined into the DBPedia Entity v1 [ 5 ] and v2 [ 18 ] datasets. 2.1.1

Leveraging Knowledge Bases for Entity Retrieval

Existing methods typically study the use of type information to improve entity retrieval accuracy [ 4, 21, 2 ]. Knowledge bases are typically represented as tuples of relations, often formatted in the Resource Description Framework (RDF) triple format. As a result, entities have rich elded information and elded retrieval methods such as BM25F [ 39, 32, 20 ] and F-SDM [ 48 ] are especially helpful. Zhiltzov et al. in particular propose the use of name, attribute, categories, similar entities, and related entities as the elds for a elded retrieval model [ 48 ].

To take advantage of both structured and unstructured data, Schuhmacher et al. used a learning-to-rank approach which incorporates di erent features of both text and entities [ 40 ]. Foley et al. expand on results for their dataset by exploring minimal knowledge-base features for use in learning-to-rank [ 13 ]. Both of these studies leverage crowd-sourced judgments of entity relevance for traditional TREC ad-hoc queries. 2.1.2

Entity Retrieval without a Knowledge Base

There have also been e orts to answer entity queries that cannot be satis ed via information in the knowledge bases due to the various ways of addressing an entity in the query. In earlier work on expert nding, entities were de ned by their locations in text [ 1, 33 ]. More recently, Hong et al. [ 19 ] tried to enrich their knowledge base using linked web pages and queries from a query log. In addition, Grause et al. [ 14 ] tried to present a dynamic representation for entities by collecting di erent representation from a variety of resources and combine them together.

In this work, we focus on entities that can be found in knowledge bases. 2.2

Neural and Embedding Approaches for Entity Retrieval

As our primary direction of study for this work is toward an entity representation to improve retrieval, the most relevant e orts are those that leverage word or entity embeddings in their ranking tasks.

Word embedding techniques learn a low-dimensional vector (compared to the vocabulary size) for each vocabulary term in which the similarity between the word vectors captures the semantic as well as the syntactic similarities between the corresponding words. Word embeddings are unsupervised learning methods since they only need raw textual data without any labels. There are di erent methods to compute the word embeddings. One of the most popular methods is using neural networks to predict words based on the context of a text. Mikolov et al. [ 28 ] introduced word2vec that learns vector representation of words via a neural network with a single layer. Word2vec is proposed in two ways, CBOW and Skip-gram. CBOW tries to predict the word based on the context, i.e., neighboring words. Skip-gram tries to predict the context. Given the word w, it tries to predict the probability of word w0 being in a xed window of word w. Another model for learning embedding vectors is based on matrix factorization, e.g., GloVe vectors [ 31 ]. Although many variants of word embeddings exist, skipgram embeddings are quite e cient and not signi cantly di erent from other variations if tuned correctly [ 27, 23 ].

Xiong et al. propose a model for ad-hoc document retrieval that represents documents in queries in both text and entity spaces, leveraging entity embeddings in their approach [ 44 ]. However, such deep models require a signi cant quantity of training data to learn e ective models, and our approach uses far less supervision than this direction.

Entity embeddings are also used for academic search [ 45 ], for entity disambiguation [ 49 ], for question answering [ 8 ] and for knowledge graph completion [ 46, 24 ]. The benchmark paper for TREC-CAR (Complex Answer Retrieval) determined that RDF2Vec entity embeddings [ 38 ] are not as e ective as BM25 for their entity-focused paragraph ranking task [ 29 ]. Our survey of related work suggests that opportunities to customize entity vectors for ranking remain relatively unexplored. 3

Embedding-Based Entity Retrieval

Vocabulary mismatch is a long-standing problem in information retrieval. Previous work [ 47 ] has proposed to incorporate word embeddings to solve this problem. In this paper, we investigate the e ect of word embeddings in entity retrieval with the goal of solving vocabulary mismatches.

Moreover, since in entity retrieval we retrieve entities instead of documents, and since most of the queries are entity centric, we learn an embedding representation for entities and explore the e ect of those embeddings on entity retrieval. We hypothesize that mapping the query to the entity space and comparing with the retrieved entities will improve the retrieval results. In this section, we describe our approach to validate our hypothesis that incorporating word embeddings and entity embeddings enhances entity retrieval accuracy. We also discuss query expansion [ 22 ], an approach that also attempts to address the vocabulary gap by augmenting the query with additional related words. 3.1

General Scheme of Retrieval

Given a query, q, that targets a speci c entity, our task is to return a ranked list of entities likely to be relevant. In this case, each entity is represented by a short textual description. In our experiments, for example, we used the short abstract of each entity available in DBpedia. A list of candidate entities will also be retrieved using term-based retrieval models such as query likelihood model [ 34 ], e ciently creating a large pool of candidate matches.

In our model, we try to enhance the accuracy of entity retrieval by representing queries and entities by their corresponding embedding vectors. We explore two methods to represent query and entity embedding vectors, which we refer to them as WordVec and EntityVec models.

In the WordVec model each query is represented by the average of the embedding vector of the query's terms. Entities are also represented in a similar way, by averaging over the embedding vectors of the terms in the entity's abstract. The GloVe [ 31 ] pre-trained word embedding is used for the words embedding vector in the WordVec model.

In the EntityVec model, an embedding vector for entities is learned based on the Skip-gram model implemented in gensim [ 36 ]. To learn this embedding, following the approach presented in [ 30 ], we replace the Wikipedia pages' hyperlinks (links referring to other pages, i.e., entities) with a placeholder representing the entity. Consider the following excerpt, where links to other Wikipedia articles (entities) are represented by italics:

Harry Potter is a series of fantasy novels written by British author J. K. Rowling. The novels chronicle the life of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts School of Witchcraft and Wizardry The excerpt will be replaced by:

Harry Potter is a series of Fantasy literature written by British author J. K. Rowling. The novels chronicle the life of a young Magician (fantasy), Harry Potter (character), and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts where the link is replaced by the corresponding article's title and spaces are replaced by underscores. Now each entity in the original excerpt is considered as a single \term", and an embedding is learned based on the Skipgram model.

As mentioned before, entities are represented by the abstract available in DBpedia. To also consider this representation, the nal embedding of a target entity is obtained by averaging over the embedding vectors of referred entities appeared in the abstract of the target entity.

In the EntityVec model, queries are represented by the average of the embedding vectors of the entities in the query. The entities in the query are annotated using TagMe [ 12 ] mention detection tool.

For both WordVec and EntityVec the similarity between query and the document is calculated by cosine similarity between their respective embedding vectors.

The nal entity retrieval score is obtained by linear interpolation of the baseline, WordVec, and EntityVec models.

Table 1 reports the learning corpora for WordVec and EntityVec models. Moreover, we summarize the nal embedding vector for query and entity in table 2. 3.2

Query Expansion

In an intuitive sense, query and document embedding models solve the vocabulary mismatch problem by virtue of expanding the representation. Therefore, it makes sense to compare our work to techniques in the query-expansion literature.

Lavrenko and Croft introduce relevance modeling, an approach to query expansion that derives a probabilistic model of term importance from documents that receive high scores, given the initial query [ 22 ]. They present a number of models, but the most utilized version is RM3, which is a mixture model between the top k expansion terms and the original query. Expansion terms (t) are given the following weights derived from a set of pseudo-relevant documents DQ a query Q: w(t) =

d2DQ 1 X P (djQ)P (tjd)

Terms that occur frequently P (tjd) in high-scoring documents P (djQ) are given the most weight in the expansion. The Z is merely a normalizer allowing for the weights to be turned into a probability distribution over terms that occur in the pseudo-relevant document set DQ. This baseline is often used for comparison in entity-focused retrieval literature [ 9, 40, 43 ].

Experimental Setup

In this section, we introduce our experimental setup, baselines, and evaluation metrics. Next, we report and discuss our result. 4.1

Data set

Our experiments are conducted on the entity search test collection DBpedia-Entity v2 [ 18 ]. This dataset originally consists of queries gathered from the seven previous competitions with relevance judgment on entities from DBpedia version 2015-10.

For word embeddings, we used the GloVe [ 31 ] pretrained word embedding with 300 dimensions. The word embeddings were extracted from a 6 billion token collection (the Wikipedia dump 2014 plus the Gigawords 5).

To train the entity embeddings, we used the full article of Wikipedia pages obtained from the DBpedia 2016-10 dump. 4.2

Data Processing

Retrieval results were obtained using the index built from the abstract of the entities.

We used TagMe [ 12 ] as the mention detection tool for the entities in the queries. We used the Word2Vec implementation in gensim [ 36 ] for learning entities embeddings { i.e. EntityVec. As mentioned previously, to obtain EntityVec embeddings we followed the approach outlined by Ni et al. [ 30 ] and replaced the outbound hyperlinks to Wikipedia pages with a unique placeholder token. We learn embeddings of 3.0 million entities out of 4.8 million entities in Wikipedia. 4.3

Hyperparameter Settings

The parameter of the language modeling approach is obtained by 2-fold cross validation over the queries. The parameter is chosen from the set f100, 500, 1000, 1500g. To tune the RM3 hyperparameters { i.e., the original query's weight and the number of expansion terms { we use 2-fold and 5-fold cross-validation. The original weight is changed from 0:1 to 0:9 in increments of 0:1, and the number of terms is changed from 10 to 90 in increments of 20. With the tuned parameter with 2-folds and 5-folds, RM3 for short queries did not improve over the Language model approach. We note that there were another parameter settings that did improve RM3 over the language model but they were not discoverable in the 2-fold or 5-fold approaches. When we report RM3 results (Table 5), we report the results for 2-fold cross-validation.

The parameters for learning the EntityVec embeddings are as follows: window-size = 10, sub-sampling = 1e-3, cuto min-count = 0. The learned embedding dimension is equal to 200 and it is learned based on Skip-gram model. Mean Average Precision (MAP) of the top-ranked 1000 documents is selected as the main evaluation metric to evaluate the retrieval e ectiveness. Furthermore, we consider precision of the top 10 retrieved documents (P@10). Since we have graded relevance judgment, we also report nDCG@10. Statistically signi cant di erences in performances are determined using the twotailed paired t-test computed at a 95% con dence level based on the average precision per query. 5

Results

In this section, we explore the results of our entity representation models atop two baselines. We look at both a standard unigram approach { language modeling (LM) [ 34 ] { and an approach built on query expansion { relevance modeling (RM3).

In Table 3, we present the results of our model on top of the LM baseline for short and verbose query subsets as well as their union. We discuss the results of our models with respect to query length in Section 5.2. This Table is the appropriate table to look at overall results of our models, particularly in the \All Queries" section. Both proposed methods outperform the baseline LM model, suggesting that there is value in both our EntityVec representation and in the more traditional WordVec query expansion. Combining the two methods yields even greater accuracy across all measures.

In Table 4, we present the results of our di erent models atop LM using the traditional dataset subsets inside of DBPedia-Entity-v2. Since these datasets were originally constructed for di erent variations of the entity ranking task, we were curious if their di erent query types would yield di erent results. We discuss the results in terms of the di erent styles of queries in Section 5.3.

Finally, in Table 5, we examine our approaches on top of a baseline with query expansion built-in. We discuss the results of our models on this expanded baseline in Section 5.4. 5.1 In the result tables, relative improvements over the base retrieval models { i.e. LM and RM3 { are shown as percentages to the right of the scores. Win/Tie/Loss shows the number of queries improved, unchanged, or hurt, respectively, comparing with the base retrieval models and using the MAP measure. y; z, and x indicate statistical signi cance over the (base retrieval model), (base retrieval model)+WordVec, and (base retrieval model)+EntityVec, respectively. As mentioned earlier we use two base retrieval models (LM and RM3). The best method for each metric is marked bold. 5.2

Entity Representations for Short and Verbose Queries

We found that results were quite di erent for verbose queries (de ned as queries longer than four terms) and short queries, so our tables are broken into three sections to re ect the overall dataset and these querylength subsets.

Based on the results in Table 3 we can see that both WordVec and EntityVec improve verbose queries more than they improve short queries (particularly measured by MAP). We speculate this could be due to short queries being more prone to ambiguity, so those better query representations are built from verbose queries where the additional words provide disambiguation and thus better matching of related entities. Also for the WordVec model, it seems that the embedding of a short query does not seem to help improve matching signi cantly. It is also possible that some short queries are more speci c so the embedding (implicitly incorporating related words) is less important. Further analysis is needed to understand this behavior fully, but we recommend that systems that use entity representations consider using query length to select an appropriate model.

If we now look at the win/tie/loss analysis for these queries at the far right of Table 3, we can see that there are many ties. This is a result of some queries lacking entities in their description. In the current version of our model, we cannot generate an entity representation if our entity linker (TagMe, in this case) does not identify any entities in queries, so each representation is identical. Even ignoring ties, we can see that there are more wins than losses so that our vector modeling approaches are helpful when entities are identi ed, and the magnitude of MAP improvements is higher for EntityVec than for WordVec, even though WordVec can be used for all queries and EntityVec only changes a subset.

We further note that combining WordVec and EntityVec results in additional gains, indicating that the two methods are complementary, capturing di er

+4.75% +6.45% +9.91%

nDCG@10 0.2742 0.2863y 0.2871y 0.2983yzx When we investigate the e ect of our entity vector models on di erent types of queries, we can see some more interesting results in Table 4. Since the queries are of such diverse types, it is not surprising to observe some variation. We see that the WordVec model does not show a signi cant improvement in the SemSearch-ES and QALD-2 results. Since SemSearch-ES queries are mostly ambiguous keyword queries, it is possible that the WordVec representations are not speci c enough to be helpful. 5.4

Entity Representations and Query Expansion

Finally, we evaluate the proposed methods in the pseudo-relevance feedback scenario. We choose RM3 which is a state-of-the-art PRF method that has been shown to perform well in various collections [ 26 ]. Table 5 shows the results for the proposed methods and the RM3 baseline.

We observe the same kind of improvements over the RM3 baseline with our WordVec and EntityVec models that we saw on top of our keyword-query baseline. This is a really interesting observation because it shows that our embedding models are somehow orthogonal to a state-of-the-art query expansion model, which is often pointed to as the source of improvement for embedding approaches.

We note that in this dataset, the RM3 methods actually lowers the e ectiveness for short queries compared to using LM alone. The WordVec and EntityVec models compensate somewhere for that reduction, but are not su cient to recover all of the loss.

In future work, we hope to analyze the relevant entities discovered by our embedding approaches that are not present in the RM3 baselines in order to better understand where our improvements are coming from. For the EntityVec gains, we hypothesize that we have been able to encode critical information about the entity graph by modifying entity vectors to include their most important neighbors. 6

Conclusion And Future Work

In this study, we expanded on traditional entity embeddings by incorporating information from related entities that are mentioned in their summary. We demonstrated the e cacy of this model on a popular entity ranking collection in comparison to simpler word2vec style models and traditional retrieval models. In our comparison to RM3, a pseudo-relevance feedback query-expansion approach, we demonstrate that the utility of our entity modeling is not limited to query expansion { or at least, it provides a useful and novel method of query expansion in comparison to this popular approach.

In order to fully validate our model, we intend to compare it to other unsupervised and semi-supervised entity embedding representations. We hope to explore more comparisons in future work, as well as more variations of our entity embedding model.

Acknowledgement

This work was supported in part by the Center for Intelligent Information Retrieval and in part by NSF grant #IIS-1617408. Any opinions, ndings and conclusions or recommendations expressed in this material

ListSearch

-2.82% +0.96% +2.21% are those of the authors and do not necessarily re ect those of the sponsors.

[1]

Balog ,

Azzopardi , and M. De Rijke . Formal models for expert nding in enterprise corpora . In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval , pages 43 { 50 . ACM, 2006 .

[2]

Balog ,

Bron , and M. De Rijke . Query modeling for entity search based on terms, categories, and examples . ACM Transactions on Information Systems (TOIS) , 29 ( 4 ): 22 , 2011 .

[3]

Balog ,

Carmel , and

Arjen . de vries, daniel m. herzig, peter mika, haggai roitman, ralf schenkel, pavel serdyukov, thanh tran duc . In The rst joint international workshop on entityoriented and semantic search (JIWES) , ACM SIGIR Forum , volume 46 , 2012 .

[4]

Balog and

Neumayer . Hierarchical target type identi cation for entity-oriented queries . In Proceedings of the 21st ACM international conference on Information and knowledge management , pages 2391 { 2394 . ACM, 2012 .

[5]

Balog and

Neumayer . A test collection for entity search in dbpedia . In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval , pages 737 { 740 . ACM, 2013 .

[6]

Balog ,

Serdyukov , and

A. P. d.

Vries . Overview of the trec 2010 entity track . Technical report, NORWEGIAN UNIV OF SCIENCE AND TECHNOLOGY TRONDHEIM , 2010 .

[7]

Blanco ,

Halpin ,

D. M.

Herzig ,

Mika ,

Pound ,

H. S.

Thompson , and

T. T.

Duc . Entity search evaluation over structured web data . In Proceedings of the 1st international workshop on entity-oriented search workshop (SIGIR 2011 ), ACM, New York, 2011 .

[8]

Bordes ,

Weston , and

Usunier . Open question answering with weakly supervised embedding models . In Joint European Conference on Machine Learning and Knowledge Discovery in Databases , pages 165 { 180 . Springer, 2014 .

[9]

Dalton ,

Dietz , and

Allan . Entity query feature expansion using knowledge base links . In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval , pages 365 { 374 . ACM, 2014 .

[10]

Deerwester ,

S. T.

Dumais ,

G. W.

Furnas ,

T. K.

Landauer , and

Harshman . Indexing by latent semantic analysis . Journal of the American society for information science , 41 ( 6 ): 391 , 1990 .

[11]

Demartini ,

Iofciu , and A. P. De Vries . Overview of the inex 2009 entity ranking track . In International Workshop of the Initiative for the Evaluation of XML Retrieval , pages 254 { 264 . Springer, 2009 .

[12]

Ferragina and

Scaiella . Fast and accurate annotation of short texts with wikipedia pages . IEEE software , 29 ( 1 ): 70 { 75 , 2012 .

[13]

Foley ,

B. O

'Connor ,

and J.

Allan . Improving entity ranking for keyword queries . In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management , pages 2061 { 2064 . ACM, 2016 .

[14]

Graus ,

Tsagkias ,

Weerkamp , E. Meij, and M. de Rijke. Dynamic collective entity representations for entity ranking . In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining , pages 595 { 604 . ACM, 2016 .

[15]

Gunaratna ,

Thirunarayan ,

Sheth , and G. Cheng. Gleaning types for literals in rdf triples with application to entity summarization . In International Semantic Web Conference , pages 85 { 100 . Springer, 2016 .

[16]

Gunaratna ,

Thirunarayan , and

A. P.

Sheth . Faces: Diversity-aware entity summarization using incremental hierarchical conceptual clustering . In AAAI , pages 116 { 122 , 2015 .

[17]

Halpin ,

D. M.

Herzig ,

Mika ,

Blanco ,

Pound ,

Thompon , and

D. T.

Tran . Evaluating ad-hoc object retrieval . In IWEST@ ISWC , 2010 .

[18]

Hasibi ,

Nikolaev ,

Xiong ,

Balog ,

S. E.

Bratsberg ,

Kotov , and

Callan . Dbpediaentity v2: A test collection for entity search . In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages 1265 { 1268 . ACM, 2017 .

[19]

Hong ,

Pei ,

Y.-Y.

Wang , and D. Hakkani-Tur. Entity ranking for descriptive queries . In Spoken Language Technology Workshop (SLT) , 2014 IEEE, pages 200 { 205 . IEEE, 2014 .

[20]

K. Y.

Itakura and

C. L.

Clarke . A framework for bm25f-based xml retrieval . In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval , pages 843 { 844 . ACM, 2010 .

[21]

Kaptein and

Kamps . Exploiting the category structure of wikipedia for entity ranking . Arti cial Intelligence , 194 : 111 { 129 , 2013 .

[22]

Lavrenko and

W. B.

Croft . Relevance based language models . In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval , pages 120 { 127 . ACM, 2001 .

[23]

Levy ,

Goldberg ,

and I.

Dagan . Improving distributional similarity with lessons learned from word embeddings . Transactions of the Association for Computational Linguistics , 3 : 211 { 225 , 2015 .

[24]

Lin ,

Liu ,

Sun ,

Liu , and

Zhu . Learning entity and relation embeddings for knowledge graph completion . In AAAI , volume 15 , pages 2181 { 2187 , 2015 .

[25]

Lopez ,

Unger ,

Cimiano , and

Motta . Evaluating question answering over linked data . Web Semantics: Science, Services and Agents on the World Wide Web , 21 :3{ 13 , 2013 .

[26]

Lv and

Zhai . A comparative study of methods for estimating query language models with pseudo feedback . In Proceedings of the 18th ACM conference on Information and knowledge management , pages 1895 { 1898 . ACM, 2009 .

[27]

Mikolov ,

Chen , G. Corrado, and

Dean . Efcient estimation of word representations in vector space . arXiv preprint arXiv:1301.3781 , 2013 .

[28]

Mikolov , I. Sutskever,

Chen ,

G. S.

Corrado , and

Dean . Distributed representations of words and phrases and their compositionality . In Advances in neural information processing systems , pages 3111 { 3119 , 2013 .

[29]

Nanni ,

Mitra ,

Magnusson , and

Dietz . Benchmark for complex answer retrieval . In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval , pages 293 { 296 . ACM, 2017 .

[30]

Ni ,

Q. K.

Xu ,

Cao ,

Mass ,

Sheinwald ,

H. J.

Zhu , and

S. S.

Cao . Semantic documents relatedness using concept graph representation . In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining , pages 635 { 644 . ACM, 2016 .

[31]

Pennington ,

Socher , and

Manning . Glove: Global vectors for word representation . In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages 1532 { 1543 , 2014 .

[32] J. R. Perez-Aguera, J. Arroyo, J.

Greenberg , J. P.

Iglesias , and V.

Fresno . Using bm25f for semantic search . In Proceedings of the 3rd international semantic search workshop, page 2. ACM , 2010 .

[33]

Petkova and

W. B.

Croft . Hierarchical language models for expert nding in enterprise corpora . International Journal on Arti cial Intelligence Tools , 17 ( 01 ): 5 { 18 , 2008 .

[34]

J. M.

Ponte and

W. B.

Croft . A language modeling approach to information retrieval . In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval , pages 275 { 281 . ACM, 1998 .

[35]

Pound ,

Mika , and

Zaragoza . Ad-hoc object retrieval in the web of data . In Proceedings of the 19th international conference on World wide web , pages 771 { 780 . ACM, 2010 .

[36]

Rehurek and

Sojka . Software Framework for Topic Modelling with Large Corpora . In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks , pages 45 { 50 , Valletta , Malta, May 2010 . ELRA. http://is.muni.cz/ publication/884893/en.

[37]

Resnik . Using information content to evaluate semantic similarity in a taxonomy . arXiv preprint cmp-lg/9511007 , 1995 .

[38]

Ristoski and

Paulheim . Rdf2vec: Rdf graph embeddings for data mining . In International Semantic Web Conference , pages 498 { 514 . Springer, 2016 .

[39]

Robertson ,

Zaragoza , and

Taylor . Simple bm25 extension to multiple weighted elds . In Proceedings of the thirteenth ACM international conference on Information and knowledge management , pages 42 { 49 . ACM, 2004 .

[40]

Schuhmacher ,

Dietz , and

S. Paolo

Ponzetto . Ranking entities for web queries through text and knowledge . In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management , pages 1461 { 1470 . ACM, 2015 .

[41]

Serdyukov and A. De Vries . Delft university at the trec 2009 entity track: Ranking wikipedia entities . Technical report, DELFT UNIV OF TECHNOLOGY (NETHERLANDS) , 2009 .

[42]

Wang ,

Kamps ,

G. R.

Camps ,

Marx ,

Schuth ,

Theobald ,

Gurajada , and

Mishra . Overview of the inex 2012 linked data track . In CLEF (Online Working Notes/Labs/Workshop) , 2012 .

[43]

Xiong and

Callan . Esdrank: Connecting query and documents through external semistructured data . In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management , pages 951 { 960 . ACM, 2015 .

[44]

Xiong ,

Callan , and T.-Y. Liu. Word-entity duet representations for document ranking . In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages 763 { 772 . ACM, 2017 .

[45]

Xiong ,

Power , and

Callan . Explicit semantic ranking for academic search via knowledge graph embedding . In Proceedings of the 26th international conference on world wide web , pages 1271 { 1279 . International World Wide Web Conferences Steering Committee, 2017 .

[46]

Yang , W.-t. Yih,

He ,

Gao , and

Deng . Embedding entities and relations for learning and inference in knowledge bases . arXiv preprint arXiv:1412.6575 , 2014 .

[47]

Zamani and

W. B.

Croft . Embedding-based query language models . In Proceedings of the 2016 ACM international conference on the theory of information retrieval , pages 147 { 156 . ACM, 2016 .

[48]

Zhiltsov ,

Kotov , and

Nikolaev . Fielded sequential dependence model for ad-hoc entity retrieval in the web of data . In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages 253 { 262 . ACM, 2015 .

[49]

Zwicklbauer ,

Seifert , and

Granitzer . Robust and collective entity disambiguation through semantic embeddings . In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval , pages 425 { 434 . ACM, 2016 .