-

Leveraging Wikipedia Knowledge for Entity Recommendations

Nitish Aggarwal

Peter Mika

pmika@yahoo-inc.com 0

Roi Blanco

roi@yahoo-inc.com 0

Paul Buitelaar

0 0 Insight Centre for Data Analytics National University of Ireland Galway , Ireland Yahoo Labs 125 Shaftesbury Ave, WC2H 8HR London , UK

User engagement is a fundamental goal of commercial search engines. In order to increase it, they provide the users an opportunity to explore the entities related to the queries. As most of the queries can be linked to entities in knowledge bases, search engines recommend the entities that are related to the users' search query. In this paper, we present Wikipedia-based Features for Entity Recommendation (WiFER) that combines di erent features extracted from Wikipedia in order to provide related entity recommendations. We evaluate WiFER on a dataset of 4.5K search queries where each query has around 10 related entities tagged by human experts on 5-level label scale.

With the advent of large knowledge bases like DBpedia1, YAGO2 and Freebase3, search engines have started recommending entities related to the web search queries. Pound et al. [ 7 ] reported that more than 50% web search queries pivot around a single entity and can be linked to an entity in the knowledge bases. Consequently, the task of entity recommendation in the context of web search can be de ned as nding the entities related to the entity appearing in a web search query. It is very intuitive to get the related entities by obtaining all the explicitly linked entities to a given entity in knowledge bases. However, most of the popular entities can easily have more than 1,000 directly connected entities, and knowledge bases mainly tend to cover some speci c types of relations. For instance, \Tom Cruise" and \Brad Pitt" are not directly connected in the DBpedia graph with any relation, however, they can be considered related to 1 http://wiki.dbpedia.org/ 2 http://www.mpi-inf.mpg.de/departments/databases-and-informationsystems/research/yago-naga/yago/ 3 https://www.freebase.com/ each other as they both are popular Hollywood actors and co-starred in movies. Therefore, to build a system for entity recommendation, there is a need to discover related entities beyond the relations explicitly de ned in knowledge bases. Furthermore, these related entities require a ranking method to select the most related ones.

Blanco et al. [ 4 ] described the Spark system for related entity recommendation and suggested that such recommendations are successful at extending users' search sessions in Yahoo search. Microsoft also published a similar system [ 8 ] that performs personalized entity recommendation by analyzing the click logs. In this paper, we present Wikipedia-based Features for Entity Recommendation (WiFER) that combines di erent features extracted from Wikipedia. It makes use of Distributional Semantics for Entity Relatedness (DiSER) [ 1, 2 ] and Explicit Semantic Analysis (ESA) [ 6 ] as its features, in combination of others. The features are combined by using learning to rank methods [ 5 ]. WiFER is inspired by Spark. However, Spark utilizes proprietary data like query logs and query sessions, which are not available publicly. Therefore, we focus on extracting di erent features from Wikipedia to build the entity recommendation system. 2

Approach

Wikipedia-based Features for Entity Recommendation (WiFER) combines the di erent features by using learning to rank method. These features are extracted from Wikipedia by considering two di erent types of data source: collection of textual content and collection of Wikipedia hyperlinks. The features are derived from the hypothesis that the entities, which occur often in the same context (Wikipedia article), are more likely to be related to each other. We use following features: 1. Probability (P1; P2) is calculated by taking the ratio of the number of articles that contain the given entity to the total number of articles. P1 is

PN the probability of an entity E1. P1 = iN=0 oi where oi = 1, if an article si contains the entity E otherwise oi = 0. N is the total number of articles. The value of P of an entity is independent of the other entities, therefore it gives two values P1 and P2 for an entity pair consisting of E1 and E2. 2. Joint probability (JPSYM) This score is obtained by taking the ratio of the number of articles that contain both the given entities to total number

PN of articles. J P SY M = i=N0 coi where coi = 1 if an article si contains both the entities E1 and E2, otherwise oi = 0. 3. PMI (SISYM) It computes the point-wise mutual information (PMI).

P M I(E1; E2) = Plo(gE(P1)(EP1(;EE22)))) where P (E1) and P (E2) are the prior probabilities as described above. P (E1; E2) is computed by taking the ratio of number of articles that contain both the entities E1 and E2, to the total number of articles. 4. Cosine similarity (CSSYM) The cosine similarity is calculated as

P (E1;E2)

Cosine(E1; E2) = P (E1) P (E2)) Since we mentioned that Wikipedia is used twice, WiFER generates 16 di erent feature values. In order to generate the feature values from text collection, we consider only the surface from of an entity to obtain the occurrence. However, we count the occurrence of an entity in collection of hyperlinks, only if the entity appears as hyperlink in an article. The Probability features generates two values for an entity pair, therefore, each collection provides 8 di erent feature values and we obtain total 16 values. 3

Evaluation

In order to evaluate our approach, we compare WiFER with the Spark entity recommendation system [ 4 ] that uses more than 100 features extracted from di erent data sources such as query logs and user search sessions. We evaluate the performance on same dataset that was used by Spark. It consists of 4,797 search queries. Every query refers to an entity in DBpedia and contains a list of entity candidates. The entity candidates are tagged by professional editors on 5 label scale: Perfect, Excellent, Good, Fair, and Bad. Finally, it contains 47,623 query-entity pairs. We use Gradient Boosting Decision Tree (GBDT) [ 5 ] ranking method. Due to variations in the number of retrieved related entities for a query, we use Normalized Discounted Cumulative Gain (nDCG) for the performance metric. We calculate nDCG@10, nDCG@5, and nDCG@1 as the evaluation metrics. All the nDCG scores are obtained by performing 10-fold cross validation. In addition to performing experiments on the dataset with all the entity types, we also evaluated the systems for the datasets including only person type entities or location type entities. Table 1 shows the retrieval performance of Spark, and compare it with WiFER. It shows that WiFER achieved comparable results on full dataset and person type entities. However, it could not cope well for location type entities. The possible reason behind it could be that most of the locations are too speci c which do not have enough information on Wikipedia. Moreover, to investigate if WiFER can complement Spark performance, we combine all the features in Spark with WiFER features. WiFER could not outperform Spark, however the combination of both i.e. Spark+WiFER achieved higher scores for all the test cases. Although, WiFER obtained relatively lower scores for location type entities, it is able to compliment the Spark's performance. Further, we performed an extensive evaluation to investigate the importance of di erent features in entity recommendations (see for more details [ 3 ]). 4

Conclusion

In this paper, we presented WiFER that combines di erent features extracted from Wikipedia, by using a learning to rank method. We showed that WiFER achieved a comparable accuracy to Spark, which uses more than 100 features obtained from proprietary data sources like query logs and user search sessions. Moreover, Spark does not utilize Wikipedia to build its features, thus, we combine WiFER with Spark features, and we showed that WiFER complements the overall performance of Spark.

Aggarwal ,

Asooja ,

Ziad , and

Buitelaar . Who are the american vegans related to brad pitt?: Exploring related entities . In Proceedings of the 24th International Conference on World Wide Web Companion , 2015 .

Aggarwal and

Buitelaar . Wikipedia-based distributional semantics for entity relatedness . In 2014 AAAI Fall Symposium Series , 2014 .

Aggarwal ,

Mika ,

Blanco , and

Buitelaar . Insights into entity recommendation in web search . In Proceedings of the Intelligent Exploration of Semantic Data, ISWC , 2015 .

Blanco ,

B. B.

Cambazoglu ,

Mika , and

Torzec . Entity recommendations in web search . In International Semantic Web Conference (ISWC) , 2013 .

J. H.

Friedman . Greedy function approximation: a gradient boosting machine . Annals of Statistics , pages 1189 { 1232 , 2001 .

Gabrilovich and

Markovitch . Computing semantic relatedness using wikipediabased explicit semantic analysis . In Proceedings of the 20th international joint conference on Arti cal intelligence , IJCAI'07 , pages 1606 { 1611 , 2007 .

Pound ,

Mika , and

Zaragoza . Ad-hoc object retrieval in the web of data . In Proceedings of the 19th international conference on World wide web , pages 771 { 780 . ACM, 2010 .

Yu ,

Ma ,

B.-J. P.

Hsu , and J. Han. On building entity recommender systems using user click log and freebase knowledge . In Proceedings of the 7th ACM international conference on Web search and data mining , pages 263 { 272 . ACM, 2014 .