A Topic-Sensitive Model for Salient Entity Linking

A Topic-Sensitive Model for Salient Entity Linking LeiZhang l.zhang@kit.edu Institute AIFB Karlsruhe Institute of Technology (KIT)

Germany

CongLiu cong.liu@student.kit.edu Institute AIFB Karlsruhe Institute of Technology (KIT)

Germany

AchimRettinger rettinger@kit.edu Institute AIFB Karlsruhe Institute of Technology (KIT)

Germany

A Topic-Sensitive Model for Salient Entity Linking B00A938CC68A68F12DB186B6DC7E79C2 GROBID - A machine learning software for extracting information from scholarly documents

In recent years, the amount of entities in large knowledge bases available on the Web has been increasing rapidly. Such entities can be used to bridge textual data with knowledge bases and thus help with many tasks, such as text understanding, word sense disambiguation and information retrieval. The key issue is to link the entity mentions in documents with the corresponding entities in knowledge bases, referred to as entity linking. In addition, for many entity-centric applications, entity salience for a document has become a very important factor. This raises an impending need to identify a set of salient entities that are central to the input document. In this paper, we introduce a new task of salient entity linking and propose a graph-based disambiguation solution, which integrates several features, especially a topic-sensitive model based on Wikipedia categories. Experimental results show that our method significantly outperforms the state-of-the-art entity linking methods in terms of precision, recall and F-measure.

Introduction

In recent years, large repositories of structured knowledge publicly available on the Web, such as Wikipedia, DBpedia, Freebase and YAGO, have become valuable resources for information extraction. In this regard, entity linking, which leverages such knowledge bases to link words or phrases in natural language text with the corresponding entities, has emerged as a topic of major interest.

The challenges of entity linking lie in entity recognition and disambiguation. The first stage serves to detect words or phrases in text, also called mentions, that are likely to denote entities; the second stage performs the disambiguation of the recognized mentions into entities. Many methods [1,2,3,4,5,6,7,8] have been proposed to address the problems of entity disambiguation and linking. However, these methods do not take into account the actual importance of entities w.r.t. the topics of the input document. In this work, the relation between the candidate entities and their associated categories are utilized to opt the entities that are related to the document topics.

In addition, there is an impending need to identify a set of salient entities in a document that play an important role in the content of the document, which would help to better understand its meaning or aboutness [9]. This paper focuses on the task of salient entity linking, especially the disambiguation of The rest of the paper is organized as follows. We start with an overview of our framework for salient entity linking in Sec. 2. The details of features and measures used for salient entity disambiguation are provided in Sec. 3. Based on them, we discuss the graph-based disambiguation utilizing a topic-sensitive model in Sec. 4. Evaluation results are then presented in Sec. 5, followed by the conclusions in Sec. 6.

Framework

Before we discuss our salient entity linking framework, we first formulate the task of entity linking and then introduce the problem of salient entity linking, an extension of the general entity linking task.

Definition 1 (Entity Linking). Let M = {m 1 , m 2 , . . . , m p } denote a set of entity mentions in a document D. Given a knowledge base KB containing a set of entities E = {e 1 , e 2 , . . . , e n }, the objective of entity linking is to determine the referent entities in KB for the mentions in M , where two functions are to be found. For entity recognition, the mentions need to be extracted from D, where a recognition function er : D → 2 M will be computed. The resulting mentions (i.e., a subset µ ⊆ M ) are then mapped to entities in KB, where a disambiguation function ed : µ → E must be derived.

Definition 2 (Salient Entity Linking). Given a knowledge based KB and a document D, the recognition function of salient entity linking is same as general entity linking, i.e., er : D → 2 M . For the set of mentions µ ⊆ M yielded by the recognition function, the disambiguation function ed : µ → E ∪ {Non-Salient}, which maps the set of mentions µ to entities in the KB or to non-salient entities, must be derived, where non-salient entities are entities with no focus of attention in D, i.e., the document D is really not about such entities.

An illustration of our salient entity linking framework consisting of several components is given in Fig. 1. In the following, we first introduce the components w.r.t. general entity linking and then discuss its extension with the components for salient entity linking by utilizing a topic-sensitive model.

For both general and salient entity linking, the input text is first processed by entity recognition, which detects the boundaries of mentions without knowing the actual referent entities or whether they are salient or non-salient entities. Then these mentions serve as the input of entity disambiguation, which is the focus of this work since we do not aim to compare the method's ability to recognize entity names in the input text.

Given a detected mention, its candidate referent entities are extracted from the knowledge base. For entity disambiguation regarding general entity linking, our framework combines different features including prior mention importance, mention-entity compatibility and entity-entity coherence. The feature of prior mention importance assigns the prior importance to each detected mention as weight and it will be used as the initial evidence for graph-based disambiguation. While the local feature of mention-entity compatibility captures the most likely entity behind the mention and the entity that best fits the context, the global feature of entity-entity coherence collectively captures the linked entities in a document that are related to each other. These features are then employed by graph-based disambiguation based on a personalized PageRank algorithm.

To aim for effective salient entity linking, we first perform text classification on the input text using a multi-class support vector machine (SVM) classifier based on Wikipedia categories1 aligned with the training corpus. For each category, we compute the category probability of the input document that serves as the feature of document-specific category importance. In addition, we compute the strength of entity-category association based on the depth between each candidate entity and its categories. Such features are then incorporated into graph-based disambiguation using a topic-sensitive PageRank algorithm.

Features and Measures

In this section, we discuss the features and measures needed for salient entity disambiguation, while the graph model and algorithm will be presented in Sec. 4.

Prior Mention Importance. We employ the Wikipedia link structures for determining the prior mention importance. As each Wikipedia article describes an entity, article titles, redirect pages and link anchors can be used to refer to the entity. Based on the above sources, we extract all surface forms of entities.

For each mention m with the name m.s as surface form of an entity, we define the probability P (m.s) that captures how likely m.s refers to an entity as

P (m.s) = count link (m.s) count link (m.s) + count text (m.s)(1)

where count link (m.s) is the number of articles that contain m.s as anchor text and count text (m.s) is the number of articles where m.s appears as raw text.

Mention-Entity Compatibility. For each mention m and its candidate referent entity e, we calculate the semantic similarity SS(m, e) representing the local mention-entity compatibility of m and e as follows

SS(m, e) = α • LP (m, e) + β • CS(m, e)(2)

where LP (m, e) is the link probability of e for m and CS(m, e) is the context similarity between m and e, α and β are tunable parameters with α + β = 1.

The link probability LP (m, e) can be calculated using the probability P (e|m.s) capturing how likely the mention name m.s refers to the entity e as follows LP (m, e) = P (e|m.s) = count link (e, m.s)

ei∈Em.s count link (e i , m.s)(3)

where count link (e, m.s) denotes the number of links using m.

Entity-Entity Coherence. The disambiguation is based on the feature of entity-entity coherence, which collectively captures the referent entities of the mentions contained in the same document that are related to each other. In this regard, we calculate the semantic relatedness between each pair of entities e i and e j by adopting the Wikipedia link-based measure described in [10], which is originally modeled after the Normalized Google Distance (NGD) [11], as follows

SR(e i , e j ) = 1 − log(max(|E i |, |E j |)) − log(|E i ∩ E j |) log(|E|) − log(min(|E i |, |E j |))(5)

where E i and E j are the sets of entities that link to e i and e j in KB respectively, and E is the set of all entities in KB. Document-specific Category Importance. For text classification of the input document, we employ John C. Platt's sequential minimal optimization for training a support vector machine (SVM) classifier [12,13]. Multi-category problems are solved using pairwise classification. To obtain proper probability estimates, we use the option that fits logistic regression models to the outputs of the SVM classifier. In our multi-category scenario, the predicted probabilities are coupled using Hastie and Tibshirani's pairwise coupling method [14]. All these algorithms have been integrated into Weka 2 , a collection of machine learning algorithms for data mining tasks. Based on that, we calculate the category probability P (c i ) of the input text for each assigned category c i , which reflects the document-specific category importance.

Entity-Category Association. All candidate entities are mapped to the selected Wikipedia categories. In order to measure the entity-category association between an entity e and its assigned category c, we define the distance d(c, e) as the minimum depth at which the entity e is located in Wikipedia's category tree with the category c as the root. This is computed offline by performing a breadth-first search starting from the fundamental category that forms the root of Wikipedia's hierarchy to each entity. Then the semantic association SA(c, e) between entity e and category c can be calculated as

SA(c, e) = 1 d(c, e)(6)

Graph Model and Algorithm

Based on the features and measures discussed in Sec. 3, we construct a directed weighted graph G = {N, R}, called disambiguation graph, where N = N M N E N C is the disjoint union of mention nodes N M , entity nodes N E and category nodes N C , and R is the set of directed edges representing relationships between these nodes. All detected mentions and their candidate referent entities are added into N M and N E , respectively, while the categories that the input text belongs to are added into N C . For each mention m and its candidate entity e, we add an edge from m to e into R. Additionally, we add an edge between e i and e j into R if they are connected in KB. Furthermore, for each association between an entity e and a category c, an edge from c to e will be added into R.

Once the disambiguation graph G is built, we apply a personalized PageRank algorithm [15,16] over it. The calculation of the PageRank vector P r over G is equivalent to resolving the following equation

P r = d • T • P r + (1 − d) • v (7)

where T is the transition probability matrix, v is the initial evidence vector and d is the so called damping factor, usually set as 0.85. Each entry T ij in T is the evidence propagation ratio from node i to node j, which is computed in Eq. 8.

T ij =          SS(mi,ej ) k∈N E (i) SS(mi,e k ) if i ∈ N M , j ∈ N E SR(ei,ej ) k∈N E (i) SR(ei,e k ) if i ∈ N E , j ∈ N E SA(ci,ej ) k∈N E (i) SA(ci,e k ) if i ∈ N C , j ∈ N E (8)

2 http://www.cs.waikato.ac.nz/ml/weka

where N E (i) is the set of entity nodes such that for each node k ∈ N E (i), there is an edge from i to k in G. The entry v i in v is the initial evidence representing the prior importance of a mention m i if i ∈ N M or the document-specific importance of an category c i if i ∈ N C , which is calculated as follows

v i =        λ•P (mi) λ• k∈N M P (m k )+η• k∈N C P (c k ) if i ∈ N M η•P (ci) λ• k∈N M P (m k )+η• k∈N C P (c k ) if i ∈ N C 0 otherwise(9)

where λ and η are tunable parameters with λ+η = 1, which reflect the sensitivity of prior mention importance and document-specific category importance to the final probability of each candidate entity. When η = 0, our method reduces to general entity linking without considering the topic-sensitive model. In contrast, when λ = 0, the initial evidence of the graph-based disambiguation only depends on the category importance. As a result of the personalized PageRank algorithm, each candidate entity e receives a final probability P (e). For each mention m having a set of candidate entities E m , we choose the entity with the maximal probability as the predicted linking entity, i.e., e m = arg max e∈Em P (e). The process discussed above doesn't distinguish between salient and non-salient entities. In order to deal with salient entity linking, one important task of the topic-sensitive model is to validate whether the predicted linking entity e m for mention m is a salient entity. For this purpose, we learn a threshold τ such that if P (e m ) is greater than τ we return e m as the linking entity for m, otherwise we return Non-Salient.

Experiments

We now discuss the experiments we performed to assess the performance of our approach. As the knowledge base, we used the English Wikipedia snapshot from July 2013. We employed the Reuters-128 entity salience dataset3 , which is an extension of a part of the N3 entity linking datasets [17]. The Reuters-128 dataset is an English corpus and it contains 128 economic news articles. The dataset contains information for 880 named entities with their position in the document and a URI of a DBpedia resource identifying each entity. The salience dataset extends the Reuters-128 dataset also with 3,551 common entities.

In order to construct the dataset, entity salience information was obtained by crowdsourcing salience information using the CrowdFlower platform. For each named and common entity in the Reuters-128 dataset, the authors of the dataset collected at least three judgements. Only judgments from annotator with trust score higher than 70% were considered as trusted judgements. If the trust score of an annotator falls bellow 70%, all his/her judgements were disregarded. Finally, each named and common entity in the dataset has been classified in one of the following classes -Most Salient -Entities with the highest focus of attention in the article. The document is mostly about the these entities, or the entities play a prominent role in the content of the article. -Less Salient -Entities with less focus of attention in the article. The entities play an important role in some parts of the content of the article. -Not Salient -The article is really not about the entities

In our experiments, we consider the entities in both classes Most Salient and Less Salient as salient entities, while entitites belonging to Not Salient are considered as non-salient entities. Using the Reuters-128 entity salience dataset, we conducted the experiments to compare our approach with several entity linking methods. We used two variants of our approach, one employs only the graph-based disambiguation for general entity linking (λ = 1 and η = 0) and the other integrates the topic-sensitive model with the goal of salient entity linking (λ = 0.2 and η = 0.8). All the methods should label each mention with either the correct entity or Not Salient. Note that we restrict the input to the labeled mentions to compare the method's ability to distinguish between salient entity and non-salient entity, not its ability to recognize entity names in the input text. The adopted evaluation criteria include Micro-Precision, Micro-Recall, Micro-F1, Macro-Precision, Macro-Recall and Macro-F1.

The experimental results are shown in Table 1. By utilizing the topic-sensitive model, our approach to salient entity disambiguation significantly outperforms the baselines in terms of all evaluation criteria. Regarding the two variants of our approach, it clearly shows that the topic-sensitive model indeed contributes to the final performance improvement.

Conclusions

In this paper, we introduce the task of salient entity linking that existing entity linking solutions cannot well address. For tackling this new problem, we propose a graph-based disambiguation framework, which integrates several features including prior mention importance, mention-entity compatibility, entity-entity coherence and in particular a topic-sensitive model capturing entity-category association and document-specific category importance. We have experimentally shown that our approach achieves a significant improvement over the baselines. The evaluation results also show that the topic-sensitive model indeed helps with the salient entity disambiguation.

Fig. 1 :1Fig. 1: Salient entity linking framework.

s as anchor text pointing to e as destination and E m.s is the set of entities that have the surface form m.s. An entity e is characterized by its textual description e.c, called context of e and a mention m is characterized by its surrounding sentences m.c, called context of m. The context similarity CS(m, e) between m and e can be calculated using cosine similarity on the term vectors e.c of e.c and m.c of m.c as CS(m, e) = cos(e.c, m.c) = e.c, m.c |e.c| • |m.c|

4 :4MethodsMic. Prec. Mic. Rec. Mic. F1 Mac. Prec. Mac. Rec. Mac. F1.DBpedia Spotlight [2]0.450.390.410.450.370.40Wikipedia Miner [1]0.600.480.540.600.420.52NERD-ML [5,7]0.670.500.570.650.460.54WAT [4,8]0.350.320.340.360.330.34AGDISTIS [6]0.730.500.590.730.480.58Our Method (General)0.700.460.560.690.450.55Our Method (Salient)0.830.510.630.820.500.62

Table 1 :1The experimental results.In this work, we employ the 16 second-level categories including Mathematics, People, Science, Sport, Geography, Culture, Politics, Nature, Technology, Education, Health, Business, Belief, Society, Life and Concepts in Wikipedia, where the first-level category is the fundamental category.https://github.com/KIZI/ner-eval-collectionhttp://ner.vse.cz/datasets/entitysalience-collection

Acknowledgments. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 611346.

Learning to link with wikipedia DNMilne IHWitten CIKM 2008 Dbpedia spotlight: shedding light on the web of documents PNMendes MJakob AGarcía-Silva CBizer I-SEMANTICS 2011 Collective entity linking in web text: a graph-based method XHan LSun JZhao SIGIR 2011 Fast and accurate annotation of short texts with wikipedia pages PFerragina UScaiella IEEE Software 29 1 2012 Learning with the web: Spotting named entities on the intersection of NERD and machine learning MVan Erp GRizzo RTroncy #MSM 2013 AGDISTIS -graph-based disambiguation of named entities using linked data RUsbeck ANNgomo MRöder DGerber SACoelho SAuer ABoth ISWC 2014 Benchmarking the extraction and disambiguation of named entities on the semantic web GRizzo MVan Erp RTroncy LREC 2014 From tagme to WAT: a new entity annotator FPiccinno PFerragina ERD@SIGIR 2014 Identifying salient entities in web pages MGamon TYano XSong JApacible PPantel CIKM 2013 An effective, low-cost measure of semantic relatedness obtained from wikipedia links IWitten DMilne WIKIAI 2008 The google similarity distance RCilibrasi PM BVitányi IEEE Trans. Knowl. Data Eng 19 3 2007 Advances in kernel methods JCPlatt 1999 MIT Press Cambridge, MA, USA Improvements to platt's smo algorithm for svm classifier design SSKeerthi SKShevade CBhattacharyya KR KMurthy Neural Comput 13 3 2001 Classification by pairwise coupling THastie RTibshirani NIPS 1997 Scaling personalized web search GJeh JWidom WWW 2003 Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search THHaveliwala IEEE Trans. Knowl. Data Eng 15 4 2003 N 3 -A collection of datasets for named entity recognition and disambiguation in the NLP interchange format MRöder RUsbeck SHellmann DGerber ABoth LREC 2014