A graph-based collective linking approach with Group Co-existence Strength

A graph-based collective linking approach with Group Co-existence Strength ChinmayChoudhay c.choudhary1@university.ie National University of Ireland (NUI)

Galway

ColmO'riordan colm.oriordan@university.ie National University of Ireland (NUI)

Galway

A graph-based collective linking approach with Group Co-existence Strength E4D2F4B5A0BEDA343C98792CAC4944B7 GROBID - A machine learning software for extracting information from scholarly documents

This paper addresses a drawback of many existing graphbased collective entity-linking approaches by introducing the new concept of Group Co-existence Strength (GCS). Doing so, this work proposes an approach to the collective linking of text documents which extends an existing recent approach by taking into account GCS for all possible groups of candidate entities along with standard attributes. Preliminary experimental results indicate that the proposed approach leads to performance gains with selected real world data.

Introduction

Named Entity Disambiguation (NED) is the process of linking name-mentions in a document to the accurate real-world entities to which they are referring. These entities can be instances of diverse range of categories such as famousperson, place, institution, country, scientific-discovery etc., collectively forming a Knowledge-base (such as Wikipedia).

Entire Entity Linking (EL) process comprises of three major steps namely, recognition of ambiguous name-mentions within a document called Named Entity Recognition (NER), identification of candidate entities for each such ambiguous name-mention and disambiguation of these name-mentions by linking each one with most appropriate entity out of all the candidates, each step being distinct broad research area within itself. The research work described within this paper is focused on the final step, thus assumes all name-mentions within a document being correctly demarcated and a set of candidate entities for each such namemention being identified beforehand.

Related Work

[1], [9], [7], [2] are examples of prominent approaches belonging to individuallinking category, which link each name-mention individually based on similarity between context of it within document and description of entity, commonly referred as Compatibility(CP) whereas [8], [15], [18], [19], [13], [16], [6], [12], [11], [4], [3] [14] are examples of modern collective-linking approaches adopting various supervised and unsupervised linking methods. [5] is a prominent graph-based approach that link all name-mentions within single document simultaneously by considering semantic relationships between various pairs of entities indicating the chances of both entities being referred in a single real-world document depending upon how closely they are associated with common topic or field, referred as Semantic Relatedness (SR) along with CP, which is directly extended within this paper.

Drawback of contemporary approaches

Most graph-based collective linking approaches consider SR between all possible pairs of entities that are candidates of two distinct name-mentions appearing within the document for computation of overall linking. Thus for a set of entities associated with entire document consisting of members such that each member is a candidate of single distinct name-mention, value of score indicating the suitability of the entire set being appropriate collective link is computed as a function of SR scores for all possible pairs that can be extracted from the particular set. There is an inherent assumption with this approach which can be stated as follows.

All members of a set of entities have higher chances of being referred together in a single real-world document, if most of the pairs extracted from the set possess strong semantic relationship.

But this assumption does not always hold true, specifically if there is an outlier in the group thus limiting the accuracy of system. Consider text-document stated as Example 1 consisting of four name-mentions namely Donald, Hillary, Fox and America.

Example 1. Donald will direct the upcoming movie from Fox with Hillary playing lead role in it. The movie will be released across America by the end of 2017.

Let there be two candidate collective links of entire document namely W 1 and W 2 listed as follows.

W 1 = [ Donald Trump, Hillary Clinton, Fox, United States of America ] W 2 = [ Donald Petrie, Hillary Swank, Fox Studios, North America ]

Here by common-sense and real-world knowledge it is evident that W 2 is more appropriate link than W 1 but modern approaches would still link W 1 as most pairs of entities extracted from W 1 have stronger semantic relationship as compared to their counterparts in W 2 (for example pair {Donald Trump,Hillary Clinton} as compared to pair {Donald Petrie and Hillary Swank} etc.). To address this issue the paper introduces a new concept called Group Coexistence Strength (GCS) as section 4 and proposes an NED approach taking it into consideration as section 5.

GCS of a group of entities, indicate the chances of all its members being coreferred within any given real-world document. This strength depends on how symmetrically the entities are distributed with respect to each other in terms of mutual SR scores. One way to demonstrate this distribution is to plot all members of entities on a graph with co-ordinates of each being determined by values of Semantic Distance (computed as a factor of SR) of it from pre-decided benchmark members of the same group. Groups with members being more compactly plotted can be considered to be semantically stronger.For sets of candidate collective-links W 1 and W 2 outlined for Example 1 the distribution plots are represented as figure 2 and figure 1 respectively. It is evident from the figures that W 2 is more compactly distributed as compared to W 1 which has an anomaly. GCS of any group of entities is indicated by value of indicator called Group Strength Factor (GSF) described as section 4.1

Group Strength Factor

GSF for a given set of entities of any size is the minimum value obtained out of all Gaussian values achieved at the positions of all entities within the set, with peak of Gaussian being at average position and standard deviation being a fixed value vector (of size equal to total number of co-ordinates). For a set of entities S let R be set of reference entities such that R ⊂ S. Then for any entity E i ∈ S, the position of it is defined by equal number of co-ordinates as the size of R, with any j th co-ordinate being computed with respect to j th member of R ( 0 ≤ j ≤ Size of R) by equation 1 with SR being Semantic Relatedness [8] and NF being a pre-determined Normalization Factor. It is important to note that value of NF would not have any impact on overall performance, provided it satisfies the necessary condition of being common during entire linking-process. It merely exists to provide facility to manage and keep positions of entities as well as GSF values obtained on each, within a considerable range (in possible cases when GSF values would go too low to be easily distinguished, compared or distinctly plotted on graph) to be conveniently analysed.

E i coordinate j = N F * SR(E i , R j )(1)

For experiments described in this paper NF is considered to be one. Co-ordinates determining position of peak is given as average of values of same co-ordinate for all Entities belonging to set S computed through equation 2.

P eakCoordinatej = AllE∈S (E coordinatej ) n(2)

Having positions of all entities belonging to S and peak (as values of representing co-ordinates), GSF score of S is determined by applying equation 3 with N representing Normal distribution of position of entities belonging to S, around the Peak (computed through equation 2) with a fixed value of standard deviation (σ).

GSF = min

AllE∈S

(E ∼ N (P eak, σ))(3)

For the purpose of experimentation, surely larger the size of R would have more accurate positioning (with more co-ordinates), thus more accurate final-linking, though at the cost of lower time-bound efficiency. Once having decided the size of R to be considered during entire linking process, any sub-set of S of that size being used as R would give similar results as computation of GCF involves symmetry of mutual distribution of all entities with respect to each other. Within this paper the value of σ is randomly considered to be 0.1 (As it does not matter what specific value is taken provided it is same for all examples during both training and testing) whereas size of R is considered to be 1, thus all entities being plotted on 1-D axis. Basic intuition behind GSF is simply the fact that Gaussian value obtained on the outliers will be relatively much lower as they are positioned at a considerable distance from the peak on the overall plot, thus penalizing the entire group.

Linking Approach

The overall collective linking of name-mentions in a given document (let having N name-mentions) simultaneously while taking into account sum of GSF values of all possible groups of entities of a particular size (referred to as GSF n with n being the size), for all possible sizes greater than two that can be extracted from set of entities being candidate collective-link, involves computation of a term called Linking Factor (LF) for all such candidates by applying heuristic formula stated as equation 4. LF value of a particular candidate collective-link (as a set of N entities corresponding to each name-mention) depends upon GSF n (2 < n ≤ N ) along with sum of Semantic Relatedness scores ( SR) between all possible pairs of entities as well as sum of values of Compatibility scores ( CP ) between all name-mentions and their respective candidate entities (forming candidate collective link).

LF = N n=3 (φ 1 * n 3 + φ 2 * n 2 + φ 3 * n + φ 4 ) * GSF n +ϕ 2 * SR + ϕ 1 * CP (4)

Here N is the total number of name-mentions appearing within document while φ 1 , φ 2 , φ 3 , φ 4 , ϕ 1 and ϕ 2 are parameter that can be learnt using a set of training dataset. Equation 4 is formulated based on intuition that impact of GCS for groups of all sizes on collective-linking process should not be same, thus GSF n being normalized by a quadratic equation of n with optimum degree 3 to avoid both under-fitting and over-fitting. For a given document consisting of a set of name-mentions appearing within it and a group of sets of entities as candidate collective-links (having equal number of entities as name-mentions with each entity being associated with single distinct name-mention), LF score of all such candidates can be computed to identify the one with maximum score as most appropriate.

Experimentation

As already explained in section 1, the proposed approach can be applied to rank respective candidates of all name-mentions appearing in a single document for the purpose of collectively linking all such name-mentions to their respective most appropriate entities simultaneously. Thus dataset utilized for the purpose of training and testing the approach should consist of text-documents with all name mentions demarcated and candidates for each being identified beforehand. Section 6.1 describes the structure and process of generation of final datasets whereas subsequent sections elaborate on computation, training and testing procedures.

Dataset

First phase of experimentation involves extraction of information from original IITB helpfulness dataset [17] to formulate three distinct final datasets to be utilized for final training and testing of proposed approach. IITB dataset is comprised of a collection of text-documents related to varied range of subjects such as sports, science, politics etc. with details of all name-mentions within all documents including Title of correct Wikipedia link to be each, is represented a single large JSON document. As proposed approach identifies most appropriate collective link of all name-mentions simultaneously after learning the parameters of equation 4, three distinct final datasets having most suitable specific structure are created by modifying original dataset through elaborate process. Following two sub-sections describe Structure and Process of creation of Final Datasets respectively.

Final datasets As already explained final training requires parameters of equation 4 to be learnt through Logistic Regression which fundamentally requires a set of positive and negative training example for its implementation. Final datasets are constituted by such examples with each having label as either positive or negative with a single example consisting of a collection of top 100 name-mentions appearing in a single text-document ranked according to their relevance, with each being paired up with one of its candidate entities. Examples with all name-mentions being paired to their respective correct links as per information provided with original IITB helpfulness datasets JSON document can be considered as having positive label while others as having negative labels.

Various text-documents within IITB dataset contain varied number of namementions being appeared in the content, with minimum number being as 100. Thus for all the documents only top 100 high-relevance name-mentions are being considered while ignoring others, for the purpose of maintaining homogeneity between all examples of datasets, essential for training and testing convenience. Relevance of each name-mention within specific text-document for the purpose of collective linking is indicated within original IITB dataset as relevance index. For each text-document all name-mentions appearing within it are sorted according torelevance index and top 100 members are retained while ignoring others. Characteristic feature that mainly distinguishes three datasets is the degree of overlap among examples contained by each one of them. For a collection of sets of entity-mention pairs forming a single dataset overlap of that dataset refers to the percentage of common members belonging to any two given candidate sets of entities that can possibly be a collective link of single common text-document. Details of all three datasets is summarized as Table 1.

Process of creation of Datasets

As the proposed approach identifies most appropriate candidate entity to be linked to each of the name-mentions within single document simultaneously, evaluation of it requires at least one incorrect and one correct candidate entity that can be linked to each name-mention within all text-documents. All name-mentions are provided by the correct link as Wikipedia title within original IITB dataset while incorrect candidate is extracted from See Also section of that correct link. All other Wikipedia page hyper-links within See Also section of Wikipedia article of a correct link of given name-mention are sorted according of similarity of Bag of Words (BOW) extracted from these with BOW extracted from contents of correct link in decreasing order. Hyper-link of Wikipedia page on the top of the list is considered as second incorrect candidate of particular name-mention.

Having two candidates for each name-mentions, final datasets are created by pairing up these name-mentions with each of its candidates and re-arranging all pairs with name-mentions appearing in single text-document as a large collection. Each such collection being a single possible collective link of the particular text-document forms single example with label as positive if all name-mentions are paired with correct link and negative otherwise. Three distinct methods adopted to perform re-arrangement classifies three distinct datasets.

Computation

For a given set of entity-mention pair being a possible collective link, Equation 4computes Linking factor by taking into account three distinct parameters namely sum of Semantic Relatedness (SR) scores of all possible pairs of entities that can be extracted from the set, sum of Compatibility (CP) scores of all possible entitymentions pairs forming the set and sums of Group Strength Factors (GSF) of all possible group of entities of a specific size n (n>2) that can be extracted from set, for all possible values of n. The processes adopted for computation of these scores are explained as follows.

1. Compatibility (CP) : It is computed between context of name-mention and entity description. For this experimentation context of a name-mention is considered as twenty words before and after it within text-file content and entity description is simply the content of respective Wikipedia article. For a name-mention NM and a Wikipedia entity W, let B N M and B W be Bags of N-grams extracted from their context and description respectively with value of N ranging from 1 to 3. Compatibility between NM and W is given by equation 5.

CP (N M, W ) = T F IDF N M * V W/N M T(5)

Where T F IDF N M consists of TFIDF scores of all N-grams within B N M with respect to all the text-documents within original IITB dataset. V W/N M is a Boolean vector of length equal to length of B N M with values obtained from equation 6. For all i= 1 to length of

V W/N M V W/N M i = 1 if B N Mi ∈ B W 0 otherwise (6)

Semantic Relatedness (SR) :

There are numerous approaches to compute Semantic Relatedness between Wikipedia Entities but the most common one is proposed within [10] which uses the intersection and union of hyper-links shared between two given entities for computation by applying equation 7. For this experimentation same-method is utilized adopting common practice.

SR(x, y)= 1 − log(max(|X|, |Y |)) − log(X ∩ Y ) log |W | − log(min(|X|, |Y |))(7)

The components of formula are described as follows.

- 3. Group Strength Factor (GSF): For any group of entities of size greater than two GSF is computed by applying Equation 3. Ideally application of equation 4 for computation of Linking factor (LF) for a given set of entity-mention pairs requires GSF values of all possible groups of entities of size three or more that can be extracted from the set being taken into account. Since there are 100 entities within each example, total number of GSF computations that need to be performed for each example is as follows.

100 This reduces the time-efficiency of overall training and testing to extremely low, thus making the evaluation of hypothesis in stipulated time-period infeasible. Considering this limitation, for the purpose of this experimentation GSF for the groups of entities with maximum size as 10 only is taken into consideration. Maximum size is considered to be 10 because it is the maximum value for which experimentation process held feasibility within decided time-constraint.

Final Matrix

As explained in section 6.2 a single example of final dataset is formed by collections of all name-mentions (top 100 based on relevance in case of this particular experimentation) appearing in a single text-document, with each being pairedup with one of its candidate entities. For each such example all 10 distinct values namely sum of SR values, sum of CP values and sums of GSF values of group of size ranging from 3 to 10 are represented as single 1*10 vector. Thus an example e is represented as vector V e given be equation 8.

V e = [ GSF 10 GSF 9 ... GSF 3 SR CP ](8)

Thus entire dataset consisting of m examples can be represented as an m*10 matrix M d given by equation 9 and an m*1 Boolean vector holding labels of all m examples.

M d = [ V e1 V e2 ..... V em ] T(9)

Training and Testing

Final Collective linking is performed by computing Linking Factor (LF) for each training example given by formula described as equation 2.2. For the case of current experimentation process, since maximum size of group of entities is considered to be 10, the formula can written as equation 10.

LF = (10 3 * α 1 + 10 2 * α 2 + 10 * α 3 + α 4 ) * m i=1 SSF 10 +(9 3 * α 1 + 9 2 * α 2 + 9 * α 3 + α 4 ) * m i=1 SSF 9 + ....+ (3 3 * α 1 + 3 2 * α 2 + 3 * α 3 + α 4 ) * m i=1 SSF 3 + θ 1 * m i=1 SR + θ 2 * m i=1 CP(10)

Value of Linking factor (LF) for all examples within a given dataset d can be represented as a single m*1 matrix called LFMatrix. After performing mathematical derivations on equation 10 it can be proved that LFMatrix of d can be computed by applying equation 11.

LF M atrix= (M d * M ultiplier) * P(11)

Here M d is matrix defined as equation 8. P and Multiplier as given as equations 12 and 13. P is the parameter matrix that needs to be learnt through Logistic Regression. It is initialized with random values and is subsequently updated after each iteration until optimization. To evaluate performance of proposed approach, given dataset is split in the ratio of 60% and 40%, with first 60% examples being utilized to learn parameters within equation 4 (represented as single matrix P in equation 12) through Logistic Regression whereas the testing is performed on last 40% of dataset. Training and testing is performed on all three datasets distinctively and results obtained by each are discussed as section 7.

P = [ α 1 α 2 α 3 α 4 θ 1 θ 2 ] T(12)

Preliminary Results and Future Work

Having probability matrix for a given test-dataset as described in section 6.2, considering a fixed threshold value of 0.5, predictions are made for each example thus obtaining a predicted Boolean matrix to be compared with actual Boolean matrix. Table 2 compares the average results achieved on three datasets with results of Wikification approach [9] and approach [5] which are benchmark individual-linking and collective-linking graph-based approaches respectively. Though the results are yet to be compared with various state of the art approaches, preliminary results indicate that proposed approach performed significantly better than both benchmark approaches. Future work would include much more exhaustive testing and evaluation of proposed approach on larger datasets.

Fig. 1 .1Fig. 1. Distribution of entities in the set W2

Fig. 2 .2Fig. 2. Distribution of entities in the set W1

|W | : Total Number of articles on Wikipedia -|X| : Number of hyperlinks on entity x -|Y | : Number of hyperlinks on entity y -X ∩ Y : Number of hyperlinks shared by entities x and y

Table 1 .1Total Number of examples Number of Positive examples Number of Positive examples Percentage of Overlap Information about DatasetsDataset 19161979064More than 90%Dataset 21949797Approximately 30%Dataset 31949896Less than 10%

Table 2 .2Comparison of results between various approachesAverageAverageAveragePrecisionRecallF-ScoreWikify0.550.320.38Collective Graph-based0.690.760.73Our-approach0.690.9960.81

Using encyclopedic knowledge for named entity disambiguation RCBunescu MPasca Eacl 6 2006 Entity disambiguation for knowledge base population MDredze PMcnamee DRao AGerber TFinin Proceedings of the 23rd International Conference on Computational Linguistics the 23rd International Conference on Computational Linguistics Association for Computational Linguistics 2010 Probabilistic bagof-hyperlinks model for entity linking OEGanea MGanea ALucchi CEickhoff THofmann Proceedings of the 25th International Conference on World Wide Web the 25th International Conference on World Wide Web International World Wide Web Conferences Steering Committee 2016 Graph-based named entity linking with wikipedia BHachey WRadford JRCurran WISE Springer 2011 Collective entity linking in web text: a graph-based method XHan LSun JZhao Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval the 34th international ACM SIGIR conference on Research and development in Information Retrieval ACM 2011 Robust disambiguation of named entities in text JHoffart MAYosef IBordino HFürstenau MPinkal MSpaniol BTaneva SThater GWeikum Proceedings of the Conference on Empirical Methods in Natural Language Processing the Conference on Empirical Methods in Natural Language Processing Association for Computational Linguistics 2011 Enhancing text clustering by leveraging wikipedia semantics JHu LFang YCao HJZeng HLi QYang ZChen Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval the 31st annual international ACM SIGIR conference on Research and development in information retrieval ACM 2008 Collective annotation of wikipedia entities in web text SKulkarni ASingh GRamakrishnan SChakrabarti Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining the 15th ACM SIGKDD international conference on Knowledge discovery and data mining ACM 2009 Wikify!: linking documents to encyclopedic knowledge RMihalcea ACsomai Proceedings of the sixteenth ACM conference on Conference on information and knowledge management the sixteenth ACM conference on Conference on information and knowledge management ACM 2007 Learning to link with wikipedia DMilne IHWitten Proceedings of the 17th ACM conference on Information and knowledge management the 17th ACM conference on Information and knowledge management ACM 2008 Entity linking meets word sense disambiguation: a unified approach AMoro ARaganato RNavigli Transactions of the Association for Computational Linguistics 2 2014 Unsupervised entity linking using graph-based semantic similarity AMNaderi 2016 Unsupervised entity linking with abstract meaning representation XPan TCassidy UHermjakob HJi KKnight HLT-NAACL 2015 Lightweight multilingual entity extraction and linking APappu RBlanco YMehdad AStent KThadani Proceedings of the Tenth ACM International Conference on Web Search and Data Mining the Tenth ACM International Conference on Web Search and Data Mining ACM 2017 Local and global algorithms for disambiguation to wikipedia LRatinov DRoth DDowney MAnderson Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 2011 Association for Computational Linguistics Joint inference of entities, relations, and coreference SSingh SRiedel BMartin JZheng AMccallum Proceedings of the 2013 workshop on Automated knowledge base construction the 2013 workshop on Automated knowledge base construction ACM 2013 Evaluating the helpfulness of linked entities to readers IYamada TIto SUsami STakagi HTakeda YTakefuji Proceedings of the 25th ACM Conference on Hypertext and Social Media the 25th ACM Conference on Hypertext and Social Media 2014 Learning to link entities with knowledge base ZZheng FLi MHuang XZhu Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics Association for Computational Linguistics 2010 Resolving surface forms to wikipedia topics YZhou LNie ORouhani-Kalleh FVasile SGaffney Proceedings of the 23rd International Conference on Computational Linguistics the 23rd International Conference on Computational Linguistics Association for Computational Linguistics 2010