Cross-Document Coreference Resolution using Latent Features Axel-Cyrille Ngonga Ngomo} , Michael Röder} , and Ricardo Usbeck} } University of Leipzig, Germany,  R & D, Unister GmbH, Leipzig, Germany, email: {ngonga|usbeck}@informatik.uni-leipzig.de Abstract. Over the last years, entity detection approaches which com- bine named entity recognition and entity linking have been used to detect mentions of RDF resources from a given reference knowledge base in un- structured data. In this paper, we address the problem of assigning a single URI to named entities which stand for the same real-object across documents but are not yet available in the reference knowledge base. This task is known as cross-document co-reference resolution and has been addressed by manifold approaches in the past. We present a pre- liminary study of a novel take on the task based on the use of latent fea- tures derived from matrix factorizations combined with parameter-free graph clustering. We study the influence of di↵erent parameters (window size, rank, hardening) on our approach by comparing the F-measures we achieve on the N3 benchmark. Our results suggest that using latent fea- tures leads to higher F-measures with an increase of up to 20.5% on datasets of the N3 collection. 1 Introduction The Document Web contains a large amount of information that is still not available on the Web of Data. For example, open extraction frameworks for unstructured data have been shown to harvest a considerable amount of new triples pertaining to real-objects for which no URI is available [3]. While no URI has been assigned to the said real-world objects, facts pertaining to these objects can be distributed across manifold data sources. Hence, simple URI generation approaches based on the labels of named entities can easily fail to generate the same URI when relying on two di↵erent labels that stand for the same real-world object. For example, simple URI generation schemes based on strings would fail to generate the same URI when presented with the strings “P. Diddy” and “Pu↵ Daddy” as labels for resources. Moreover, they would generate the same URI for “Golf” across di↵erent documents even if the “Golf” stood for the sport in some documents and for the car in others. In literature, detecting that two labels stand for the same real-object even across documents is referred to as cross- document co-reference resolution (CDCR) [1,2]. While a large number of CDCR approaches have been developed in previous works (see Section 2), none of the current approaches makes use of latent features to detect whether two labels 33 stand for the same real-object. In previous work, latent features have yet been shown to be able to generate reliable representations of real-world objects [9]. In this paper, we address the aforementioned research gap by presenting the first CDCR approach based on latent features. Our approach represents entity mentions as bags of words. Each entity mention is then regarded as a vector in the space spanned by all words used to describe at least one entity mention. In the subsequent step, we compute the latent features of the entity mentions. The similarity of the latent representation of the entity mentions is then transformed into a similarity graph which is clustered by using BorderFlow [8], a parameter- free graph clustering approach. All entity mentions which belong to the same cluster are regarded as mentions of the same real-world object and are assigned to the same URI. Our approach is open-source and available at http://github. com/AKSW/CoreferenceResolution. The rest of this paper is organized as follows: First, we give an overview of previous CDCR approaches. Then, we present our approach in detail. In Section 4, we evaluate our approach on the N3 benchmark dataset [14] and compare it with a baseline approach. We conclude the paper and discuss future work in Section 5. 2 Related Work In the following section, we will provide an overview over recent approaches towards CDCR with a focus on their underlying techniques w.r.t. the semantic and syntactic features they exploit. Mayfield et al.’s [6] CDCR approach comprises five stages: (1) intra-document processing, i.e., identification of mentions of entities, (2) entity pairs filtering, i.e., discarding of possible entity mappings to reduce computational costs, (3) calculating features of entities, (4) classification of entity matching by machine learning techniques and (5) clustering of entities to map each mention to the same equivalence class. Unfortunately, the authors evaluated their approach in the ACE 2008 English named entity recognition task which is no longer available. There, the approach achieved a value metric of 54.8 [10]. Haghighi et al. [4] present an unsupervised approach based upon a generative process which is capable to use modular syntactic and semantic features making use of latent information. For every document, the generative process creates a number of entities mentioned in the text. For every mention a noun phrase is created. However, since the inference algorithm only uses these noun phrases, their approach lacks on taking a larger context into account. Rahman et al. [12] introduce an approach which incorporates world knowl- edge into two baseline CDCR algorithms. Thereby, the authors use YAGO1 and FrameNet2 as underlying knowledge bases. Afterwards, they use a mention-entity 1 http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/ research/yago-naga/yago/ 2 https://framenet.icsi.berkeley.edu/fndrupal/ 34 pair classifier and a cluster-ranking model. The results show an improvement over each baseline. Singh et al. [16] present an approach consisting of (1) a large scale distributed inference mechanism based on Markov chain Monte Carlo methods and (2) they introduce sub-entity and super-entity variables representing clusters which are used to distribute or collect certain entities on a specific part of the machine cloud. Furthermore, they evaluate their approach on a 1.5 million document comprising web crawl using anker tags to Wikipedia as gold standard. Never- theless, the authors approach misses the opportunity to consider latent features resulting in large computational costs w.r.t. the size of the resulting Markov chain. Lee et al. [5] present an approach not only capable of co-referencing enti- ties but also events. Their idea is based upon linear regression which is used to merge clusters of entities. Furthermore, the authors featurize entities via se- mantic role labeling. Their approach is able to co-reference entities intra- and inter-document-wise. Although the authors claim to be better than the state-of- the-art with respect to the CoNLL 2011 shared task [11] their published corpus is not available anymore. In 2013, Beheshti et al. [2] provide a systematic analysis of state-of-the-art CDCR systems. The survey provides an in-depth structurization of the underly- ing methods and algorithms, which are widely used to solve CDCR problems on large scale. Furthermore, the authors highlight certain Big Data challenges, e.g., large amounts of pair-wise string similarity calculations and costly classification algorithms. Normally, these approaches are based on a trained set of parameters for se- mantic and syntactic similarity algorithms. Recently, Andrews et al. [1] describe an approach towards CDCR, here called entity clustering, that relies on learning parameters from test data without the need for training data. The generative process within assumes a mutation of semantic context and syntactic similarity while generating the documents with cross-referenced entities. Afterwards, the authors deploy a block Gibbs sampler to infer the clusters. Unfortunately, this approach is only empirically evaluated. With respect to the clustering aspect of this paper, Schae↵er [15] provides an exhaustive overview of common graph-clustering algorithms and their use cases. To the best of our knowledge, we present the first paper on CDCR based on latent features, matrix decomposition as well as graph-clustering. 3 Approach In this section, we present our approach to CDCR in more detail. We introduce the notation necessary to understand the approach as required by each section. Figure 1 gives an overview of the five steps that underly our approach. In a first step, a Matrix M is generated containing the context of every entity mention. After that, this matrix is decomposed into two smaller matrices L and R with M ⇡ LR> . In parallel, a second matrix S is created which contains the pairwise 35 similarities of the labels of the entity mentions. These matrices are used to generate a symmetric graph G in which (1) every entity mention is a node and (2) two nodes are connected if their similarity is higher than a certain threshold. G is finally clustered. Mentions that belong to the same cluster are considered to be mentions of the same entity. Hence, they are all assigned the same URI. Matrix Matrix Generation Factorization M L,R Graph Graph Generation Clustering String Similarity Matrix Generation S Fig. 1: The five steps of our approach. 3.1 Matrix Generation The first step of our approach consists of generating a matrix which describes the context of every named entity mention inside the texts by means of a bag of words. To this end, the given corpus is preprocessed by tokenizing the documents, removing stop words and indexing the remaining tokens. In these tokenized documents, the context of a named entity mention is defined as the multiset of tokens inside a window with the size ± that is centered on the named entity’s tokens. The contexts are stored in a matrix M containing a row for every named entity mention and a column for every indexed word. The entries of the matrix are the counts of the words inside the entity mention’s context. As an example, let us consider the sentence Example 1. Yesterday, VW’s CEO presented the new Golf in Munich. from which the stopwords {the, in} are removed. For the window size = 1, we get the bag-of-word multiset {new (1), Munich (1)} as representation of “Golf”. Within the vector space spawned by (presented, new, Munich, Germany), this mention has the vector representation (0, 1, 1, 0). In the following, we will consider five entity mentions g1 , g2 , g3 , g4 and g5 labelled with the same word “golf” as example. These entity mentions will be assumed to be represented by the vectors g1 = (2, 2, 2, 0), g2 = (1, 0, 0, 1), g3 = (0, 0, 0, 1), g4 = (1, 0, 0, 0) and g5 = (0, 1, 1, 0). 3.2 Matrix Factorization The matrix M is now a matrix of dimensions n ⇥ m (denoted M (n, m)). The goal of a matrix factorization is to compute the matrices L(n, ⇢) and R(m, ⇢) 36 such that M ⇡ LR> . We call ⇢ 2 N\0 the rank of the factorization. Several approaches have been used to factorize matrices. Here, we loosely follow the tensor factorization approach presented in [9]: Given two matrices L and R that are supposed to be the factors of M , the overall quadratic error of the approximation is the square Frobenius norm of E = M RL> , i.e., ||E||2F = ||M RL> ||2F . Previous works have shown that to prevent overfitting, the error function to minimize must be extended. While several approaches have been suggested to this end, we adopt the error expression given by ||E||2F 2 (||R||2F + ||L||2F ), where 2 [0, 1] controls how well L and R fit M . Thus, the error derivatives are as follows: @eij = 2eij ljk + rik (1) @rik and @eij = 2eij rik + ljk . (2) @ljk We can now adopt a gradient descent approach to update the matrices L and R and reduce the error they lead to by overwriting each lik resp rjk as follows: n ! @eij X ljk ljk ↵ = ljk + ↵ 2 eij rik ljk (3) @ljk i=1 and 0 1 Xj @eij rik rik ↵ = rik + ↵ @2 eij ljk rik A . (4) @rik j=1 We initialize L and R with random entries between 0 and max mij . For our example, we get 0 1 2220 B1 0 0 1C B C M =B C B0 0 0 1C . (5) @1 0 0 0A 0110 For ⇢ = 2, our approach computes 0 1 1.385 1.102 0 1 B 0.006 0.501 C 0.331 1.406 B C B1.059 0.446C L=B B 0.079 0.051C C and R=B C @1.118 0.363A . (6) @ 0.234 0.712 A 0.062 0.066 0.933 0.168 The intuition behind our approach is that L is a better and compressed descrip- tion of the entity mentions than M . Hence, we now use L in combination with a string similarity function to compute the similarity of entity mentions. 37 3.3 String Similarity Matrix The string similarity matrix S is an optional feature of our approach. Each entry sij of S describes the similarity between the label of the ith and the jth entity in our input corpus. Assuming a symmetric string similarity function such as the 3-gram similarity (which we use in our experiments), we also get a symmetric string similarity matrix S. We assume sij = 1 if no string similarity is specified. sij = 1 also holds for our example, as all mentions are labelled with “golf”. 3.4 Graph Generation The aim of the graph generation is to generate a similarity graph G = (V, E, w) that will allow detecting mentions of the same real-world object through clus- tering. The set of vertices of V is the set of entity mentions in our corpus. We l(i,·) ·l(j,·) define the weight function w : V ⇥ V ! [0, 1] as w(vi , vj ) = sij ⇥ ||l(i,·) ||⇥||l(j,·) || , where l(i,·) is the ith row-vector of L and stands for the latent description of the ith entity mention in the corpus. Given that many graph clustering approaches are polynomial in the number of edges, we can control |E| by only setting an edge between vi and vj if w(vi , vj ) ✓ 2 [0, 1]. For ✓ = 0.3 and ⇢ = 2 we end up with the graph displayed in Figure 2(a). As comparison, Figure 2(b) shows the graph obtained with by setting L = M , i.e., generating G without using latent features. g3 g4 g3 g5 0.95 0.31 0.34 0.58 0.82 g1 0.71 g1 0.61 0.61 0.41 0.92 0.71 g2 g5 g2 g4 (a) Graph generated using (b) Graph generated using M ⇢=2 instead of L Fig. 2: Graphs generated by our approach for the example dataset. 3.5 Graph Clustering We now cluster the graph G to detect mentions that stand for the same real- world object. Our approach can rely on any graph clustering approach. In our current implementation, we rely on the BorderFlow algorithm [8] because it is parameter-free. BorderFlow regards any set C ✓ V as having a border b(C) = 38 {v 2 C : 9u 2 V \C with (v, u) 2 E}. The flow ⌦(CP 1 , C2 ) between two sets C1 ✓ V and C2 ✓ V is defined as ⌦(C1 , C2 ) = w(v, u). Based on v2C1 ,u2C2 these definitions, BorderFlow implements a local graph clustering paradigm by mapping each node v 2 V to the set of nodes C ✓ V that is such that v 2 C and C is a node-maximal set w.r.t. the function ⌦(b(C), C) bf (C) = . (7) ⌦(b(C), V \C) While finding the optimal C for each v can be very time-consuming, the heuristic presented in [7] allows determining an approximation of C in an efficient manner. We employ this heuristic herein. Now, the result of BorderFlow is not a partitioning of the graph. Rather, clusters may overlap. We thus employ a hardening approach to generate a par- titioning of the input graph. To this end, each node v 2 V which belongs to two di↵erent clusters C1 and C2 is assigned to C1 i↵ bf (C1 [ {v}) + bf (C2 \{v}) bf (C2 [ {v}) + bf (C1 \{v}). (8) In all other cases, v is assigned to C2 . We call this form of hardening flow maximization. Other forms of hardening can be conceived of, e.g., minimizing the number of union operations that need to be carried out to achieve a partitioning of the graph (set-based ). A third possibility is the silhouette hardening that chooses the cluster C1 if the dissimilarity of v to each other element of C1 is smaller than the dissimilarity to all elements of C2 [13]. For our example, we get the clusters {g1 , g5 } and {g2 , g3 , g4 } for ⇢ = 2 when using BorderFlow with any partitioning approach. If we replace L with M , we get the clusters {g1 }, {g2 , g4 } and {g3 , g5 }. This result on toy data already suggests that matrix factorization leads to results that di↵er from those gathered when using raw data. In the subsequent section, we show empirically that using L to generate G leads to more accurate results than using M to generate G. 4 Evaluation 4.1 Experimental Setup Goals The goal of our experiments was two-fold. First, we wanted to measure the e↵ect of the di↵erent parameters on our approach. Moreover, we wanted to know whether the factorization outperforms a comparable baseline. To achieve the first goal of our experiments, we conducted experiments where we varied the rank ⇢ as well as the window size while keeping all other parameters fixed. We addressed the second goal by creating a baseline as follows: We ran our pipeline as described in the sections above with the sole di↵erence that (1) we did not carry out a factorization and (2) we use M instead of L as input for the graph clustering. All other steps (matrix generation, graph generation, graph clustering) remained unchanged. The similarity threshold for the graph generation is set to ✓ = 0.1 for all our experiments. 39 Datasets We use the three corpora of the N3 collection [14] in our experiments. – The News-100 corpus comprises 100 German news articles from news.de. Each of these articles contains the German word “Golf”—a homonym that has three di↵erent meanings inside these documents. The word could mean (a) a gulf, e.g., the Mexican gulf, (b) the ball sport or (c) a compact car of the German manufacturer Volkswagen. This is clearly the most difficult dataset, as many resources share exactly the same name but have di↵erent meanings. – The Reuters-128 corpus contains 128 English economy news articles from the Reuters news agency. The documents in this dataset are smaller than the ones from the News-100 corpus providing a shallow context. – The third corpus, RSS-500, contains 500 documents each with only one sentence. The sentences were randomly chosen from a larger amount of RSS news feeds, as described in [3]. Every sentence contains exactly two named entities. Table 1 provides further detailed information about the corpora. On average, each named entity occurs nearly 5 times in the News-100 corpus. Within the Reuters-128 corpus nearly two mentions per named entity exist on average while in the RSS-500 corpus only every tenth entity is mentioned more than once. Table 1: Features of the corpora News-100 Reuters-128 RSS-500 Documents 100 128 500 Tokens 48199 33413 31640 Entities 362 444 849 Mentions 1655 880 1000 4.2 Results Influence of rank In our first series of experiments, we fixed the window size to 4 and measured the influence of the rank ⇢ on the precision, recall and F- measure. The left side of Figure 3 shows the results of our experiments on the three datasets. Most importantly, our results show that we outperform the base- line in most settings. We achieve the best increase of performance on the RSS-500 corpus, where we achieve a 20.5% increase in F-measure over the baseline. This result suggest that our approach does not tend to overgeneralize through the compression on information that is carried out during the factorization. Instead, our results suggest that we get rid of a significant amount of noise while factor- izing. Our results on the other two datasets show that we also achieve a better F-measure (increases of 18.2% on Reuters-128 and 6.3% on News-100, see Ta- ble 2). An analysis of the results reveals that this increase is mostly due to the 40 significant increase in precision that we achieve in most settings. On the other hand, our recall is rarely ever worse than that of the baseline. This suggests that BorderFlow tends to generate smaller clusters with factorization than when the baseline approach is used. We measure the statistical significance of our results using a Wilcoxon signed rank-test with 95% confidence. Our results are signifi- cant in all cases. Influence of window size In this experiment, we set the rank to 100 for all experiments and measured the e↵ect of the window size on the overall F- measure of our approach. The right half of Figure 3 shows the results of this series of experiments on the three datasets. Overall, our results suggest that for this rank, the window size does not have a major influence on the F-measure. This also seems to hold for other ranks. Interestingly, a small window size seems to lead to good results in most cases when we use the factorization. While we assume that this might be due to the factorization being able to convert transfer information from other context to the words within the window while computing the latent features of each entity mention, we still need to study this behavior more thoroughly. This result indicates that small window sizes are sufficient for our approach to achieve better F-measures than the baseline on the CDCR problem. This might mean that a small set of words is already sufficient to disambiguate resources across di↵erent documents. 4.3 E↵ect of hardening In all results presented above, we used a hardening based on the borderflow ra- tio. We also implemented the set-based hardening and the silhouette hardening mentioned above and compared the results we achieve with these hardenings. Overall, our results suggest that the borderflow-maximization approach that we used for hardening generates the best results both for the baseline and our ap- proach. Moreover, we outperform the baseline independently from the hardening used. Table 2: Best improvements in F-measure of our approach (OA) over the baseline (BL) Flow Max. Set-Based Silhouette BL OA BL OA BL OA News-100 25.86 32.21 23.87 28.81 26.56 34.05 Reuters-128 47.89 66.16 47.00 56.65 47.59 59.60 RSS-500 71.11 91.62 69.57 85.71 68.97 88.22 41 Precision BL Precision Recall BL Recall F1-Score BL F1-Score News-100 News-100 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 Reuters-128 Reuters-128 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 RSS-500 RSS-500 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 rank for factorization window size Fig. 3: Precision, recall and F1-score of our approach with di↵erent ranks (left) and di↵erent window sizes (right) compared to the baseline (BL). The diagrams show the results for the flow maximization hardening. 42 Discussion Overall, our initial results suggest that we indeed outperform the proposed baseline by using matrix factorization (see Table 2). Still, many ques- tions do remain open. The most important question that we did not address is when should a high rank be used? First, in our experiments, ⇢ = 10 was suffi- cient across all datasets to outperform the baseline. To the best of our knowl- edge, finding the optimal rank for a factorization problem is an open question. Nevertheless, we think that the answer to this question lies in the amount of information contained in the corpus. The higher the information density of a corpus, the higher the rank required to characterize entity adequately. A sec- ond question that remains unanswered is whether we can improve the results of the factorization by considering known resources in the dataset. We will address this question in future work by disambiguating using a combination of textual information and Linked Data. 5 Conclusion In this paper, we presented a CDCR approach based on latent features. We showed that our approach can outperform our baseline by more than 10% F- measure. We will use our approach to complement the entity linking frame- work [17] when it is used in batch mode, i.e., over a document corpus at once. Moreover, we will develop means to detect an appropriate rank for factorization. To this end, we plan to use the derivative of the mean squared error ||M LR> ||2F . Finally, we will develop a deterministic approach to initialize L and R. Prelim- inary results on random matrices show that we can already reduce the initial value of ||E||2F by more approximately 40%, leading to a significantly faster convergence of the factorization. Acknowledgments This work has been supported by the ESF and the Free State of Saxony and the FP7 project GeoKnow (GA No. 318159). References 1. N. Andrews, J. Eisner, and M. Dredze. Robust entity clustering via phylogenetic inference. In Association for Computational Linguistics (ACL), 2014. 2. S.-M.-R. Beheshti, S. Venugopal, S. H. Ryu, B. Benatallah, and W. Wang. Big data and cross-document coreference resolution: Current state and future opportunities. CoRR, abs/1311.3987, 2013. 3. D. Gerber, A.-C. Ngonga Ngomo, S. Hellmann, T. Soru, L. Bühmann, and R. Us- beck. Real-time rdf extraction from unstructured data streams. In ISWC, 2013. 4. A. Haghighi and D. Klein. Coreference resolution in a modular, entity-centered model. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 385–393. Association for Computational Linguistics, June 2010. 43 5. H. Lee, M. Recasens, A. Chang, M. Surdeanu, and D. Jurafsky. Joint entity and event coreference resolution across documents. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computa- tional Natural Language Learning, EMNLP-CoNLL ’12, pages 489–500, Strouds- burg, PA, USA, 2012. Association for Computational Linguistics. 6. J. Mayfield, D. Alexander, B. J. Dorr, J. Eisner, T. Elsayed, T. Finin, C. Fink, M. Freedman, N. Garera, P. McNamee, et al. Cross-document coreference reso- lution: A key technology for learning by reading. In AAAI Spring Symposium: Learning by Reading and Learning to Read, pages 65–70, 2009. 7. A.-C. Ngonga Ngomo. Parameter-free clustering of protein-protein interaction graphs. In Proceedings of Symposium on Machine Learning in Systems Biology 2010, 2010. 8. A.-C. Ngonga Ngomo and F. Schumacher. Borderflow: A local graph clustering algorithm for natural language processing. In CICLing, pages 547–558, 2009. 9. M. Nickel, V. Tresp, and H.-P. Kriegel. Factorizing yago: scalable machine learning for linked data. In WWW, pages 271–280, 2012. 10. NIST. Automatic Content Extraction 2008 Evaluation. http://www.itl.nist.gov/iad/mig//tests/ace/. 11. S. Pradhan, L. Ramshaw, M. Marcus, M. Palmer, R. Weischedel, and N. Xue. Conll-2011 shared task: Modeling unrestricted coreference in ontonotes. In Pro- ceedings of the Fifteenth Conference on Computational Natural Language Learning (CoNLL 2011), Portland, Oregon, June 2011. 12. A. Rahman and V. Ng. Coreference resolution with world knowledge. In Proceed- ings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 814–824, 2011. 13. P. J. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65, 1987. 14. M. Röder, R. Usbeck, S. Hellmann, D. Gerber, and A. Both. N3 - a collection of datasets for named entity recognition and disambiguation in the nlp interchange format. In The 9th edition of the Language Resources and EvaluationConference, 26-31 May, Reykjavik, Iceland, 2014. 15. S. E. Schae↵er. Graph clustering. Computer Science Review, 1(1):27–64, 2007. 16. S. Singh, A. Subramanya, F. Pereira, and A. McCallum. Large-scale cross- document coreference using distributed inference and hierarchical models. In Pro- ceedings of the 49th Annual Meeting of the Association for Computational Linguis- tics: Human Language Technologies - Volume 1, HLT ’11, pages 793–803, Strouds- burg, PA, USA, 2011. Association for Computational Linguistics. 17. R. Usbeck, A.-C. Ngonga Ngomo, S. Auer, D. Gerber, and A. Both. AGDIS- TIS - Agnostic Disambiguation of Named Entities Using Linked Open Data. In International Semantic Web Conference. 2014. 44