Cross-Document Coreference Resolution using
                      Latent Features

         Axel-Cyrille Ngonga Ngomo} , Michael Röder} , and Ricardo Usbeck}

         } University of Leipzig, Germany,  R & D, Unister GmbH, Leipzig, Germany,
                       email: {ngonga|usbeck}@informatik.uni-leipzig.de


            Abstract. Over the last years, entity detection approaches which com-
            bine named entity recognition and entity linking have been used to detect
            mentions of RDF resources from a given reference knowledge base in un-
            structured data. In this paper, we address the problem of assigning a
            single URI to named entities which stand for the same real-object across
            documents but are not yet available in the reference knowledge base.
            This task is known as cross-document co-reference resolution and has
            been addressed by manifold approaches in the past. We present a pre-
            liminary study of a novel take on the task based on the use of latent fea-
            tures derived from matrix factorizations combined with parameter-free
            graph clustering. We study the influence of di↵erent parameters (window
            size, rank, hardening) on our approach by comparing the F-measures we
            achieve on the N3 benchmark. Our results suggest that using latent fea-
            tures leads to higher F-measures with an increase of up to 20.5% on
            datasets of the N3 collection.


     1    Introduction

     The Document Web contains a large amount of information that is still not
     available on the Web of Data. For example, open extraction frameworks for
     unstructured data have been shown to harvest a considerable amount of new
     triples pertaining to real-objects for which no URI is available [3]. While no URI
     has been assigned to the said real-world objects, facts pertaining to these objects
     can be distributed across manifold data sources. Hence, simple URI generation
     approaches based on the labels of named entities can easily fail to generate the
     same URI when relying on two di↵erent labels that stand for the same real-world
     object. For example, simple URI generation schemes based on strings would fail
     to generate the same URI when presented with the strings “P. Diddy” and “Pu↵
     Daddy” as labels for resources. Moreover, they would generate the same URI for
     “Golf” across di↵erent documents even if the “Golf” stood for the sport in some
     documents and for the car in others. In literature, detecting that two labels
     stand for the same real-object even across documents is referred to as cross-
     document co-reference resolution (CDCR) [1,2]. While a large number of CDCR
     approaches have been developed in previous works (see Section 2), none of the
     current approaches makes use of latent features to detect whether two labels


33
     stand for the same real-object. In previous work, latent features have yet been
     shown to be able to generate reliable representations of real-world objects [9].
         In this paper, we address the aforementioned research gap by presenting the
     first CDCR approach based on latent features. Our approach represents entity
     mentions as bags of words. Each entity mention is then regarded as a vector in
     the space spanned by all words used to describe at least one entity mention. In
     the subsequent step, we compute the latent features of the entity mentions. The
     similarity of the latent representation of the entity mentions is then transformed
     into a similarity graph which is clustered by using BorderFlow [8], a parameter-
     free graph clustering approach. All entity mentions which belong to the same
     cluster are regarded as mentions of the same real-world object and are assigned
     to the same URI. Our approach is open-source and available at http://github.
     com/AKSW/CoreferenceResolution.
         The rest of this paper is organized as follows: First, we give an overview
     of previous CDCR approaches. Then, we present our approach in detail. In
     Section 4, we evaluate our approach on the N3 benchmark dataset [14] and
     compare it with a baseline approach. We conclude the paper and discuss future
     work in Section 5.


     2   Related Work

     In the following section, we will provide an overview over recent approaches
     towards CDCR with a focus on their underlying techniques w.r.t. the semantic
     and syntactic features they exploit.
          Mayfield et al.’s [6] CDCR approach comprises five stages: (1) intra-document
     processing, i.e., identification of mentions of entities, (2) entity pairs filtering,
     i.e., discarding of possible entity mappings to reduce computational costs, (3)
     calculating features of entities, (4) classification of entity matching by machine
     learning techniques and (5) clustering of entities to map each mention to the
     same equivalence class. Unfortunately, the authors evaluated their approach in
     the ACE 2008 English named entity recognition task which is no longer available.
     There, the approach achieved a value metric of 54.8 [10].
          Haghighi et al. [4] present an unsupervised approach based upon a generative
     process which is capable to use modular syntactic and semantic features making
     use of latent information. For every document, the generative process creates a
     number of entities mentioned in the text. For every mention a noun phrase is
     created. However, since the inference algorithm only uses these noun phrases,
     their approach lacks on taking a larger context into account.
          Rahman et al. [12] introduce an approach which incorporates world knowl-
     edge into two baseline CDCR algorithms. Thereby, the authors use YAGO1 and
     FrameNet2 as underlying knowledge bases. Afterwards, they use a mention-entity
     1
       http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/
       research/yago-naga/yago/
     2
       https://framenet.icsi.berkeley.edu/fndrupal/


34
     pair classifier and a cluster-ranking model. The results show an improvement over
     each baseline.
         Singh et al. [16] present an approach consisting of (1) a large scale distributed
     inference mechanism based on Markov chain Monte Carlo methods and (2) they
     introduce sub-entity and super-entity variables representing clusters which are
     used to distribute or collect certain entities on a specific part of the machine
     cloud. Furthermore, they evaluate their approach on a 1.5 million document
     comprising web crawl using anker tags to Wikipedia as gold standard. Never-
     theless, the authors approach misses the opportunity to consider latent features
     resulting in large computational costs w.r.t. the size of the resulting Markov
     chain.
         Lee et al. [5] present an approach not only capable of co-referencing enti-
     ties but also events. Their idea is based upon linear regression which is used
     to merge clusters of entities. Furthermore, the authors featurize entities via se-
     mantic role labeling. Their approach is able to co-reference entities intra- and
     inter-document-wise. Although the authors claim to be better than the state-of-
     the-art with respect to the CoNLL 2011 shared task [11] their published corpus
     is not available anymore.
         In 2013, Beheshti et al. [2] provide a systematic analysis of state-of-the-art
     CDCR systems. The survey provides an in-depth structurization of the underly-
     ing methods and algorithms, which are widely used to solve CDCR problems on
     large scale. Furthermore, the authors highlight certain Big Data challenges, e.g.,
     large amounts of pair-wise string similarity calculations and costly classification
     algorithms.
         Normally, these approaches are based on a trained set of parameters for se-
     mantic and syntactic similarity algorithms. Recently, Andrews et al. [1] describe
     an approach towards CDCR, here called entity clustering, that relies on learning
     parameters from test data without the need for training data. The generative
     process within assumes a mutation of semantic context and syntactic similarity
     while generating the documents with cross-referenced entities. Afterwards, the
     authors deploy a block Gibbs sampler to infer the clusters. Unfortunately, this
     approach is only empirically evaluated.
         With respect to the clustering aspect of this paper, Schae↵er [15] provides an
     exhaustive overview of common graph-clustering algorithms and their use cases.
         To the best of our knowledge, we present the first paper on CDCR based on
     latent features, matrix decomposition as well as graph-clustering.


     3   Approach

     In this section, we present our approach to CDCR in more detail. We introduce
     the notation necessary to understand the approach as required by each section.
     Figure 1 gives an overview of the five steps that underly our approach. In a first
     step, a Matrix M is generated containing the context of every entity mention.
     After that, this matrix is decomposed into two smaller matrices L and R with
     M ⇡ LR> . In parallel, a second matrix S is created which contains the pairwise


35
     similarities of the labels of the entity mentions. These matrices are used to
     generate a symmetric graph G in which (1) every entity mention is a node and
     (2) two nodes are connected if their similarity is higher than a certain threshold.
     G is finally clustered. Mentions that belong to the same cluster are considered
     to be mentions of the same entity. Hence, they are all assigned the same URI.


                 Matrix             Matrix
               Generation        Factorization
                            M                    L,R
                                                     Graph              Graph
                                                     Generation       Clustering

                      String Similarity
                      Matrix Generation
                                                 S
                            Fig. 1: The five steps of our approach.


     3.1   Matrix Generation
     The first step of our approach consists of generating a matrix which describes
     the context of every named entity mention inside the texts by means of a bag of
     words. To this end, the given corpus is preprocessed by tokenizing the documents,
     removing stop words and indexing the remaining tokens. In these tokenized
     documents, the context of a named entity mention is defined as the multiset of
     tokens inside a window with the size ± that is centered on the named entity’s
     tokens. The contexts are stored in a matrix M containing a row for every named
     entity mention and a column for every indexed word. The entries of the matrix
     are the counts of the words inside the entity mention’s context. As an example,
     let us consider the sentence
     Example 1. Yesterday, VW’s CEO presented the new Golf in Munich.
     from which the stopwords {the, in} are removed. For the window size                      =
     1, we get the bag-of-word multiset {new (1), Munich (1)} as representation of
     “Golf”. Within the vector space spawned by (presented, new, Munich, Germany),
     this mention has the vector representation (0, 1, 1, 0). In the following, we will
     consider five entity mentions g1 , g2 , g3 , g4 and g5 labelled with the same word
     “golf” as example. These entity mentions will be assumed to be represented by
     the vectors g1 = (2, 2, 2, 0), g2 = (1, 0, 0, 1), g3 = (0, 0, 0, 1), g4 = (1, 0, 0, 0) and
     g5 = (0, 1, 1, 0).

     3.2   Matrix Factorization
     The matrix M is now a matrix of dimensions n ⇥ m (denoted M (n, m)). The
     goal of a matrix factorization is to compute the matrices L(n, ⇢) and R(m, ⇢)


36
     such that M ⇡ LR> . We call ⇢ 2 N\0 the rank of the factorization. Several
     approaches have been used to factorize matrices. Here, we loosely follow the
     tensor factorization approach presented in [9]: Given two matrices L and R
     that are supposed to be the factors of M , the overall quadratic error of the
     approximation is the square Frobenius norm of E = M RL> , i.e., ||E||2F =
     ||M RL> ||2F . Previous works have shown that to prevent overfitting, the error
     function to minimize must be extended. While several approaches have been
     suggested to this end, we adopt the error expression given by ||E||2F 2 (||R||2F +
     ||L||2F ), where   2 [0, 1] controls how well L and R fit M . Thus, the error
     derivatives are as follows:
                                   @eij
                                        =    2eij ljk + rik                        (1)
                                   @rik

     and
                                   @eij
                                        =    2eij rik + ljk .                      (2)
                                   @ljk
     We can now adopt a gradient descent approach to update the matrices L and R
     and reduce the error they lead to by overwriting each lik resp rjk as follows:
                                                        n
                                                                         !
                                   @eij             X
                    ljk    ljk   ↵      = ljk + ↵ 2     eij rik    ljk             (3)
                                   @ljk             i=1

     and                                          0                   1
                                                     Xj
                                   @eij
                   rik    rik    ↵      = rik + ↵ @2     eij ljk   rik A .         (4)
                                   @rik              j=1

     We initialize L and R with random entries between 0 and max mij . For our
     example, we get                  0        1
                                        2220
                                      B1 0 0 1C
                                      B        C
                                  M =B         C
                                      B0 0 0 1C .                          (5)
                                      @1 0 0 0A
                                        0110
     For ⇢ = 2, our approach computes
                      0               1
                         1.385 1.102                        0        1
                      B 0.006 0.501 C                    0.331 1.406
                      B               C                B1.059 0.446C
                  L=B B 0.079 0.051C
                                      C        and   R=B             C
                                                       @1.118 0.363A .             (6)
                      @ 0.234 0.712 A
                                                         0.062 0.066
                         0.933 0.168

     The intuition behind our approach is that L is a better and compressed descrip-
     tion of the entity mentions than M . Hence, we now use L in combination with
     a string similarity function to compute the similarity of entity mentions.


37
     3.3    String Similarity Matrix

     The string similarity matrix S is an optional feature of our approach. Each entry
     sij of S describes the similarity between the label of the ith and the jth entity in
     our input corpus. Assuming a symmetric string similarity function such as the
     3-gram similarity (which we use in our experiments), we also get a symmetric
     string similarity matrix S. We assume sij = 1 if no string similarity is specified.
     sij = 1 also holds for our example, as all mentions are labelled with “golf”.


     3.4    Graph Generation

     The aim of the graph generation is to generate a similarity graph G = (V, E, w)
     that will allow detecting mentions of the same real-world object through clus-
     tering. The set of vertices of V is the set of entity mentions in our corpus. We
                                                                               l(i,·) ·l(j,·)
     define the weight function w : V ⇥ V ! [0, 1] as w(vi , vj ) = sij ⇥ ||l(i,·) ||⇥||l(j,·) || ,
     where l(i,·) is the ith row-vector of L and stands for the latent description of the
     ith entity mention in the corpus. Given that many graph clustering approaches
     are polynomial in the number of edges, we can control |E| by only setting an
     edge between vi and vj if w(vi , vj ) ✓ 2 [0, 1]. For ✓ = 0.3 and ⇢ = 2 we end up
     with the graph displayed in Figure 2(a). As comparison, Figure 2(b) shows the
     graph obtained with by setting L = M , i.e., generating G without using latent
     features.


                     g3                    g4            g3                    g5
                               0.95
                       0.31       0.34                      0.58     0.82

                                g1                   0.71           g1

                       0.61       0.61                      0.41
                               0.92                                0.71
                     g2                    g5            g2                    g4

                   (a) Graph generated using       (b) Graph generated using M
                   ⇢=2                             instead of L

            Fig. 2: Graphs generated by our approach for the example dataset.


     3.5    Graph Clustering

     We now cluster the graph G to detect mentions that stand for the same real-
     world object. Our approach can rely on any graph clustering approach. In our
     current implementation, we rely on the BorderFlow algorithm [8] because it is
     parameter-free. BorderFlow regards any set C ✓ V as having a border b(C) =


38
     {v 2 C : 9u 2 V \C with (v, u) 2 E}. The flow ⌦(CP
                                                      1 , C2 ) between two sets
     C1 ✓ V and C2 ✓ V is defined as ⌦(C1 , C2 ) =            w(v, u). Based on
                                                             v2C1 ,u2C2
     these definitions, BorderFlow implements a local graph clustering paradigm by
     mapping each node v 2 V to the set of nodes C ✓ V that is such that v 2 C
     and C is a node-maximal set w.r.t. the function
                                               ⌦(b(C), C)
                                   bf (C) =                 .                           (7)
                                              ⌦(b(C), V \C)
     While finding the optimal C for each v can be very time-consuming, the heuristic
     presented in [7] allows determining an approximation of C in an efficient manner.
     We employ this heuristic herein.
         Now, the result of BorderFlow is not a partitioning of the graph. Rather,
     clusters may overlap. We thus employ a hardening approach to generate a par-
     titioning of the input graph. To this end, each node v 2 V which belongs to two
     di↵erent clusters C1 and C2 is assigned to C1 i↵

                bf (C1 [ {v}) + bf (C2 \{v})      bf (C2 [ {v}) + bf (C1 \{v}).         (8)

     In all other cases, v is assigned to C2 . We call this form of hardening flow
     maximization. Other forms of hardening can be conceived of, e.g., minimizing the
     number of union operations that need to be carried out to achieve a partitioning
     of the graph (set-based ). A third possibility is the silhouette hardening that
     chooses the cluster C1 if the dissimilarity of v to each other element of C1 is
     smaller than the dissimilarity to all elements of C2 [13].
         For our example, we get the clusters {g1 , g5 } and {g2 , g3 , g4 } for ⇢ = 2 when
     using BorderFlow with any partitioning approach. If we replace L with M , we get
     the clusters {g1 }, {g2 , g4 } and {g3 , g5 }. This result on toy data already suggests
     that matrix factorization leads to results that di↵er from those gathered when
     using raw data. In the subsequent section, we show empirically that using L to
     generate G leads to more accurate results than using M to generate G.


     4     Evaluation
     4.1   Experimental Setup
     Goals The goal of our experiments was two-fold. First, we wanted to measure
     the e↵ect of the di↵erent parameters on our approach. Moreover, we wanted to
     know whether the factorization outperforms a comparable baseline. To achieve
     the first goal of our experiments, we conducted experiments where we varied
     the rank ⇢ as well as the window size         while keeping all other parameters
     fixed. We addressed the second goal by creating a baseline as follows: We ran
     our pipeline as described in the sections above with the sole di↵erence that (1)
     we did not carry out a factorization and (2) we use M instead of L as input
     for the graph clustering. All other steps (matrix generation, graph generation,
     graph clustering) remained unchanged. The similarity threshold for the graph
     generation is set to ✓ = 0.1 for all our experiments.


39
     Datasets We use the three corpora of the N3 collection [14] in our experiments.
      – The News-100 corpus comprises 100 German news articles from news.de.
        Each of these articles contains the German word “Golf”—a homonym that
        has three di↵erent meanings inside these documents. The word could mean
        (a) a gulf, e.g., the Mexican gulf, (b) the ball sport or (c) a compact car
        of the German manufacturer Volkswagen. This is clearly the most difficult
        dataset, as many resources share exactly the same name but have di↵erent
        meanings.
      – The Reuters-128 corpus contains 128 English economy news articles from
        the Reuters news agency. The documents in this dataset are smaller than
        the ones from the News-100 corpus providing a shallow context.
      – The third corpus, RSS-500, contains 500 documents each with only one
        sentence. The sentences were randomly chosen from a larger amount of RSS
        news feeds, as described in [3]. Every sentence contains exactly two named
        entities.
     Table 1 provides further detailed information about the corpora. On average,
     each named entity occurs nearly 5 times in the News-100 corpus. Within the
     Reuters-128 corpus nearly two mentions per named entity exist on average while
     in the RSS-500 corpus only every tenth entity is mentioned more than once.


                             Table 1: Features of the corpora
                                News-100     Reuters-128    RSS-500
                   Documents          100           128           500
                   Tokens           48199         33413         31640
                   Entities           362           444           849
                   Mentions          1655           880          1000


     4.2   Results
     Influence of rank In our first series of experiments, we fixed the window size
     to 4 and measured the influence of the rank ⇢ on the precision, recall and F-
     measure. The left side of Figure 3 shows the results of our experiments on the
     three datasets. Most importantly, our results show that we outperform the base-
     line in most settings. We achieve the best increase of performance on the RSS-500
     corpus, where we achieve a 20.5% increase in F-measure over the baseline. This
     result suggest that our approach does not tend to overgeneralize through the
     compression on information that is carried out during the factorization. Instead,
     our results suggest that we get rid of a significant amount of noise while factor-
     izing. Our results on the other two datasets show that we also achieve a better
     F-measure (increases of 18.2% on Reuters-128 and 6.3% on News-100, see Ta-
     ble 2). An analysis of the results reveals that this increase is mostly due to the


40
     significant increase in precision that we achieve in most settings. On the other
     hand, our recall is rarely ever worse than that of the baseline. This suggests that
     BorderFlow tends to generate smaller clusters with factorization than when the
     baseline approach is used. We measure the statistical significance of our results
     using a Wilcoxon signed rank-test with 95% confidence. Our results are signifi-
     cant in all cases.


     Influence of window size In this experiment, we set the rank to 100 for
     all experiments and measured the e↵ect of the window size on the overall F-
     measure of our approach. The right half of Figure 3 shows the results of this
     series of experiments on the three datasets. Overall, our results suggest that for
     this rank, the window size does not have a major influence on the F-measure.
     This also seems to hold for other ranks. Interestingly, a small window size seems
     to lead to good results in most cases when we use the factorization. While we
     assume that this might be due to the factorization being able to convert transfer
     information from other context to the words within the window while computing
     the latent features of each entity mention, we still need to study this behavior
     more thoroughly. This result indicates that small window sizes are sufficient
     for our approach to achieve better F-measures than the baseline on the CDCR
     problem. This might mean that a small set of words is already sufficient to
     disambiguate resources across di↵erent documents.


     4.3   E↵ect of hardening

     In all results presented above, we used a hardening based on the borderflow ra-
     tio. We also implemented the set-based hardening and the silhouette hardening
     mentioned above and compared the results we achieve with these hardenings.
     Overall, our results suggest that the borderflow-maximization approach that we
     used for hardening generates the best results both for the baseline and our ap-
     proach. Moreover, we outperform the baseline independently from the hardening
     used.


     Table 2: Best improvements in F-measure of our approach (OA) over the baseline
     (BL)
                             Flow Max.         Set-Based         Silhouette
                             BL    OA          BL     OA         BL     OA
             News-100       25.86    32.21    23.87   28.81     26.56   34.05
             Reuters-128    47.89    66.16    47.00   56.65     47.59   59.60
             RSS-500        71.11    91.62    69.57   85.71     68.97   88.22


41
                   Precision              BL Precision
                      Recall                 BL Recall
                   F1-Score               BL F1-Score
                                    News-100                                           News-100
          1                                                          1


        0.8                                                         0.8


        0.6                                                         0.6


        0.4                                                         0.4


        0.2                                                         0.2


          0                                                          0
              10   20   30     40    50   60   70     80   90 100         1   2   3   4      5      6   7   8   9


                                 Reuters-128                                          Reuters-128
          1                                                          1


        0.8                                                         0.8


        0.6                                                         0.6


        0.4                                                         0.4


        0.2                                                         0.2


          0                                                          0
              10   20   30     40    50   60   70     80   90 100         1   2   3   4      5      6   7   8   9


                                    RSS-500                                               RSS-500
          1                                                          1


        0.8                                                         0.8


        0.6                                                         0.6


        0.4                                                         0.4


        0.2                                                         0.2


          0                                                          0
              10   20   30     40    50   60   70     80   90 100         1   2   3   4      5      6   7   8   9
                             rank for factorization                                   window size


     Fig. 3: Precision, recall and F1-score of our approach with di↵erent ranks (left)
     and di↵erent window sizes (right) compared to the baseline (BL). The diagrams
     show the results for the flow maximization hardening.


42
     Discussion Overall, our initial results suggest that we indeed outperform the
     proposed baseline by using matrix factorization (see Table 2). Still, many ques-
     tions do remain open. The most important question that we did not address is
     when should a high rank be used? First, in our experiments, ⇢ = 10 was suffi-
     cient across all datasets to outperform the baseline. To the best of our knowl-
     edge, finding the optimal rank for a factorization problem is an open question.
     Nevertheless, we think that the answer to this question lies in the amount of
     information contained in the corpus. The higher the information density of a
     corpus, the higher the rank required to characterize entity adequately. A sec-
     ond question that remains unanswered is whether we can improve the results of
     the factorization by considering known resources in the dataset. We will address
     this question in future work by disambiguating using a combination of textual
     information and Linked Data.


     5   Conclusion
     In this paper, we presented a CDCR approach based on latent features. We
     showed that our approach can outperform our baseline by more than 10% F-
     measure. We will use our approach to complement the entity linking frame-
     work [17] when it is used in batch mode, i.e., over a document corpus at once.
     Moreover, we will develop means to detect an appropriate rank for factorization.
     To this end, we plan to use the derivative of the mean squared error ||M LR> ||2F .
     Finally, we will develop a deterministic approach to initialize L and R. Prelim-
     inary results on random matrices show that we can already reduce the initial
     value of ||E||2F by more approximately 40%, leading to a significantly faster
     convergence of the factorization.


     Acknowledgments
                        This work has been supported by the ESF and the Free State
                        of Saxony and the FP7 project GeoKnow (GA No. 318159).


     References
      1. N. Andrews, J. Eisner, and M. Dredze. Robust entity clustering via phylogenetic
         inference. In Association for Computational Linguistics (ACL), 2014.
      2. S.-M.-R. Beheshti, S. Venugopal, S. H. Ryu, B. Benatallah, and W. Wang. Big data
         and cross-document coreference resolution: Current state and future opportunities.
         CoRR, abs/1311.3987, 2013.
      3. D. Gerber, A.-C. Ngonga Ngomo, S. Hellmann, T. Soru, L. Bühmann, and R. Us-
         beck. Real-time rdf extraction from unstructured data streams. In ISWC, 2013.
      4. A. Haghighi and D. Klein. Coreference resolution in a modular, entity-centered
         model. In Human Language Technologies: The 2010 Annual Conference of the
         North American Chapter of the Association for Computational Linguistics, pages
         385–393. Association for Computational Linguistics, June 2010.


43
      5. H. Lee, M. Recasens, A. Chang, M. Surdeanu, and D. Jurafsky. Joint entity and
         event coreference resolution across documents. In Proceedings of the 2012 Joint
         Conference on Empirical Methods in Natural Language Processing and Computa-
         tional Natural Language Learning, EMNLP-CoNLL ’12, pages 489–500, Strouds-
         burg, PA, USA, 2012. Association for Computational Linguistics.
      6. J. Mayfield, D. Alexander, B. J. Dorr, J. Eisner, T. Elsayed, T. Finin, C. Fink,
         M. Freedman, N. Garera, P. McNamee, et al. Cross-document coreference reso-
         lution: A key technology for learning by reading. In AAAI Spring Symposium:
         Learning by Reading and Learning to Read, pages 65–70, 2009.
      7. A.-C. Ngonga Ngomo. Parameter-free clustering of protein-protein interaction
         graphs. In Proceedings of Symposium on Machine Learning in Systems Biology
         2010, 2010.
      8. A.-C. Ngonga Ngomo and F. Schumacher. Borderflow: A local graph clustering
         algorithm for natural language processing. In CICLing, pages 547–558, 2009.
      9. M. Nickel, V. Tresp, and H.-P. Kriegel. Factorizing yago: scalable machine learning
         for linked data. In WWW, pages 271–280, 2012.
     10. NIST.               Automatic     Content      Extraction      2008    Evaluation.
         http://www.itl.nist.gov/iad/mig//tests/ace/.
     11. S. Pradhan, L. Ramshaw, M. Marcus, M. Palmer, R. Weischedel, and N. Xue.
         Conll-2011 shared task: Modeling unrestricted coreference in ontonotes. In Pro-
         ceedings of the Fifteenth Conference on Computational Natural Language Learning
         (CoNLL 2011), Portland, Oregon, June 2011.
     12. A. Rahman and V. Ng. Coreference resolution with world knowledge. In Proceed-
         ings of the 49th Annual Meeting of the Association for Computational Linguistics:
         Human Language Technologies, pages 814–824, 2011.
     13. P. J. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation
         of cluster analysis. Journal of computational and applied mathematics, 20:53–65,
         1987.
     14. M. Röder, R. Usbeck, S. Hellmann, D. Gerber, and A. Both. N3 - a collection of
         datasets for named entity recognition and disambiguation in the nlp interchange
         format. In The 9th edition of the Language Resources and EvaluationConference,
         26-31 May, Reykjavik, Iceland, 2014.
     15. S. E. Schae↵er. Graph clustering. Computer Science Review, 1(1):27–64, 2007.
     16. S. Singh, A. Subramanya, F. Pereira, and A. McCallum. Large-scale cross-
         document coreference using distributed inference and hierarchical models. In Pro-
         ceedings of the 49th Annual Meeting of the Association for Computational Linguis-
         tics: Human Language Technologies - Volume 1, HLT ’11, pages 793–803, Strouds-
         burg, PA, USA, 2011. Association for Computational Linguistics.
     17. R. Usbeck, A.-C. Ngonga Ngomo, S. Auer, D. Gerber, and A. Both. AGDIS-
         TIS - Agnostic Disambiguation of Named Entities Using Linked Open Data. In
         International Semantic Web Conference. 2014.


44