Efficient Computation of Relationship-Centrality
             in Large Entity-Relationship Graphs

 Stephan Seufert1 , Srikanta J. Bedathur2 , Johannes Hoffart1 , Andrey Gubichev3 , and
                                   Klaus Berberich1
                       1
                      Max Planck Institute for Informatics, Germany
               {sseufert,jhoffart,kberberi}@mpi-inf.mpg.de
                                 2
                                   IIIT Delhi, India
                            bedathur@iiitd.ac.in
                     3
                       Technische Universität München, Germany
                        andrey.gubichev@in.tum.de


       Abstract. Given two sets of entities – potentially the results of two queries on a
       knowledge-graph like YAGO or DBpedia– characterizing the relationship between
       these sets in the form of important people, events and organizations is an analytics
       task useful in many domains. In this paper, we present an intuitive and efficiently
       computable vertex centrality measure that captures the importance of a node
       with respect to the explanation of the relationship between the pair of query sets.
       Using a weighted link graph of entities contained in the English Wikipedia, we
       demonstrate the usefulness of the proposed measure.


1   Introduction
Consider a journalist researching the political relations between France and Germany.
In order to gain insight into the underlying relationship, it is an important task to identify
entities (e. g. events, organizations, etc.) that play an important role in the interactions
between these countries. This task can be greatly simplified if the journalist could simply
input two sets of entities – corresponding to the classes “French Politicians” and “Ger-
man Politicians” – and the system automatically generates a ranking of entities in the
knowledge base, reflecting their potential for characterizing the relationship between the
two entity-sets. Variants of this relationship characterization problem can be found in
settings ranging from political studies to analysis of relationships in computational biol-
ogy and economics. With the availability of massive entity-relationship networks such as
Wikipedia, DBLP, and BioCyc networks as well as large Semantic Web ontologies like
YAGO2 [5], solutions not only need to be effective, but also scalable. State-of-the-art
approaches for identifying important nodes in networks include various centrality mea-
sures (e.g., closeness- and betweenness-centrality) which operate on the entire network,
without any specific input entity sets.
    In this paper, we develop a novel centrality measure called relationship centrality,
that assesses the ‘strength’ of a node in the relationship path between the given two
sets of nodes. These scores can be computed exactly, or can be well-approximated to
scale to networks as large as the entire Wikipedia graph, comprising tens of millions of
edges. The resulting rankings can further be restricted to entities of certain types (e.g,
Organization or Location etc.), leveraging semantic knowledge-bases such as
YAGO2 [5] or DBPedia [1]. In the following section, we formally introduce our novel
centrality measure.
2   Relationship Centrality

Graph centrality measures, which assign to every node a score reflecting its importance in
the graph structure, are a valuable tool for analyzing different kinds of graphs. Although
they have been studied extensively in the scope of social networks, the use of centrality
measures in the context of Semantic Web is gaining importance only recently. The
classical measures proposed in past include closeness [3] and betweenness centrality [2].
However, these measures become computationally expensive when we consider large
networks. Also, their utility in ranking nodes with respect to an input set of entities is
rather limited.
    In contrast, the measure we introduce in this work, called relationship centrality, is
easier to compute since it is designed to assign scores that reflect the centrality only with
respect to the two input entity sets, rather than on a global scale. Formally, given two
query entity sets S and T from the network, we define the relationship centrality of a
node v as follows:
                                          XX          1
                                cR (v) =                     ,
                                                  ρ(s, v, t)
                                          s∈S t∈T

where ρ(s, v, t) is a penalty function for a path connecting a node s ∈ S and t ∈ T
passing through v, given by: ρ(s, v, t) = (1 + d(s, v)) · (1 + d(v, t)). The distance d(·, ·)
between two nodes connected by an edge can be customized based on the underlying
network to measure the semantic distance between the corresponding entities. The
corresponding edge-weighting schemes we envision can be based on the graph structure
(Milne-Witten inlink overlap measure [6]) or textual representations of the entities
(keyphrase overlap measure [4]), among others.
    Relationship centrality takes into account different paths than betweenness centrality
(which regards the shortest paths between all pairs of vertices). For every vertex in the
graph and every pair (s, t) ∈ S × T , the shortest path from s to t passing through v
contributes to the centrality score of v.
    The corresponding paths are computed as follows: For every vertex s ∈ S and
t ∈ T , the shortest distances to each vertex v ∈ V are computed using Dijkstra’s
algorithm. Then, the centrality scores for every vertex can be computed from the resulting
distance vectors. While the O(m + n log(n)) time complexity induced by each of the
|S| + |T | required shortest path computations is rather lightweight, for very large graphs
the corresponding computation time can be too demanding, especially for interactive
applications. For this purpose, we have experimented with an alternative scoring scheme
which only considers shortest path distances up to a value ∆ ∈ R. Then, the distance
from a query node q to a vertex v is approximated in the following way:
                                   (
                        ˜ v) =         d(q, v)      for d(q, v) ≤ ∆,
                        d(q,
                                       diam(G)      else,

where diam(G) denotes the diameter of the (weighted) graph. In our experimental
evaluation in Section 4, we evaluate the quality of computed scores based on different
choices of the cutoff parameter, ∆.
 Rank Entity                              Rank Entity                              Rank Entity
     1   2009 G-20 Pittsburgh sum.         1    The Expendables (2010 film)          1   Iraq War
     2   2010 G-20 Toronto summit          2    Crouching Tiger, Hidden Dragon       2   War in Afghanistan
     3   37th G8 summit                    3    The Forbidden Kingdom                3   Gulf War
     4   Iraq War                          4    Rush Hour 2                          4   Op. Enduring Freedom
     5   35th G8 summit                    5    Police Story (1985 film)             5   Yom Kippur War
     6   2009 G-20 London Summit           6    Once Upon a Time in China II         6   War on Terror
     7   36th G8 summit                    7    Fist of Fury                         7   Battle of Karameh
     8   2010 G-20 Seoul summit            8    Romeo Must Die                       8   Palestinian diaspora
     9   Presidency of G. W. Bush          9    Kung Fu Hustle                       9   Operation Opera
    10   2009 Nobel Peace Prize            10   Fearless (2006 film)                10   Suez Crisis
               (a) Q1                                   (b) Q2                               (c) Q3
                              Table 1. Top ranked entities for example queries
3        Application
In this section, we briefly present the application scenarios we envision. We target
analytical tasks at the downstream of Semantic Web applications. In particular, we
consider a large knowledge-base (such as YAGO or DBPedia), over which S and T sets
are derived as results of SPARQL queries.
Example: As a concrete scenario, the following two queries retrieve all organizations
conducting research on (variants of) lung cancer, and all tobacco companies respectively:
                SELECT ?p WHERE { ?p <rdf:type> <Organization> . ?p <worksOn> <Lung Cancer> }

                          SELECT ?c WHERE { ?c <rdf:type> <American Tobacco Company> }


    An analytics task could be to identify legal cases that played an important role in the
relationship between the entity sets corresponding to the query results. We can utilize
the relationship centrality measure developed above, and then use a type hierarchy, e. g.
Wikipedia categories or the WordNet lexical database, to retain only the relevant entity
types in the generated ranking.

4        Experimental Evaluation
In this section, we provide an overview over our experimental evaluation of the rela-
tionship centrality measure. In order to empirically validate the assigned scores, we
have compiled several example queries over an edge-weighted entity-relationship graph
obtained from Wikipedia: The vertices of the graph correspond to Wikipedia pages
that represent an entity contained in the YAGO knowledge base. Two vertices u, v are
connected via a weighted, undirected edge if there exists an internal Wikipedia link in
either direction between the corresponding articles, A(u), A(v). The weight we assign
to the edge (u, v) should capture the semantic relatedness of the respective concepts. For
this purpose, we employ the inlink overlap measure originally proposed by Milne and
Witten [6]. For nodes u, v the weight (inverse semantic relatedness) is given by

                                       log(max{|Iu |, |Iv |}) − log(|Iu ∩ Iv |)
                          d(u, v) =                                             ,
                                           log(n) − log(min{|Iu |, |Iv |})

where Iu and Iv denote the set of pages linking to A(u) and A(v), respectively and n
corresponds to the overall number of pages. The resulting weights lie in the interval
[0, ∞]. Using this measure, vertices u and v exhibit a high semantic relatedness if the
weight of the edge (u, v) is close to zero. Finally, we discard all edges with d(u, v) ≥ 1.
                       Query  ∆=∞             ∆=1              ∆ = 0.5
                                Time         Time       τ     Time      τ
                       Q1   41,960.40 ms 18,629.80 ms 1.0 4,616.05 ms 0.55
                       Q2   48,174.80 ms 15,002.70 ms 1.0 5,117.02 ms 0.60
                       Q3   71,162.50 ms 32,028.50 ms 1.0 7,858.39 ms 0.87
                          Table 2. Computation time and ranking quality
The resulting graph structure contains around 2.5 million vertices (entities) and roughly
37 million edges.
    In order to empirically evaluate our ranking, we use three example queries where the
sets of entities correspond to a collection of
Q1: Events between European politicians (S) and US American politicians (T )4
Q2: Movies between US action movie stars (S) and Asian action movie stars (T )5
Q3: Events between countries from the Middle East/Central Asia (S) and Western
   countries (T )6
In Table 1 we present the top-10 ranked results by relationship centrality for each of the
queries. The resulting rankings suggest that our measure is useful for the explanation of
the relationship between the sets of query entities. Regarding the computation time, we
give an overview over the effect of pruning the shortest path computation using different
cutoff parameters ∆, as well as the resulting rank correlation (measured by Kendall’s τ )
for the top 10 entities in Table 2.

5    Conclusions & Outlook
In this work we have presented the relationship centrality measure, a vertex centrality
score that reflects the potential of an individual vertex for the explanation of the rela-
tionship between two sets of query nodes. Our preliminary experimental results over
the edge-weighted Wikipedia entity-relationship graph indicate that our measure can
provide valuable insights into the relationship between sets of real-world entities. In
future work, we plan to conduct a large-scale evaluation of our result ranking in a user
study. In addition, we plan to use our centrality measure as a building block for extracting
interesting subgraphs between the query entities.

References
1. C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann. Dbpedia
   - a crystallization point for the web of data. J. Web Sem., 7(3):154–165, 2009.
2. L. C. Freeman. A set of measures of centrality based upon betweenness. Sociometry, 40:35–41,
   1977.
3. L. C. Freeman. Centrality in social networks: Conceptual clarification. Social Networks,
   1(3):215–239, 1979.
4. J. Hoffart, S. Seufert, D. B. Nguyen, M. Theobald, and G. Weikum. KORE: Keyphrase
   Overlap Relatedness for Entity Disambiguation. In CIKM’12: Proceedings of the 21th ACM
   International Conference on Information and Knowledge Management, pages 545–555. ACM,
   2012.
5. J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum. YAGO2: A Spatially and Temporally
   Enhanced Knowledge Base from Wikipedia. Artificial Intelligence, 2013.
6. D. Milne and I. H. Witten. An Effective, Low-Cost Measure of Semantic Relatedness Obtained
   from Wikipedia Links. In WIKIAI’08: Proceedings of the 2008 AAAI Workshop on Wikipedia
   and Artificial Intelligence. AAAI, 2008.

 4
   S = {Angela Merkel, Nicolas Sarkozy, David Cameron, Silvio Berlusconi},T = {Barack Obama, Hillary Clinton}
 5
   S = {Chuck Norris, A. Schwarzenegger, Sylvester Stallone,Bruce Willis},T = {Jet Li, Jackie Chan, Chow Yun-Fat}
 6
   S = {Iraq, Iran, Israel, Palestine, Afghanistan, Pakistan},T = {Germany, France, Spain, Italy, Netherlands, Portugal}