<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Global Vertex Similarity for Large-Scale Knowledge Graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Caballero</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aidan Hogan</string-name>
          <email>ahogang@dcc.uchile.cl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DCC, Universidad de Chile &amp; IMFD</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We investigate global measures of vertex similarity for knowledge graphs. While vertex similarity has been explored in the context of directed, unlabelled graphs, measures based on recursive algorithms or learning frameworks can be costly to compute, assume labelled data, and/or provide poorly-interpretable results. Knowledge graphs further imply unique challenges for vertex similarity in terms of scale and diversity. We thus propose and explore global measures of vertex similarity for Knowledge Graphs that (i) are unsupervised, (ii) o er explanations of similarity results; (iii) take into consideration edge labels; and (iv) are robust in terms of redundant or interdependent information. Given that these measures can still be costly to compute precisely, we propose an approximation strategy that enables computation at scale. We compare our measures with a recursive measure (SimRank) for computing vertex similarity over subsets of Wikidata.</p>
      </abstract>
      <kwd-group>
        <kwd>Similarity</kwd>
        <kwd>Knowledge Graphs</kwd>
        <kwd>Wikidata</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Knowledge graphs have become popular for integrating and managing
incomplete sources of data at large scale, where nodes represent entities of interest,
and edges represent the relations between them. While knowledge graphs can
be used for a variety of purposes and applications [20], often they are used to
support exploratory tasks, where the user may not know precisely what they are
looking for but will rather explore the data in order to gain actionable
knowledge [26]. In this exploratory context, adopting a graph-based data model opens
up opportunities to leverage graph algorithms for the purposes of discovering
implicit connections or patterns in a knowledge graph [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        An important class of graph algorithms is that of vertex similarity [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Given
a graph G = (V; E), which may be directed or undirected, the goal of vertex
similarity is to identify pairs of nodes (i.e., vertices) that are similar. Global
scenarios involve computing similarity scores for all pairs of nodes in V V
o ine, possibly restricted by a certain criteria, such as a similarity threshold,
Copyright © 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
or a k-nearest neighbour constraint. Single-source scenarios take a given node
v 2 V as input and nd the nodes most similar to it at runtime. Unlike similarity
based (e.g.) on metric spaces, vertex similarity relies on the graph's structure.
      </p>
      <p>
        As in the case of graph analytics such as centrality, clustering, etc., there
is no one single notion of similarity between vertices. Thus a variety of
vertex similarity measures have been proposed, ranging from simple local measures
that take into consideration a bounded neighbourhood to more complex
recursive ones [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. When considering di erent similarity measures, a trade-o arises.
While recursive algorithms tend to give better results by considering
information beyond the local neighbourhood, local similarity measures tend to be more
e cient, more scalable and arguably more explainable.
      </p>
      <p>
        Though vertex similarity has been used productively in various domains [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ],
knowledge graphs present three characteristic challenges for graph algorithms.
      </p>
      <p>Regarding diversity, a knowledge graph can be seen as multiple directed
graphs representing di erent relations, where for the purposes of vertex
similarity, it is not immediately clear how the similarity computed for the graphs of
individual relations can be combined into an overall measure. Di erent relations may
also encode di erent topologies of graph. For example, a relation :citizenship
would yield a bipartite directed graph, where person nodes have low out-degree
and zero in-degree, while country nodes have very high in-degree and zero
outdegree. On the other hand, a relation :follows in a social network would yield
a cyclic directed graph consisting of user nodes, whose in-degree and out-degree
average to the same value, but follow di erent distributions. Furthermore, a
vertex similarity measure should be able to distinguish two directed labelled edges
:Unforgiven :directo!r :CEastwood and :Unforgiven :acto!r :CEastwood as
encoding di erent information about the movie and the person.</p>
      <p>Regarding redundancy, knowledge graphs may provide redundant (i.e.,
entailed) or near-redundant (i.e., dependent) data. For example, a knowledge graph
may state that :Jill :mothe!r :Anne, and :Anne :chil!d :Jill, where though
the latter edge is intuitively redundant (i.e., it could be entailed from the former
triple with the appropriate ontology), it can a ect the results of vertex
similarity by changing the structure of the graph. As an example of near-redundant
data, a knowledge graph may state :Honduras :regio!n :CentralAmerica and
:Honduras :languag!e :Spanish, where, while neither triple necessarily follows
from the other, there exists a dependency between both edges. Such redundancy
is more prevalent in knowledge graphs due to the fact that multiple
interdependent relations can be encoded; in fact, encoding such redundancy is a feature
of knowledge graphs supported by ontologies. When applying vertex similarity
over knowledge graphs, we must then be cautious not to overestimate similarity
by considering dependent edges as independent events.</p>
      <p>Regarding scale, global vertex similarity, when applied to V V , may
generate a similarity relation of quadratic size (O(jV j2)), which will be infeasible
to materialise at scale. Even where only the results for one vertex is required,
or where thresholds are applied, recursive algorithms may require knowledge of
pairwise similarities in order to compute any similarities.</p>
      <p>It is in this context that we investigate four measures of vertex similarity for
knowledge graphs, two of which are taken from the literature and adapted by us
for the knowledge graph setting, and two of which are novel. We will motivate and
de ne these measures, and discuss how they can be optimised and approximated.
We will apply these similarity measures to sub-graphs of Wikidata to compare
their performance and to see how the results they provide correlate with each
other. We also evaluate the measures against two external ground truths: one
for similar movies, and another for similar music albums.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>
        Vertex similarity. Measures of vertex similarity are mostly proposed in the
context of undirected and directed graphs [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. For this discussion we will assume
a directed graph G = (V; E). A vertex similarity measure : V V ! R is
a measure that associates pairs of vertices (v1; v2) with a real-valued similarity
score. We will assume, without loss of generality, that a higher similarity score
indicates that the pair of nodes is more similar; i.e., that (v1; v2) &gt; (v1; v3)
indicates that v1 is more similar to v2 than it is to v3.
      </p>
      <p>Vertex similarity measures may be local, or may be non-local. We say that
a particular measure is local if the value of (v1; v2) depends on a bounded
neighbourhood surrounding v1 and v2; di erent notions of neighbourhood give
rise to di erent levels of locality, such as considering only the nodes themselves,
considering the graph induced by the nodes themselves and their direct
neighbours in G, or their neighbours' neighbours, and so forth. Otherwise { if (v1; v2)
depends on an unbounded neighbourhood { we say that is non-local.</p>
      <p>We also distinguish between recursive and non-recursive measures. We call a
measure recursive if the value of (v1; v2) depends on the value of for pairs
of nodes other than v1 and v2. Otherwise we call non-recursive. Non-recursive
measures tend to be local, while recursive measures tend to be non-local (though
recursive measures applied up to a bounded number of iterations are local).</p>
      <p>
        A number of local, non-recursive measures are introduced by Leicht et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ],
where nodes are considered similar if they have a similar neighbourhood (aka.
structural equivalence). For simplicity, we will consider the neighbourhood of
a node v in G = (V; E) generically as E(v) = fn j (v; n) 2 Eg; we could
also, for example, consider neighbours via incoming links in directed graphs. In
terms of concrete measures for (v1; v2), we can consider simply a neighbourhood
count (jE(v1) \ E(v2)j), or neighbourhood Jaccard similarity ( jjEE((vv11))[\EE((vv22))jj ), or
neighbourhood cosine similarity ( pj EjE(v(v11)\)jEj E(v(2v)2j)j )[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        On the other hand, recursive measures of similarity are generally based on
the principle that nodes are considered similar if they have many similar
neighbours; in this case, we consider the similarity of neighbours. Many recursive
similarity measures are based on recursive centrality measures, following loosely
the intuition that similar nodes should be well-connected { or easily reachable {
from each other. Jeh and Widom [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] propose a vertex similarity measure based
on PageRank centrality, Blondel et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] a vertex similarity measure based on
HITS (hubs and authorities) centrality, and Leicht et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] a vertex
similarity measure based on Katz centrality. All such measures are recursive, and have
proven to give good results in practice. Of these measures, arguably the one that
has been most in uential is that of Jeh and Widom [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], called SimRank, which
has led to various follow-up works on accuracy estimations [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], probabilistic
graphs [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], similarity joins [28], etc., for the algorithm.
      </p>
      <p>Some more recent approaches de ne vertex similarity based on latent features
of the graph using machine learning techniques. In particular, there are strong
connections between the world of graph embeddings { whose goal is to develop
meaningful numeric representations of the elements of a graph { and vertex
similarity, where, on the one hand, embedding techniques can be used to learn
similar numeric representations for nodes [23], while on the other hand, o
-theshelf similarity measures can be used to guide the embeddings that are learnt [25].
Such measures can again be local [23] or non-local [25].</p>
      <p>
        Similarity in knowledge graphs. Similarity over knowledge graphs has been used
for general applications, such as clustering [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], entity comparison [21], entity
linking [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], link discovery [19], ontology matching [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], query relaxation [
        <xref ref-type="bibr" rid="ref12 ref16">12,16</xref>
        ],
semantic relatedness [22], as well as domain-speci c use-cases, such as academic
search [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], crime reports [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], geographic databases [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], multimedia retrieval [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
protein search [
        <xref ref-type="bibr" rid="ref2 ref4">2,4</xref>
        ], etc. We identify two forms of similarity used in these works.
      </p>
      <p>
        Metric similarity involves the use of a (normalised) distance metric
(Euclidean, Manhattan, Levenshtein etc.), where similarity will typically be de ned
in terms of the distance as (v1; v2) = 1 (v1; v2). The use of thus
induces/supposes a metric space to which nodes must be mapped. Such forms of similarity
have been used for geographic databases [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], link prediction [24], multimedia
retrieval [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], protein similarity search [
        <xref ref-type="bibr" rid="ref2 ref4">2,4</xref>
        ] and query relaxation [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. However,
such metric spaces are either domain speci c [
        <xref ref-type="bibr" rid="ref11 ref2 ref3 ref4">3,2,4,11</xref>
        ], or need to be speci ed
manually [
        <xref ref-type="bibr" rid="ref16">24,16</xref>
        ]; they are not inherent to graphs in general.
      </p>
      <p>
        Vertex similarity, on the other hand, is a measure intrinsic to graphs.
Variations of the aforementioned algorithms for vertex similarity have been adapted
for use in clustering [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], entity comparison [21], entity linking [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], query
relaxation over crime reports [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], semantic relatedness [22], etc.
      </p>
      <p>Novelty. We investigate vertex similarity measures for knowledge graphs,
including variants of two existing measures, and two novel measures. Our novel
measures are local and non-recursive. We show one to be competitive with
SimRank (a representative non-local, recursive measure) in terms of predicting a
ground truth, while being more scalable, more e cient, and more explainable.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Vertex Similarity Measures</title>
      <p>
        We now de ne four vertex similarity measures: neighbourhood count ( nc),
neighbourhood selectivity ( ns), neighbourhood rarity ( nr) and SimRank ( sr).
Of these measures, ns and nr are { to the best of our knowledge { novel (though
ns is inspired by our prior proposal called concurrence [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].). Given a directed
graph G = (V; E), we recall the notation E(v) to denote the neighbouring nodes
of v in G. We assume that for our similarity measures, (v) 2 [0; 1), where
(v1; v2) &gt; (v1; v3) implies that v1 is more similar to v2 than to v3.
Neighbourhood count Our rst vertex similarity measure is the
neighbourhood count as was de ned previously: nc(v1; v2) = jE(v1)\E(v2)j. The measure
simply counts the number of neighbours that v1 and v2 have in common.
Neighbourhood selectivity Neighbourhood count treats all neighbours as
being equal. However, neighbours with a high (in)degree, like :UnitedStates,
will be shared by many pairs of nodes; this would still be counted the same as a
node with a low (in)degree, such as :VaticanCity. Equivalently, we can say that
the probability that a particular node v has :UnitedStates in its neighbourhood
is much higher than it having :VaticanCity in its neighbourhood. Speci cally,
for a node n, let deg(n) = jfv 2 V j n 2 E(v)gj denote the degree of n in terms
of the number of neighbourhoods it appears in. We denote the probability of n
appearing in the neighbourhood of a random node as P (n) = deg(n) .
jV j
      </p>
      <p>Let fn1; : : : ; nkg = E(v1) \ E(v2) denote the set of neighbours that v1 and
v2 have in common. Instead of simply counting the neighbours that two nodes
have in common (k), we propose neighbourhood selectivity, which measures the
vertex similarity between v1 and v2 as the probability of a random node v not
having all of the individual neighbours that v1 and v2 share in common:
ns(v1; v2) = P (:(n1 \ : : : \ nk)) = 1
k
Y P (ni) = 1
i=1
1 k</p>
      <p>Y deg(ni)
jV jk i=1
This assumes that having individual neighbours are independent events. The
idea behind this measure is that the more neighbours a pair of nodes share, and
the rarer those neighbours, then the lower the probability of a random node
having all of those neighbours, and thus the higher the similarity measure.
Neighbourhood rarity The previous measure assumes that having di erent
neighbours corresponds to independent events. However, as argued in the
introduction, this is often not the case. For example, in a bipartite graph linking
movies to actors, having :SLaurel in the neighbourhood, and having :OHardy
in the neighbourhood, should not be considered as being independent events, as
these were a famous duo of actors who often starred in movies together. In the
presence of redundant or interdependent data { which, as we previously argued
is common in the case of knowledge graphs { the previous measure would end
up overestimating the similarity of node pairs per its own design criteria.</p>
      <p>
        Rather than combining the probabilities of (supposedly) independent events,
we propose neighbourhood rarity, which measures the vertex similarity of v1 and
v2 as the ratio of other nodes not having (at least) all of the neighbours that
v1 and v2 share in common. First, for a given set of nodes N , we generalise
the de nition of degree to a set of nodes: let deg(N ) = jfv 2 V j N E(v)gj.
Defending our slight abuse of notation, note that deg(n) = deg(fng). We can
now de ne the neighbourhood rarity measure as simply:
We subtract 2 to exclude v1 and v2 from the count. As opposed to the previous
measure, if we now have interdependent neighbours in E(v1)\E(v2), then
neighbourhood rarity will take this into account as many other nodes with likewise
have those interdependent neighbours together, reducing the above similarity.
SimRank The measures we have presented thus far are non-recursive. We now
introduce a representative of a recursive algorithm, namely SimRank [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. First
we need some additional notation. Let E (v) = fn j (n; v) 2 Eg denote the
nodes with incoming edges to v. The SimRank measure sr is then de ned
recursively as follows: if v1 = v2, then sr(v1; v2) = 1; else if jE (v1)j = 0 or
jE (v2)j = 0, then sr(v1; v2) = 0; otherwise:
sr(v1; v2) =
jE (v1)j jE (v2)j
      </p>
      <p>X X
jE (v1)j jE (v2)j i=1 j=1
sr(Ei (v1); Ej (v2))
where Ei (v) and Ej (v) denote the ith and jth elements of E (v), respectively;
and where 2 [0; 1] is a hyperparameter called the decay constant (originally
= 0:8). Intuitively, the similarity of v1 and v2 (where v1 6= v2) depends on
the sum of the pairwise similarity of their incoming neighbours, normalised by
the score they would have if all incoming neighbours had a pairwise similarity
of 1 (jE (v1)j jE (v2)j) multiplied by a decay parameter. This measure can
be computed iteratively by setting sr(v1; v2) = 1 if v1 = v2, or sr(v1; v2) = 0
otherwise, and then computing the above equation iteratively.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Knowledge Graph Vertex Similarity</title>
      <p>We assume a knowledge graph to be a directed, labelled graph K = (V; L; E),
where V is a set of nodes, L is a set of edge labels, and E V L V is
a set of directed, labelled edges. Thus far our measures have been de ned for
directed graphs only. This raises the question: how should the proposed
measures be applied to knowledge graphs? The rst alternative would be to drop
the labels from the edges (and remove duplicate edges between the same pair
of nodes with di erent labels) in order to return to a directed graph. For a
directed labelled edge s !p o, this would involve projecting a directed edge
s ! o. However, as we argued in the case of :Unforgiven :directo!r :CEastwood
and :Unforgiven :acto!r :CEastwood, edge labels carry important
information for vertex similarity measures. Taking a more extreme example, consider
:CEastwood :citize!n :UnitedStates, :DTrump :presiden!t :UnitedStates, and
:BObama :presiden!t :UnitedStates. Intuitively the similarity of :DTrump and
:BObama should bene t more from being presidents of the :UnitedStates.</p>
      <p>A second alternative that keeps edge information is to encode a directed
labelled edge s !p o as two directed edges of the form (p; s) ! o and s ! (p; o).
For example, the directed labelled edge :Unforgiven :directo!r :CEastwood
would become two directed edges: :Unforgiven ! (:director,:CEastwood)
and (:director,:Unforgiven) ! :CEastwood. In this case, the resulting
directed graph is lossless, meaning that we can from the directed graph reconstruct
the full directed edge-labelled graph. While this might seem unusual, particularly
in the case of the local neighbourhood measures, this is quite a practical
alternative, where elements of the neighbourhood now become pairs of edge labels and
nodes, which in turn allows us to distinguish the probabilities associated with,
for example, (:president; :UnitedStates) and (:citizen; :UnitedStates).</p>
    </sec>
    <sec id="sec-5">
      <title>5 Implementation</title>
      <p>We now describe our algorithms for implementing our measures, their
complexity, and approximations that we apply in the case of the local neighbourhood
measures to enable increased scalability and e ciency.</p>
      <p>
        Neighbourhood measures We implement the local neighbourhood measures
on the distributed Apache Spark framework [27]. In comparison with, for
example, the Hadoop framework based on MapReduce [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Spark abstracts storage
not only on the hard disk, but also in-memory, which generally allows for more
e cient computation across a cluster of machines when su cient main memory
is available. Our implementation assumes an input graph in RDF format.
Implementation. We implement the similarity measures in the natural way, with
the exception of the neighbourhood rarity measure, which we compute in a
slightly indirect but more e cient way. For reasons of space, we provide a
highlevel sketch of the details of the algorithms used:
{ Neighbourhood count : we group the graph by neighbours n, collecting
together all nodes fv1; : : : ; vkg such that n 2 E(vi) for 1 i k; then for each
such group we write out all asymmetric, irre exive pairs of nodes (va; vb) for
1 a &lt; b k sharing that neighbour. We then count occurrences of the
resulting pairs, producing the nal neighbourhood count.
{ Neighbourhood selectivity: we apply the same process as before, but instead
of writing pairs of the form (va; vb), we output triples of the form (va; vb; k);
we can then group these triples by (va; vb) and compute the neighbourhood
selectivity over the bag of values ffk1; : : : ; kmgg for each pair.
{ Neighbourhood rarity: we apply a neighbourhood count generating pairs of
the form (va; vb; nc(va; vb)); we apply a similar process to count how many
neighbours a triple of nodes have in common (va; vb; vc; n0c(va; vb; vc)), where
1 a &lt; b k and a 6= c 6= b. We can then group both sets of data
together by (va; vb), giving us its neighbour count nc(va; vb), and a list of
pairs f(vc1; n0c(va; vb; vc1)); : : : ; (vcm; n0c(va; vb; vcm))g denoting other nodes
sharing at least one neighbour with va and vb, along with how many
neighbours the three nodes share. Finally we can count how many nodes vci there
are in the set such that n0c(va; vb; vci) = nc(va; vb).
      </p>
      <p>Complexity. Letting E = jEj and V = jV j for readability, then the worst-case
runtime complexities are as follows. Neighbourhood count is (V3 log V). This
worst case is tight, as seen for the V-clique where we rst sort the edges of
the graph (in time O(E log E)), and then write O(V2) pairs of nodes a total of
O(V) times, giving a bag of pairs of nodes of size O(V3), which we then sort in
time O(V3 log V3). This gives us (E log E + V3 log V3), which can be simpli ed
to (V3 log V) observing that E V2 and that V3 log V3 = 3V3 log V.
Neighbourhood selectivity is likewise O(V3 log V) as it follows a similar procedure.
Neighbourhood rarity is rather (V4 log V); for the V-clique we now write O(V3)
triples of nodes a total of O(V) times, giving rise to a bag of triples of nodes of
size O(V4), which we sort in (V4 log V). However, noting that we produce pairs
and/or triples of nodes only for a particular neighbour, we can tighten these
bounds if we rather de ne D = maxfdeg(n) j n 2 V g, where often D V. This
then gives worst-case complexities of (E log E + VD2 log V) for neighbourhood
count and selectivity, and (E log E + VD3 log V) for neighbourhood rarity.</p>
      <p>-approximation. Still, maxfdeg(n) j n 2 V g can be a prohibitively large value
to raise by 3 or 4, particularly for large-scale knowledge graphs. However, this
analysis does suggest a exible mechanism for approximation: we can de ne a
threshold such that we ignore neighbours n with deg(n) &gt; . Intuitively, this
means that we will simply ignore high-degree nodes from the similarity
computation. While this may a ect the nal scores produced, we note that in the case of
neighbourhood selectivity and neighbourhood rarity, high-degree neighbours have
comparatively less in uence on the measure. We will later experimentally check
the e ect of varying on the associated similarity scores. With -approximation,
the complexity becomes simply O(E log E) in all three cases as the
quadratic/cubic parameter D = maxfdeg(n) j n 2 V g is now bounded. For this reason
we believe that these measures are promising for global vertex similarity on
large-scale knowledge graphs, with allowing to trade precision for scale.
Neighbourhood rarity-s. Finally, we can solve the issue of frequent ties in
neighbourhood rarity with negligible practical cost by interleaving the computation of
neighbourhood selectivity into the process, where we then de ne a novel hybrid
measure of similarity that we call neighbourhood rarity-s as</p>
      <p>nrs(v1; v2) = jV j nr(v1; v2) + ns(v1; v2)
such that the whole part of the similarity measure is given by nr
(neighbourhood rarity ), while the decimal part is given by ns (neighbourhood selectivity );
intuitively, nrs rst ranks similar pairs by nr, using ns to break ties.
SimRank We use an o -the-shelf implementation of SimRank. The space
complexity of SimRank is O(V2) for V = jV j as it needs to store all pairwise
similarities (aside from the cases that are trivially 0 or 1 throughout). The worst
case time complexity of an iteration of SimRank is O(V4) where several such
iterations of SimRank are required to approximate the measure. Though
optimisations of SimRank have been proposed in the literature, to the best of our
knowledge they consider single-source rather than global similarity.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Experiments</title>
      <p>In this initial work, we perform experiments that consider small-to-medium sized
sub-graphs of Wikidata, where each focuses on a di erent entity type. We show
statistics relating to these sub-graphs in Table 1. With respect to our
experiments, we wish to address three initial research questions: (1) How do the
runtimes and results of local neighbourhood measures compare? (2) How does the
value of a ect runtimes and similarity results? (3) How do the results for the
local neighbourhood measures and SimRank compare for ground-truth similarity
datasets? Noting that we failed to compute SimRank for any of these subgraphs,
we postpone discussion of SimRank results until the end of the section, where
we will sample smaller subgraphs for comparison.
6.1</p>
      <sec id="sec-6-1">
        <title>Runtime performance</title>
        <p>We rst look at the runtime performance of the di erent measures and for
different values of . In Table 2, we show the results for the smallest and largest
subgraphs. As general observations, we see that, as predicted by the complexity
analysis, there is little di erence between the runtimes of nc and ns (being
quadratic with respect to ), while nrs is generally slower to execute (being
cubic with respect to ). Furthermore, as increases, the runtimes of nc and
ns grow more slowly than in the case of nrs. In the case of Movies (and
Albums), Spark failed to compute nrs with = 512, where a single neighbour n
with deg(n) = 512 produces (5122 2) (5122 512) = 66; 716; 160 quadruples.</p>
        <p>ns
nrs</p>
        <p>E ect of</p>
        <p>on the similarity results
The aforementioned results show the performance bene t of lowering , but
what e ect does this have on the similarity scores? To address this question, we
compared the similarity results for varying values of against the most accurate
approximation ( = 512 in all cases, except nrs in the case of Albums and
Movies which did not run, where we compare with = 256). In general, we see
that as decreases exponentially, the correlations remain strong until = 128,
reducing more dramatically at around = 32. In general, we recommend keeping
as high as feasible to ensure that the results are as faithful to the original
measures as possible, but where necessary, lowering can still lead to reasonable
approximations while tending towards O(E log E) worst-case runtime complexity.
6.3</p>
      </sec>
      <sec id="sec-6-2">
        <title>Ground-truth comparisons, full-scale</title>
        <p>While the previous results compare the changes in the local neighbourhood
results for varying , they do not establish how well the measures correspond to
a general notion of \similarity". Unfortunately, many of the ground truths
considered in the literature consider directed graphs rather than knowledge graphs,
as is our setting. For this reason, we develop two novel ground-truths based on
similarity relations scraped from two external websites: BestSimilar for Movies
and Last.fm for Albums. Both of these websites rely on user inputs, where
BestSimilar allows to upvote or downvote similarity pairs, while Last.fm likewise has
a vast amount of user data to leverage for similarity. For this reason, we assume
that both sites o er reliable ground truths. The question then is: can we
replicate these similarity relations based on (costly) user input with our unsupervised
vertex similarity measures over the corresponding subgraphs of Wikidata?</p>
        <p>On BestSimilar, each movie is associated with a ranked list of 10 similar
movies, where we extract 100 such lists. On Last.fm, each album is associated
with a ranked list of 20 similar albums, where we again extract 100 lists. We
then linked the associated movies and albums with their Wikidata identi ers.We
measure the Kendall's- correlation between our three local neighbourhood
measures and each of the 100 ground truths, using the highest available values for
. Given that our subgraphs may mention nodes not appearing in the ground
truth, and vice versa, we only consider the ordering of elements appearing in
both lists when measuring the correlation. In Table 4, we show the distribution
of results, where we see that nrs provides a clear improvement in both datasets
with respect to the other measures, which may justify its cost. We attribute this
improvement to the way in which it is more robust to redundant information.
6.4</p>
      </sec>
      <sec id="sec-6-3">
        <title>Ground-truth comparisons, small-scale</title>
        <p>Finally, in order to compare with SimRank ( sr), we reduced the scale of the
Movies and Albums sub-graph until we could successfully compute the
SimRank measure. The new datasets contained 275,123 and 169,963 triples
respectively. We show the results in Table 5 where we see that sr performs best out of
the four measures. This is perhaps to be expected as sr is a recursive measure
that can incorporate information from the full graph, not just the locality of the
two nodes. On the other hand, nrs remains quite competitive with sr.
We have investigated measures for vertex similarity in the context of knowledge
graphs. We proposed two new measures that only consider the local
neighbourhoods of the pairs of nodes being compared. One of these measures {
neighbourhood rarity-s { shows promising results, being outperformed by non-local
SimRank in terms of a comparison with two ground truth datasets, but by a
narrow margin. On the other hand, this neighbourhood rarity-s measure, in
combination with degree-based ( ) approximation, can be evaluated at larger
scale and in a more e cient way than the SimRank alternative. Furthermore,
it can avoid \double counting" of redundant information by considering a
common neighbourhood as a single event, rather than assuming that it consists of
multiple independent events (i.e., individual neighbours). The locality of these
measures also have further advantages that we have not explored. For example,
unlike SimRank, given a pair of nodes v1 and v2, we could in principle write a
SPARQL 1.1 query to compute the (local) similarity of the node pair.</p>
        <p>More generally, we believe that in knowledge graphs, with the addition of
edge labels, the local neighbours of nodes tend to be rich in information: much
richer than directed graphs that only consider one relation type. For this reason,
we believe that simpler local measures of vertex similarity can compete with
more complex non-local measures in knowledge graphs such as Wikidata. As
part of future work, we plan to run scale-up experiments to understand what
values of make it possible to process all of Wikidata with each algorithm, to
develop a user interface, and to perform user studies.</p>
        <p>Online material. Please see http://aidanhogan.com/wikidata-sim.
Acknowledgements. This work was funded by Fondecyt Grant No. 1181896 and
ANID Millennium Science Initiative Program ICN17 002.
19. Ngomo, A.N., Auer, S.: LIMES { A Time-E cient Approach for Large-Scale Link
Discovery on the Web of Data. In: International Joint Conference on Arti cial
Intelligence (IJCAI). pp. 2312{2317 (2011)
20. Noy, N.F., Gao, Y., Jain, A., Narayanan, A., Patterson, A., Taylor, J.:
Industryscale Knowledge Graphs: Lessons and Challenges. ACM Queue 17(2), 20 (2019)
21. Petrova, A., Sherkhonov, E., Grau, B.C., Horrocks, I.: Entity Comparison in RDF
Graphs. In: International Semantic Web Conference (ISWC). pp. 526{541. Springer
(2017)
22. Pirro, G., Euzenat, J.: A Feature and Information Theoretic Framework for
Semantic Similarity and Relatedness. In: International Semantic Web Conference
(ISWC). pp. 615{630 (2010)
23. Ribeiro, L.F.R., Saverese, P.H.P., Figueiredo, D.R.: struc2vec: Learning node
representations from structural identity. In: Special Interest Group on Knowledge
Discovery and Data Mining (SIGKDD). pp. 385{394 (2017)
24. Sherif, M.A., Ngomo, A.N.: A systematic survey of point set distance measures for
link discovery. Semantic Web 9(5), 589{604 (2018)
25. Tsitsulin, A., Mottin, D., Karras, P., Muller, E.: VERSE: Versatile Graph
Embeddings from Similarity Measures. In: World Wide Web Conference (WWW). pp.
539{548 (2018)
26. Yahya, M., Berberich, K., Ramanath, M., Weikum, G.: Exploratory querying of
extended knowledge graphs. Proc. VLDB Endow. 9(13), 1521{1524 (2016)
27. Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X.,
Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S.,
Stoica, I.: Apache Spark: a uni ed engine for big data processing. Commun. ACM
59(11), 56{65 (2016)
28. Zheng, W., Zou, L., Chen, L., Zhao, D.: E cient SimRank-Based Similarity Join.</p>
        <p>ACM Trans. Database Syst. 42(3), 16:1{16:37 (2017)</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alqadah</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhatnagar</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>A game theoretic framework for heterogenous information network clustering</article-title>
          . In: Special Interest Group on
          <article-title>Knowledge Discovery and Data Mining (SIGKDD)</article-title>
          . pp.
          <volume>795</volume>
          {
          <issue>804</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Apweiler</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bairoch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barker</surname>
            ,
            <given-names>W.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boeckmann</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gasteiger</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Magrane</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Natale</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>O'Donovan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Redaschi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yeh</surname>
            ,
            <given-names>L.L.</given-names>
          </string-name>
          :
          <article-title>UniProt: the Universal Protein knowledgebase</article-title>
          .
          <source>Nucleic Acids Research</source>
          <volume>32</volume>
          ,
          <issue>D115</issue>
          {
          <fpage>D119</fpage>
          (01
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Battle</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kolas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Enabling the geospatial Semantic Web with Parliament and GeoSPARQL</article-title>
          .
          <source>Semantic Web</source>
          <volume>3</volume>
          (
          <issue>4</issue>
          ),
          <volume>355</volume>
          {
          <fpage>370</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Belleau</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nolin</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tourigny</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rigault</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morissette</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Bio2RDF: towards a mashup to build bioinformatics knowledge systems</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          <volume>41</volume>
          (
          <issue>5</issue>
          ),
          <volume>706</volume>
          {
          <fpage>716</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>V.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gajardo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heymans</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Senellart</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dooren</surname>
            ,
            <given-names>P.V.</given-names>
          </string-name>
          :
          <article-title>A Measure of Similarity between Graph Vertices: Applications to Synonym Extraction and Web Searching</article-title>
          .
          <source>SIAM Review</source>
          <volume>46</volume>
          (
          <issue>4</issue>
          ),
          <volume>647</volume>
          {
          <fpage>666</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Bonatti</surname>
            ,
            <given-names>P.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Presutti</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web</article-title>
          .
          <source>Dagstuhl Reports</source>
          <volume>8</volume>
          (
          <issue>9</issue>
          ),
          <volume>29</volume>
          {
          <fpage>111</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Chiang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liou</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Exploring heterogeneous information networks and random walk with restart for academic search</article-title>
          .
          <source>Knowl. Inf. Syst</source>
          .
          <volume>36</volume>
          (
          <issue>1</issue>
          ),
          <volume>59</volume>
          {
          <fpage>82</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghemawat</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>MapReduce: simpli ed data processing on large clusters</article-title>
          .
          <source>Commun. ACM</source>
          <volume>51</volume>
          (
          <issue>1</issue>
          ),
          <volume>107</volume>
          {
          <fpage>113</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , Y.:
          <article-title>Probabilistic SimRank computation over uncertain graphs</article-title>
          .
          <source>Inf. Sci</source>
          .
          <volume>295</volume>
          ,
          <issue>521</issue>
          {
          <fpage>535</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Euzenat</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shvaiko</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Ontology Matching. Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Ferrada</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bustos</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>IMGpedia: a linked dataset with content-based analysis of Wikimedia images</article-title>
          .
          <source>In: International Semantic Web Conference (ISWC)</source>
          . pp.
          <volume>84</volume>
          {
          <fpage>93</fpage>
          . Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mellotte</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Powell</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stampouli</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Towards Fuzzy QueryRelaxation for RDF</article-title>
          .
          <source>In: Extended Semantic Web Conference (ESWC)</source>
          . pp.
          <volume>687</volume>
          {
          <issue>702</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zimmermann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora</article-title>
          .
          <source>J. Web Semant</source>
          .
          <volume>10</volume>
          ,
          <issue>76</issue>
          {
          <fpage>110</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Hulpus</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prangnawarat</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hayes</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Path-Based Semantic Relatedness on Linked Data and Its Use to Word and Entity Disambiguation</article-title>
          .
          <source>In: International Semantic Web Conference (ISWC)</source>
          . pp.
          <volume>442</volume>
          {
          <issue>457</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Jeh</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Widom</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>SimRank: a measure of structural-context similarity</article-title>
          .
          <source>In: SIGKDD International Conference on Knowledge Discovery and Data Mining (SIDKDD)</source>
          . pp.
          <volume>538</volume>
          {
          <issue>543</issue>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Kiefer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stocker</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The fundamentals of iSPARQL: a virtual triple approach for similarity-based Semantic Web tasks</article-title>
          .
          <source>In: International Semantic Web Conference (ISWC)</source>
          , pp.
          <volume>295</volume>
          {
          <fpage>309</fpage>
          . Springer (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Leicht</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holme</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Newman</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          :
          <article-title>Vertex similarity in networks</article-title>
          .
          <source>Physical Review E</source>
          <volume>73</volume>
          (
          <issue>2</issue>
          ),
          <volume>026120</volume>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Lizorkin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Velikhov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grinev</surname>
            ,
            <given-names>M.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Turdakov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Accuracy estimate and optimization techniques for SimRank computation</article-title>
          .
          <source>VLDB J</source>
          .
          <volume>19</volume>
          (
          <issue>1</issue>
          ),
          <volume>45</volume>
          {
          <fpage>66</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>