<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>User-friendly Comparison of Similarity Algorithms on Wikidata</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Filip Ilievski</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pedro Szekely</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gleb Satyukov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amandeep Singh</string-name>
          <email>amandeepg@isi.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Sciences Institute, University of Southern California</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>While the similarity between two concept words has been evaluated and studied for decades, much less attention has been devoted to algorithms that can compute the similarity of nodes in very large knowledge graphs, like Wikidata. To facilitate investigations and headto-head comparisons of similarity algorithms on Wikidata, we present a user-friendly interface that allows exible computation of similarity between Qnodes in Wikidata. At present, the similarity interface supports four algorithms, based on: graph embeddings (TransE, ComplEx), text embeddings (RoBERTa), and class-based similarity. We demonstrate the behavior of the algorithms on representative examples about semantically similar, related, and entirely unrelated entity pairs. To support anticipated applications that require e cient similarity computations, like entity linking and recommendation, we also provide a REST API that can compute most similar neighbors for any Qnode in Wikidata.</p>
      </abstract>
      <kwd-group>
        <kwd>Wikidata</kwd>
        <kwd>Knowledge Graphs</kwd>
        <kwd>Similarity</kwd>
        <kwd>Embeddings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Semantically similar entities are close in ontology space and they share essential
de ning properties, e.g., a car and a bus are both used for transportation and
have four wheels [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. A broader relation of semantic relatedness holds for entities
that are dissimilar, but they are cognitively and topically close, and tend to
appear in similar contexts, e.g., car and driver. Estimating similarity between
two concept words have been evaluated and studied for decades [
        <xref ref-type="bibr" rid="ref10 ref3">3, 10</xref>
        ]. Much less
attention has been devoted to algorithms that can compute similarity of nodes
in very large knowledge graphs, like Wikidata [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. E ective and e cient metrics
of Wikidata similarity are essential for a range of downstream applications, such
as entity linking [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] and recommendation [
        <xref ref-type="bibr" rid="ref1 ref6">6, 1</xref>
        ].
      </p>
      <p>
        To facilitate investigations and head-to-head comparisons of similarity
algorithms on Wikidata, we present a user-friendly graphical user interface (GUI)
that allows exible computation of similarity between Qnodes in Wikidata. The
similarity interface is publicly available at https://kgtk.isi.edu/similarity. At
Copyright © 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
present, the similarity GUI supports four algorithms, based on: graph
embeddings(TransE [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], ComplEx [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]), text embeddings (RoBERTa [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]), and
classbased similarity. Through the similarity interface, users can investigate the
ability of di erent (families of) algorithms to capture similarity of concepts and
entities in Wikidata. To support applications that require e cient similarity
computations, like entity linking and recommendation, we also provide a REST
API that can compute the most similar neighbors for any Qnode in Wikidata.
The endpoint of this API is https://kgtk.isi.edu/nearest neighbors.
      </p>
      <p>We demonstrate the behavior of the algorithms on representative examples
about semantically similar, related, or entirely unrelated entity pairs. We show
that the class-based metric consistently captures semantic similarity, and assigns
lower scores to terms that are merely related or unrelated. RoBERTa-based
similarity behaves di erently, providing high scores to both semantically similar
and related pairs. The graph embedding-based metrics are somewhere in between
class-based similarity and RoBERTa.</p>
      <p>The code for the similarity GUI and our similarity API is freely available on
GitHub: https://github.com/usc-isi-i2/kgtk-similarity.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Similarity interfaces</title>
      <p>In this section, we describe the similarity interfaces that we have developed,
together with their currently supported algorithms.</p>
      <p>GUI Our GUI allows users to search for a primary Qnode based on its labels
or aliases. The user could then add any number of secondary Qnodes in the
same way, based on free text search against the node labels and aliases. We use
ElasticSearch to build a text index and enable this search. The interface then
displays the similarity between the primary node and each secondary Qnode,
according to each of the supported algorithms.</p>
      <p>Currently, we support four algorithms:
1. Class similarity computes the set of common is-a parents for two nodes.</p>
      <p>Here, the is-a relations are computed as a transitive closure over both the
subclass-of (P279) and the instance-of (P31) relations. Each shared parents
is weighted by its inverse document frequency (IDF), computed based on
the number of instances that transitively belong to that parent class.
2. TransE similarity computes the cosine similarity between the pre-computed</p>
      <p>TransE embeddings of two Wikidata nodes.
3. ComplEx similarity computes the cosine similarity between the pre-computed</p>
      <p>ComplEx embeddings of two Wikidata nodes.
4. Text similarity computes the cosine similarity between the pre-computed
RoBERTa-large embeddings of two Wikidata nodes. We pre-compute these
RoBERTa embeddings over a lexicalized version of each Wikidata Qnode,
based on its outgoing edges in the graph. We use outgoing edges that belong
to the following seven relation types: P31 (instance of), P279 (subclass of),
P106 (occupation), P39 (position held), P1382 (partially coincident with),
P373 (Commons Category), and P452 (industry).</p>
      <p>
        The similarity score between two nodes is computed on the y, which allows
us to facilitate estimation for any pair of Wikidata nodes. We use the operation
graph-embeddings of the Knowledge Graph ToolKit (KGTK) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] to compute
TransE and ComplEx embeddings. We use the KGTK text-embeddings
command to compute the text (RoBERTa-based) embeddings. A snapshot of the
similarity interface is shown in Figure 1.
      </p>
      <p>
        Nearest Neighbors API Our REST API returns K nearest neighbors for a
Qnode based on the ComplEx algorithm. We index the ComplEx embeddings
in a FAISS [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] index, which facilitates e cient retrieval.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Analysis</title>
      <p>GUI examples In this section, we show the similarity scores provided by the
supported algorithms between the Wikidata Qnode for motorcycle (Q34493)
and ten other Qnodes. Speci cally, we include three semantically similar nodes:
bus (Q5638), Dirt Bike (Q3050907), and yacht (Q170173); four related, but
dissimilar nodes: engine (Q44167), helmet (Q173603), road (Q34442), and cyclist
(Q2125610); and three unrelated nodes: cheese (Q10943), Norway (Q20), and
shelf (Q2637814). Following our terminology introduced in the previous section,
motorcycle is the primary Qnode, and the ten additional Qnodes are secondary.</p>
      <p>Next, we order the same set of results based on their text-based score. The
result is shown in Figure 2. Here, we observe that the terms that are unrelated to
motorcycle (shelf, cheese, and Norway) are consistently assigned low scores. At
the same time, we observe that the terms that are semantically similar (e.g., dirt
bike) and merely related (e.g., cyclist) receive comparable scores. We conclude
that the RoBERTa-based text similarity metric is able to discern related from
unrelated nodes, but it is unable to distinguish between similar and related
terms. This can be expected, considering that the RoBERTa model is trained to
capture natural language co-occurrence, thus favoring both semantic and related
terms over unrelated ones.
Nearest neighbors API examples The nearest neighbors API can be
leveraged to obtain the top-K most similar Wikidata Qnodes for a given Qnode. For
instance, in order to obtain the top-5 most similar nodes to motorcycle (Q34493),
we query: https://kgtk.isi.edu/nearest neighbors?qnode=Q34493&amp;k=5. The
result is a list of the 5 most similar nodes, with their corresponding distance from
the motorcycle Qnode and their human-readable label in English:
[
{
qnode: "Q13586807",
score: 13.393990516662598,
label: "Manet Korado"
},
{
},
{
},
{
},
{
}
qnode: "Q376498",
score: 13.482695579528809,
label: "diesel motorcycle"
qnode: "Q28126796",
score: 15.886520385742188,
label: "Harley-Davidson FLSTFB Fat Boy"
qnode: "Q20076361",
score: 16.452970504760742,
label: "Honda SH50"
qnode: "Q18695780",
score: 16.553009033203125,
label: "Bultaco TSS Mk2"
]</p>
      <p>Curiously, the list of the most similar nodes is dominated by speci c
motorcycle models and categories. The most similar three nodes are direct subclasses
of the motorcycle class (connected by using the P279 relation). The remaining
two Qnodes are speci c models of motorcycles, represented as instances (P31)
of the motorcycle class in Wikidata. This con rms our earlier observation: the
graph embeddings like ComplEx assign a higher similarity to node pairs that
connect to similar structures in the Wikidata graph.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Similarity in Downstream Tasks</title>
      <p>Meaningful estimation of similarity is at the core of a long list of applications in
natural language processing, information retrieval, and network analysis. Here,
we discuss the role of estimating similarity for two prominent applications: 1)
entity linking in tables, and 2) recommendation and deduplication. We also
discuss how our interfaces could support these applications.</p>
      <p>Entity linking in tables Understanding the reference of entities in tables relies
on two di erent notions of similarity. On the one hand, entities in the same
column typically are of the same type, or play the same role in a given context. For
example, a table with Russian politicians will include a column with politicians,
and a column with their positions. Thus, understanding entities within a column
relies on similarity indicators that can capture semantic similarity, such as our
class-based metric. On the other hand, entities mentioned in the same row rely
on metrics that capture aspects of relatedness, such as our text-based metric,
which relies on linguistic similarity, or our graph embedding metrics, which
capture structural similarity. Following our previous example, this would require a
metric that can assign a high score to the pairs: Vladimir Putin - Russia, and
Vladimir Putin - president.</p>
      <p>
        Recommendation and deduplication A special use case of Qnode
recommendation is assistance of Wikidata editors. Namely, when an editor introduces
a new Qnode, it is useful to have metrics which can detect very similar existing
entities and ask the editor to con rm that the new entity is di erent from the
most similar existing ones [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This procedure would help to avoid introducing
duplicates in Wikidata, which is a key challenge today, considering that millions
of redirects have been introduced in Wikidata since its inception [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. At the
same time, similarity methods could be run over the current set of entities in
Wikidata to detect potentially existing duplicates, which can be validated by
an editor before their merging. The class-based metric could be used to detect
potential duplicates, and it could be complemented with additional metrics (e.g.,
text-based similarity) when the taxonomic information is not present for a node.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>This demo paper presented a user-friendly interface for computation of pairwise
similarity between Qnodes in Wikidata. To facilitate head-to-head comparisons
of similarity, the interface rendered the scores for multiple node pairs by four
di erent algorithms: a class-based metric, two graph embedding metrics, and
a language model based (text) metric. We experimented with their scores on
semantically similar, related, or entirely unrelated entity pairs, observing that
the class-based metric favored semantically similar pairs, while the text-based
metric favored both semantically similar and related pairs, at the expense of the
unrelated ones. Graph embeddings scored pairs orthogonally to our similarity
categorization, by assigning higher scores to pairs that are structurally similar
in Wikidata. To support applications where similarity plays a key role, such as
entity linking, recommendation, and deduplication, we also provided a public
API that returns the top-K neighbors for a given Qnode.</p>
      <p>Ilievski et al.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>AlGhamdi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simperl</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          : Learning to recommend items to wikidata editors.
          <source>arXiv preprint arXiv:2107.06423</source>
          (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bordes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Usunier</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia-Duran</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weston</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yakhnenko</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Translating embeddings for modeling multi-relational data</article-title>
          .
          <source>Advances in neural information processing systems</source>
          <volume>26</volume>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Budanitsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hirst</surname>
          </string-name>
          , G.:
          <article-title>Evaluating wordnet-based measures of lexical semantic relatedness</article-title>
          .
          <source>Computational linguistics 32(1)</source>
          ,
          <volume>13</volume>
          {
          <fpage>47</fpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Cetoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bragaglia</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>O</given-names>
            <surname>'Harney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.D.</given-names>
            ,
            <surname>Sloan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Akbari</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.:</surname>
          </string-name>
          <article-title>A neural approach to entity linking on wikidata</article-title>
          .
          <source>In: European conference on information retrieval</source>
          . pp.
          <volume>78</volume>
          {
          <fpage>86</fpage>
          . Springer (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Delpeuch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Opentapioca: Lightweight entity linking for wikidata</article-title>
          . arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>09131</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gleim</surname>
            ,
            <given-names>L.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schimassek</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Huser,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          , Kramer,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Cochez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Decker</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          : Schematree:
          <article-title>Maximum-likelihood property recommendation for wikidata</article-title>
          .
          <source>In: European Semantic Web Conference</source>
          . pp.
          <volume>179</volume>
          {
          <fpage>195</fpage>
          . Springer (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ilievski</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garijo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chalupsky</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Divvala</surname>
            ,
            <given-names>N.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Liu,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Schwabe</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          , et al.:
          <article-title>Kgtk: a toolkit for large knowledge graph manipulation and analysis</article-title>
          .
          <source>In: International Semantic Web Conference</source>
          . pp.
          <volume>278</volume>
          {
          <fpage>293</fpage>
          . Springer (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Johnson</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Douze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jegou</surname>
          </string-name>
          , H.:
          <article-title>Billion-scale similarity search with gpus</article-title>
          .
          <source>arXiv preprint arXiv:1702.08734</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ott</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoyanov</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          . arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Pantel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Philpot</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hovy</surname>
          </string-name>
          , E.:
          <article-title>Matching and integration across heterogeneous data sources</article-title>
          .
          <source>In: Proceedings of the 2006 international conference on Digital government research</source>
          . pp.
          <volume>438</volume>
          {
          <issue>439</issue>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Shenoy</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ilievski</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garijo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwabe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szekely</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>A study of the quality of wikidata</article-title>
          .
          <source>arXiv preprint arXiv:2107.00156</source>
          (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Trouillon</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welbl</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riedel</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaussier</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bouchard</surname>
          </string-name>
          , G.:
          <article-title>Complex embeddings for simple link prediction</article-title>
          .
          <source>In: International conference on machine learning</source>
          . pp.
          <year>2071</year>
          {
          <year>2080</year>
          . PMLR (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Vrandecic</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Krotzsch, M.:
          <article-title>Wikidata: a free collaborative knowledgebase</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>57</volume>
          (
          <issue>10</issue>
          ),
          <volume>78</volume>
          {
          <fpage>85</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>