=Paper=
{{Paper
|id=Vol-2982/paper-2
|storemode=property
|title=User-friendly Comparison of Similarity Algorithms on Wikidata
|pdfUrl=https://ceur-ws.org/Vol-2982/paper-2.pdf
|volume=Vol-2982
|authors=Filip Ilievski,Pedro Szekely,Gleb Satyukov,Amandeep Singh
|dblpUrl=https://dblp.org/rec/conf/semweb/IlievskiSSS21
}}
==User-friendly Comparison of Similarity Algorithms on Wikidata==
<pdf width="1500px">https://ceur-ws.org/Vol-2982/paper-2.pdf</pdf>
<pre>
         User-friendly Comparison of Similarity
                Algorithms on Wikidata

        Filip Ilievski, Pedro Szekely, Gleb Satyukov, and Amandeep Singh

           Information Sciences Institute, University of Southern California
                   {ilievski,pszekely,gleb,amandeep}@isi.edu


        Abstract. While the similarity between two concept words has been
        evaluated and studied for decades, much less attention has been devoted
        to algorithms that can compute the similarity of nodes in very large
        knowledge graphs, like Wikidata. To facilitate investigations and head-
        to-head comparisons of similarity algorithms on Wikidata, we present a
        user-friendly interface that allows flexible computation of similarity be-
        tween Qnodes in Wikidata. At present, the similarity interface supports
        four algorithms, based on: graph embeddings (TransE, ComplEx), text
        embeddings (RoBERTa), and class-based similarity. We demonstrate the
        behavior of the algorithms on representative examples about semanti-
        cally similar, related, and entirely unrelated entity pairs. To support
        anticipated applications that require efficient similarity computations,
        like entity linking and recommendation, we also provide a REST API
        that can compute most similar neighbors for any Qnode in Wikidata.

        Keywords: Wikidata · Knowledge Graphs · Similarity · Embeddings


1     Introduction

Semantically similar entities are close in ontology space and they share essential
defining properties, e.g., a car and a bus are both used for transportation and
have four wheels [3]. A broader relation of semantic relatedness holds for entities
that are dissimilar, but they are cognitively and topically close, and tend to
appear in similar contexts, e.g., car and driver. Estimating similarity between
two concept words have been evaluated and studied for decades [3, 10]. Much less
attention has been devoted to algorithms that can compute similarity of nodes
in very large knowledge graphs, like Wikidata [13]. Effective and efficient metrics
of Wikidata similarity are essential for a range of downstream applications, such
as entity linking [4, 5] and recommendation [6, 1].
    To facilitate investigations and head-to-head comparisons of similarity algo-
rithms on Wikidata, we present a user-friendly graphical user interface (GUI)
that allows flexible computation of similarity between Qnodes in Wikidata. The
similarity interface is publicly available at https://kgtk.isi.edu/similarity. At
    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
2      Ilievski et al.

present, the similarity GUI supports four algorithms, based on: graph embed-
dings(TransE [2], ComplEx [12]), text embeddings (RoBERTa [9]), and class-
based similarity. Through the similarity interface, users can investigate the abil-
ity of different (families of) algorithms to capture similarity of concepts and
entities in Wikidata. To support applications that require efficient similarity
computations, like entity linking and recommendation, we also provide a REST
API that can compute the most similar neighbors for any Qnode in Wikidata.
The endpoint of this API is https://kgtk.isi.edu/nearest neighbors.
    We demonstrate the behavior of the algorithms on representative examples
about semantically similar, related, or entirely unrelated entity pairs. We show
that the class-based metric consistently captures semantic similarity, and assigns
lower scores to terms that are merely related or unrelated. RoBERTa-based
similarity behaves differently, providing high scores to both semantically similar
and related pairs. The graph embedding-based metrics are somewhere in between
class-based similarity and RoBERTa.
    The code for the similarity GUI and our similarity API is freely available on
GitHub: https://github.com/usc-isi-i2/kgtk-similarity.


2   Similarity interfaces
In this section, we describe the similarity interfaces that we have developed,
together with their currently supported algorithms.
GUI Our GUI allows users to search for a primary Qnode based on its labels
or aliases. The user could then add any number of secondary Qnodes in the
same way, based on free text search against the node labels and aliases. We use
ElasticSearch to build a text index and enable this search. The interface then
displays the similarity between the primary node and each secondary Qnode,
according to each of the supported algorithms.
   Currently, we support four algorithms:
 1. Class similarity computes the set of common is-a parents for two nodes.
    Here, the is-a relations are computed as a transitive closure over both the
    subclass-of (P279) and the instance-of (P31) relations. Each shared parents
    is weighted by its inverse document frequency (IDF), computed based on
    the number of instances that transitively belong to that parent class.
 2. TransE similarity computes the cosine similarity between the pre-computed
    TransE embeddings of two Wikidata nodes.
 3. ComplEx similarity computes the cosine similarity between the pre-computed
    ComplEx embeddings of two Wikidata nodes.
 4. Text similarity computes the cosine similarity between the pre-computed
    RoBERTa-large embeddings of two Wikidata nodes. We pre-compute these
    RoBERTa embeddings over a lexicalized version of each Wikidata Qnode,
    based on its outgoing edges in the graph. We use outgoing edges that belong
    to the following seven relation types: P31 (instance of), P279 (subclass of),
    P106 (occupation), P39 (position held), P1382 (partially coincident with),
    P373 (Commons Category), and P452 (industry).
             User-friendly Comparison of Similarity Algorithms on Wikidata          3

    The similarity score between two nodes is computed on the fly, which allows
us to facilitate estimation for any pair of Wikidata nodes. We use the operation
graph-embeddings of the Knowledge Graph ToolKit (KGTK) [7] to compute
TransE and ComplEx embeddings. We use the KGTK text-embeddings com-
mand to compute the text (RoBERTa-based) embeddings. A snapshot of the
similarity interface is shown in Figure 1.
Nearest Neighbors API Our REST API returns K nearest neighbors for a
Qnode based on the ComplEx algorithm. We index the ComplEx embeddings
in a FAISS [8] index, which facilitates efficient retrieval.


3    Analysis

GUI examples In this section, we show the similarity scores provided by the
supported algorithms between the Wikidata Qnode for motorcycle (Q34493)
and ten other Qnodes. Specifically, we include three semantically similar nodes:
bus (Q5638), Dirt Bike (Q3050907), and yacht (Q170173); four related, but
dissimilar nodes: engine (Q44167), helmet (Q173603), road (Q34442), and cyclist
(Q2125610); and three unrelated nodes: cheese (Q10943), Norway (Q20), and
shelf (Q2637814). Following our terminology introduced in the previous section,
motorcycle is the primary Qnode, and the ten additional Qnodes are secondary.


Fig. 1. Similarity between motorcycle (Q34493) and ten other terms, i.e., three seman-
tically similar nodes: bus (Q5638), Dirt Bike (Q3050907), and yacht (Q170173); four
related, but dissimilar nodes: engine (Q44167), helmet (Q173603), road (Q34442), and
cyclist (Q2125610); and three unrelated nodes: cheese (Q10943), Norway (Q20), and
shelf (Q2637814). The results are ordered based on their class similarity score.
4       Ilievski et al.

    Figure 1 shows the obtained similarity scores, in a descending order according
to their class-based scores. We observe that the class-based metric consistently
prioritizes semantically similar nodes over the others, as its three top-scored
nodes are semantically similar to motorcycle. The remaining nodes receive no-
tably lower scores, with the exception of the motorcycle-engine pair, whose sim-
ilarity is fairly high (0.42). We observe that the class metric makes little distinc-
tion between nodes that are related and nodes that are unrelated to motorcycle.
These findings show that the class metric mostly captures semantic similarity,
and it does not capture semantic relatedness. This is intuitive, given that it
is purely based on the Wikidata taxonomy, and naturally favors semantically
similar terms.
   Next, we order the same set of results based on their text-based score. The
result is shown in Figure 2. Here, we observe that the terms that are unrelated to
motorcycle (shelf, cheese, and Norway) are consistently assigned low scores. At
the same time, we observe that the terms that are semantically similar (e.g., dirt
bike) and merely related (e.g., cyclist) receive comparable scores. We conclude
that the RoBERTa-based text similarity metric is able to discern related from
unrelated nodes, but it is unable to distinguish between similar and related
terms. This can be expected, considering that the RoBERTa model is trained to
capture natural language co-occurrence, thus favoring both semantic and related
terms over unrelated ones.


Fig. 2. Similarity between motorcycle (Q34493) and ten other terms, i.e., three seman-
tically similar nodes: bus (Q5638), Dirt Bike (Q3050907), and yacht (Q170173); four
related, but dissimilar nodes: engine (Q44167), helmet (Q173603), road (Q34442), and
cyclist (Q2125610); and three unrelated nodes: cheese (Q10943), Norway (Q20), and
shelf (Q2637814). The results are ordered based on their text similarity score.
             User-friendly Comparison of Similarity Algorithms on Wikidata          5

    Figure 3 provides a third ordering of the results, based on their TransE score.
The scoring in this case correlates to a lesser extent with our a priori three-way
categorization of the Qnodes, though on average semantic similarity is favored
over relatedness, which is on average favored over unrelatedness. This could
be explained with the property of the graph embeddings to capture structural
similarity of nodes, i.e., to assign higher similarity between nodes that connect
to similar other nodes (e.g., both engine and bus relate to car). For this reason,
engine, bus, and helmet are assigned higher similarity than terms such as Norway
and road.


Fig. 3. Similarity between motorcycle (Q34493) and ten other terms, i.e., three seman-
tically similar nodes: bus (Q5638), Dirt Bike (Q3050907), and yacht (Q170173); four
related, but dissimilar nodes: engine (Q44167), helmet (Q173603), road (Q34442), and
cyclist (Q2125610); and three unrelated nodes: cheese (Q10943), Norway (Q20), and
shelf (Q2637814). The results are ordered based on their TransE score.


Nearest neighbors API examples The nearest neighbors API can be lever-
aged to obtain the top-K most similar Wikidata Qnodes for a given Qnode. For
instance, in order to obtain the top-5 most similar nodes to motorcycle (Q34493),
we query: https://kgtk.isi.edu/nearest neighbors?qnode=Q34493&k=5. The re-
sult is a list of the 5 most similar nodes, with their corresponding distance from
the motorcycle Qnode and their human-readable label in English:

[
    {
         qnode: "Q13586807",
         score: 13.393990516662598,
         label: "Manet Korado"
6        Ilievski et al.

    },
    {
          qnode: "Q376498",
          score: 13.482695579528809,
          label: "diesel motorcycle"
    },
    {
          qnode: "Q28126796",
          score: 15.886520385742188,
          label: "Harley-Davidson FLSTFB Fat Boy"
    },
    {
          qnode: "Q20076361",
          score: 16.452970504760742,
          label: "Honda SH50"
    },
    {
          qnode: "Q18695780",
          score: 16.553009033203125,
          label: "Bultaco TSS Mk2"
    }
]

    Curiously, the list of the most similar nodes is dominated by specific motor-
cycle models and categories. The most similar three nodes are direct subclasses
of the motorcycle class (connected by using the P279 relation). The remaining
two Qnodes are specific models of motorcycles, represented as instances (P31)
of the motorcycle class in Wikidata. This confirms our earlier observation: the
graph embeddings like ComplEx assign a higher similarity to node pairs that
connect to similar structures in the Wikidata graph.

4   Similarity in Downstream Tasks
Meaningful estimation of similarity is at the core of a long list of applications in
natural language processing, information retrieval, and network analysis. Here,
we discuss the role of estimating similarity for two prominent applications: 1)
entity linking in tables, and 2) recommendation and deduplication. We also
discuss how our interfaces could support these applications.
Entity linking in tables Understanding the reference of entities in tables relies
on two different notions of similarity. On the one hand, entities in the same col-
umn typically are of the same type, or play the same role in a given context. For
example, a table with Russian politicians will include a column with politicians,
and a column with their positions. Thus, understanding entities within a column
relies on similarity indicators that can capture semantic similarity, such as our
class-based metric. On the other hand, entities mentioned in the same row rely
on metrics that capture aspects of relatedness, such as our text-based metric,
which relies on linguistic similarity, or our graph embedding metrics, which cap-
ture structural similarity. Following our previous example, this would require a
             User-friendly Comparison of Similarity Algorithms on Wikidata            7

metric that can assign a high score to the pairs: Vladimir Putin - Russia, and
Vladimir Putin - president.
Recommendation and deduplication A special use case of Qnode recom-
mendation is assistance of Wikidata editors. Namely, when an editor introduces
a new Qnode, it is useful to have metrics which can detect very similar existing
entities and ask the editor to confirm that the new entity is different from the
most similar existing ones [1]. This procedure would help to avoid introducing
duplicates in Wikidata, which is a key challenge today, considering that millions
of redirects have been introduced in Wikidata since its inception [11]. At the
same time, similarity methods could be run over the current set of entities in
Wikidata to detect potentially existing duplicates, which can be validated by
an editor before their merging. The class-based metric could be used to detect
potential duplicates, and it could be complemented with additional metrics (e.g.,
text-based similarity) when the taxonomic information is not present for a node.


5   Conclusions

This demo paper presented a user-friendly interface for computation of pairwise
similarity between Qnodes in Wikidata. To facilitate head-to-head comparisons
of similarity, the interface rendered the scores for multiple node pairs by four
different algorithms: a class-based metric, two graph embedding metrics, and
a language model based (text) metric. We experimented with their scores on
semantically similar, related, or entirely unrelated entity pairs, observing that
the class-based metric favored semantically similar pairs, while the text-based
metric favored both semantically similar and related pairs, at the expense of the
unrelated ones. Graph embeddings scored pairs orthogonally to our similarity
categorization, by assigning higher scores to pairs that are structurally similar
in Wikidata. To support applications where similarity plays a key role, such as
entity linking, recommendation, and deduplication, we also provided a public
API that returns the top-K neighbors for a given Qnode.


References

 1. AlGhamdi, K., Shi, M., Simperl, E.: Learning to recommend items to wikidata
    editors. arXiv preprint arXiv:2107.06423 (2021)
 2. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating
    embeddings for modeling multi-relational data. Advances in neural information
    processing systems 26 (2013)
 3. Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of lexical semantic
    relatedness. Computational linguistics 32(1), 13–47 (2006)
 4. Cetoli, A., Bragaglia, S., O’Harney, A.D., Sloan, M., Akbari, M.: A neural approach
    to entity linking on wikidata. In: European conference on information retrieval. pp.
    78–86. Springer (2019)
 5. Delpeuch, A.: Opentapioca: Lightweight entity linking for wikidata. arXiv preprint
    arXiv:1904.09131 (2019)
8       Ilievski et al.

 6. Gleim, L.C., Schimassek, R., Hüser, D., Peters, M., Krämer, C., Cochez, M.,
    Decker, S.: Schematree: Maximum-likelihood property recommendation for wiki-
    data. In: European Semantic Web Conference. pp. 179–195. Springer (2020)
 7. Ilievski, F., Garijo, D., Chalupsky, H., Divvala, N.T., Yao, Y., Rogers, C., Li, R.,
    Liu, J., Singh, A., Schwabe, D., et al.: Kgtk: a toolkit for large knowledge graph
    manipulation and analysis. In: International Semantic Web Conference. pp. 278–
    293. Springer (2020)
 8. Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. arXiv
    preprint arXiv:1702.08734 (2017)
 9. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
    Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining
    approach. arXiv preprint arXiv:1907.11692 (2019)
10. Pantel, P., Philpot, A., Hovy, E.: Matching and integration across heterogeneous
    data sources. In: Proceedings of the 2006 international conference on Digital gov-
    ernment research. pp. 438–439 (2006)
11. Shenoy, K., Ilievski, F., Garijo, D., Schwabe, D., Szekely, P.: A study of the quality
    of wikidata. arXiv preprint arXiv:2107.00156 (2021)
12. Trouillon, T., Welbl, J., Riedel, S., Gaussier, É., Bouchard, G.: Complex embed-
    dings for simple link prediction. In: International conference on machine learning.
    pp. 2071–2080. PMLR (2016)
13. Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Com-
    munications of the ACM 57(10), 78–85 (2014)

</pre>