ReliK: A Reliability Measure for Knowledge Graph
                                Embeddings [Extended Abstract]
                                Maximilian K. Egger1 , Wenyue Ma1 , Davide Mottin1,* , Panagiotis Karras1,2 ,
                                Ilaria Bordino3 , Francesco Gullo3 and Aris Anagnostopoulos4
                                1
                                  Aarhus University, Department of Computer Science, Aabogade 34, 8200 Aarhus, Denmark
                                2
                                  Copenhaghen University, Department of Computer Science, Universitetsparken 1, 2100 København, Denmark
                                3
                                  Unicredit, Via Molfetta 101, 00171 Rome, Italy
                                4
                                  Dept of Computer, Control, and Management Eng., Sapienza University of Rome, Via Ariosto 25, 00185 Rome, Italy


                                           Abstract
                                           Can we assess a priori how well a knowledge graph embedding will perform on a specific downstream
                                           task and in a specific part of the knowledge graph? Knowledge graph embeddings (KGEs) represent
                                           entities and relationships of a knowledge graph (KG) as vectors. KGEs are generated by optimizing an
                                           embedding score, which assesses whether a triple exists in the graph. KGEs have been proven effective in
                                           a variety of downstream tasks, including, for instance, predicting relationships among entities. However,
                                           the problem of anticipating the performance of a given KGE in a certain downstream task and locally
                                           to a specific individual triple, has not been tackled so far. In this paper, we fill this gap with ReliK, a
                                           Reliability measure for KGEs. ReliK relies solely on KGE embedding scores, is task- and KGE-agnostic,
                                           and requires no KGE retraining. Through extensive experiments, we attest that ReliK correlates well
                                           with both common downstream tasks, such as tail or relation prediction and triple classification, and
                                           advanced downstream tasks, such as rule mining and question answering, while preserving locality.

                                           Keywords
                                           Knowledge Graph Embeddings, Reliability, Knowledge Graphs

                                1. Introduction
                                Knowledge graphs (KGs) are sets of facts (i.e., triples such as “da Vinci,” “painted,” “Mona Lisa”)
                                that interconnect entities (“da Vinci,” “Mona Lisa”) via relationships (“painted”) [1, 2]. Entities
                                and relationships correspond to nodes and (labeled) edges of the KG, respectively. Knowledge
                                graph embeddings (KGEs) [3] are popular techniques to generate a vector representation for
                                entities and relationships of a KG. A KGE is computed by optimizing a scoring function that
                                provides an embedding score as an indication of whether a triple actually exists in the KG. KGEs
                                have been extensively used as a crucial building block of state-of-the-art methods for a variety
                                of downstream tasks commonly carried out on the Web, such as knowledge completion [4],
                                whereby a classifier is trained on the embeddings to predict the existence of a triple; or head/tail
                                prediction [5], which aims to predict entities of a triple, as well as more advanced ones, including
                                rule mining [6], query answering [7], and entity alignment [8, 9, 10, 11].


                                SEBD 2024: 32nd Symposium on Advanced Database System, June 23-26, 2024 - Villasimius, Sardinia, Italy
                                $ maximilian.egger@cs.au.dk (M. K. Egger); wenyuema@cs.au.dk (W. Ma); davide@cs.au.dk (D. Mottin);
                                piekarras@gmail.com (P. Karras); ibordino@acm.org (I. Bordino); gullof@acm.org (F. Gullo); aris@diag.uniroma1.it
                                (A. Anagnostopoulos)
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Motivation. So far, the choice of an appropriate KGE method has depended on the downstream
task, the characteristics of the input KG, and the computational resources. The existence of many
different scoring functions, including linear embeddings [12], bilinear [6], based on complex
numbers [13], or projections [14] further complicates this choice. Alas, the literature lacks
a unified measure to quantify how reliable the performance of a KGE method can be for a
certain task beforehand, without performing such a potentially slow task. Furthermore, KGE
performance on a specific downstream task is typically assessed in a global way, that is, in terms
of how accurate a KGE method is for that task on the entire KG. However, the performance of
KGEs for several practical applications (e.g., knowledge completion [4]) typically varies across
the parts of the KG. This requires carrying out a performance assessment of KGE locally to
specific parts of the KG, rather than globally.
Contributions. We address all the above shortages of the state of the art in KGE and introduce
ReliK (Reliability for KGEs), a simple, yet principled measure that quantifies the reliability of
how a KGE will perform on a certain downstream task in a specific part of the KG, without
executing that task or (re)training that KGE. To the best of our knowledge, no measure like ReliK
exists in the literature. ReliK relies exclusively on embedding scores as a black box, particularly
on the ranking determined by those scores (rather than the scores themselves). ReliK is simple,
intuitive, and easy-to-implement. Despite that, its exact computation requires processing all the
possible combinations of entities and relationships, for every single fact of interest. We address
this major challenge by devising an approximation to ReliK.
   Apart from experimenting with ReliK in basic downstream tasks, such as entity/relation
prediction, we also showcase ReliK on two advanced downstream tasks, query answering, which
finds answers to complex logical queries over KGs, and rule mining, deduces logic rules, with
the purpose of cleaning the KG from spurious facts or expanding the information therein.


2. Preliminaries
A knowledge graph (KG) 𝒦 : ⟨ℰ, ℛ, ℱ⟩ is a triple consisting of a set ℰ of 𝑛 entities, a set ℛ of
relationships, and a set ℱ ⊂ ℰ × ℛ × ℰ of 𝑚 facts. A fact is a triple 𝑥ℎ𝑟𝑡 = (ℎ, 𝑟, 𝑡) , where
ℎ ∈ ℰ is the head, 𝑡 ∈ ℰ is the tail, and 𝑟 ∈ ℛ is the relationship. For instance, entities “Leonardo
da Vinci” and “Mona Lisa,” and relationship “painted” form the triple (“Leonardo da Vinci,”
“painted,” “Mona Lisa”). The set ℱ of facts form an edge-labeled graph whose nodes and labeled
edges correspond to entities and relationships, respectively. We say a triple 𝑥ℎ𝑟𝑡 is positive if it
actually exists in the KG (i.e., 𝑥ℎ𝑟𝑡 ∈ ℱ), negative otherwise (i.e., 𝑥ℎ𝑟𝑡 ∈ / ℱ).
Knowledge graph embedding. A KG embedding (KGE) [3, 5, 15] is a representation of entities
and relationships in a 𝑑-dimensional (𝑑≪|ℰ|) space, typically, the real R𝑑 space or the complex
C𝑑 space. For instance, TransE [12] represents a triple 𝑥ℎ𝑟𝑡 as entity vectors eℎ , e𝑡 ∈ R𝑑 and
relation vector e𝑟 ∈ R𝑑 , and DistMult [6] represents the relationship as a matrix W𝑟 ∈ R𝑑×𝑑 .
Although KGEs can differ (significantly) from one another in their definition, a common key
aspect of all KGEs is that they are typically defined based on a so-called embedding scoring
function or simply embedding score. This is a function 𝑠 : ℰ × ℛ × ℰ → R, which quantifies
how likely a triple 𝑥ℎ𝑟𝑡 ∈ ℰ × ℛ × ℰ exists in 𝒦 based on the embeddings of its head (ℎ),
relationship (𝑟), and tail (𝑡). Specifically, the higher 𝑠(𝑥ℎ𝑟𝑡 ), the more likely the existence of
𝑥ℎ𝑟𝑡 . For instance, TransE’s score 𝑠(𝑥ℎ𝑟𝑡 ) = −‖eℎ + e𝑟 − e𝑡 ‖ is the (ℓ1 or ℓ2 ) distance between
the “translation” from ℎ’s embedding to 𝑡’s embedding through 𝑟’s embedding [12].
   KGEs are typically learned via optimization (e.g., gradient descent) of a loss function defined
based on the embedding score. This training process can be computationally expensive, espe-
cially if it has to be repeated for multiple KGEs. KGEs learned this way are shown to be effective
for a number of downstream tasks [5], such as predicting the existence of a triple, but do not
offer any prior indication on their performance [16]. Moreover, existing benchmarks [15] show
global performance on the entire graph rather than local on subgraphs. To this end, in this
work, we provide an answer to the following key question: Is there a measure that provides a
prior indication on the performance of an embedding on a specific subgraph?


3. KGE Reliability
We would like a measure of reliability that properly assesses how the embedding of a triple
would perform on certain tasks and data, without knowing them in advance. To this end, we
aim to fulfill the following desiderata for a KGE reliability measure.
(R1) Embedding-agnostic. Is independent of the specific KGE method. This is to have a
measure fully general. (R2) Learning-free. Requires no further KGE training. This is primarily
motivated by efficiency, but also for other reasons, such as privacy or unavailability of the data
used for KGE training. (R3) Task-agnostic. Is independent of the specific downstream task. by
anticipating the performance of a KGE in general, for any downstream task. (R4) Locality. Is
a good predictor of KGE performance locally to a given triple, that is, in a close surrounding
neighborhood of that triple. This is important, as a KGE model may be more or less effective
based on the different parts of the KG it is applied to.

3.1. The Proposed ReliK Measure
Design principles. To fulfill (R1) and (R2), the KGE reliability measure should not engage
with the internals of the computation of KGEs. Thus, we need to treat the embeddings as
vectors and the embedding score as a black-box function that provides only an indication of
the actual existence of a triple. Although the absolute embedding scores are incomparable to
one another, we observe that the distribution of positive and negative triples is significantly
different. Specifically, we assume the relative ranking of a positive triple to be higher than
that of a negative. Otherwise, we multiply the score by −1. This leads to the following main
observation.
Observation 1. A KGE reliability measure that uses the position of a triple relative to other triples
via a ranking defined based on the embedding score fulfills (R1) and (R2).
  Furthermore, comparing a triple to all other (positive or negative) triples might be ineffective.
This is because the absolute score does not provide an indication of performance. We thus
advocate that a local approach that considers triples relative to a neighborhood is more appro-
priate, and propose a measure that fulfills (R4). The soundness of (R4) is better attested in our
experiments in Section 4. To meet (R3), the KGE reliability measure should not exploit any
peculiarity of a downstream task in its definition. Indeed, this is accomplished by our measure,
as we show next.
Definition. For a triple 𝑥ℎ𝑟𝑡 = (ℎ, 𝑟, 𝑡) we compute the neighbor set 𝒩 − (ℎ) of all possible
negative triples, that is, triples with head ℎ that do not exist in 𝒦. Similarly, we compute 𝒩 − (𝑡)
for tail 𝑡. We define the head-rank ℎ of a triple 𝑥ℎ𝑟𝑡 as the position of the triple in the rank
obtained using score 𝑠 for a specific KGE relative to all the negative triples having head ℎ.

                      rank 𝐻 (𝑥ℎ𝑟𝑡 ) = ⃒{𝑥 ∈ 𝒩 − (ℎ) : 𝑠(𝑥) > 𝑠(𝑥ℎ𝑟𝑡 )}⃒ + 1
                                         ⃒                                ⃒

The tail-rank rank 𝑇 (𝑥ℎ𝑟𝑡 ) for tail 𝑡 is defined similarly.
  Our reliability measure, ReliK, for a triple 𝑥ℎ𝑟𝑡 is ultimately defined as the average of the
reciprocal of the head- and tail-rank
                                           (︂                            )︂
                                         1         1            1
                      ReliK(𝑥ℎ𝑟𝑡 ) =                       +                .                (1)
                                         2 rank 𝐻 (𝑥ℎ𝑟𝑡 ) rank 𝑇 (𝑥ℎ𝑟𝑡 )
   We define the ReliK score of a set 𝑆 ⊆ ℱ of triples as the average ReliK of the triples in the
set.
Rationale. ReliK ranges from (0, 1], with higher values corresponding to better reliability. In
fact, the lower the head-rank rank 𝐻 (𝑥ℎ𝑟𝑡 ) or tail-rank rank 𝑇 (𝑥ℎ𝑟𝑡 ), the better the ranking of
𝑥ℎ𝑟𝑡 induced by the underlying embedding scores, relatively to the nonexisting triples in 𝑥ℎ𝑟𝑡 ’s
neighborhood, complies with the actual existence of 𝑥ℎ𝑟𝑡 in the KG.
   It is easy to see that ReliK achieves (R1) and (R2) by relying on the relative ranking rather
than the absolute scores. It also fulfills (R3) as it involves no downstream tasks at all, and (R4)
as it is based on the local (i.e., 1-hop) neighborhood of a target triple.

3.2. Efficiently Computing ReliK
Computing ReliK (Eq. (1)) requires Ω (|ℰ| · |ℛ|) time, as it needs to scan the entire negative
neighborhood of the target triple. For large KGs, repeating this for a (relatively) high number
of triples may be computationally too heavy. For this purpose, here we focus on approximate
versions of ReliK, which properly trade off between accuracy and efficiency.
   The main intuition behind the ReliK approximation is that the precise ranking of the various
potential triples is not actually needed. Rather, what it matters is just the number of those triples
that exhibit a higher embedding score than the target triple. As such, we estimate ReliK by
evaluating the fraction of triples in the sample that have a higher score than the triple under
consideration and then scaling this fraction to the total number of negative triples.
   Let 𝑆𝐻 be a random subset of 𝑘 elements selected without replacement independently and
uniformly at random from the negative neighborhood 𝒩 − (ℎ) of the head ℎ of a triple 𝑥ℎ𝑟𝑡 .
The size |𝑆𝐻 | trades off between efficiency and accuracy of the estimator, and it may be defined
based on the size of 𝒩 − (ℎ). Define also rank 𝑆𝐻 (𝑥ℎ𝑟𝑡 ) = |{𝑥 ∈ 𝑆𝐻 : 𝑠(𝑥) > 𝑠(𝑥ℎ𝑟𝑡 )}| + 1, to
be the rank of the score 𝑠(𝑥ℎ𝑟𝑡 ) that the KGE assigns to 𝑥ℎ𝑟𝑡 , among all the triples in the sample.
We similarly compute 𝑆𝑇 and rank 𝑆𝑇 for tail’s neighborhood 𝒩 − (𝑡).
   We define our estimator as
                        [︃(︂                             )︂−1                              )︂−1 ]︃
                                                − (ℎ)|                           |𝒩 − (𝑡)|
                                                                 (︂
                    1                      |𝒩
         ReliKApx =        rank 𝑆𝐻 (𝑥ℎ𝑟𝑡 )                             𝑆
                                                                + rank 𝑇 (𝑥ℎ𝑟𝑡 )                   .   (2)
                    2                       |𝑆𝐻 |                                  |𝑆𝑇 |
In words, we simply scale up the rank induced by the sample to the entire set of negative triples.
Theorem 1. Equation 2 is an upper bound for ReliK; that is E[ReliKApx (𝑥ℎ𝑟𝑡 )] ≥ ReliK(𝑥ℎ𝑟𝑡 ).
    Theorem 1 follows immediately from the Jensen’s inequality [17] and the fact that
E[rank 𝑆𝐻 (𝑥ℎ𝑟𝑡 )] = |𝑆𝐻 | · rank 𝐻 (𝑥ℎ𝑟𝑡 )
                               |𝒩 − (ℎ)|
                                            .
Algorithm. To compute ReliKApx , we first sample, uniformly at random, 𝑘 negative triples
from the head neighborhood and the tail neighborhood. We can save computation time by first
filtering the triples in 𝑆𝐻 ∪ 𝑆𝑇 by score, that is, by considering only those with score higher
than the input triple 𝑥ℎ𝑟𝑡 , and then checking whether a triple in 𝑆𝐻 ∪ 𝑆𝑇 has either the head
or the tail in common with 𝑥ℎ𝑟𝑡 to update the corresponding rank.
Time complexity. ReliKApx runs in 𝒪(𝑘) time to sample the negative triples for each positive
triple.The sample size 𝑘 trades off between accuracy and efficiency of the estimation.

4. Experimental evaluation
We evaluate ReliK on four downstream tasks, six embeddings, and four datasets.
Embeddings. We include six established KGE methods, TransE [12], DistMult [6], RotatE [13],
PairRE [14], ComplEx [18], ConvE [19] representative of the major embedding families (Sec. 5).
Datasets. We perform experiments on KGs with different characteristics: Countries [20] is
a small graph from geographical locations; FB15k237 [21] is a sample of Freebase KG [22]
with 14k nodes, 310k facts, and 237 relationships; Codex [23] is a collection of three datasets
of incremental size, Codex-S (2𝑘 entities, 36𝑘 triples), Codex-M (17𝑘 entities, 200𝑘 facts), and
Codex-L (78𝑘 entities, 610𝑘 facts) extracted from Wikidata and Wikipedia.; YAGO2 [24] is a
KG automatically extracted from Wikidata, which comprises 834k entities and 948k facts. .
Experimental setup. We implement our approximate and exact ReliK in Python v3.9.13 (Code
at: https://github.com/AU-DIS/ReliK). We train the embedding with Pykeen v1.10.1 [15], with
default parameters besides the embedding dimension 𝑑𝑖𝑚 = 50 and training loop sLCWA. We
report an average of 5 experiments using 5-fold cross validation with 80-10-10 split.

4.1. Common Downstream Tasks
We test ReliK on the ability to anticipate the results of common tasks for KGEs [5, 3]. We
measure the statistical significance of Pearson correlation among two ranking tasks, tail and
relation prediction, and the triple classification task. To evaluate ReliK on different areas of the
graph and different graph topologies, we sample random subgraphs of Codex-S with 60 nodes
by initially selecting a starting node uniformly at random and then including nodes and edges
by a random walk with restart [25] with restart probability 1 − 𝛼 = 0.2, until the subgraph
comprises 60 nodes. For Codex-M and Codex-L we use size 100 and for FB15k237 we use 200
nodes. We report the average ReliK on 100 random subgraphs.
Ranking tasks (T1). In the first experiments, we measure the Pearson correlation between
ReliK and the performance on ranking tasks with mean reciprocal rank (MRR) [26]. The first
task, tail prediction [12, 14, 13], assesses the ability of the embedding to predict the tail given
the head and the relation, thus answering the query (ℎ, 𝑟, ?) where the tail is unknown. The
second task, relation prediction, assesses the ability of the embedding to predict the undisclosed
relation of a triple (ℎ, ?, 𝑡). The common measure used for tail and relation prediction is MRR,
which provides an indication of how close to the top the score ranks the correct tail (or relation).
Consistently with previous approaches [12, 14, 13], we employ the filtered approach in which
we consider for evaluation only negative triples that do not appear in either the train, test, or
validation set. Table 1 reports the correlations alongside the statistical significance in terms
of the p-value. We marked in red, high p-values (> 0.05), which suggest no correlation, and
Pearson score values that indicate inverse correlation. Generally, ReliK exhibits significant
correlation across embeddings and tasks. Noteworthy, even though ReliK (see Eq. (1)) does not
explicitly target tail or head rankings by including both, we observe significant correlation on
tail prediction in most embeddings and datasets. Because of the considerable training time, we
only report results for RotatE on Codex-S. Comparing the actual results of the various tasks, it
is also clear in most cases in which we do not have correlation that the results are too close
to distinguish; for example, ComplEx having only results close to 0. In such cases, the results
indicate that the particular embedding method needs additional training.
                                       Tail (MRR)        Relation (MRR)      Classific. (Acc.)
                            KGE     Pearson p-value     Pearson p-value     Pearson    p-value
                                                 −26                  −64
                            TransE   0.83    1.13𝑒      0.97    3.812𝑒      0.63       2.54𝑒−12
                            DistMult 0.49    2.10𝑒−07   0.78    4.68𝑒−22    0.60       3.74𝑒−11
                 Codex-L


                            RotatE   –       –          –       –           –          –
                            PairRE   −0.04   0.68       0.95    3.33𝑒−52    −4.47𝑒−4   0.99
                            ComplEx 0.82     1.03𝑒−25   0.91    3.96𝑒−39    0.06       0.57
                            ConvE    0.59    4.26𝑒−11   −0.07   0.48        0.31       1.57𝑒−3
                            TransE   0.24    0.02       0.86    2.83𝑒−30    0.34       5.79𝑒−4
                            DistMult −0.05   0.65       0.64    5.57𝑒−13    0.39       5.58𝑒−05
                 FB15k237


                            RotatE   –       –          –       –           –          –
                            PairRE   0.80    1.51𝑒−23   0.65    1.74𝑒−13    0.08       0.44
                            ComplEx 0.20     0.05       0.88    3.53𝑒−34    0.14       0.18
                            ConvE    0.09    0.37       0.85    4.47𝑒−30    0.01       0.93

Table 1
Pearson correlation and statistical significance of ReliK for tail prediction, relation prediction, and triple
classification; Results for Codex-M and Codex-S are similar to those of Codex-L.
Classification task (T2). In this experiment, we test the correlation between ReliK and the
accuracy of a threshold-based classifier on the embeddings. The classifier predicts the presence
of a triple in the KG if the embedding score is larger than a threshold, a common scenario for
link prediction [5]. Table 1 (right column) reports the correlations and their significance for
all datasets. At close inspection, we observe that in cases of unclear correlation, such as with
PairRE, the respective classification results are too close to observe a difference. Those cases
notwithstanding, ReliK is significantly correlated with accuracy. This result confirms that ReliK
can serve as a proxy for the quality of complex models trained on embeddings.
Tuning Subgraph Size. Next, we analyze how ReliK                    tail relation    classifier

correlates with the tasks presented in Section 4.1 on        0.8


                                                                                       Pearson
subgraphs of varying size with the TransE embedding.         0.6
Figure 1 reports the correlation values for all three tasks, 0.4
only including those values where the p-value is be-             0 20 40 60 80 100 120 140
low 0.05. We observe that ReliK’s correlation generally                  Subgraph size
increases with subgraphs of up to 100 nodes on Codex- Figure 1: Correlation vs subgraph size
S. After that point, we note an unstable behavior in all           on Codex-S.
tasks. This is consistent with the assumption that ReliK is a measure capturing local reliability.
To strike a balance between quality and time we test on subgraphs with 60 nodes for Codex-S
in all experiments. Yet, as tasks are of different nature, the subgraph size can be tuned in
accordance with the task to provide more accurate results.

4.2. Complex Downstream Tasks
Query answering (T3). We show how ReliK can improve query-answering tasks. Complex
logical queries on KGs are working with different query structures. We focus on queries of
chaining multiple predictions or having an intersection of predictions, from different query
structures that have been described in recent work [27, 28]. We keep the naming conven-
tion introduced by Ren and Leskovec [27]. We evaluate a selection of 1000 queries per type
(1𝑝,2𝑝,3𝑝,2𝑖,3𝑖) from their data on the FB15k237 graph.The queries of type 𝑝 are 1 to 3 hops
from a given entity with fixed relation labels that point to a solution, whereas queries of type 𝑖
are the intersection of 2 or 3 predictions pointing towards the same entity. We evaluate ReliK
on the ability to detect whether an instance of an answer is true or false. We compute ReliK on
TransE embeddings trained on the entire FB15k237. Figure 2 shows the average ReliK scores
for positive and negative answers. ReliK clearly discriminates between positive and negative
instances, often by a large margin.
                                                            negative instances positive instances
                   Query Answering
                  −5                                                 Rule Mining                                    Rule Mining
            ·10                                         ·10−5
                                                  1.5
        6
                                                                                                      0.2
                                                   1
ReliK


                                          ReliK


        4
                                                                                                 RR


        2                                         0.5                                                 0.1

        0                                          0                                                   0
            1p      2p   3p    2i    3i                 0        5     10    15       20                    0   5      10     15   20
                     Query types                                       Rules                                          Rules
Figure 2: Comparison between positive and negative instances for query answering on FB15k237 (left)
and rule mining on Yago2 with ReliK (middle) and RR (right).
Rule mining (T4). ReliK effectively improves on the rule mining task as well. Rule mining
methods [29, 30, 31] automatically retrieve logic rules over KGs having a predefined minimum
confidence. A logic rule is an assertion such as 𝐴 ⇒ 𝐵, which states that 𝐵 follows from 𝐴.
For instance, a rule could imply that all presidents of a country are citizens of the country. An
instance of a rule is triples matching 𝐵, given that 𝐴 is true. Logic rules are typically harvested
with slow exhaustive algorithms similar to the apriori algorithm for association rules [32]. We
show that ReliK can discriminate between true and false instances.
Detecting true instances. To showcase performance on the downstream task (T4), we compare
ReliK with the reciprocal rank (RR) of a combination of the tail and the relation embeddings
on the ability to detect whether an instance of a rule is true or false. This task is particularly
important to quantify the real confidence of a rule [33]. To this end, we use a dataset [34]
comprising 23 324 manually annotated instances over 26 rules extracted from YAGO2 using
the AMIE [29] and RudiK [31] methods. We compute ReliK on TransE embeddings trained on
the entire YAGO2. Figure 2 shows the average ReliK scores for positive and negative instances.
ReliK discriminates between positive and negative instances, often by a large margin, whereas
RR often confounds positive and negative instances.

5. Related Work
Knowledge graph embeddings are commonly used to detect missing triples, correcting
errors, or question answering [3, 5]. Translational embeddings in the TransE [12] family and
the recent PairRE [14] assume that the relationship performs a translation from the head to
the tail. Semantic embeddings, such as DistMult [6] or HolE [35], interpret the relationship
as a multiplicative operator. Complex embeddings, such as RotatE [13] and ComplEx [18], use
complex-valued vectors and operations in the complex plane. Neural-network embeddings, such
as ConvE [19], perform sequences of nonlinear operations.
Embedding calibration. An orthogonal direction to ours is embedding calibration [36, 37].
Calibration methods provide effective ways to improve the existing embeddings on various
tasks, by altering the embedding vectors in subspaces with low accuracy [36], by reweighing
the output probabilities in the respective tasks [37], or by matrix factorization [38].
Evaluation of embeddings. ReliK bears an interesting connection with ranking-based quality
measures, in particular with the mean reciprocal rank (MRR) and HITS@k for head, tail, and
relation prediction [3, 36, 39, 12, 14, 13]. Even though MRR and HITS@k provide a global
indication of performance, ReliK is suitable for local analysis. Yet, current global measures have
been recently shown to be biased towards high-degree nodes [40].

6. Conclusion
Aiming to develop a measure that prognosticates the performance of a knowledge graph
embedding on a specific subgraph, we introduced ReliK, a KGE reliability measure agnostic
to the choice of the embeddings, the dataset, and the task. To allow for efficient computation,
we proposed a sampling-based approximation, which we show to achieve similar results to
the exact ReliK in a fraction of the time. Our experiments confirm that ReliK anticipates the
performance on a number of common and complex downstream tasks for KGEs. These results
suggest that ReliK may be used in other domains, as well as a debugging tool for KGEs.


Acknowledgments
M. Egger is supported by the Danish Council for Independent Research grant DFF-1051-00062B.
W. Ma is supported by the China Scholarship Council grant 202110320012.
References
 [1] A. Hogan, E. Blomqvist, M. Cochez, C. d’Amato, G. de Melo, C. Gutierrez, S. Kirrane,
     J. E. L. Gayo, R. Navigli, S. Neumaier, A. N. Ngomo, A. Polleres, S. M. Rashid, A. Rula,
     L. Schmelzeisen, J. F. Sequeda, S. Staab, A. Zimmermann, Knowledge graphs, ACM CSUR
     54 (2022) 71:1–71:37.
 [2] G. Weikum, Knowledge graphs 2021: A data odyssey, PVLDB 14 (2021) 3233–3238.
 [3] Q. Wang, Z. Mao, B. Wang, L. Guo, Knowledge graph embedding: A survey of approaches
     and applications, TKDE 29 (2017) 2724–2743.
 [4] X. Wang, L. Chen, T. Ban, M. Usman, Y. Guan, S. Liu, T. Wu, H. Chen, Knowledge graph
     quality control: A survey, Fundamental Research 1 (2021) 607–626.
 [5] S. Ji, S. Pan, E. Cambria, P. Marttinen, S. Y. Philip, A survey on knowledge graphs:
     Representation, acquisition, and applications, Trans. Neural Netw. Learn. Syst. 33 (2021)
     494–514.
 [6] B. Yang, S. W.-t. Yih, X. He, J. Gao, L. Deng, Embedding entities and relations for learning
     and inference in knowledge bases, in: ICLR, 2015.
 [7] Y. Wu, Y. Xu, X. Lin, W. Zhang, A holistic approach for answering logical queries on
     knowledge graphs, in: ICDE, 2023, pp. 2345–2357.
 [8] S. S. Bhowmick, E. C. Dragut, W. Meng, Globally aware contextual embeddings for named
     entity recognition in social media streams, in: ICDE, 2023, pp. 1544–1557.
 [9] J. Huang, Z. Sun, Q. Chen, X. Xu, W. Ren, W. Hu, Deep active alignment of knowledge
     graph entities and schemata, PACMMOD 1 (2023) 159:1–159:26.
[10] A. Zeakis, G. Papadakis, D. Skoutas, M. Koubarakis, Pre-trained embeddings for entity
     resolution: An experimental analysis, PVLDB 16 (2023) 2225–2238.
[11] Z. Zhong, M. Zhang, J. Fan, C. Dou, Semantics driven embedding learning for effective
     entity alignment, in: ICDE, 2022, pp. 2127–2140.
[12] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, O. Yakhnenko, Translating embeddings
     for modeling multi-relational data, NeurIPS 26 (2013).
[13] Z. Sun, Z. Deng, J. Nie, J. Tang, Rotate: Knowledge graph embedding by relational rotation
     in complex space, in: ICLR, 2019.
[14] L. Chao, J. He, T. Wang, W. Chu, Pairre: Knowledge graph embeddings via paired relation
     vectors, in: ACL, 2021, pp. 4360–4369.
[15] M. Ali, M. Berrendorf, C. T. Hoyt, L. Vermue, M. Galkin, S. Sharifzadeh, A. Fischer, V. Tresp,
     J. Lehmann, Bringing light into the dark: A large-scale evaluation of knowledge graph
     embedding models under a unified framework, TPAMI 44 (2021) 8825–8845.
[16] N. Jain, J.-C. Kalo, W.-T. Balke, R. Krestel, Do embeddings actually capture knowledge
     graph semantics?, in: ESWC, 2021, pp. 143–159.
[17] J. L. W. V. Jensen, Sur les fonctions convexes et les inégalités entre les valeurs moyennes,
     Acta mathematica 30 (1906) 175–193.
[18] T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, G. Bouchard, Complex embeddings for simple
     link prediction, in: ICML, PMLR, 2016, pp. 2071–2080.
[19] T. Dettmers, P. Minervini, P. Stenetorp, S. Riedel, Convolutional 2d knowledge graph
     embeddings, in: AAAI, volume 32, 2018.
[20] G. Bouchard, S. Singh, T. Trouillon, On approximate reasoning capabilities of low-rank
     vector spaces, in: AAAI, 2015.
[21] K. Toutanova, D. Chen, Observed versus latent features for knowledge base and text
     inference, in: Proceedings of the 3rd workshop on continuous vector space models and
     their compositionality, 2015, pp. 57–66.
[22] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, J. Taylor, Freebase: a collaboratively created
     graph database for structuring human knowledge, in: SIGMOD, 2008, pp. 1247–1250.
[23] T. Safavi, D. Koutra, Codex: A comprehensive knowledge graph completion benchmark,
     in: EMNLP, 2020, pp. 8328–8350.
[24] J. Hoffart, F. M. Suchanek, K. Berberich, G. Weikum, Yago2: A spatially and temporally
     enhanced knowledge base from wikipedia, Artificial intelligence 194 (2013) 28–61.
[25] H. Tong, C. Faloutsos, J.-Y. Pan, Fast random walk with restart and its applications, in:
     ICDM, IEEE, 2006, pp. 613–622.
[26] N. Craswell, Mean reciprocal rank., Encyclopedia of database systems 1703 (2009).
[27] H. Ren, J. Leskovec, Beta embeddings for multi-hop logical reasoning in knowledge graphs,
     Advances in Neural Information Processing Systems 33 (2020) 19716–19726.
[28] Y. Bai, X. Lv, J. Li, L. Hou, Answering complex logical queries on knowledge graphs via
     query computation tree optimization (2023).
[29] L. A. Galárraga, C. Teflioudi, K. Hose, F. M. Suchanek, AMIE: association rule mining
     under incomplete evidence in ontological knowledge bases, in: TheWebConf, 2013, pp.
     413–422.
[30] L. Galárraga, C. Teflioudi, K. Hose, F. M. Suchanek, Fast rule mining in ontological
     knowledge bases with amie ++, VLDBJ 24 (2015) 707–730.
[31] S. Ortona, V. V. Meduri, P. Papotti, Robust discovery of positive and negative rules in
     knowledge bases, in: ICDE, 2018, pp. 1168–1179.
[32] R. Agrawal, R. Srikant, et al., Fast algorithms for mining association rules, in: VLDB,
     volume 1215, Santiago, Chile, 1994, pp. 487–499.
[33] M. Loster, D. Mottin, P. Papotti, J. Ehmüller, B. Feldmann, F. Naumann, Few-shot knowledge
     validation using rules, in: TheWebConf, 2021, pp. 3314–3324.
[34] M. Loster, D. Mottin, P. Papotti, J. Ehmüller, B. Feldmann, F. Naumann, Colt dataset, 2021.
     URL: https://hpi.de/naumann/projects/repeatability/datasets/colt-dataset.html, accessed
     on November 1, 2023.
[35] M. Nickel, V. Tresp, H.-P. Kriegel, et al., A three-way model for collective learning on
     multi-relational data., in: ICML, volume 11, 2011, pp. 3104482–3104584.
[36] T. Safavi, D. Koutra, E. Meij, Evaluating the calibration of knowledge graph embeddings
     for trustworthy link prediction, in: EMNLP, 2020.
[37] P. Tabacof, L. Costabello, Probability calibration for knowledge graph embedding models,
     in: ICLR, 2020.
[38] C. Demir, J. Lienen, A.-C. Ngonga Ngomo, Kronecker decomposition for knowledge graph
     embeddings, in: HT, 2022, pp. 1–10.
[39] F. Bianchi, G. Rossiello, L. Costabello, M. Palmonari, P. Minervini, Knowledge graph em-
     beddings and explainable ai, in: Knowledge Graphs for eXplainable Artificial Intelligence:
     Foundations, Applications and Challenges, IOS Press, 2020, pp. 49–72.
[40] S. Tiwari, I. Bansal, C. R. Rivero, Revisiting the evaluation protocol of knowledge graph
     completion methods for link prediction, in: TheWebConf, 2021, pp. 809–820.