=Paper= {{Paper |id=Vol-2578/PIE2 |storemode=property |title=Knowledge Graph Embeddings: Are Relation-Learning Models Learning Relations? |pdfUrl=https://ceur-ws.org/Vol-2578/PIE2.pdf |volume=Vol-2578 |authors=Andrea Rossi,Antonio Matinata |dblpUrl=https://dblp.org/rec/conf/edbt/0002M20 }} ==Knowledge Graph Embeddings: Are Relation-Learning Models Learning Relations?== https://ceur-ws.org/Vol-2578/PIE2.pdf
                        Knowledge Graph Embeddings:
               Are Relation-Learning Models Learning Relations?
                               Andrea Rossi                                                          Antonio Matinata
                           Roma Tre University                                                       Roma Tre University
                        andrea.rossi3@uniroma3.it                                               ant.matinata@stud.uniroma3.it
ABSTRACT                                                                           fact. With the rise of machine learning techniques, this has natu-
Link Prediction (LP) is the task of inferring relations between                    rally combined with the use of KG embeddings: on the one hand,
entities in a Knowledge Graph (KG). LP is difficult, due to the                    in training phase, LP models learn entity and relationship em-
sparsity and incompleteness of real-world KGs. Recent advances                     beddings that optimize the scores for the facts already contained
in Machine Learning have led to a large and rapidly growing                        in the KG; on the other hand, in prediction phase the scoring
number of relation learning models, from the seminal work of                       function is applied on the embeddings of subject, predicate and
Bordes et al. [4] to the recent model in [2]. Despite the flurry of                object of a fact to compute its plausibility.
papers in this area, just a few datasets and evaluation metrics                        LP models can be queried by providing a subject (or an ob-
have emerged as de facto benchmarking criteria. In our work, we                    ject), e.g., “Barack Obama”, and a predicate, e.g., “place of birth”,
question the effectiveness of these benchmarks in establishing                     representing a question of the form “What is Barack Obama’s
the state-of-the-art. The use of unreliable benchmarking practices                 place of birth?”. Answering such a query amounts to compute the
can have hidden ethical implications, as it may yield distorted                    score of each potential object with respect to the current subject
evaluation results and overall lead the research community into                    and predicate, and to find which one yields the best value. That
adopting ineffective design choices. To this end, we consider key                  is, the answer to a LP query is a ranking of the KG entities by
desiderata of a benchmark formulated as specific questions rel-                    decreasing plausibility.
evant to the LP task, and provide empirical evidence to answer                         This approach for was explored in the seminal work by Bordes
those questions. Our analysis shows that existing datasets and                     et al., describing the TransE LP model [4]. TransE interprets rela-
metrics fall short in capturing a model’s capability of solving                    tions as translations operating on low-dimensional embeddings
LP. Specifically, we show that a model can score very high by                      of the entities. In just a few years, TransE has inspired dozens
learning to predict facts about a small fraction of the entities in                of new relation-learning systems (see [21] for a survey), and few
the training set. Our study provides a more robust evaluation di-                  datasets and metrics have emerged as a de facto benchmark.
rection for future research on relation learning models, stressing                     In this paper we critically analyze current benchmarks for LP
that understanding why LP models reach certain performances                        models. Our study is motivated by the observation that, in current
is a crucial step towards explaining predicted relations.                          datasets, less than 15% of entities cover more than 80% of facts.
                                                                                   Such a skew casts doubts on the suitability of these benchmarks
                                                                                   for evaluating LP models. Indeed, we have empirically observed
1    INTRODUCTION                                                                  that a model can achieve state-of-the-art scores by learning to
Knowledge Graphs (KG) are structured representations of facts                      predict facts about a tiny fraction of the entities with highest
in the real world. In a KG, each node represents an entity, e.g. a                 degree in the training set, which are also the most mentioned
person, a place or a concept; each label represents a relationship                 entities in the test set as well. As an informative example, in
usable to link entities; each edge in the form ⟨ subject, predicate,               FB15k, the most commonly used among the LP datasets, the
object ⟩, represents a fact connecting entity subject with entity                  entity node with the highest degree is by far “United States”,
object through the relationship predicate. Examples of KGs are                     with ≈ 2% of all edges; the vast majority of the “nationality”
FreeBase [3], WikiData [20], DBPedia [1], Yago [16] and – in                       facts in the test set refer to “United States” as well. Therefore a
industry – Google KG [15] and Microsoft Satori [12]. Such KGs                      model that learns to predict U.S. citizens only can obtain results
can contain even billions of facts, yet only a small subset of all                 comparable or even better than one that attempts to learn the
the facts in the real world.                                                       nationality relation in detail.
   KG embeddings are a way of representing the components of                       Our contribution. We argue that the research community is not
a KG as vectors or matrices (embeddings) in a low-dimensional                      best served by benchmarks that allow for such a discrepancy to
hyperspace, called latent space. Embeddings are computed by                        go unnoticed. Therefore, we provide a constructive contribution
training a model on the KG data, and thus carry the semantic                       towards the definition of more effective benchmarks for LP mod-
meaning of the original KG relations. In other words, given the                    els: (𝑖) we formulate some key questions to highlight some of the
embeddings of two elements, it should be possible to identify                      most desirable properties for LP benchmarks; (𝑖𝑖) we conduct
their semantic correlations.                                                       an extensive experimental analysis to understand whether the
   Knowledge Graph Completion is the task of identifying miss-                     currently employed benchmarks satisfy such properties or not,
ing edges (facts) in KGs, by either extracting them from external                  and why. In doing this, we also highlight the ethical implica-
corpora, or inferring them from the ones already in the KG [11].                   tions potentially connected with the limitations of the current
The latter approach, called Link Prediction (LP), typically requires               benchmarks.
defining a scoring function that estimates the plausibility of any

© 2020 Copyright held by the owner/author(s). Published in the Workshop Pro-
                                                                                   2   BENCHMARKS FOR LP
ceedings of the EDBT/ICDT 2020 Joint Conference, March 30-April 2, 2020 on         In this section we describe the currently employed LP bench-
CEUR-WS.org. Distribution of this paper is permitted under the terms of the Cre-
ative Commons license CC BY 4.0.
                                                                                   marks; analogously to [19], we consider a benchmark as the
whole workload employed to evaluate competing systems, com-
posed of both datasets and metrics.
    The most popular datasets for LP benchmarking consist of
facts sampled from the FreeBase [3] and WordNet [13] KGs. Free-
Base is an open KG with billions of facts about millions of real
world entities and thousands of different relationships. WordNet
is a lexical KG whose entities are English words grouped by their
sense, and whose edges describe relations among words. The
main features of such datasets are described in Table 1.
    The FB15K dataset has been extracted by the TransE authors
selecting all facts containing the 100 most mentioned entities in
FreeBase also featured in the Wikilinks database1 (thus including
their low-degree neighbors). Defining a full-fledged benchmark-
ing workload was beyond their intentions; nonetheless, most of
the approaches inspired by TransE have been evaluated against                         (a) Distribution of entity degrees in FB15K and WN18.
the same dataset and metrics, making them a de facto benchmark.
    The FB15K-237 dataset is a FB15K subset built by [17] after
observing that FB15K suffers so much from test leakage that a
simple model based on observable features can reach state-of-
the-art performances on it. The authors only considered facts for
the most occurring 401 relationships in FB15K, and filtered away
those with implicitly same meaning or inverse meaning. In order
to take away trivial facts, they also removed from validation
and test sets any facts linking entities already connected in the
training set. We note that is nonetheless a biased approach, as
we cannot evaluate the ability of a model to learn useful patterns
such as, for instance, ⟨ x, father_of, y ⟩ entails ⟨ y, child_of, x ⟩.
    The WN18 dataset, analogously to FB15K, was built by the
authors of TransE. They used the WordNet ontology [13] as
                                                                                  (b) Distribution of relationship mentions in FB15K. WN18 is
a starting point, and then filtered out over multiple iterations                  omitted as it only features 18 relationships.
entities and relationships with too few mentions.
    The WN18-RR dataset was built by [5] applying similar policies           Figure 1: Skew analysis for entity degrees distribution
to [17], after performing further investigations on test leakage             and relationship mentions on training, validation and test
in FB15K and WN18.                                                           facts for FB15K and WN18.
                                                          Triples
                      Entities   Relations
                                              Train       Valid     Test
                                                                             the resulting score: the original subject s should thus rank as low
         FB15K          14951        1345     483142        50000   50971
                                                                             as possible. An analogous pipeline is used for predicting object o.
         WN18           40943           18    141442         5000     5000      These resulting rankings enable the following global metrics:
       FB15k-237        14541          237    272115        17535   20466
                                                                                 • Mean Rank (MR), i.e. the average rank of the correct subject
        WN18-RR         40943           11        86835      3034     3134
                                                                                   (object) over all predictions.
                   Table 1: Standard datasets for LP.                            • Mean Reciprocal Rank (MRR), i.e. the average of the inverse
                                                                                   ranks of the correct subject (object) over all predictions.
                                                                                 • Hits@K, i.e. the fraction of correct subject (object) predic-
                                                                                   tions with rank equal or lesser than K. The most common
Datasets structural analysis. We define the degree of an entity                    choices for K are 10 and 1.
and the number of mentions of a relationship as, respectively, the           The above described metrics can be computed either in two dif-
number of times that this entity or relationship is mentioned in             ferent settings, dubbed raw scenario and filtered scenario. As a
different facts of a dataset.                                                matter of fact, an incomplete triple ⟨ ?, p, o ⟩ might accept multiple
   Both FB15K and WN18 show severe skew in both entity de-                   entities as correct answers; an answer is correct if the resulting
grees and relationship mentions. Figures 1(a) and 1(b) plot their            fact is already contained in the training, validation or test set. In
distributions, showing that the large majority of entities (rela-            raw scenario these entities are still considered “mistakes”, and
tionships) have a very low degree (number of mentions), whereas              therefore, if they outscore the expected answer, they affect the
a small minority of them can reach massive representation.                   prediction rank. On the contrary, in filtered scenario they are
                                                                             considered acceptable, so if they outscore the expected entity
2.1     Metrics                                                              they are just ignored.
Evaluation for LP models is typically performed on the task of
Entity Prediction. Given the number of entities 𝑛, for any test fact         3    THE CASE FOR BENCHMARKING
⟨𝑠, 𝑝, 𝑜⟩: (𝑖) s is removed from the triple, obtaining ⟨ ?, p, o ⟩; (𝑖𝑖)
                                                                             In this section we define key questions that a good LP benchmark
all entities are tested as the triple subject and ranked according to
                                                                             should answer in the affirmative, and investigate whether the
1 https://code.google.com/archive/p/wiki-links/                              current benchmarks satisfy them or not, providing experimental
evidence to all our claims. We finally provide an overall discussion   configuration even across different datasets. For the sake of veri-
for the potential ethical implications of the defined questions and    fiability and reproducibility we report the resulting combination
their extent in this research field.                                   in Table 3.
   In our experiments, we take into account two representative
                                                                                                  Batches per Embedding   Learning
models, namely TransE [4] and DistMult [23].                                            Epochs
                                                                                                    epoch     dimension     Rate
                                                                                                                                     Optimizer
   TransE is one of the first KG embedding systems, and has                   TransE       1000         100        100       0.001     SGD
inspired dozens of successors. It represents facts as translations           DistMult      1000         100        100      0.0005    Adam
in the latent space: its scoring function uses the relationship
embedding as a translation vector to move from the embedding                     Table 3: Hyperparameter configurations.
of the subject to the one of the object.
   DistMult is very popular due to its simplicity: its scoring
function is a bilinear product among the embedding of the subject,
a diagonal matrix based on the embedding of the relationship,          3.2    Questioning current benchmarks
and the embedding of the object. If properly fine-tuned, it has        Our questions on the current LP benchmarks refer to their rele-
been recently shown to surpass most models in the state of the         vance, fairness and capability to highlight overfitting. For each
art [7].                                                               question we provide a formulation, an analysis and an overall
   We stress that our goal is not to determine which one, among        answer.
TransE and DistMult, is better: our purpose is rather to inves-
tigate the effectiveness of current benchmarks of highlighting           Q1: Does the benchmark measure the ability of the sys-
their differences.                                                     tem to learn relations?
                                                                       Analysis. This question is related to the relevance of the bench-
                                       FB15K                           mark. An LP benchmark is relevant if it actually measures how
                       MR        MRR           H@1        H@10         good the system is at learning relations.
        TransE           70.6      0.497       33.50%     75.68%          A limitation of current benchmarks lies in their use of global
       DistMult         156.3      0.469       18.04%     74.03%       metrics (Hits@K, Mean Rank, Mean Reciprocal Rank) that relate
                                                                       to the overall number of accurate predictions rather than their
                                     FB15K-237                         quality. This practice does not take into account that some facts
                       MR        MRR           H@1        H@10         may be inherently different than others. In other words, global
        TransE           353.5      0.272        18.51%    44.29%      metrics do not let the specific strengths and weaknesses of differ-
       DistMult          741.9      0.139         7.81%    26.08%      ent models surface: this does not allow to investigate in which
                                                                       aspects a model performs better or worse than the others, and
                                       WN18                            why. Ultimately, we believe that this hinders our understanding
                       MR        MRR           H@1        H@10         of what our systems are actually learning.
        TransE           494.5      0.445        16.00%    81.37%         Furthermore, as pointed out by [22], current evaluation metrics
       DistMult          928.9      0.811        70.00%    93.66%      are based on positive test facts only, and do not check if false or
                                                                       even nonsensical facts receive low scores in turn.
                                     WN18-RR                              Finally, it has been recently observed [7] that the extensive
                       MR        MRR           H@1        H@10         use of the Hits@10 metric might be misleading when compar-
        TransE          5304.3      0.180         2.39%    40.67%      ing different models: many systems achieve similar, very good
       DistMult         9953.5      0.373        36.05%    39.81%      Hits@10 values, but they show marked differences with more
                                                                       selective versions of the same metric, such as Hits@1.
Table 2: Global performances of models in our experi-
                                                                          To prove our claim we observe that most datasets display very
ments (filtered scenario).
                                                                       skewed degree distributions. Our experiments show that the de-
                                                                       gree of an entity in training set largely affects the LP prediction
                                                                       accuracy in testing; nonetheless, this strong correlation is com-
   Table 2 reports the global values of Hits@1, Hits@10, Mean          pletely overlooked by the commonly employed global metrics.
Rank and Mean Reciprocal Rank for both models on all datasets.         We plot in Figure 2 the correlation between the entity degree
For our results, we focus on the filtered scenario; we have ob-        and the prediction performances for the entities with that degree.
served analogous findings in raw scenario as well. We show that        We measure performances with Hits@10 and MR metrics. Our
analyzing the behaviour of TransE and DistMult on the cur-             results provide strong evidence that a higher degree yields better
rent benchmarks can lead to surprising (and even contradictory)        predictions; this pattern holds for the vast majority of entities,
conclusions.                                                           up to 1K mentions. We note that despite reaching comparable
                                                                       Hits@10 overall, DistMult can significantly outperform TransE
3.1    Experimental Setup                                              on low degree entities, while TransE is better on the few high
Our experiments have been performed on a server environment            degree entities.
with a CPU Intel Core(TM) i7-3820 at 3.60GH, 16GB RAM and                 We believe that insights like these are vital to understand what
a GPU NVIDIA Quadro M5000. We have employed the Tensor-                our models are actually learning, and to choose the most suitable
flow implementation of TransE and DistMult provided by the             model for a specific setting; nonetheless, they are completely
OpenKE toolkit [6]; our Tensorflow version is 1.9. Since com-          unobtainable by just relying on current benchmarking metrics.
paring TransE and DistMult is out of the scope of this paper,             In order to provide an explanation for the correlation between
we have not performed a full-fledged hyperparameter tuning,            degree and LP performances, we have analyzed how the degree
keeping our setting as similar as possible to the default OpenKE       in training facts correlates to the average distance between the
         (a) Entity degree in training set and average Hits@10.




                                                                               (a) Closest neighbours distance by degree (TransE, FB15K).




          (b) Entity degree in training set and average Rank.


Figure 2: Training entity degree vs average performances
when predicting an entity with that degree (FB15K).
Dashed lines have been obtained by fitting a polynomial
function of degree 4 with the least Squares technique.


entities with that degree and their closest neighbor. We report
our findings in Figure 3a; in this chart, in order to yield more
robust results, for each entity we actually consider an average of
the distances from the top three closest neighbors. Interestingly,                (b) Stylized example of the high entity degree effect.
higher degrees typically correspond to more “isolated” embed-
dings in the latent space, with greater distances from their closest     Figure 3: Degree in training set vs top 3 closest neighbours
neighbors.                                                               distance, and intuitive interpretation of its effects on pre-
   We interpret this as illustrated in Figure 3b: a "rich" entity such   dictions. The dashed line in (a) has been obtained fitting a
as United States has a very isolated embedding, while Washington         polynomial function of degree 4 with least Squares tech-
and New York lie in a dense area. On the one hand, due to the lack       nique.
of alternatives in the close neighborhood, it is reasonably easy to
operate transformations in the latent space and answer correctly
United States to the question ⟨𝑊 𝑎𝑠ℎ𝑖𝑛𝑔𝑡𝑜𝑛, 𝑐𝑎𝑝𝑖𝑡𝑎𝑙_𝑜 𝑓 , ?⟩. On         entities in the test set, combined to the fact that the same entities
the other hand, the inverse question ⟨𝑈 𝑛𝑖𝑡𝑒𝑑𝑆𝑡𝑎𝑡𝑒𝑠, 𝑐𝑎𝑝𝑖𝑡𝑎𝑙, ?⟩ is      also enable better predictions, leads to an overall unfairness of
much more difficult, because it requires to learn a very precise         the benchmark, favouring “easy” entities with high degrees over
transformation, in order to disambiguate between Washington              harder ones with medium and low degrees.
and New York.                                                               We demonstrate this by studying how progressively skipping
                                                                         test predictions for the top-degree entities affects global perfor-
Answer. Entities with high degree, like United States, can boost
                                                                         mances. We show our results in Figure 4: in both Mean Rank
the ability of a model to predict relations mentioning them (e.g.,
                                                                         and Hits@10 curves, the more high-degree entities are ignored,
capital_of ). Therefore, it is hard to understand whether a model
                                                                         the worse the performances become. At this regard, the Hits@10
has learned a given relation precisely or only its top mentioned
                                                                         graph also confirms the slightly different behaviours of TransE
entities, by looking only at global metrics like Hits@10.
                                                                         and DistMult, with the former seemingly more depending on
 Q2: Does the benchmark measure the performances of                      the degree than the latter.
models in a fair way?                                                       We have computed the number of entities that contribute the
Analysis. The fairness of a benchmark is the absence of un-              most to the global Hits@10 metrics. The results are impressive:
wanted biases in any operation of its workload. Fairness depends         in FB15K 80% of the global Hits@10 come from 24.1% entities
both on the metrics (i.e. what is measured) and on the compo-            in TransE, from 28.5% entities in DistMult. An even more ex-
sition of the test set (i.e., how the measure is computed). In the       treme situation is witnessed in WN18, where 80% of the global
context of LP, fairness is compromised by the the same correla-          Hits@10 come from 9.87% entities in TransE and 11.6% entities
tion observed in the previous section between entity degrees and         in DistMult.
prediction accuracy. Since both the training set and the test set        Answer. High degree entities, in addition to be more easily in-
are obtained from the same uniform sample of the KG, an entity           ferred, are also over-represented in the test set and a model may
with high degree in the training set will be mentioned more than         obtain significantly good evaluations by just focusing on a small
the others in the test set too. The over-representation of some          number of high-degree entities.
                                                                        obtaining paradoxically better results when removing training
                                                                        samples that the system would apparently learn well is a typical
                                                                        sign of overfitting. In our case, removing from the training set
                                                                        entities whose embeddings would take very large portions of the
                                                                        embedding space may allow the other entities to be placed in
                                                                        better positions; this can be seen as a form of regularization.

                                                                                                             MR        MRR      H@10

                                                                                      Complete                    70     0.49   75.67%
                                                                               Top 10 entities removed            66     0.49   76.55%
                                                                               Top 25 entities removed            64     0.50   78.22%
                                                                              Top 100 entities removed            68     0.53   80.64%
     (a) Global Hits@10 when progressively skipping tests on top        Table 4: TransE performances in filtered scenario when re-
     degree entities (FB15K).
                                                                        moving the top degree entities from FB15K.


                                                                           Finally, all the currently employed benchmarks display a static
                                                                        separation between training set, validation set and test set. This
                                                                        is known to be a bad practice, because in time it may favour
                                                                        models overfitted on this configuration. Running K-folding cross-
                                                                        validation for the two models on both FB15K and WN18 we
                                                                        did not observe significant signs of this form of overfitting yet.
                                                                        Nonetheless, we advice to employ K-folding whenever possible
                                                                        as a way to prevent it in the future.
                                                                        Answer. The counter-intuitive boost of performances when re-
                                                                        moving high-degree entities from both the training and test set
                                                                        suggests an undetected form of overfitting towards these entities.
     (b) Global Mean Rank when progressively skipping tests on top         The above mentioned observations on the relevance, fairness
     degree entities (FB15K).                                           and capability to discourage overfitting of the current bench-
                                                                        marking practices can have interesting implications within the
Figure 4: Effects on global metrics when progressively ig-              ethics of information processing.
noring test predictions on up to 95% of the the entities with              Relying on ineffective benchmarks undermines the capability
highest degree (FB15K).                                                 to assess the quality of the software and of the data it should man-
                                                                        age. We have highlighted that the current evaluation practices
                                                                        may not detect, or may even penalize desirable properties in LP
   Q3: Does the benchmark discourage (or at least high-                 models. For instance, the results currently yielded in evaluation
light) overfitting?                                                     can not tell whether a model is learning a large set of relations
Analysis. Overfitting takes place when a model matches its              or, rather, a narrow set of entities. This can lead to systematic
dataset too closely, conforming to noise and irrelevant corre-          overlooking of underrepresented entities, because systems that
lations in its samples: an overfitted model fails to generalize, and    actually reason on the the entire set of entities can be outranked
will not behave correctly when dealing with unseen data. A good         by models that overfit on few over-represented ones.
benchmark can highlight the emerging of overfitting.                       We also point out that the opacity of current results, computed
   In the LP scenario, as already pointed out by [17] and [5],          with global metrics over large batches of test facts, makes it
FB15K and WN18 significantly suffer from test leakage, with             almost impossible to interpret the behaviours of models. This, of
inverse triples from the training occurring in the test set. As a       course, has negative effects on their explainability and, ultimately,
consequence, even extremely simple systems can reach state-of-          on their trustworthiness.
the-art performances on those datasets [7], thus casting doubt
on the generality of reported results.                                  4   RELATED WORKS
   We have also observed that, when removing from both train-           To the best of our knowledge, there are just a few papers inves-
ing and test sets the top-degree entities and retraining the model,     tigating the validity of current LP benchmarks, and providing
performances improve instead of worsening as one would expect           interpretation for the performances of relation-learning models.
from the previous findings (note that in this experiment we are         Works related to ours can be roughly divided into two main cat-
also retraining, differently from the experiment reported in Fig-       egories, depending on whether they address limitations of the
ure 4). This counter-intuitive pattern is steadily visible in Table 4   standard metrics or of the datasets used in this research field. For
when removing the top 10, 25 and 100 entities. This phenom-             instance, the already mentioned work by [7] demonstrates that a
enon may be partly caused by the fact that, when removing a             carefully tuned implementation of DistMult can achieve state
high-degree entity, along with all of its facts, a large number of      of the art performances, surpassing most of its own successors,
test “questions” about its low-degree neighbors (on which the           raising questions on whether we are developing better LP models
model would perform badly) are removed as well. Nonetheless,            or we are just tuning better hyperparameters.
Limitations of standard metrics. The most similar work to               affecting the fairness of the evaluation workload. We have finally
ours is [22]: they observe that the currently employed metrics          reported that, ignoring entities with high degree (and thus high
tend to be biased as they are computed using only "positive"            performances), LP models show a counter-intuitive improvement
test facts, originally belonging to the KG. For instance, if fact       in performances, potentially attributable to overfitting.
⟨ Barack Obama, place of birth, Honolulu ⟩ is seen in the test             Overall, our results raise concerns on the effectiveness of these
set, our test questions will be ⟨ ?, place of birth, Honolulu ⟩ and     benchmarks. We demonstrate that relying on global metrics over
⟨ Barack Obama, place of birth, ? ⟩. This approach is highly biased     heavily skewed distributions hinders our understanding of LP
as it just scores triples for which an answer is already known          models; all in all, our results imply that at their current state these
to exist. It is more akin to Question Answering than to Knowl-          benchmarking practices may not be able to capture and fairly
edge Graph Completion, because it never tests the plausibility of       measure the capability of relation-learning models to effectively
nonsensical facts, such as ⟨ Honolulu, place of birth, ? ⟩, or facts    learn relations.
that have no answer, such as ⟨ Barack Obama, place of death, ? ⟩
They then propose a new testing workload in which all possible          ACKNOWLEDGEMENTS
couples of entities are tested for all relationships, in order to       Work funded in part by Regione Lazio LR 13/08 Project “In Codice
check whether any false or nonsensical triples manage to obtain         Ratio” (14832).
high plausibility scores.
Limitations of standard datasets. Some of the current LP                REFERENCES
benchmarking workloads have been already put into discussion             [1] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia:
                                                                             A nucleus for a web of open data. In The semantic web, pages 722–735. Springer,
by a few previous works, to which we refer in our analysis. In               2007.
general, these works do not aim at performing a systematic in-           [2] I. Balažević, C. Allen, and T. M. Hospedales. Tucker: Tensor factorization for
vestigation of the benchmark properties; on the contrary, they               knowledge graph completion. arXiv preprint arXiv:1901.09590, 2019.
                                                                         [3] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a col-
just highlight a specific issue, often in the context of presenting          laboratively created graph database for structuring human knowledge. In
a new model or implementation. To the best of our knowledge,                 Proceedings of the 2008 ACM SIGMOD international conference on Management
                                                                             of data, pages 1247–1250. ACM, 2008.
[17] has been the first study to openly discuss the limitations of       [4] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Trans-
FB15K, demonstrating that it heavily suffers from test leakage:              lating embeddings for modeling multi-relational data. In Advances in neural
many relationships in this datasets are semantically identical or            information processing systems, pages 2787–2795. NIPS, 2013.
                                                                         [5] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel. Convolutional 2d
inverse to others, allowing even a very simple model based on                knowledge graph embeddings. In AAAI Conference on Artificial Intelligence,
observed features to outperform most embedding-based state                   2018.
of the art ones. The authors have then proceeded to extract a            [6] X. Han, S. Cao, X. Lv, Y. Lin, Z. Liu, M. Sun, and J. Li. Openke: An open
                                                                             toolkit for knowledge embedding. In Proceedings of the 2018 Conference on
more challenging subset from FB15K, called FB15K-237, contain-               Empirical Methods in Natural Language Processing: System Demonstrations,
ing non-trivial facts only. Unfortunately, FB15K-237 has been                pages 139–144, 2018.
                                                                         [7] R. Kadlec, O. Bajgar, and J. Kleindienst. Knowledge base completion: Baselines
only partially used by the research community, with prominent                strike back. arXiv preprint arXiv:1705.10744, 2017.
models such as HolE ([10]), ComplEx ([18]) and ANALOGY ([9])             [8] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu. Learning entity and relation em-
ignoring it.                                                                 beddings for knowledge graph completion. In Proceedings of the 29th AAAI
                                                                             Conference on Artificial Intelligence, volume 15, pages 2181–2187. AAAI, 2015.
   Starting from their analysis, [5] have further investigated test      [9] H. Liu, Y. Wu, and Y. Yang. Analogical inference for multi-relational em-
leakage in both FB15K and WN18. They have demonstrated that                  beddings. In Proceedings of the 34th International Conference on Machine
a simple rule-based system based on inverse relationships can                Learning-Volume 70, pages 2168–2178. JMLR. org, 2017.
                                                                        [10] M. Nickel, L. Rosasco, and T. Poggio. Holographic embeddings of knowledge
reach state of the art performances in WN18; they have then                  graphs. In Thirtieth Aaai conference on artificial intelligence, 2016.
applied a similar procedure as [17] on WN18 to generate its             [11] H. Paulheim. Knowledge graph refinement: A survey of approaches and
                                                                             evaluation methods. Semantic web, 8(3):489–508, 2017.
challenging subset WN18-RR.                                             [12] R. Qian. Understand your world with bing, 2013. Blogpost in Bing Blogs.
Other tasks. For the sake of completeness we also observe that,         [13] R. Richardson, A. F. Smeaton, and J. N. Murphy. Using wordnet as a knowledge
                                                                             base for measuring semantic similarity between words. Technical report, In
when proposing new models for LP, many papers analyze their                  Proceedings of AICS Conference, 1994.
applicability and performances on related tasks too. For instance,      [14] S. Riedel, L. Yao, and A. McCallum. Modeling relations and their mentions
the authors of [8] show the performances of their model in rela-             without labeled text. In Joint European Conference on Machine Learning and
                                                                             Knowledge Discovery in Databases, pages 148–163. Springer, 2010.
tional extraction from text using the NYT-FB dataset [14], where        [15] A. Singhal. Introducing the knowledge graph: things, not strings, 2012. Blog-
sentences from the New York Times Corpus are annotated with                  post in the Official Google Blog.
                                                                        [16] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowl-
Stanford NER and linked to Freebase elements. Analyzing the                  edge. In Proceedings of the 16th international conference on World Wide Web,
properties of benchmarks for relation extraction tasks is out of             pages 697–706. ACM, 2007.
the scope of our work.                                                  [17] K. Toutanova and D. Chen. Observed versus latent features for knowledge
                                                                             base and text inference. In Proceedings of the 3rd Workshop on Continuous
                                                                             Vector Space Models and their Compositionality, pages 57–66, 2015.
5    CONCLUSIONS                                                        [18] T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, and G. Bouchard. Complex
                                                                             embeddings for simple link prediction. In International Conference on Machine
We have analyzed the current LP benchmarks, observing that                   Learning, pages 2071–2080, 2016.
the training sets of their datasets display severely skewed dis-        [19] J. von Kistowski, J. A. Arnold, K. Huppler, K.-D. Lange, J. L. Henning, and
                                                                             P. Cao. How to build a benchmark. 02 2015.
tributions in both the degrees of entities and the mentions of          [20] D. Vrandečić and M. Krötzsch. Wikidata: a free collaborative knowledge base.
relationships.                                                               2014.
   We have experimentally demonstrated that LP models are               [21] Q. Wang, Z. Mao, B. Wang, and L. Guo. Knowledge graph embedding: A
                                                                             survey of approaches and applications. IEEE Transactions on Knowledge and
deeply affected by these unbalanced conditions; nonetheless,                 Data Engineering (TKDE), 29(12):2724–2743, 2017.
these effects go completely unnoticed by the current evaluation         [22] Y. Wang, D. Ruffinelli, S. Broscheit, and R. Gemulla. On evaluating embedding
                                                                             models for knowledge base completion. arXiv preprint arXiv:1810.07180, 2018.
workloads, thus casting doubts on their relevance. We have also         [23] B. Yang, W.-t. Yih, X. He, J. Gao, and L. Deng. Embedding entities and relations
displayed that entities and relationships that are highly men-               for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575,
tioned in training sets tend to be over-represented in test sets too,        2014.