Knowledge Graph Embeddings: Are Relation-Learning Models Learning Relations? Andrea Rossi Antonio Matinata Roma Tre University Roma Tre University andrea.rossi3@uniroma3.it ant.matinata@stud.uniroma3.it ABSTRACT fact. With the rise of machine learning techniques, this has natu- Link Prediction (LP) is the task of inferring relations between rally combined with the use of KG embeddings: on the one hand, entities in a Knowledge Graph (KG). LP is difficult, due to the in training phase, LP models learn entity and relationship em- sparsity and incompleteness of real-world KGs. Recent advances beddings that optimize the scores for the facts already contained in Machine Learning have led to a large and rapidly growing in the KG; on the other hand, in prediction phase the scoring number of relation learning models, from the seminal work of function is applied on the embeddings of subject, predicate and Bordes et al. [4] to the recent model in [2]. Despite the flurry of object of a fact to compute its plausibility. papers in this area, just a few datasets and evaluation metrics LP models can be queried by providing a subject (or an ob- have emerged as de facto benchmarking criteria. In our work, we ject), e.g., “Barack Obama”, and a predicate, e.g., “place of birth”, question the effectiveness of these benchmarks in establishing representing a question of the form “What is Barack Obama’s the state-of-the-art. The use of unreliable benchmarking practices place of birth?”. Answering such a query amounts to compute the can have hidden ethical implications, as it may yield distorted score of each potential object with respect to the current subject evaluation results and overall lead the research community into and predicate, and to find which one yields the best value. That adopting ineffective design choices. To this end, we consider key is, the answer to a LP query is a ranking of the KG entities by desiderata of a benchmark formulated as specific questions rel- decreasing plausibility. evant to the LP task, and provide empirical evidence to answer This approach for was explored in the seminal work by Bordes those questions. Our analysis shows that existing datasets and et al., describing the TransE LP model [4]. TransE interprets rela- metrics fall short in capturing a model’s capability of solving tions as translations operating on low-dimensional embeddings LP. Specifically, we show that a model can score very high by of the entities. In just a few years, TransE has inspired dozens learning to predict facts about a small fraction of the entities in of new relation-learning systems (see [21] for a survey), and few the training set. Our study provides a more robust evaluation di- datasets and metrics have emerged as a de facto benchmark. rection for future research on relation learning models, stressing In this paper we critically analyze current benchmarks for LP that understanding why LP models reach certain performances models. Our study is motivated by the observation that, in current is a crucial step towards explaining predicted relations. datasets, less than 15% of entities cover more than 80% of facts. Such a skew casts doubts on the suitability of these benchmarks for evaluating LP models. Indeed, we have empirically observed 1 INTRODUCTION that a model can achieve state-of-the-art scores by learning to Knowledge Graphs (KG) are structured representations of facts predict facts about a tiny fraction of the entities with highest in the real world. In a KG, each node represents an entity, e.g. a degree in the training set, which are also the most mentioned person, a place or a concept; each label represents a relationship entities in the test set as well. As an informative example, in usable to link entities; each edge in the form ⟨ subject, predicate, FB15k, the most commonly used among the LP datasets, the object ⟩, represents a fact connecting entity subject with entity entity node with the highest degree is by far “United States”, object through the relationship predicate. Examples of KGs are with ≈ 2% of all edges; the vast majority of the “nationality” FreeBase [3], WikiData [20], DBPedia [1], Yago [16] and – in facts in the test set refer to “United States” as well. Therefore a industry – Google KG [15] and Microsoft Satori [12]. Such KGs model that learns to predict U.S. citizens only can obtain results can contain even billions of facts, yet only a small subset of all comparable or even better than one that attempts to learn the the facts in the real world. nationality relation in detail. KG embeddings are a way of representing the components of Our contribution. We argue that the research community is not a KG as vectors or matrices (embeddings) in a low-dimensional best served by benchmarks that allow for such a discrepancy to hyperspace, called latent space. Embeddings are computed by go unnoticed. Therefore, we provide a constructive contribution training a model on the KG data, and thus carry the semantic towards the definition of more effective benchmarks for LP mod- meaning of the original KG relations. In other words, given the els: (𝑖) we formulate some key questions to highlight some of the embeddings of two elements, it should be possible to identify most desirable properties for LP benchmarks; (𝑖𝑖) we conduct their semantic correlations. an extensive experimental analysis to understand whether the Knowledge Graph Completion is the task of identifying miss- currently employed benchmarks satisfy such properties or not, ing edges (facts) in KGs, by either extracting them from external and why. In doing this, we also highlight the ethical implica- corpora, or inferring them from the ones already in the KG [11]. tions potentially connected with the limitations of the current The latter approach, called Link Prediction (LP), typically requires benchmarks. defining a scoring function that estimates the plausibility of any © 2020 Copyright held by the owner/author(s). Published in the Workshop Pro- 2 BENCHMARKS FOR LP ceedings of the EDBT/ICDT 2020 Joint Conference, March 30-April 2, 2020 on In this section we describe the currently employed LP bench- CEUR-WS.org. Distribution of this paper is permitted under the terms of the Cre- ative Commons license CC BY 4.0. marks; analogously to [19], we consider a benchmark as the whole workload employed to evaluate competing systems, com- posed of both datasets and metrics. The most popular datasets for LP benchmarking consist of facts sampled from the FreeBase [3] and WordNet [13] KGs. Free- Base is an open KG with billions of facts about millions of real world entities and thousands of different relationships. WordNet is a lexical KG whose entities are English words grouped by their sense, and whose edges describe relations among words. The main features of such datasets are described in Table 1. The FB15K dataset has been extracted by the TransE authors selecting all facts containing the 100 most mentioned entities in FreeBase also featured in the Wikilinks database1 (thus including their low-degree neighbors). Defining a full-fledged benchmark- ing workload was beyond their intentions; nonetheless, most of the approaches inspired by TransE have been evaluated against (a) Distribution of entity degrees in FB15K and WN18. the same dataset and metrics, making them a de facto benchmark. The FB15K-237 dataset is a FB15K subset built by [17] after observing that FB15K suffers so much from test leakage that a simple model based on observable features can reach state-of- the-art performances on it. The authors only considered facts for the most occurring 401 relationships in FB15K, and filtered away those with implicitly same meaning or inverse meaning. In order to take away trivial facts, they also removed from validation and test sets any facts linking entities already connected in the training set. We note that is nonetheless a biased approach, as we cannot evaluate the ability of a model to learn useful patterns such as, for instance, ⟨ x, father_of, y ⟩ entails ⟨ y, child_of, x ⟩. The WN18 dataset, analogously to FB15K, was built by the authors of TransE. They used the WordNet ontology [13] as (b) Distribution of relationship mentions in FB15K. WN18 is a starting point, and then filtered out over multiple iterations omitted as it only features 18 relationships. entities and relationships with too few mentions. The WN18-RR dataset was built by [5] applying similar policies Figure 1: Skew analysis for entity degrees distribution to [17], after performing further investigations on test leakage and relationship mentions on training, validation and test in FB15K and WN18. facts for FB15K and WN18. Triples Entities Relations Train Valid Test the resulting score: the original subject s should thus rank as low FB15K 14951 1345 483142 50000 50971 as possible. An analogous pipeline is used for predicting object o. WN18 40943 18 141442 5000 5000 These resulting rankings enable the following global metrics: FB15k-237 14541 237 272115 17535 20466 • Mean Rank (MR), i.e. the average rank of the correct subject WN18-RR 40943 11 86835 3034 3134 (object) over all predictions. Table 1: Standard datasets for LP. • Mean Reciprocal Rank (MRR), i.e. the average of the inverse ranks of the correct subject (object) over all predictions. • Hits@K, i.e. the fraction of correct subject (object) predic- tions with rank equal or lesser than K. The most common Datasets structural analysis. We define the degree of an entity choices for K are 10 and 1. and the number of mentions of a relationship as, respectively, the The above described metrics can be computed either in two dif- number of times that this entity or relationship is mentioned in ferent settings, dubbed raw scenario and filtered scenario. As a different facts of a dataset. matter of fact, an incomplete triple ⟨ ?, p, o ⟩ might accept multiple Both FB15K and WN18 show severe skew in both entity de- entities as correct answers; an answer is correct if the resulting grees and relationship mentions. Figures 1(a) and 1(b) plot their fact is already contained in the training, validation or test set. In distributions, showing that the large majority of entities (rela- raw scenario these entities are still considered “mistakes”, and tionships) have a very low degree (number of mentions), whereas therefore, if they outscore the expected answer, they affect the a small minority of them can reach massive representation. prediction rank. On the contrary, in filtered scenario they are considered acceptable, so if they outscore the expected entity 2.1 Metrics they are just ignored. Evaluation for LP models is typically performed on the task of Entity Prediction. Given the number of entities 𝑛, for any test fact 3 THE CASE FOR BENCHMARKING ⟨𝑠, 𝑝, 𝑜⟩: (𝑖) s is removed from the triple, obtaining ⟨ ?, p, o ⟩; (𝑖𝑖) In this section we define key questions that a good LP benchmark all entities are tested as the triple subject and ranked according to should answer in the affirmative, and investigate whether the 1 https://code.google.com/archive/p/wiki-links/ current benchmarks satisfy them or not, providing experimental evidence to all our claims. We finally provide an overall discussion configuration even across different datasets. For the sake of veri- for the potential ethical implications of the defined questions and fiability and reproducibility we report the resulting combination their extent in this research field. in Table 3. In our experiments, we take into account two representative Batches per Embedding Learning models, namely TransE [4] and DistMult [23]. Epochs epoch dimension Rate Optimizer TransE is one of the first KG embedding systems, and has TransE 1000 100 100 0.001 SGD inspired dozens of successors. It represents facts as translations DistMult 1000 100 100 0.0005 Adam in the latent space: its scoring function uses the relationship embedding as a translation vector to move from the embedding Table 3: Hyperparameter configurations. of the subject to the one of the object. DistMult is very popular due to its simplicity: its scoring function is a bilinear product among the embedding of the subject, a diagonal matrix based on the embedding of the relationship, 3.2 Questioning current benchmarks and the embedding of the object. If properly fine-tuned, it has Our questions on the current LP benchmarks refer to their rele- been recently shown to surpass most models in the state of the vance, fairness and capability to highlight overfitting. For each art [7]. question we provide a formulation, an analysis and an overall We stress that our goal is not to determine which one, among answer. TransE and DistMult, is better: our purpose is rather to inves- tigate the effectiveness of current benchmarks of highlighting Q1: Does the benchmark measure the ability of the sys- their differences. tem to learn relations? Analysis. This question is related to the relevance of the bench- FB15K mark. An LP benchmark is relevant if it actually measures how MR MRR H@1 H@10 good the system is at learning relations. TransE 70.6 0.497 33.50% 75.68% A limitation of current benchmarks lies in their use of global DistMult 156.3 0.469 18.04% 74.03% metrics (Hits@K, Mean Rank, Mean Reciprocal Rank) that relate to the overall number of accurate predictions rather than their FB15K-237 quality. This practice does not take into account that some facts MR MRR H@1 H@10 may be inherently different than others. In other words, global TransE 353.5 0.272 18.51% 44.29% metrics do not let the specific strengths and weaknesses of differ- DistMult 741.9 0.139 7.81% 26.08% ent models surface: this does not allow to investigate in which aspects a model performs better or worse than the others, and WN18 why. Ultimately, we believe that this hinders our understanding MR MRR H@1 H@10 of what our systems are actually learning. TransE 494.5 0.445 16.00% 81.37% Furthermore, as pointed out by [22], current evaluation metrics DistMult 928.9 0.811 70.00% 93.66% are based on positive test facts only, and do not check if false or even nonsensical facts receive low scores in turn. WN18-RR Finally, it has been recently observed [7] that the extensive MR MRR H@1 H@10 use of the Hits@10 metric might be misleading when compar- TransE 5304.3 0.180 2.39% 40.67% ing different models: many systems achieve similar, very good DistMult 9953.5 0.373 36.05% 39.81% Hits@10 values, but they show marked differences with more selective versions of the same metric, such as Hits@1. Table 2: Global performances of models in our experi- To prove our claim we observe that most datasets display very ments (filtered scenario). skewed degree distributions. Our experiments show that the de- gree of an entity in training set largely affects the LP prediction accuracy in testing; nonetheless, this strong correlation is com- Table 2 reports the global values of Hits@1, Hits@10, Mean pletely overlooked by the commonly employed global metrics. Rank and Mean Reciprocal Rank for both models on all datasets. We plot in Figure 2 the correlation between the entity degree For our results, we focus on the filtered scenario; we have ob- and the prediction performances for the entities with that degree. served analogous findings in raw scenario as well. We show that We measure performances with Hits@10 and MR metrics. Our analyzing the behaviour of TransE and DistMult on the cur- results provide strong evidence that a higher degree yields better rent benchmarks can lead to surprising (and even contradictory) predictions; this pattern holds for the vast majority of entities, conclusions. up to 1K mentions. We note that despite reaching comparable Hits@10 overall, DistMult can significantly outperform TransE 3.1 Experimental Setup on low degree entities, while TransE is better on the few high Our experiments have been performed on a server environment degree entities. with a CPU Intel Core(TM) i7-3820 at 3.60GH, 16GB RAM and We believe that insights like these are vital to understand what a GPU NVIDIA Quadro M5000. We have employed the Tensor- our models are actually learning, and to choose the most suitable flow implementation of TransE and DistMult provided by the model for a specific setting; nonetheless, they are completely OpenKE toolkit [6]; our Tensorflow version is 1.9. Since com- unobtainable by just relying on current benchmarking metrics. paring TransE and DistMult is out of the scope of this paper, In order to provide an explanation for the correlation between we have not performed a full-fledged hyperparameter tuning, degree and LP performances, we have analyzed how the degree keeping our setting as similar as possible to the default OpenKE in training facts correlates to the average distance between the (a) Entity degree in training set and average Hits@10. (a) Closest neighbours distance by degree (TransE, FB15K). (b) Entity degree in training set and average Rank. Figure 2: Training entity degree vs average performances when predicting an entity with that degree (FB15K). Dashed lines have been obtained by fitting a polynomial function of degree 4 with the least Squares technique. entities with that degree and their closest neighbor. We report our findings in Figure 3a; in this chart, in order to yield more robust results, for each entity we actually consider an average of the distances from the top three closest neighbors. Interestingly, (b) Stylized example of the high entity degree effect. higher degrees typically correspond to more “isolated” embed- dings in the latent space, with greater distances from their closest Figure 3: Degree in training set vs top 3 closest neighbours neighbors. distance, and intuitive interpretation of its effects on pre- We interpret this as illustrated in Figure 3b: a "rich" entity such dictions. The dashed line in (a) has been obtained fitting a as United States has a very isolated embedding, while Washington polynomial function of degree 4 with least Squares tech- and New York lie in a dense area. On the one hand, due to the lack nique. of alternatives in the close neighborhood, it is reasonably easy to operate transformations in the latent space and answer correctly United States to the question ⟨𝑊 𝑎𝑠ℎ𝑖𝑛𝑔𝑡𝑜𝑛, 𝑐𝑎𝑝𝑖𝑡𝑎𝑙_𝑜 𝑓 , ?⟩. On entities in the test set, combined to the fact that the same entities the other hand, the inverse question ⟨𝑈 𝑛𝑖𝑡𝑒𝑑𝑆𝑡𝑎𝑡𝑒𝑠, 𝑐𝑎𝑝𝑖𝑡𝑎𝑙, ?⟩ is also enable better predictions, leads to an overall unfairness of much more difficult, because it requires to learn a very precise the benchmark, favouring “easy” entities with high degrees over transformation, in order to disambiguate between Washington harder ones with medium and low degrees. and New York. We demonstrate this by studying how progressively skipping test predictions for the top-degree entities affects global perfor- Answer. Entities with high degree, like United States, can boost mances. We show our results in Figure 4: in both Mean Rank the ability of a model to predict relations mentioning them (e.g., and Hits@10 curves, the more high-degree entities are ignored, capital_of ). Therefore, it is hard to understand whether a model the worse the performances become. At this regard, the Hits@10 has learned a given relation precisely or only its top mentioned graph also confirms the slightly different behaviours of TransE entities, by looking only at global metrics like Hits@10. and DistMult, with the former seemingly more depending on Q2: Does the benchmark measure the performances of the degree than the latter. models in a fair way? We have computed the number of entities that contribute the Analysis. The fairness of a benchmark is the absence of un- most to the global Hits@10 metrics. The results are impressive: wanted biases in any operation of its workload. Fairness depends in FB15K 80% of the global Hits@10 come from 24.1% entities both on the metrics (i.e. what is measured) and on the compo- in TransE, from 28.5% entities in DistMult. An even more ex- sition of the test set (i.e., how the measure is computed). In the treme situation is witnessed in WN18, where 80% of the global context of LP, fairness is compromised by the the same correla- Hits@10 come from 9.87% entities in TransE and 11.6% entities tion observed in the previous section between entity degrees and in DistMult. prediction accuracy. Since both the training set and the test set Answer. High degree entities, in addition to be more easily in- are obtained from the same uniform sample of the KG, an entity ferred, are also over-represented in the test set and a model may with high degree in the training set will be mentioned more than obtain significantly good evaluations by just focusing on a small the others in the test set too. The over-representation of some number of high-degree entities. obtaining paradoxically better results when removing training samples that the system would apparently learn well is a typical sign of overfitting. In our case, removing from the training set entities whose embeddings would take very large portions of the embedding space may allow the other entities to be placed in better positions; this can be seen as a form of regularization. MR MRR H@10 Complete 70 0.49 75.67% Top 10 entities removed 66 0.49 76.55% Top 25 entities removed 64 0.50 78.22% Top 100 entities removed 68 0.53 80.64% (a) Global Hits@10 when progressively skipping tests on top Table 4: TransE performances in filtered scenario when re- degree entities (FB15K). moving the top degree entities from FB15K. Finally, all the currently employed benchmarks display a static separation between training set, validation set and test set. This is known to be a bad practice, because in time it may favour models overfitted on this configuration. Running K-folding cross- validation for the two models on both FB15K and WN18 we did not observe significant signs of this form of overfitting yet. Nonetheless, we advice to employ K-folding whenever possible as a way to prevent it in the future. Answer. The counter-intuitive boost of performances when re- moving high-degree entities from both the training and test set suggests an undetected form of overfitting towards these entities. (b) Global Mean Rank when progressively skipping tests on top The above mentioned observations on the relevance, fairness degree entities (FB15K). and capability to discourage overfitting of the current bench- marking practices can have interesting implications within the Figure 4: Effects on global metrics when progressively ig- ethics of information processing. noring test predictions on up to 95% of the the entities with Relying on ineffective benchmarks undermines the capability highest degree (FB15K). to assess the quality of the software and of the data it should man- age. We have highlighted that the current evaluation practices may not detect, or may even penalize desirable properties in LP Q3: Does the benchmark discourage (or at least high- models. For instance, the results currently yielded in evaluation light) overfitting? can not tell whether a model is learning a large set of relations Analysis. Overfitting takes place when a model matches its or, rather, a narrow set of entities. This can lead to systematic dataset too closely, conforming to noise and irrelevant corre- overlooking of underrepresented entities, because systems that lations in its samples: an overfitted model fails to generalize, and actually reason on the the entire set of entities can be outranked will not behave correctly when dealing with unseen data. A good by models that overfit on few over-represented ones. benchmark can highlight the emerging of overfitting. We also point out that the opacity of current results, computed In the LP scenario, as already pointed out by [17] and [5], with global metrics over large batches of test facts, makes it FB15K and WN18 significantly suffer from test leakage, with almost impossible to interpret the behaviours of models. This, of inverse triples from the training occurring in the test set. As a course, has negative effects on their explainability and, ultimately, consequence, even extremely simple systems can reach state-of- on their trustworthiness. the-art performances on those datasets [7], thus casting doubt on the generality of reported results. 4 RELATED WORKS We have also observed that, when removing from both train- To the best of our knowledge, there are just a few papers inves- ing and test sets the top-degree entities and retraining the model, tigating the validity of current LP benchmarks, and providing performances improve instead of worsening as one would expect interpretation for the performances of relation-learning models. from the previous findings (note that in this experiment we are Works related to ours can be roughly divided into two main cat- also retraining, differently from the experiment reported in Fig- egories, depending on whether they address limitations of the ure 4). This counter-intuitive pattern is steadily visible in Table 4 standard metrics or of the datasets used in this research field. For when removing the top 10, 25 and 100 entities. This phenom- instance, the already mentioned work by [7] demonstrates that a enon may be partly caused by the fact that, when removing a carefully tuned implementation of DistMult can achieve state high-degree entity, along with all of its facts, a large number of of the art performances, surpassing most of its own successors, test “questions” about its low-degree neighbors (on which the raising questions on whether we are developing better LP models model would perform badly) are removed as well. Nonetheless, or we are just tuning better hyperparameters. Limitations of standard metrics. The most similar work to affecting the fairness of the evaluation workload. We have finally ours is [22]: they observe that the currently employed metrics reported that, ignoring entities with high degree (and thus high tend to be biased as they are computed using only "positive" performances), LP models show a counter-intuitive improvement test facts, originally belonging to the KG. For instance, if fact in performances, potentially attributable to overfitting. ⟨ Barack Obama, place of birth, Honolulu ⟩ is seen in the test Overall, our results raise concerns on the effectiveness of these set, our test questions will be ⟨ ?, place of birth, Honolulu ⟩ and benchmarks. We demonstrate that relying on global metrics over ⟨ Barack Obama, place of birth, ? ⟩. This approach is highly biased heavily skewed distributions hinders our understanding of LP as it just scores triples for which an answer is already known models; all in all, our results imply that at their current state these to exist. It is more akin to Question Answering than to Knowl- benchmarking practices may not be able to capture and fairly edge Graph Completion, because it never tests the plausibility of measure the capability of relation-learning models to effectively nonsensical facts, such as ⟨ Honolulu, place of birth, ? ⟩, or facts learn relations. that have no answer, such as ⟨ Barack Obama, place of death, ? ⟩ They then propose a new testing workload in which all possible ACKNOWLEDGEMENTS couples of entities are tested for all relationships, in order to Work funded in part by Regione Lazio LR 13/08 Project “In Codice check whether any false or nonsensical triples manage to obtain Ratio” (14832). high plausibility scores. Limitations of standard datasets. Some of the current LP REFERENCES benchmarking workloads have been already put into discussion [1] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. In The semantic web, pages 722–735. Springer, by a few previous works, to which we refer in our analysis. In 2007. general, these works do not aim at performing a systematic in- [2] I. Balažević, C. Allen, and T. M. Hospedales. Tucker: Tensor factorization for vestigation of the benchmark properties; on the contrary, they knowledge graph completion. arXiv preprint arXiv:1901.09590, 2019. [3] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a col- just highlight a specific issue, often in the context of presenting laboratively created graph database for structuring human knowledge. In a new model or implementation. To the best of our knowledge, Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1247–1250. ACM, 2008. [17] has been the first study to openly discuss the limitations of [4] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Trans- FB15K, demonstrating that it heavily suffers from test leakage: lating embeddings for modeling multi-relational data. In Advances in neural many relationships in this datasets are semantically identical or information processing systems, pages 2787–2795. NIPS, 2013. [5] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel. Convolutional 2d inverse to others, allowing even a very simple model based on knowledge graph embeddings. In AAAI Conference on Artificial Intelligence, observed features to outperform most embedding-based state 2018. of the art ones. The authors have then proceeded to extract a [6] X. Han, S. Cao, X. Lv, Y. Lin, Z. Liu, M. Sun, and J. Li. Openke: An open toolkit for knowledge embedding. In Proceedings of the 2018 Conference on more challenging subset from FB15K, called FB15K-237, contain- Empirical Methods in Natural Language Processing: System Demonstrations, ing non-trivial facts only. Unfortunately, FB15K-237 has been pages 139–144, 2018. [7] R. Kadlec, O. Bajgar, and J. Kleindienst. Knowledge base completion: Baselines only partially used by the research community, with prominent strike back. arXiv preprint arXiv:1705.10744, 2017. models such as HolE ([10]), ComplEx ([18]) and ANALOGY ([9]) [8] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu. Learning entity and relation em- ignoring it. beddings for knowledge graph completion. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, volume 15, pages 2181–2187. AAAI, 2015. Starting from their analysis, [5] have further investigated test [9] H. Liu, Y. Wu, and Y. Yang. Analogical inference for multi-relational em- leakage in both FB15K and WN18. They have demonstrated that beddings. In Proceedings of the 34th International Conference on Machine a simple rule-based system based on inverse relationships can Learning-Volume 70, pages 2168–2178. JMLR. org, 2017. [10] M. Nickel, L. Rosasco, and T. Poggio. Holographic embeddings of knowledge reach state of the art performances in WN18; they have then graphs. In Thirtieth Aaai conference on artificial intelligence, 2016. applied a similar procedure as [17] on WN18 to generate its [11] H. Paulheim. Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic web, 8(3):489–508, 2017. challenging subset WN18-RR. [12] R. Qian. Understand your world with bing, 2013. Blogpost in Bing Blogs. Other tasks. For the sake of completeness we also observe that, [13] R. Richardson, A. F. Smeaton, and J. N. Murphy. Using wordnet as a knowledge base for measuring semantic similarity between words. Technical report, In when proposing new models for LP, many papers analyze their Proceedings of AICS Conference, 1994. applicability and performances on related tasks too. For instance, [14] S. Riedel, L. Yao, and A. McCallum. Modeling relations and their mentions the authors of [8] show the performances of their model in rela- without labeled text. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 148–163. Springer, 2010. tional extraction from text using the NYT-FB dataset [14], where [15] A. Singhal. Introducing the knowledge graph: things, not strings, 2012. Blog- sentences from the New York Times Corpus are annotated with post in the Official Google Blog. [16] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowl- Stanford NER and linked to Freebase elements. Analyzing the edge. In Proceedings of the 16th international conference on World Wide Web, properties of benchmarks for relation extraction tasks is out of pages 697–706. ACM, 2007. the scope of our work. [17] K. Toutanova and D. Chen. Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, pages 57–66, 2015. 5 CONCLUSIONS [18] T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, and G. Bouchard. Complex embeddings for simple link prediction. In International Conference on Machine We have analyzed the current LP benchmarks, observing that Learning, pages 2071–2080, 2016. the training sets of their datasets display severely skewed dis- [19] J. von Kistowski, J. A. Arnold, K. Huppler, K.-D. Lange, J. L. Henning, and P. Cao. How to build a benchmark. 02 2015. tributions in both the degrees of entities and the mentions of [20] D. Vrandečić and M. Krötzsch. Wikidata: a free collaborative knowledge base. relationships. 2014. We have experimentally demonstrated that LP models are [21] Q. Wang, Z. Mao, B. Wang, and L. Guo. Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and deeply affected by these unbalanced conditions; nonetheless, Data Engineering (TKDE), 29(12):2724–2743, 2017. these effects go completely unnoticed by the current evaluation [22] Y. Wang, D. Ruffinelli, S. Broscheit, and R. Gemulla. On evaluating embedding models for knowledge base completion. arXiv preprint arXiv:1810.07180, 2018. workloads, thus casting doubts on their relevance. We have also [23] B. Yang, W.-t. Yih, X. He, J. Gao, and L. Deng. Embedding entities and relations displayed that entities and relationships that are highly men- for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575, tioned in training sets tend to be over-represented in test sets too, 2014.