Reflections on: Knowledge Graph Fact Prediction via Knowledge-Enriched Tensor Factorization Ankur Padia, Konstantinos Kalpakis, Francis Ferraro, and Tim Finin University of Maryland, Baltimore County, Baltimore, MD, USA {ankurpadia, kalpakis, ferraro, finin}@umbc.edu Abstract. We present a family of four novel methods for embedding knowledge graphs into real-valued tensors that capture the ordered relations found in RDF. Unlike many previous models, these can easily use prior background knowledge from users or existing knowledge graphs. We demonstrate our models on the task of predicting new facts on eight different knowledge graphs, achieving a 5% to 50% improvement over existing systems. Through experiments, we derived rec- ommendations for selecting the best model based on knowledge graph character- istics. We also give a provably-convergent, linear tensor factorization algorithm. 1 Introduction Knowledge graphs are increasingly important due to their effectiveness in supporting a wide range of applications. While immensely useful in their current state, much work remains to be done to detect and correct errors they contain and add missing relations. Representation learning [2] provides a way to augment or reduce reliance on manually constructed ontology axioms and rules by using knowledge graph instances to discover common patterns and then use them to suggest changes to the graph. One popular ap- proach is based on learning how to embed the entities and relations in a graph into a real-valued vector space, allowing the entities and relations to be represented by dense, real-valued vectors. There is considerable interest in understanding how graph embed- dings can be used to augment knowledge graphs [11,19]. Current state-of-the-art systems compute embeddings to support a specific task, which might be link ranking (aka link recommendation), or fact prediction (Table 1). Link ranking tries to augment the graph by recommending relations that could hold be- tween a subject–object pair assuming (1) at least one relation exists between the pair and (2) we can determine a good threshold for choosing links that hold. The fact predic- tion task identifies potential facts and classifies them as correct or incorrect, learning a model that can find relations that are likely to hold. This task is more specific than link ranking and more directly solves an important problem. Since we are only interested in extending a knowledge graph with relations likely to be true, our approach is designed to solve it directly. Fact prediction models can also be used to filter triples produced by information extraction or inference system. Embedding entities and relations into a vector space has been shown to achieve good results. Tensor–based approaches like RESCAL [14] jointly learn the latent repre- sentation of entities and relations by factorizing the tensor representation of the knowl- edge graph. This can be further improved by imposing constraints on the factors, such Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 A. Padia, K. Kalpakis, F. Ferraro and T. Finin Tasks Alternate terminology Definition Example Input: Given relation r and entity ei . (ei ,r, ?) Link ranking Link prediction Output: Rank possible entities ej Input: Where is Statue of Liberty located? (ranking) Link recommendation – or – Output: (1) Germany (2) United States (3) New York (city) Input: Given entity pair ei and ej . (ei ,?, ej ) (4) New York (state) (5) Brazil Output: Rank possible relations, r Fact prediction Link classification Input: triple (aka fact), ei , r, and ej . Input: Is the Statue of Liberty located in Germany? (classification) Fact classification Output: 0 (No) or 1 (Yes) Output: 0 (No) Table 1. We focus on a binary classification fact prediction task rather than link ranking as non-negativity, to achieve better performance on sparse graphs. RESCAL and its variants [9,10,21] have achieved state-of-the-art results in predicting missing relations on real-world knowledge graphs. However, their extensions require additional schema information, which may be unknown or require significant effort to provide. Neural net- work based approaches, like TransE [4] learn an embedding of a knowledge graph by minimizing the ranking loss to rank likely links higher than unlikely ones. However, these models do not exploit the similarity among relations when comput- ing embeddings nor have they studied the role that relation similarities have on regular- izing and constraining the relation embeddings and the subsequent effect on fact pre- diction. We addresses these deficiencies and make four contributions: we (1) develop a new graph embedding framework exploiting prior knowledge of relation-similarity; (2) provide four new embedding models; (3) evaluate the models and previous systems on eight real-world knowledge graphs for fact prediction; and (4) prove convergence for a factorization algorithm matching or outperforming baselines. In this extended abstract, we describe one of the model, Quadratic+Constraint (QC) model in detail and refer interested readers to the full paper [18]. 2 Similarity-driven knowledge graph embedding Our approach for similarity-driven knowledge graph embedding re- lies on learning entity and rela- tion embedding such that when “combined” (multiplied), the re- sult is a close approximation of the original facts and relation oc- currences observed in the graph. We augment the learning process with a relational similarity matrix that provides a holistic judgment of how similar pairs of relations are, adding addtional constraints to the learned embeddings. We represent a multi-relational Fig. 1. Heatmap for WIN18RR’s similarity matrix using knowledge graph of Nr binary re- transitivity (relation names abbreviated) lations among Ne entities by the order-3 tensor X of dimension Knowledge Graph Fact Prediction via Knowledge-Enriched Tensor Factorization 3 Ne × Ne × Nr where each Ne × Ne is an adjacency matrix for relation k (called slice) has values of 1 when the ith entity (ei ) is connected to the j th entity (ej ) with relation k, else 0. This binary tensor is often very large and sparse. Our goal is to construct dense, informative p-dimensional embedding for entities and relations. We represent the col- lection of p-dimensional entity embeddings by A of size Ne × p, a compact order-3 tensor R of size p × p × Nr , and the similarity matrix by C of size Nr × Nr (symbols are same as described in [18]). Our objective is to reconstruct each of the k relation slices of X as the product Xk ≈ Aα Rk A|β . (1) Here both Aα and Aβ are matrices: each row is the embedding of an entity. By chang- ing the exact form of A—that is, the number of different entity matrices, or the different ways to index A—we can then arrive at different models, each encapsulating both math- ematical and philosophical differences. In this extended abstract, we describe just one of the four models, with full details on all available at [18]. We examine the case of hav- ing only a single entity embedding matrix, represented as A—that is, Aα = Aβ = A. This results in a quadratic reconstruction problem, as we approximate Xk ≈ ARk A| . We also examine the case of having two separate entity embedding matrices, repre- sented as A1 and A2 . This results in a reconstruction problem that is linear in the entity embeddings, as we approximate Xk ≈ A1 Rk A|2 . We learn A, and R by minimizing the augmented reconstruction loss numerical regularization of the embeddings z }| { min f (A, R) + g(A, R) + fs (A, R, C) . (2) A,R | {z } | {z } reconstruction loss knowledge-directed enrichment The first term of (2) reflects each of the k relational criteria given by (1). The sec- ond employs standard numerical regularization of the embeddings, such as Frobenius minimization, that enhances the algorithm’s numerical stability and supports the inter- pretability of the resulting embeddings. The third term uses our relational similarity matrix C to enrich the learning process with our extra knowledge. Computing relational similarity matrix C. We can view a knowledge graph’s nodes and edges as representing a flow of information, with subjects and objects acting as information producers and consumers, respectively. Tensor factorization captures this interaction [14]. Relations that occur on more of the same entities are more likely to have some notion of being similar. Each element of the Nr × Nr matrix C represents the similarity between a pair of relations, i.e., slices X i and X j , and is computed using the following equation for all pairs of relations (1 ≤ i, j ≤ Nr ): |(S(Xi ) ∪ O(Xi )) ∩ (S(Xj ) ∪ O(Xj ))| (Symmetric) Ci,j = (3) |(S(Xi ) ∪ O(Xi )) ∪ (S(Xj ) ∪ O(Xj ))| where S(X i ) is the set of subjects of the matrix X holding the ith relation, and simi- larly for the object O(X i ). |S(X)| gives the cardinality of the set. 4 A. Padia, K. Kalpakis, F. Ferraro and T. Finin We measure relation similarity using the overlap in the entities observed with each relation. In addition to computing similarity matrices based on relation symmetry using Eq. 3, we support four additional measures: (1) Agency: number of times both the rela- tions have share the same subject, (2) Patient: number of times both the relations have share the same object, (3) Transitivity: number of times object of relation i is the sub- ject of relation j, and (4) Reverse Transitivity: number of times subject of relation i is the object of relation j. We experimented with all of the similarity functions and report the evaluation result in Section 3. For most of our experiments we used the similarity obtained from transitivity, as we found it gave the best overall performance. Figure 1 shows the computed similarity matrix for the WordNet dataset. The QC model: Quadratic+Constraint. The QC model casts the decomposition of the order-3 tensor X into a compact relational tensor R and quadratic entity matrix A and solve the following problem. min f (A, Rk ) + g(A, Rk ) + fLag (R, C) (4) A,Rk The terms of our objective are expressed as follows. X f (A, Rk ) = 21 ||Xk − ARk AT ||2F (5) k g(A, Rk ) = 21 λa ||A||2F + 12 λr k ||Rk ||2F P (6) XX fLag = λij (1 − ||Ri − Rj ||2F + Cij ). (7) i j Here A is a n × p matrix where each row represents the entity embeddings and Rk is a p × p matrix representing the embedding for the kth relation capturing the interac- tion between the entities. The first term f forces the reconstruction to be similar to the original tensor and the second regularizes the unknown A and Rk to avoid overfitting. The flag term represents the model’s knowledge-directed enrichment component. In order to incorporate similarity constraints, we solve the dual objective for a constraint in Eq. 8 such that relations with high similarity are near one another. We convert a constraint problem into an unconstrained problem via Lagrange multipliers λij . ||Ri − Rj ||2F = 1 − Cij , 1 ≤ i, j ≤ n. (8) Computing Factor Matrices, A, Rk and Lagrange Multipliers λij . We compute the unknown factor matrices using the Adam optimization [8] stochastic gradient descent extension. Each unknown is updated in the alternative fashion, in which each parameter is updated while treating the others as constants. The model’s unknown parameters in A and Rk are updated with different learning rates. We empirically found that the error value of the objective function decreases after few iterations. Taking the partial derivative of the Eq. 4 with respect to A and equating to zero, we obtain the update rule for A. Since we are indirectly constraining the embeddings of A through slices of the compact relation tensor R, we obtain the same update rule for A as in RESCAL. By equating the partial derivatives of Eq. 4 with respect to the unknowns Rk and λij to 0, we obtain updates for Rk and λij as describe in the [18]. Knowledge Graph Fact Prediction via Knowledge-Enriched Tensor Factorization 5 Name # Entities # Relations # Facts Avg. Degree Graph Density Kinship 104 26 10,686 102.75 0.98798 UMLS 135 49 6,752 50.01 0.37048 FB15-237 14,541 237 310,116 21.32 0.00147 DB10k 4,397 140 10,000 2.27 0.00052 FrameNet 22,298 16 62,344 2.79 0.00013 WN18 40,943 18 151,442 3.70 0.00009 FB13 81,061 13 360,517 4.45 0.00005 WN18RR 40,943 11 93,003 2.27 0.00005 Table 2. Statistics of the eight datasets used in evaluation experiments 3 Experimental evaluation We evaluated the performance of the entity and relation embeddings on the fact predic- tion task, which distinguishes correct from incorrect triples, and compared the results with state-of-the-art tensor decomposition models RESCAL and NN-RESCAL, and two popular neural-based benchmarks, DistMult (which considers relation embedding matrices to be diagonal) and ComplEx (which represents entities and relation in com- plex vector space). Here we present results only for the tensor-based models; additional evaluation data and comparisons with neural-based models are in [18]. Datasets. Table 3 summarizes the key statistics of the datasets: the number of en- tities (Ne ), relations (Nr ) and facts (non-zero entries in the tensor), the average degree of entities across all relations (the ratio of facts to entities) and the graph density (the number of facts divided by square of the number of entities). A smaller average degree or graph density indicates that the knowledge graph is sparser. We used eight datasets in our evaluation, including both previous graph-embedding benchmarks and new ones. They include (1) Kinship [7], which encodes complex fam- ily links among 104 members of a tribe, (2) UMLS, a biomedical dataset [7] based on the Unified Medical Language System, (3,4) WN18 [3] and WN18RR [5] (re- verse removed), linguistic datasets of relations between words like hypernym, holonym, meronym and hyponym, (5,6,7) three general world knowledge graphs: FB13 [3] , DB10k and FB15-237 which contained more relations compared to FB13, and (8) FrameNet [1], a lexical database describing how language can be used to evoke com- plex representations of frames describing events, relations or objects and their partici- pants. For example, the Commerce buy frame represents the interrelated concepts sur- rounding stereotypical commercial transactions. Frames have roles for expected partici- pants (e.g., Buyer, Goods, Seller), modifiers (e.g., Imposed purpose and Period of iterations), and inter-frame relations defining inheritance and usage hierarchies (e.g., Commerce buy inherits from the more general Getting and is inherited by the more specific Renting. Tensor creation, parameter selection and performance metric. We created a 0-1 tensor for each dataset, as described in [18]. If entity s has relation r with entity o, then the value of the (s, r, o) entry in the tensor is set to 1, otherwise it is set to 0. Each of the created tensors is used to generate a slice-similarity matrix using Eq. 3. We fix the parameters for different datasets using coordinate descent, changing only one hyperpa- rameter at a time and always making a change from the best configuration of hyperpa- 6 A. Padia, K. Kalpakis, F. Ferraro and T. Finin Model Name Kinship UMLS WN18 FB13 DB10 Framenet WN18RR FB15-237 Previous tensor factorization models RESCAL 93.24 88.53 62.13 65.37 61.27 82.54 66.63 92.56 NN-RESCAL 92.19 88.37 83.93 79.13 81.72 82.6 68.49 93.03 Linear/Quadratic Regularized/Constrained tensor factorization models LR 93.99 88.22 81.86 80.07 80.79 78.11 69.15 90.00 QR 93.89 88.11 84.41 79.12 80.47 82.34 66.73 93.07 LC 92.87 84.71 80.18 75.79 80.67 73.64 66.46 81.88 F QC 93.84 86.17 91.07 85.15 81.69 86.24 72.62 86.47 Table 3. Fact prediction AUC performance for all models. F indicating best overall. LR and LC are linear regularized/constrained models; QR and QC are quadratic regularized/constrained. rameters found so far. The latent dimension is equal to the number of relations. We use the same performance metric as RESCAL on three samples: stratified-uniform (sam- pling 60% correct and 40% incorrect from each relation), stratified-weighted (sampling 60% correct and 40% incorrect from the public dataset with few relations are mentioned more frequently then other), and balanced-weighted (dataset mentioned in [20]). Results and discussion. In this section we analyze and the results of our models, which include a quantitative comparison with other tensor-based models and the impact of knowledge graph sparsity on the models. Table 3 has the AUC performance of our models which range from 5% to 50%. The Kinship and UMLS datasets have a significantly higher graph density com- pared to the others, as shown in Table 3. Combining this observation with the results in Table 3, we notice that graphs with lower density result in larger performance variabil- ity across both the baseline systems and our models. This suggests that when learning knowledge graph embeddings on dense graphs, basic tensor methods like RESCAL can give acceptable performance, but that for lower density graphs, different embedding learning mechanisms may be better. Focusing on the datasets with lower graph density, we see that while the LC and LR models often matched or surpassed RESCAL, they achieved comparable or lower performance compared to their corresponding quadratic models (QC and QR). This is due to the fact that the distinction of the subject and ob- ject made by A1 and A2 embeddings tends not to hold in many of the standard datasets. That is, objects can behave as subjects (and vice versa), as in the WN18 dataset. Hence the distinction between the subject and the object may not always be needed. The performance difference between the quadratic and linear versions is high for WN18 and FB13, though the difference is relatively small for DB10k. This is largely because the DBpedia dataset includes many datatype properties, i.e., properties whose values are literals rather than entities. In most cases, non-negative RESCAL outperforms the linear models. The QC model significantly outperforms RESCAL and performs relatively better compared to our other three models, emphasizing the importance of the flexible penalization that the Lagrange multipliers provide. Compared to all others, the QC model performs better in most of the cases, since the Lagrange multiplier introduces flexibility in penalizing the latent relational embeddings while learning. We also did significance tests using a Knowledge Graph Fact Prediction via Knowledge-Enriched Tensor Factorization 7 Wilcoxon signed-rank test for all algorithms and datasets at significance level of 1% and found the QC model to perform better compared to the others. In summary, both the quadratic and linear models are important depending on the data, with the QC model performing the best overall and the Linear models performing comparably. Effect of knowledge graph density. We explored how our models handle knowledge graphs of different density by reducing the number of subjects while keeping the objects constant, making the graph sparser. We used the FB13 dataset, which has nearly 16K objects and 76K subjects, so each object entity is con- nected to nearly five subjects on aver- age. Figure 2 shows the behavior of dif- ferent tensor based models on reducing number of subjects at step size of 2% with higher the value increasingly re- moving the number of subjects. We see that removing subjects gradually reduces performance. Compared to other models, Fig. 2. Change in performance with sparse LC improves significantly faster when knowledge graph subjects are added back, irrespective of the similarity metric, eventually achieving comparable performance with other ten- sor based models. The QC model performs the best irrespective of the graph’s density. Effect of similarity encoding. Figure 3 shows the relative changes in performance of each similarity metric compared to RESCAL, grouped by how we encode the knowl- edge. The gray boxes show the percent change of the two RESCAL versions. Most en- coding approaches perform well, but the encoding can yield a significant performance gain for certain datasets. In DB10 (top left) using LR the agency and symmetric encod- ings give poor performance. Changing the encoding to transitivity or reverse transitivity yields a large performance gain. On the other hand, for WN18RR both transitivity and reverse transitivity with the LR model perform poorly. The LC model performs sim- ilarly for all kinds of encoding. Moreover, QC performs consistently well compared to all the baselines without being affected by the similarity encoding. While we find that different kinds of similarity encoding methods can, and do, influence performance, we see the effect of how that knowledge is encoded in these datasets. For example, whether an encoding uses a symmetric or transitive approach may be less important than whether or not accurate knowledge is encoded at all. The knowledge enrichment that the encoding provides can result in effective model generalization beyond what simple, knowledge-poor regularizations, like Frobenius norm regularization, give. 8 A. Padia, K. Kalpakis, F. Ferraro and T. Finin Fig. 3. % AUC change over RESCAL for five different similarity encodings. 4 Conclusions and future work We described a new framework for learning knowledge-enriched entity and relation embeddings and four readily obtainable models that generalize existing efforts and demonstrate significant improvements over both state-of-the-art tensor decomposition and neural-based translation models. We motivated and empirically explored different methods for encoding prior knowledge into the tensor factorization algorithm, finding that using transitive relationship chains resulted in the highest overall performance. We further characterized the conditions under which each model performed well and con- cluded the QC model is typically the best choice. Finally, we proved in [18] that the LR model has the desirable property of convergence. Our future work will use the KGFP framework for fact prediction in three different scenarios: (1) improving information extraction in tasks like the TAC Knowledge Base Population [6], (2) enhancing the knowledge graphs used in systems [13,12] that iden- tify possible cybersecurity attacks, and (3) cleaning noisy knowledge graphs [16,17] by identifying and possibly correcting errors. The evaluation datasets and some code will be available at the KGFP repository [15]. Knowledge Graph Fact Prediction via Knowledge-Enriched Tensor Factorization 9 References 1. Baker, C.F., Fillmore, C.J., Lowe, J.B.: The berkeley framenet project. In: ACL. pp. 86–90. ACL (1998) 2. Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspec- tives. Transactions on Pattern Analysis and Machine Intelligence 35(8) (2013) 3. Bordes, A., Glorot, X., Weston, J., Bengio, Y.: A semantic matching energy function for learning with multi-relational data. Machine Learning pp. 233–259 (2014) 4. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embed- dings for modeling multi-relational data. In: NIPS. pp. 2787–2795 (2013) 5. Dettmers, T., Minervini, P., Stenetorp, P., Riedel, S.: Convolutional 2d knowledge graph embeddings. arXiv preprint arXiv:1707.01476 (July 2018), extended AAAI18 paper 6. Finin, T., Lawrie, D., McNamee, P., Mayfield, J., Oard, D., Peng, N., Gao, N., Lin, Y.C., MacKin, J., Dowd, T.: HLTCOE Participation in TAC KBP 2015: Cold Start and TEDL. In: 8th Text Analysis Conf. NIST (November 2015) 7. Kemp, C., Tenenbaum, J.B., Griffiths, T.L., Yamada, T., Ueda, N.: Learning systems of con- cepts with an infinite relational model. In: AAAI. pp. 381—-388. AAAI (2006) 8. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: Third Int. Conf. on Learning Representations (December 2014) 9. Krompass, D., Baier, S., Tresp, V.: Type-constrained representation learning in knowledge graphs. In: Int. Semantic Web Conf. Springer (2015) 10. Krompass, D., Nickel, M., Jiang, X., Tresp, V.: Non-negative tensor factorization with RESCAL. In: Tensor Methods for Machine Learning, ECML Workshop (2013) 11. Meilicke, C., Fink, M., Wang, Y., Ruffinelli, D., Gemulla, R., Stuckenschmidt, H.: Fine- grained evaluation of rule-and embedding-based systems for knowledge graph completion. In: International Semantic Web Conference. pp. 3–20. Springer (2018) 12. Mittal, S., Joshi, A., Finin, T.: Cyber-all-intel: An AI for security related threat intelligence. arXiv preprint arXiv:1905.02895 (2019) 13. Narayanan, S.N., Ganesan, A., Joshi, K.P., Oates, T., Joshi, A., Finin, T.: Early detection of cybersecurity threats using collaborative cognition. In: Int. Conf. on Collaboration and Internet Computing. IEEE (2018) 14. Nickel, M., Tresp, V., Kriegel, H.P.: A three-way model for collective learning on multi- relational data. In: 28th Int. Conf. on machine learning (ICML-11). pp. 809–816 (2011) 15. Padia, A.: KGFP repository. https://github.com/Ebiquity/KGFP.git 16. Padia, A.: Cleaning Noisy Knowledge Graphs. In: Doctoral Consortium at the 16th Int. Se- mantic Web Conf. vol. 1962. CEUR Workshop Proceedings (October 2017) 17. Padia, A., Ferraro, F., Finin, T.: KGCleaner: Identifying and correcting errors produced by information extraction systems. arXiv preprint arXiv:1808.04816 (August 2018) 18. Padia, A., Kalpakis, K., Ferraro, F., Finin, T.: Knowledge Graph Fact Prediction via Knowledge-Enriched Tensor Factorization. Journal of Web Semantics (January 2019) 19. Ristoski, P., Rosati, J., Di Noia, T., De Leone, R., Paulheim, H.: RDF2Vec: RDF graph embeddings and their applications. Semantic Web (2019) 20. Socher, R., Chen, D., Manning, C.D., Ng, A.: Reasoning with neural tensor networks for knowledge base completion. In: Advances in neural information processing systems. pp. 926–934 (2013) 21. Yang, B., Yih, W.T., He, X., Gao, J., Deng, L.: Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575 (2014)