Using Joint Tensor Decomposition on RDF Graphs Michael Hoffmann AKSW Group, Leipzig, Germany michoffmann.potsdam@gmail.com Abstract. The decomposition of tensors has on multiple occasions shown state of the art performance on classical Semantic Web tasks such as link prediction. However, the structure of the LOD cloud has not been taken into account in these developments so far. In particular, the decentralized architecture underlying Linked Data suggests that the incorporation of distributed information into these decompositions might lead to better factorization results. The goal of this thesis is hence to develop and evaluate models for joint tensor factorization on RDF graphs that arise from parameter estimation in statistic models or algebraic ap- proaches. In this paper, we focus on presenting the problem we address formally and on discussing preliminary experiments on semi-synthetic data. Our experi- ments yield promising results inspite of being derived from an ad-hoc approach. 1 Problem Statement The Semantic Web landscape is populated by large interlinked RDF graphs containing partly redundant and often complementary knowledge. Since many knowledge graphs contain subgraphs of information on the same entities, we argue that it would be de- sireable to obtain a more complete picture for the relations between these resources while performing typical semantic tasks like knowledge extraction, entity matching or link prediction. Embedding the entities of those graphs into latent spaces, that consider relations over multiple knowledge bases, opens up new avenues for improvement on established methods for this family of semantic tasks, while simultaniously lending it- self to new tasks like the generation of synthetic knowledge graphs for benchmarking or the detection of graph patterns. On the other hand, latent feature models have shown state-of-the-art performance on the aforementioned tasks, especially when processing very large graphs [13][7]. However, the state of the art focuses on one graph at a time and commonly does not take the topology of the Linked Data Web into consideration. The aim of this thesis is to develop factorization approaches that can efficiently process groups of graphs by performing the decompositions of their adjacency tensors jointly and apply these decompositions in knowledge extraction and enrichment tasks. The intuition underlying this work is that by extracting latent features with respect to several graphs concurrently, we allow implicitly for information to permeate the bound- aries of single graphs and thus achieve more accurate factorizations. To achieve this goal and jointly factorize even the largest knowledge bases, we aim to develop scalable and time-efficient approaches that are able to incorporate meta-information like links be- tween entities in different knowledge bases and literals, while also hailing from sensible and interpretable models. Thus, this thesis will focus on the research of joint decom- positions of knowledge base adjacency tensors and their applications on semantic tasks like the detection of graph patterns to achieve better results on real world applications like query answering systems. 2 Relevancy The results of this research could lead to latent feature models that incorporate informa- tion that is spread out globally over multible RDF graphs. A large number of tasks (es- pecially tasks which require abstract representations of resources) can profit from these results, including (1) question answering, (2) structured machine learning, (3) named entity linking and (4) relation extraction. For example, question answering frameworks could use the latent features to map slots in a natural-language query to entries in a SPARQL query. In relation extraction, the latent features of resources could potentially improve the selection of sentences for extracting RDF triples from text. Named entity linking could profit from the latent features we generate by providing numeric hints towards entity mentions in natural language. The runtime and accuracy of the provided solutions will be the deciding factors pertaining to their adoption by the Semantic Web community and the corresponding industry. Hence, we aim to develop time-efficient methods and apply them to the tasks aforementioned. Furthermore, to reason about these algorithms it would be desireable to derive them explicitly from model hypotheses on the structure of real-world knowedge bases. Thus, we will develop factorization models that abide by the characteristics of Linked Data. We hope that our results will play a key role towards a new generation of tools for processing large amounts of interlinked data. 3 Related Work There is a wealth of knowledge on tensor factorization in the literature. For the seminal survey on this topic, see Kolda et al. [6]. On the topic of single tensor factorization, most models are based on either the DEDICOM [2] or the CP [3] family of decompositions. Nickel et al. proposed a relaxed DEDICOM model, refered to as RESCAL [12], that is shown to be highly scalable [13], while still exhibiting state of the art performance on smaller datasets. Building on this London et al. [10] suggested a similar model, adding a bias term and using the learned model as input for a neural network. Kolda et al. [5] proposed a CP Model, to identify citation communitys and measure document similar- ity. Kuleshov et al. [8] proposed to find the CP factors of their model, via simultaneous matrix diagonalization. On the topic of joint tensor factorization, Khan. et al. [4] proposed a bayesian model for the problem of multi tensor factorization. They relaxed the constraints of the problem to use normal distributions for inference of the CP factors. Also the joint factorization of matrices and tensors has shown improvements over singular approaches, see [1]. 4 Research Question To adress the issues stated above, we plan to analyze and generalize existing approaches and to use the insights gained there to develop new techniques and benchmark their performance on knowledge extraction and integration tasks against the current state of the art. As such, the research questions and corresponding subquestions at the heart of our work are as follows: – ”What existing approaches can be applied to joint factorization?” • What existing approaches can be interpreted as stemming from statistical mod- els? By increasing the sample size one could generalize them to multiple knowl- edge bases. • Can we derive update rules that are computable on large datasets by combining different cost functions or in a distributed manner? – ”How can we use the resulting embeddings to improve upon existing tasks and what new avenues can be explored?” • Can one improve the performance of the translation of natural-language queries into formal query languages (SPARQL) by using embeddings that respect the spread of RDF data on the web? • Can we use these latent embeddings for the generation of synthetic graphs which still capture the structure of RDF Data on the web? This will be useful for benchmarking various querying systems and the like. 5 Hypotheses The basic hypothesis behind our work is that using the architecture underlying the Linked Data Web can lead to the computation of better embeddings latent embeddings. We hence aim to develop novel methods for tensor factorizatiion that make use of dis- tributed information to compute embeddings. The approaches we develop will aim to be time-efficient and easy to distribute so as to scale to billions of triples. Moreover, we will consider datasets that are updated regularly and aim for deriving updates rules or modularization techniques that will ensure a efficient update of latent features. 6 Preliminary results To begin testing the potential of a simple variant of joint decomposition, we chose to use the proven RESCAL-ALS [12] method on a data dump of the Side Effect Resource SIDER.1 From that data, we built an adjacency tensor that does not differentiate be- tween literals and entities. Ee arrived at the full tensor with a shape of 11 × 30378 × 30378 and models the complete knowledge contained in the RDF graph. To model delo- calized information, we produced two more tensors X, Y of the same shape by deleting a percentage p of entities from the full tensor twice, taking care that entities deleted in 1 http://linkeddatacatalog.dws.informatik.uni-mannheim.de/dataset/fu-berlin- sider/resource/ee6c1709-f2a6-4790-8549-152232dd9d93 X would be present in Y and vice versa2 . The task at hand was to reconstruct the global knowledge contained in the full tensor based exclusively on the partial knowledge contained in X and Y . To that end, we chose to search for solutions of the ”RESCAL”-form, that solve the problem (R̃, Ã) = arg min kARAt − Xk22 + kARAt − Y k22 + reg(A, R)  (1) (R,A) with A ∈ Rr×30378 , R ∈ R11×r×r for some r < 30378 and reg(A, R) = λ kAk22 + kRk22  Take note that for all solutions of (1), one also has3 X +Y 2 (R̃, Ã) = arg min kARAt − k2 + reg(A, R) (2) (R,A) 2 This allows us to approximate solutions of (1), by using the RESCALS-ALS algorithm, developed in [12]. To measure the goodness of fit, we first chose a rounding treshhold t and rounded the entries of ARAt in comparison with that. After that we calculated the F1-measure (F1), precision (P) and recall (R) of ARAt with respect to the full tensor and compared results for different choices of t, r and p. We wrote the algorithm in Python and all experiments were performed on an Intel Core i5-6500 CPU @ 3.20GHz × 4 with 16GB of RAM. As a baseline we chose the best performing RESCAL fit of the incomplete tensors X and Y. The following table holds the best results of our baseline tests4 Table 1. Rescal Baseline p r t F1 P R r F1 P R r F1 P R r F1 P R 10% 5 .2 0.41 0.45 0.39 10 0.45 0.49 0.41 15 0.48 0.51 0.45 20 0.49 0.52 0.46 20% .2 0.37 0.46 0.31 0.40 0.49 0.34 0.42 0.63 0.31 0.40 0.51 0.33 30% .2 0.32 0.46 0.25 0.34 0.49 0.26 0.36 0.52 0.28 0.37 0.53 0.29 To enable a fine grained comparison we include all values of t in the table that holds the values of the joint approach As one can see, the joint approach outperforms the baseline by as much as 25% on the parameter pair (p = 30%, r = 15). The large 2 This ensures that no global information is being deleted 3 This can be seen by summing over the identity  2 Xijk + Yijk 2 2 2 (ARAt )ijk − = (ARAt )ijk − Xijk + (ARAt )ijk − Yijk 2 1 + (Xijk − Yijk )2 2 4 We rounded all values to two decimal places. Table 2. Joint Rescal Factorization p t r F1 P R r F1 P R r F1 P R r F1 P R 10% .8 5 0.13 0.96 0.07 10 0.19 0.96 0.10 15 0.21 0.97 0.12 20 0.22 0.97 0.13 .7 0.16 0.93 0.09 0.23 0.93 0.13 0.26 0.94 0.15 0.27 0.94 0.16 .6 0.21 0.88 0.12 0.29 0.88 0.17 0.31 0.89 0.19 0.32 0.90 0.20 .5 0.26 0.82 0.16 0.34 0.82 0.22 0.37 0.83 0.24 0.39 0.85 0.25 .4 0.34 0.74 0.21 0.41 0.74 0.29 0.44 0.76 0.31 0.46 0.78 0.33 .3 0.40 0.63 0.30 0.46 0.65 0.36 0.49 0.67 0.39 0.51 0.69 0.41 .2 0.44 0.49 0.40 0.48 0.53 0.45 0.51 0.55 0.48 0.53 0.56 0.50 20% .8 5 0.05 0.95 0.03 10 0.08 0.95 0.04 15 0.09 0.96 0.05 20 0.10 0.96 0.05 .7 0.08 0.92 0.04 0.09 0.91 0.06 0.12 0.93 0.06 0.14 0.93 0.07 .6 0.11 0.87 0.06 0.14 0.86 0.08 0.17 0.88 0.09 0.18 0.89 0.10 .5 0.16 0.81 0.09 0.20 0.81 0.12 0.23 0.84 0.13 0.25 0.85 0.15 .4 0.25 0.76 0.15 0.31 0.77 0.19 0.35 0.80 0.22 0.35 0.81 0.23 .3 0.32 0.67 0.21 0.38 0.68 0.27 0.42 0.72 0.30 0.44 0.73 0.32 .2 0.40 0.54 0.31 0.45 0.56 0.37 0.48 0.60 0.40 0.49 0.61 0.42 30% .8 5 0.04 0.96 0.02 10 0.05 0.96 0.03 15 0.06 0.96 0.03 20 0.07 0.97 0.04 .7 0.05 0.93 0.03 0.07 0.93 0.04 0.08 0.93 0.04 0.09 0.94 0.05 .6 0.08 0.90 0.04 0.10 0.89 0.05 0.11 0.90 0.06 0.12 0.91 0.06 .5 0.11 0.85 0.06 0.14 0.85 0.08 0.17 0.87 0.09 0.18 0.88 0.10 .4 0.18 0.81 0.10 0.24 0.83 0.15 0.28 0.85 0.17 0.30 0.86 0.19 .3 0.25 0.72 0.15 0.32 0.74 0.20 0.37 0.78 0.24 0.40 0.79 0.26 .2 0.35 0.60 0.25 0.41 0.63 0.30 0.45 0.64 0.34 0.46 0.67 0.35 changes of F1-measure in the joint approach at t = 0.5 can be explained by the fact that the algorithm tries to approximate the tensor (X + Y )/2. Take note that   X +Y  0  if Xijk = Yijk = 0 = 1 if Xijk = Yijk = 1 (3) 2 ijk 1  2 if Xijk 6= Yijk Thus by rounding up at 0.5 one will begin to see the values that were correctly learned to be 1/2, among others. To remedy that situation we used the ad-hoc approach of learning the tensor max((X, Y ))ijk = max(Xijk , Yijk )5 under the same quadratic loss function, using the same RESCAL-ALS algorithm. A sample of the results obtained are contained in the following table Table 3. Rescal MAX t r F1 P R t r F1 P R t r F1 P R t r F1 P R .2 5 0.46 0.46 0.46 .2 10 0.49 0.49 0.49 .3 15 0.53 0.63 0.45 .3 20 0.54 0.65 0.47 5 take note that, by our synthetic construction of X and Y, max(X,Y) will represent the full knowledge graph, thus the parameter p is irrelevant here. As one can see, the results improved compared with learning the arithmetic mean. This hints that there is improvement to be had by reweighting the entries in the joint decomposition, which we will explore in our future work. 7 Approach To exemplify the general approach, consider the statistical model (S, p), with the mea- surable space Y S := (Ω, P (Ω)) , with Ω := {0, 1}k×n×n (4) m equipped with a propability measure p : P (Ω) → [0, 1]; {X1 , . . . , Xm } 7→ p({X1 , . . . , Xm }). (5) This is the general setting and by choosing different parametric families of propability measures, one can use inference methods to obtain estimates for the parameters. Take note that the choice of S makes this approach novel, because the usual methods for statistical inference on semantic adjacency tensors use measurable spaces of the form ({0, 1}k×n×n , P ({0, 1}n , p) to model the propabilities for links on a single tensor. This is well known and appears frequently in the Semantic Web literature. For instance in [11], the authors use the propability function Y p(X|A, R) := p̃(xijk | hAi , Rk Aj i) (6) i,j,k with p̃(xijk = 1| hAi , Rk Aj i) := σ(hAi , Rk Aj i). Then they chose to estimate the pa- rameters A and R by minimizing the log-likelihood. By extending the sample space to multiple tensors, one has a larger sample size for each individual entry and that will lead to better parameter estimations. Furthermore, if one has to deal with multiple knowledge bases, the question of confi- dence arises. To solve the problem of observing a link in knowledge base A and not in knowledge base B, one would have to introduce confidence values to the model, to make informed predictions. Another approach will be modelling the random events in our model as Markov pro- cesses. This would be appropriate to model situations where the source of our data are multiple versions of a knowledge graph, at different points in time. The ability to incor- porate Markov processes of graphs into our model, is theoretical highly valuable, since they possess an unparalleled modelling power for real world processes. 8 Evaluation Plan To test our hypotheses, we will apply our derived algorithms in three stages. Stage one will be a test on synthetic or semi-synthetic data, where we will build tensors from real-world data, according to synthetic rules to model the existence of a tensor that contains the global knowledge and measure the performance of the model using the F1-measure6 . Stage two will be a test on real world tensors which contain a large amount of related entities, e.g., YAGO [14] and DBPedia [9]. Thus, we are abstracting the existence of a global knowledge tensor, still keeping the same evaluation sheme as in stage one. Finally in stage three, we will test the latent factors we have benchmarked in stage two to fields such as relation extraction, entity linking and question answering to boost their performances. 9 Reflections As our preliminary results show, one can expect better latent embeddings by consid- ering multiple sources of relational data. While there is a wealth of knowledge for the factorization of matrices or tensors, there is still relatively little work done on the joint factorization of tensors. Furthermore, by considering quite general propability func- tions, we can model sophisticated dependencies of different knowledge graphs. This will be the start of this PHD thesis. Acknowledgements This work has been supported by Eurostars projects DIESEL (E!9367) and QAMEL (E!9725) as well as the European Union’s H2020 research and innovation action HOBBIT (GA 688227). Also, I want to thank my advisors Dr. Axel-C. Ngonga Ngomo, Ricardo Usbeck and the whole team at AKSW for all their helpful comments and fruitful coffee-driven con- versations. References 1. E. Acar, M. A. Rasmussen, F. Savorani, T. Næs, and R. Bro. Understanding data fusion within the framework of coupled matrix and tensor factorizations. Chemometrics and Intel- ligent Laboratory Systems, 129:53–63, 2013. 2. R. A. Harshman. Models for analysis of asymmetrical relationships among n objects or stimuli. In First Joint Meeting of the Psychometric Society and the Society for Mathematical Psychology, McMaster University, Hamilton, Ontario, 1978. 3. F. L. Hitchcock. The expression of a tensor or a polyadic as a sum of products. J. Math. Phys, 6(1):164–189, 1927. 4. S. A. Khan, E. Leppäaho, and S. Kaski. Multi-tensor factorization. arXiv preprint arXiv:1412.4679, 2014. 5. T. Kolda. Multilinear algebra for analyzing data with multiple linkages. 6. T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Rev., 51(3):455–500, Aug. 2009. 7. D. Krompaß, M. Nickel, and V. Tresp. Large-scale factorization of type-constrained multi- relational data. In Data Science and Advanced Analytics (DSAA), 2014 International Con- ference on, pages 18–24. IEEE, 2014. 8. V. Kuleshov, A. T. Chaganty, and P. Liang. Tensor factorization via matrix factorization. CoRR, abs/1501.07320, 2015. 6 just as in our ”Preliminary Results” section 9. J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia - a large-scale, multilingual knowl- edge base extracted from wikipedia. Semantic Web Journal, 6(2):167–195, 2015. 10. B. London, T. Rekatsinas, B. Huang, and L. Getoor. Multi-relational learning using weighted tensor decomposition with modular loss. CoRR, abs/1303.1733, 2013. 11. M. Nickel and V. Tresp. Logistic Tensor Factorization for Multi-Relational Data. ArXiv e-prints, June 2013. 12. M. Nickel, V. Tresp, and H.-P. Kriegel. A three-way model for collective learning on multi- relational data. In L. Getoor and T. Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML ’11, pages 809–816, New York, NY, USA, June 2011. ACM. 13. M. Nickel, V. Tresp, and H.-P. Kriegel. Factorizing yago: Scalable machine learning for linked data. In Proceedings of the 21st International Conference on World Wide Web, WWW ’12, pages 271–280, New York, NY, USA, 2012. ACM. 14. F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pages 697–706, New York, NY, USA, 2007. ACM.