=Paper=
{{Paper
|id=Vol-2635/paper8
|storemode=property
|title=Affinity Dependent Negative Sampling for
Knowledge Graph Embeddings
|pdfUrl=https://ceur-ws.org/Vol-2635/paper8.pdf
|volume=Vol-2635
|authors=Mirza Mohtashim Alam,Hajira Jabeen,Mehdi Ali,Karishma Mohiuddin,Jens Lehmann
|dblpUrl=https://dblp.org/rec/conf/esws/AlamJAM020
}}
==Affinity Dependent Negative Sampling for
Knowledge Graph Embeddings==
Affinity Dependent Negative Sampling for Knowledge Graph Embeddings? Mirza Mohtashim Alam1,2 , Hajira Jabeen1,2 , Mehdi Ali1,2,3 , Karishma Mohiuddin2,3 , and Jens Lehmann1,2,3 1 Smart Data Analytics Lab, University of Bonn, Germany 2 Department of Computer Science, University of Bonn, Germany 3 Fraunofer IAIS s6mialam@uni-bonn.de, jabeen@iai.uni-bonn.de, mehdi.ali@cs.uni-bonn.de, s6kamohi@uni-bonn.de, jens.lehmann@cs.uni-bonn.de Abstract. Knowledge graph embedding system work in vector space where entities and relationships profoundly rely on positive and neg- ative instances. Combination of a subject, a predicate and an object creates a triple, that make up a positive or negative assertion. Positive triples can be taken from the knowledge base, but additional steps are required to understand relations between entities and relations in or- der to create meaningful negative triples. In this paper we present an approach to create negative triples using affinities of triples. We create false triples meaningfully, rather than generating negative sample ran- domly. We rely on the closeness of the entities to create the false triples. We have compared our method with distributed negative sampling and random negative sampling and the results have been found comparable and efficient. Keywords: Knowledge graph analysis · Negative sample creation · Em- bedding models · optimization techniques. 1 Introduction Knowledge Graphs (KG) are a special kind of data where everything is pre- sented in terms of subject (head), predicate (relations) and object (tail). Sub- jects/Objects (head or tail) can be considered as entities/nodes. Predicates (re- lations) can be considered as an edge/connection between two entities [1]. Most of the knowledge graphs encode the available information only, which is mostly the true cases. A lot of vector based embedding learning models require negative sampling to create data that contrasts with the already available data (discussed ? Supported by Smart Data Analytics Lab, University of Bonn Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 2 Alam, M.M. et al. in details in later sections). In this paper, we aim to develop an informative sys- tem of developing more prominent negative samples in order to aid the state of the art knowledge graph embedding models which are vastly used for link prediction tasks. This paper is divided into several sections. Sections 2 discusses about several closely related researches. In section 3, the two embedding models are discussed where we have tested our negative sampling technique. In section 4, we present the methodology and the algorithm. In section 5, we discuss about the implementation details (used datasets, hyperparameters etc.). In section 6, we discuss our experiment and results. Finally, section 6 provides an overview of the future directions. 2 Related Work The transformation of KGs into the vector space using embedding models have become a renowned area of research. Lots of embedding learning approaches have been presented in the literature in the recent years. e.g, TransE [2], DistMult [3], Complex[4], TransH [5], RotatE [6] etc. Almost all of these models depend on the existing data encoded in the input KG. Following the closed world assumption the KG is considered as true, the knowledge not present in the KG cannot be considered false, but it can only be labelled as unknown. On the other hand, similar to most of the machine learning algorithms, The embedding models re- quire negative examples to learn the embeddings efficiently. In order to cope with this, numerous negative sample generation methods have been proposed in the literature. In this section we will briefly cover a few related negative sampling methods. The widely used negative sampling method is randomly perturbs the head entity (subject) or tail entity (tail) in a triple with other available entities from the entity set. This approach was used in TransE [2]. This approach as- sumes that the data not present in the given RDF data can be a negative triple. Possible problems might be pulling out more relevant entities as replaceable en- tity for corruption [7]. Another approach includes corrupting positive triples for every relation [8], in this approach the graphs with less number of relations suffer from insufficient pool to corrupt positive instances [7]. Typed Sampling is the negative sampling based on entity and relation type by using their metadata [7]. Some other approaches discussed in [7] are Relational Sampling, Nearest Neighbor sampling, Near Miss sampling. Nearest Neighbor sampling and Near Miss sampling aim to draw out similar entities as a candidate for perturbation of triple subject or triple predicate. However, all of these approaches require pre- training of embedding. A negative sampling which pulls out the most prominent candidate as replaceable entity is Distributional Negative Sampling [9]. This ap- proach does not require any pre-training in addition to corrupting the positive instances with more plausible entities. However, our implementation according to the algorithm given in distributional negative sampling paper [9], has a long training time that might have the potentiality for improvement of the duration of training, and computation. We therefore, tried to improve the training time while achieving a descent model performance. Affinity Dependent Negative Sampling for Knowledge Graph Embeddings 3 3 Embedding Models Used Model TransE [2] and DistMult [3] have been selected as the default models to demonstrate our negative sampling technique for their simplicity and less training effort. 3.1 Distmult The bilinear scoring function of DistMult model is obtained by multiplying their entity vectors (head and tail) with their corresponding relation matrix which is diagonal [3]. The entities are considered as ye1 , ye2 and their corresponding diagonal relation matrix Mr , leads to the equation 1 [3]. gbr (ye1 , ye2 ) = yyTe1 Mr ye2 (1) To minimize the margin based ranking loss, where, entities are denoted by e1, e2 and relation is denoted by r. For the corrupted triple and replaceable entities are e10 and e20 . The positive formed triple is (e1, r, e2) ∈ T , where T is the total triple set. Thus corrupted triple is T 0 = {(e10 , r, e2)|e10 ∈ E, (e10 , r, e2)|e10 ∈ / T }∪ {(e1, r, e20 )|e20 ∈ E, (e1, r, e20 )|e20 ∈ / T }, where E is the total set of entities. Here, author also corrupted either head or tail of the given triple to obtain corrupted ones. Ranking loss was used to minimize the objective function. Equation 2 describes the loss function of [3]. X X L= max{S(e10 ,r,e20 ) − S(e1,r,e2) + 1, 0} (2) (e1,r,e2)∈T (e10 ,r,e20 )∈T 3.2 TransE TransE uses similar randomly generated negative sampling techniques in their paper [2]. Their idea of scoring function is the edges (relations) corresponds to translation of embedding [2]. Means, in case of true triple, additive head and relation h + l should be in the nearest vicinity of the tail t. They considered it as a distance function d, their scoring function is stated in equation 3 as stated in [2], where γ is the margin. X X L= [γ + d(h + l, t) − d(h0 + l, t0 )]+ (3) (h,l,t)∈S (h0 ,l,t0 )∈S( h,l,t) 4 Affinity Dependent Negative Sampling As mentioned earlier, we improve upon an existing approach presented in [9] for the development of our negative sampling technique. In the training phase, for each of the positive triple in the KG, the aim is to generate C negative samples. In the training phase, Bernoulli sampling [10] is used to decide if the head (subject) or tail (object) is to be corrupted in a given triple. We use cosine 4 Alam, M.M. et al. similarity between the selected candidate (head or tail to be replaced by other entities) and all the other entities is calculated. This is stored as the affinity vector. Further, we made a affinity function based on the affinity vector M (the cosine similarity between the candidate entity and all the other entity). For each of the entities i in the entity set ξ, the affinity function is presented equation 4, Mi F ittnessi = P (4) a Mj This affinity vector can act as a probability vector for each of the entities in the entity list. Using the affinity function as the base probability vector, we can randomly pull out the C replaceable entities based on the affinity. These C replaceable entities have most likely higher cosine similarity with candidate entity (which is to be replaced by other entities). Finally we form our negative triple by replacing the entity by the new chosen replaceable entities. We do this for all the triples in the batch and finally form the negative sample batch. For optimization, Adagrad (adaptive gradient descent) [11] optimizer has been used. We have used self adversarial loss function similar to [6] for both models. For negative sampling we have used the negative sample creation algorithm (1). Algorithm 1: Affinity Dependent Negative Sampling INPUT: T raining set S {h, r, t} , Entity Set ξ, Relation set R, batch Size β, N umber of N egatives C OUTPUT: F or Given Batch β {h,r,t} return negative triples N {h’,r,t’} Function Sample Negative(S {h, r, t} , ξ, R, β, C): for triple t ∈ β {h,r,t} do candidate position = Bern(t, R) . bern negative sampling to decide whether head or tail corruption candidate = tcandidate position ξc = {ξ} - candidate . subtract the candidate from total entity set M score = CosineSimilarity(candidate, ξc ) M score = max(0, M score ) for i ∈ length(M score ) do P probability f itnessi = M scorei ÷ j M scorej . generate the fitness vector selected entities = random choice(ξc , C, probability f itness) N egativeT riple(h’,r,t’) = F ormN egative(t, selected entities, candidate position) . form C negative triples/positive N {h’,r,t’} =N {h’,r,t’} ∪ N egativeT riple(h’,r,t’) . append C negative triple per positive to the total batch negative set return N {h’,r,t’} In this algorithm, similar to [9] in first few lines the main aim is to, per- form head or tail corruption based on the bern sampling. Later, we remove the Affinity Dependent Negative Sampling for Knowledge Graph Embeddings 5 candidate from the total entity set and perform cosine similarity between the candidate entity (which is to be replaced) and all the other entities from entity set ξ. We denote the cosine similarity vector as M score from which the negative values from the cosine similarity vectors to 0 in the following line. Later, we generate probability fitness vector probability f itness by equation 4. Further, using numpy’s random choice, we achieve the C number selected entities (With which we intend to corrupt the training instances). In the following lines, we cor- rupt the positive triple and form C number of corrupted triples finally forming a batch of corrupted triples against the positive ones. It is a very unusual case where all of the instances of the cosine similarity vector is negative. In this case we choose random replaceable entities out of entity set ξ. 5 Implementation details 5.1 Datasets and preprocessing We have used medium to smaller datasets. Our datasets were in RDF (subject, predicate and object format). We have generated the dictionaries (entity to id, relation to id) for identification of specific entities and relations. We have used nations [12],kinship [13], and UMLS [14] dataset. These datasets are not very big and able to produce results in very short amount of time. These datasets were used in [15]. In our experiments the statistics of these datasets are provided in table 1 Table 1. Statistical information of the datasets Dataset # of entity # of relation # Training triple # Test triple # Validation triple UMLS 135 46 5216 661 652 Kinship 104 25 8544 1074 1068 Nations 14 55 1592 201 199 Each of the entities and relations in the training and test set has the same identifier (the dictionary mentioned in the previous section). 5.2 Hardware and Tools All of the results were achieved with a hardware of 16 GB ram, i7 4770 processor and Nvidia RTX 260 GPU. The experiments was built with Pytorch1 , numpy2 , scipy3 , pandas4 . 1 https://pytorch.org/ 2 https://numpy.org/ 3 https://www.scipy.org/ 4 https://pypi.org/project/pandas/ 6 Alam, M.M. et al. 5.3 Hyperparameters For TransE the optimal hyper parameters were: learning rate = 0.1, gamma = 14 (for kinship, nations dataset), gamma = 10(for UMLS dataset), adversarial temperature = 0, negative sample number = 3. Normalization for embeddings and no regularization were done after each epoch for TransE. For Distmult everything was same except the gamma which is 10 (for UMLS). In this case, normalization after each epoch has not been done. Regularization has been done for avoiding over-fitting. For both cases self adversarial loss as in [6] and Adagrad optimizer [11] has been used. 5.4 Evaluation Our evaluation focused on the model performances using three different negative samplings (i.e. random negative sampling, distributional negative sampling and affinity dependant negative sampling) on link prediction task.Filtered settings has been used as in [6]. Achievement of rankings of test triples against all the corrupted candidate triples (generation of the candidates are done by corrupting head or tail) has been done but since we are using the filtered settings candidate triples does not appear in training, validation or test set. Standard evaluations for knowledge graph embedding model have been used in our researches which are mean rank (MR), mean reciprocal rank (MRR), Hit@1, Hit@3, Hit@5, and Hit@10. 6 Discussion and Result Comparing with the Random Negative and Distributional Negative sampling [9], our method performed slightly better in certain cases. Authors in [9] showed that, the distributive negative sampling improves the performance of the model by choosing more salient negative samples [9]. The negative sampling method improves over time as with time the embedding vectors tends to get better. More appropriate negative sample is created as the steps go further. In case of distri- butional negative sampling it took a while for us for the training to be finished since at each iteration it calculates the cosine similarity. Later, it gets each entity (with which the correct ones to be replaced) by enumeration of the cosine simi- larity vector, which then acts acceptance and rejection probabilities for each of them. In this very case, full cosine similarity vector needs to be enumerated for getting C number of negative triples (or at least C times, but it may be possible that we might find more plausible negative sample if we traverse more than C). To mitigate this, we have calculated the fitness vector out of the cosine simi- larity vector and assigned it as a probability for choosing the plausible entities. By using the numpy standard library’s choice function5 , it has been possible to avoid enumerating the whole cosine similarity vector and draw C number of 5 https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.choice.html Affinity Dependent Negative Sampling for Knowledge Graph Embeddings 7 replaceable entities at once (which takes fairly less amount of time). This func- tion Generates C number of random sample from a given one dimensional array. If probability vector has been provided for the one dimensional array, then the probability vector becomes the basis of generation of C number of random sam- ples. We have used the fitness vector as the probability argument for choosing C entities out of total entity set at once. The results for TransE and Distmult are stated in table 2 and 3 accordingly. Table 2. Evaluation of Negative Sampling on TransE Dataset Random Negative DNS ADNS MR: 5 MR: 5 MR: 4 MRR: 26.96% MRR: 33.79% MRR: 42.42% Hit@1: 0% Hit@1: 13.68% Hit@1: 20.15% Nations Hit@3: 40.05% Hit@3: 37.31% Hit@3: 53.98% Hit@5: 60.95% Hit@5: 59.95% Hit@5: 73.88% Hit@10: 93.53% Hit@10: 94.53% Hit@10: 98.01% MR: 9 MR: 13 MR: 8 MRR: 25.66% MRR: 24.89% MRR: 28.47% Hit@1: 0.009% Hit@1: 0.03% Hit@1: 00.28% Kinship Hit@3: 39.80% Hit@3: 38.13% Hit@3: 47.77% Hit@5: 56.28% Hit@5: 49.30% Hit@5: 62.38% Hit@10: 76.44% Hit@10: 64.20% Hit@10: 78.91% MR: 3 MR: 3 MR: 2 MRR: 64.47% MRR: 72.90% MRR: 80.21% Hit@1: 39.86% Hit@1: 56.81% Hit@1: 64.45% UMLS Hit@3: 88.12% Hit@3: 86.76% Hit@3: 95.54% Hit@5: 93.95% Hit@5: 92.06% Hit@5: 97.05% Hit@10: 97.13% Hit@10: 96.44% Hit@10: 98.03% From the results, it can be seen that our model has slightly improved perfor- mance in terms of both models in most of the cases. The overall MR, MRR and Hit has a significant improvement for ADNS for Nations dataset in table 2. For Kinship and UMLS, ADNS has slight improvement in terms of MR, MRR and Hit@10. For Kinship and nations in table 2, some other Hits has significant im- provement. In table 3, for Nations dataset, only Hit@10 has slight improvement. In this table no other significant improvement in model performance has been detected but the training time has been reduced significantly. The training time of our model is more than the random negative sampling but less than the dis- tributional negative sampling in terms of our implementation with the original paper. Figure 1 is indicating the convergence of loss and taken time for UMLS dataset during 1000 steps which also indicates the training time with each of the negative sample discussed in legend. 8 Alam, M.M. et al. Table 3. Evaluation of Negative Sampling on DistMult Dataset Random Negative DNS ADNS MR: 3 MR: 2 MR: 2 MRR: 65.34% MRR: 82.32% MRR: 79.48% Hit@1: 49.50% Hit@1: 73.63% Hit@1: 68.91% Nations Hit@3: 76.12% Hit@3: 88.06% Hit@3: 86.32% Hit@5: 88.31% Hit@5: 94.53% Hit@5: 93.78% Hit@10: 99.520% Hit@10: 99.75% Hit@10: 100.00% MR: 6 MR: 5 MR: 5 MRR: 48.45% MRR: 57.19% MRR: 58.74% Hit@1: 33.43% Hit@1: 44.88% Hit@1: 47.63% Kinship Hit@3: 54.24% Hit@3: 61.78% Hit@3: 61.87% Hit@5: 65.27% Hit@5: 71.09% Hit@5: 71.09% Hit@10: 85.38% Hit@10: 83.80% Hit@10: 85.94% MR: 8 MR: 5 MR: 6 MRR: 48.03% MRR: 71.15% MRR: 60.72% Hit@1: 34.72% Hit@1: 63.69% Hit@1: 49.24% UMLS Hit@3: 54.84% Hit@3: 74.21% Hit@3: 67.96% Hit@5: 64.07% Hit@5: 78.97% Hit@5: 74.96% Hit@10: 76.48% Hit@10: 86.16% Hit@10: 84.87% Fig. 1. Convergence of Loss with both models Affinity Dependent Negative Sampling for Knowledge Graph Embeddings 9 7 Conclusion and Future work Learning relational data plays a vital role in various aspects such as drug tar- get interactions, social network analysis, recommender systems, decease analysis wherever, entities (i.e. movies, people, locations) and relations (i.e. friend of, located in, causes) are connected by edges. Learning these relational data needs negative samples to contrast with its instances. Apart from creating random negatives, other approaches can be time consuming since it would add an ad- ditional effort to fetch desired negative samples. In this research, we provided a glimpse how better negative sample can be generated without sacrificing too much time with a decent performance.In our experiment, one model i.e. TransE has good model performance in these data sets whereas DistMult did not pro- vided significant improvement in terms of model performance but it reduced the training time. As it was discussed earlier in section 6., Though both model showed improved or similar or slightly less model performances. In each cases, both of the models had improved training time over DNS (in our implementation of both models). In future we aim to use other similarity measures apart from cosine similarity. We also want to expand our experiment with larger datasets like Freebase 15k and Wordnet 18 and also include more recent KG models like rotate, complex etc. References 1. Dettmers, T., Minervini, P., Stenetorp, P., & Riedel, S. (2018, April). Convolutional 2d knowledge graph embeddings. In Thirty-Second AAAI Conference on Artificial Intelligence. 2. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., & Yakhnenko, O. (2013). Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems (pp. 2787-2795). 3. Yang, B., Yih, W. T., He, X., Gao, J., & Deng, L. (2014). Learning multi-relational semantics using neural-embedding models. arXiv preprint arXiv:1411.4072. 4. Trouillon, T., Welbl, J., Riedel, S., Gaussier, É., & Bouchard, G. (2016). Com- plex embeddings for simple link prediction. International Conference on Machine Learning (ICML). 5. Wang, Z., Zhang, J., Feng, J., & Chen, Z. (2014, June). Knowledge graph em- bedding by translating on hyperplanes. In Twenty-Eighth AAAI conference on artificial intelligence. 6. Sun, Z., Deng, Z. H., Nie, J. Y., & Tang, J. (2019). Rotate: Knowledge graph em- bedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197. 7. Kotnis, B., & Nastase, V. (2017). Analysis of the impact of negative sampling on link prediction in knowledge graphs. arXiv preprint arXiv:1708.06816. 8. Socher, R., Chen, D., Manning, C. D., & Ng, A. (2013). Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems (pp. 926-934). 9. Dash, S., & Gliozzo, A. (2019). Distributional Negative Sampling for Knowledge Base Completion. arXiv preprint arXiv:1908.06178. 10 Alam, M.M. et al. 10. Wang, Z., Zhang, J., Feng, J., & Chen, Z. (2014, June). Knowledge graph em- bedding by translating on hyperplanes. In Twenty-Eighth AAAI conference on artificial intelligence. 11. Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(Jul), 2121-2159. 12. Rummel, R. J. (1968). Dimensionality of Nations Project (No. 405846). HAWAII UNIV HONOLULU DEPT OF POLITICAL SCIENCE. 13. Denham, W. W. (1973). The detection of patterns in Alyawara nonverbal behavior (Doctoral dissertation, University of Washington, Seattle.). 14. McCray, A. T. (2003). An upper-level ontology for the biomedical domain. Inter- national Journal of Genomics, 4(1), 80-84. 15. Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T., & Ueda, N. (2006, July). Learning systems of concepts with an infinite relational model. In AAAI (Vol. 3, p. 5).