Efficient Learning of Entity and Predicate Embeddings for Link Prediction in Knowledge Graphs Pasquale Minervini, Claudia d’Amato, Nicola Fanizzi, Floriana Esposito LACAM – Department of Computer Science – Università degli Studi di Bari Aldo Moro, Italy {firstname.lastname}@uniba.it Abstract. Knowledge Graphs are a widely used formalism for representing knowl- edge in the Web of Data. We focus on the problem of predicting missing links in large knowledge graphs, so to discover new facts about the world. Recently, representation learning models that embed entities and predicates in continuous vector spaces achieved new state-of-the-art results on this problem. A major lim- itation in these models is that the training process, which consists in learning the optimal entity and predicate embeddings for a given knowledge graph, can be very computationally expensive: it may even require days of computations for large knowledge graphs. In this work, by leveraging adaptive learning rates, we propose a principled method for reducing the training time by an order of mag- nitude, while learning more accurate link prediction models. Furthermore, we employ the proposed training method for evaluating a set of novel and scalable models. Our evaluations show significant improvements over state-of-the-art link prediction methods on the W ORD N ET and F REEBASE datasets. 1 Introduction Knowledge Graphs (KGs) are graph-structured Knowledge Bases (KBs), where fac- tual knowledge about the world is represented in the form of relationships between entities. They are widely used for representing relational knowledge in a variety of do- mains, such as citation networks and protein interaction networks. An example of their widespread adoption is the Linked Open Data (LOD) Cloud, a set of interlinked KGs such as Freebase [1] and WordNet [2]. As of April 2014, the LOD Cloud was composed by 1,091 interlinked KBs, describing over 8 × 106 entities, and 188 × 106 relationships holding between them 1 . Despite their large size, many KGs are still largely incomplete. For example con- sider Freebase 2 , a core element in the Google Knowledge Vault project [3]: 71% of the persons described in Freebase have no known place of birth, 75% of them have no known nationality, and the coverage for less frequent predicates can be even lower [3]. In this work we focus on the problem of completing missing links in large KGs, so to discover new facts about the world. In the literature, this problem is referred to as link prediction or knowledge graph completion, and has received a considerable attention over the last years [4,5]. 1 State of the LOD Cloud 2014: http://lod-cloud.net/ 2 Publicly available at https://developers.google.com/freebase/data Recently, representation learning [6] models such as the Translating Embeddings model (TransE) [7] achieved new state-of-the-art link prediction results on large and Web-scale KGs [4,5]. Such models learn a unique distributed representation, or em- bedding, for each entity and predicate in the KG: each entity is represented by a low- dimensional continuous embedding vector, and each predicate is represented by an op- eration in the embedding vector space, such as a translation (as in [7]) or an affine transformation (as in [8]). We refer to these models as embedding models, and to the learned distributed representations as embeddings. The embeddings of all entities and predicates in the KG are learned jointly: the learning process consists in minimizing a global loss functional considering the whole KG, by back-propagating the loss to the embeddings 3 . As a consequence, the learned entity and predicate embeddings retain global, structural information about the whole KG, and can be used to serve several kinds of applications. In link prediction, the con- fidence of each candidate edge can be measured as a function of the embeddings of its source entity, its target entity, and its predicate. A major limitation in embedding models proposed so far, however, is that the learn- ing procedure (i.e. learning the optimal embeddings of all entities and predicates in the KG) can be very time-consuming: it is based on an incremental optimization algorithm that may require days of computation to converge for large KGs [10]. In this work, we propose a novel principled method for significantly reducing the learning time in embedding models, based on adaptive per-parameter learning rates. Furthermore, we employ the proposed training method for evaluating a variety of novel embedding models: our evaluations achieves new state-of-the-art link prediction results on the W ORD N ET and F REEBASE datasets. 2 Basics RDF Graphs The most widely used formalism for representing knowledge graphs is the W3C Resource Description Framework (RDF) 4 , a recommended standard for representing knowledge on the Web. An RDF KB, also referred to as RDF graph, is a set of RDF triples in the form hs, p, oi, where s, p and o respectively denote the subject, the predicate and the object of the triple: s and o are entities, and p is a relation type. Each triple hs, p, oi describes a statement, which is interpreted as “A relationship p holds between entities s and o”. Example 2.1 (Shakespeare). The statement “William Shakespeare is an author who wrote Othello and the tragedy Hamlet” can be expressed by the following RDF triples: hShakespeare, profession, Authori hShakespeare, author, Hamleti hShakespeare, author, Othelloi hHamlet, genre, Tragedyi 3 In natural language processing, a similar procedure is used by the word2vec model [9] for learning an unique distributed representation for each word in a corpus of documents. 4 http://www.w3.org/TR/rdf11-concepts/ An RDF graph can be viewed as a labeled directed multigraph, where each entity is a vertex, and each RDF triple is represented by a directed edge whose label is a predicate, and emanating from its subject vertex to its object vertex. In RDF KBs, the Open-World Assumption holds: a missing triple does not mean that the corresponding statement is false, but rather that its truth value is unknown (it cannot be observed). In the following, given an RDF graph G, we denote as EG the set of all entities occurring as subjects or objects in G, and as RG the set of all predicates occurring in G: EG = {s | ∃hs, p, oi ∈ G} ∪ {o | ∃hs, p, oi ∈ G}, RG = {p | ∃hs, p, oi ∈ G}. For instance, in the case of the RDF graph shown in Ex. 2.1, the sets EG and RG are the following: EG = {Author, Shakespeare, Hamlet, Othello, Tragedy} and RG = {profession, author, genre}. Furthermore, we denote as SG = EG × RG × EG the space of possible triples of G, i.e. the set of all triples that can be created by using the entities and predicates in G (note that G ⊆ SG ). We refer to all triples in G as observed triples, and to all triples in SG \ G as unobserved triples. Energy-Based Models Embedding models for KGs can be described in terms of Energy-Based Models (EBMs) [11]: EBMs are a versatile and flexible framework for modeling dependencies between variables. In the fields of representation learning and deep learning [6], EBMs are employed as building blocks for constructing hierarchical models that achieve ground-breaking results in several learning tasks. A fundamental component in an EBM is a scalar-valued energy function (or scor- ing function) Eθ (·), parametrized by θ, which associates a scalar energy value to the configuration of a set of variables. The energy of a configuration of a set of variables is inversely proportional to its probability: more likely configurations are associated with lower energy values, while less likely configurations are associated with higher energy values. Several tractable methods have been proposed for learning the parameters of an energy function [11,6]. In particular, the problem of learning the optimal parameters θ̂ can be cast as solving the following optimization problem [11]: θ̂ = arg min L(Eθ , D), θ∈Θ where Θ is the parameter space, and L(·) is a loss functional which measures the qual- ity of the energy function Eθ (·) on the data D. Intuitively, the loss functional L(·) assigns a lower loss to energy functions that associate a lower energy (corresponding to a higher probability) to correct answers, and higher energy (corresponding to a lower probability value) to all other incorrect answers. Energy-Based Models for RDF Graphs As discussed in [12], embedding models for KGs define an energy distribution Eθ : SG → R over the space of possible triples SG . For instance, the models proposed in [8,7,13,12] are used for assigning a score E(hs, p, oi) to each triple hs, p, oi in SG . In a link prediction setting, such models are Table 1: Energy functions used by several link prediction models, and the corresponding number of parameters: ne = |EG | and nr = |RG | respectively denote the number of entities and predicates in G, and k, d ∈ N are user-defined hyper-parameters. Model Energy function E(hs, p, oi) Parameters Complexity Unstructured [12] kes − eo k22 , es , eo ∈ Rk O (ne k) TransE [7] k(es + ep ) − eo k{1,2} , ep ∈ Rk O(ne k + nr k) SE [8] kWp,1 es − Wp,2 eo k1 , Wp,i ∈ Rk×k O ne k + nr k2 T (W1 es + W2 ep + b1 ) (W3 eo + W4 ep + b2 ) SME lin. [12] O (ne k + nr k + dk) ep ∈ Rk , Wi ∈ Rd×k , bj ∈ Rd  T   (W1 ×3 ep )es (W2 ×3 ep )eo   SME bil. [12] k d×k×k d O ne k + nr k + dk2 ep ∈ R , Wi ∈ R , bj ∈ R  uTp f eTs Wp eo + Wp,1 es + Wp,2 eo + bp   NTN [13] k×k×d d×k d O ne k + nr dk2 Wp ∈ R , Wp,i ∈ R , up , bp ∈ R used as follows. First, the optimal parameters θ̂ of the energy function are learned: the parameters are composed by the embeddings of all entities and predicates in the KG. Then, the energy function Eθ̂ (·) is used for ranking unobserved triples: those with lower energy values have a higher probability of representing true statements, and are considered more likely candidates for a completion of the KG. Consider the RDF graph shown in Ex. 2.1. In such a graph, we prefer learning an energy-based model that assigns a lower energy (a higher probability value) to the triple hOthello, genre, Tragedyi, which is unobserved but represents the true statement “Othello is a Tragedy”, and a higher energy (a lower probability value) to other unob- served triples, for example hHamlet, genre, Authori. 3 Energy-Based Embedding Models Several EBMs have been proposed in the literature for addressing the problem of link prediction in KGs [8,14,7,13,12,15]. These models share a fundamental characteristic: they can be used for learning a distributed representation (or embedding) for each entity and predicate in the KG. We refer to such models as embedding models, and denote the distributed representation of an entity or predicate z by adding a subscript to the corresponding vector or matrix representation, as in ez ∈ Rk . Formally, let G be an RDF graph. For each entity x ∈ EG , embedding models learn a continuous vector representation ex ∈ Rk , with k ∈ N, called the embedding vector of x. Similarly, for each predicate p ∈ RG , embedding models learn an operation on the embedding vector space, characterized by a set of embedding parameters. This can be an empty set of parameters, as in the Unstructured model proposed in [12]; a translation vector ep ∈ Rk , as in the Translating Embeddings model proposed in [7]; or a more complex set of parameters. The distributed representations of all entities and predicates in G are then used for defining an energy distribution E : SG → R over the space of possible triples of G. In particular, the energy E(hs, p, oi) of a triple hs, p, oi is defined as a function of the distributed representations of its subject s, its predicate p and its object o. In Tab. 1, we report the energy functions adopted by several models proposed in the literature. For each model, we report the number of parameters needed for storing the distributed representations of all entities and predicates: ne = |EG | denotes the number of entities in the KG, nr = |RG | denotes the number of predicates, and k, d ∈ N are user-defined hyper-parameters. In general, if the number of parameters in a model grows super-linearly with the number of entities and predicates in the KG, it becomes increasingly harder for the model to scale to very large and Web-scale KGs. The Translating Embeddings model Among the models outlined in Tab. 1, the re- cently proposed Translating Embeddings model (TransE) [7] has interesting character- istics, and recently received a considerable attention [4]: – It achieves more accurate link prediction results than other state-of-the-art methods on several datasets. – The number of parameters in TransE scales linearly in the number of entities ne and predicates nr in the KG: this allows TransE to potentially scale to large KGs. The TransE model is very simple. In TransE, each entity x ∈ EG is represented by its embedding vector ex ∈ Rk , and each predicate p ∈ RG is represented by a (vector) translation operation ep ∈ Rk . The energy of a triple hs, p, oi is given by the L1 or L2 distance between (es + ep ) and eo : E(hs, p, oi) = k(es + ep ) − eo k{1,2} . In TransE, all the embedding and translation vectors are learned jointly from the KG by using Stochastic Gradient Descent, as discussed in Sect. 4. The number of parameters needed by the TransE model for storing all the embedding and translation vectors is (ne k + nr k), a quantity that grows linearly with ne and nr . For such a reason, TransE can potentially scale to very large and highly-relational KGs [7]. A New Set of Embedding Models In the following, we propose a set of variants of the TransE model, which preserve its scalability properties. Let d(x, y) be a dissimilarity function, from the following set: d(x, y) ∈ {kx − yk1 , kx − yk2 , −xT y}, i.e. chosen from the L1 and L2 distance, and the negative inner product. We propose the following embedding models, where each is defined by the corresponding energy function E(·): – TransE : E(hs, p, oi) = d(es + ep , eo ), – TransE+ : E(hs, p, oi) = d(es + ep,1 , eo + ep,2 ), – ScalE : E(hs, p, oi) = d(es ◦ ep , eo ), – ScalE+ : E(hs, p, oi) = d(es ◦ ep,1 , eo ◦ ep,2 ), where es , eo ∈ Rk are the embedding vectors of the entities appearing as the subject s and the object o; ep,· ∈ Rk are the embedding parameters of the predicate p, denoting either a translation or a scaling vector; and ◦ denotes the Hadamard (element-wise) product, corresponding to the vector scaling operation. The energy function in TransE is the same used in [7], but also allows using the negative inner product as a dissimilarity measure between the (translated) subject and object embedding vectors, if it shows to Algorithm 1 Learning the model parameters via SGD Require: Learning rate η, Batch size n, Iterations τ Ensure: Optimal model parameters (embeddings) θ̂ 1: Initialize the model parameters θ0 2: for t ∈ h1, . . . , τ i do 3: ex ← ex /kex k, ∀x ∈ EG {Normalize all entity embedding vectors} 4: T ← S AMPLE P B ATCH (G, n)  {Sample  a batch of observed and corrupted triples} 5: gt ← ∇ (y,ỹ)∈T γ + Eθ (y) − Eθ (ỹ) + {Compute the gradient of the loss} 6: ∆t ← −ηgt {Compute the update to model parameters (embeddings)} 7: θt ← θt−1 + ∆t {Update the model parameters} 8: end for 9: return θτ improve the performance on the validation set. The TransE+ model generalizes TransE by also translating the object embedding vector eo . The ScalE and ScalE+ models are similar to the previous two models, but replace the vector translation with a scaling operation. The rationale behind ScalE and ScalE+ is the following: scaling the embedding vector of an entity can be seen as weighting the (latent) features of such an entity in the embedding vector space. All proposed models share the same advantages as the TransE model: (i) the re- quired number of parameters is O (ne k + nr k), which grows linearly with ne and nr , and (ii) the energy function and its gradient w.r.t. the embedding of entities and predi- cates can be computed very efficiently, using element-wise vector operations. 4 Improving the Efficiency of the Embeddings Learning Process In [8,7,12], authors propose a method for jointly learning the embeddings of all entities and predicates in a KG G. The method relies on a stochastic optimization process, that iteratively updates the embeddings by reducing the energy of triples in G (observed triples) while increasing the energy of triples in SG \ G (unobserved triples). During the learning process, unobserved triples are randomly generated by means of a corruption process, which replaces either the subject or the object of each observed triple with another entity in G. More formally, given an observed triple y ∈ G, let CG (y) denote the set of all corrupted triples obtained by replacing either its subject or object with another entity in G: CG (hs, p, oi) = {hs̃, p, oi | s̃ ∈ EG } ∪ {hs, p, õi | õ ∈ EG }. The embeddings of all entities and predicates in the KG, which compose the model parameters, can be learned by minimizing a margin-based ranking loss. More formally, learning the optimal model parameters θ̂, corresponding to all the entity and predicate embeddings, is equivalent to solving the following constrained minimization problem: X X   minimize γ + Eθ (y) − Eθ (ỹ) + θ∈Θ y∈G ỹ∈CG (y) (1) subject to ∀x ∈ EG kex k = 1, : where [x]+ = max{0, x}, and γ ≥ 0 is a hyper-parameter referred to as margin. The objective function in Eq. 1 (corresponding to the loss functional L(·) discussed in Sect. 2) enforces the energy of observed triples to be lower than the energy of unob- served triples. The constraints in the optimization problem prevent the training process to trivially solve the problem by increasing the entity embedding norms. Stochastic Gradient Descent In the literature, the constrained loss minimization prob- lem in Eq. 1 is solved using Stochastic Gradient Descent (SGD) in mini-batch mode, as summarized in Alg. 1. On each iteration, the algorithm samples a batch of triples from the knowledge graph G. Batches are obtained by first randomly permuting all triples in G, partitioning them into nb batches of the same size, and then iterating over such batches. A single pass over all triples in G is called an epoch. Then, for each triple y in the batch, the algorithm generates a corrupted triple ỹ uniformly sampled from CG (y): this leads to a set T of observed/corrupted pairs of triples hy, ỹi. The observed/corrupted triple pairs are used for computing the gradient of the objective (loss) function in Eq. 1 w.r.t. the current model parameters θ. Finally, θ is updated in the steepest descent direc- tion of the objective function. This procedure is repeated until convergence. The main drawback of SGD is that it requires an initial, careful tuning of the global learning rate η, which is then used for updating all model parameters, regardless of their peculiarities. However, if an entity x ∈ EG occurs in a limited number of triples in G, the corresponding embedding vector ex ∈ Rk will be updated less often, and it will require a much longer time to be learned. For such a reason, SGD may be very time- consuming and slow to converge. For instance, it was reported in [10] that learning the optimal embeddings in TransE may require days of computation for large KGs. A possible solution to this problem consists in associating smaller learning rates to parameters updated more often, such as the embedding vectors of entities appearing more frequently, and larger learning rates to parameters updated less often. Adaptive Learning Rates In order to reduce the time required for learning all entity and predicate embeddings, in this work we propose leveraging Adaptive Per-Parameter Learning Rates. While SGD uses a global, fixed learning rate η for updating all param- eters, we propose relying on methods for estimating the optimal learning rate for each parameter, while still being tractable for learning very large models. We consider two highly-scalable criteria for selecting the optimal learning rates, namely the Momentum method [16] and AdaGrad [17]: they specify alternate ways of computing the parameters update ∆t , defined in Alg. 1 on line 6. Momentum Method The idea behind this method is to accelerate the progress along dimensions where the sign of the gradient does not change, while slowing the progress along dimensions where the sign of the gradient continues to change. The new update rule is defined as follows: ∆t ← ρ∆t−1 − ηm gt , where ηm ∈ R is a user-defined hyper-parameter. Table 2: Statistics for the datasets used in the Link Prediction and Triple Classification tasks. Train. Valid. Test Dataset Entities Predicates Triples Triples Triples F REEBASE (FB15 K ) [7] 14,951 1,345 483,142 50,000 59,071 W ORD N ET (WN18) [7] 40,943 18 141,442 5,000 5,000 Fig. 1: Average loss (the lower, the better) across 10 TransE parameters learning tasks on the W ORD N ET (WN18) and F REEBASE (FB15 K ) datasets, using the optimal TransE settings reported in [7]. For each optimization method, we report the hyper- parameter values that achieve the lowest average loss after 100 epochs, and the corre- sponding average loss values. W ORD N ET (WN18) 8000 F REEBASE (FB15 K ) 60000 AdaGrad η = 0.1 AdaGrad η = 0.1 50000 Momentum η = 10−4, (1 − ρ) = 0.1 7000 Momentum η = 10−3, (1 − ρ) = 0.5 SGD η = 10 −3 SGD η = 10−3 6000 40000 Average Loss Average Loss 5000 30000 4000 20000 3000 10000 2000 0 1000 0 20 40 60 80 100 0 20 40 60 80 100 Epoch Epoch AdaGrad This method is based on the idea that the learning rate of each parameter should grow with the inverse of gradient magnitudes. The update rule in AdaGrad is: ηa ∆t ← − q P gt , t 2 j=1 gj where ηa ∈ R is a user-defined hyper-parameter. AdaGrad adds nearly no complexity, it has very strong convergence guarantees [17], and it has shown remarkable results on large scale learning tasks in distributed environments [18]. 5 Empirical Evaluations This section is organized as follows. In Sect. 5.1 we describe experimental settings, datasets and evaluation metrics. In Sect. 5.2, we show that adaptive learning rates sen- sibly improve both the efficiency of the learning process, and the predictive accuracy of embedding models. In Sect. 5.3, we empirically evaluate the novel embedding models proposed in Sect. 3, by training them using adaptive learning rates. 5.1 Experimental Settings Link Prediction In these experiments, we used the metrics proposed in [7] for eval- uating the rank of each test triple. In particular, for each test triple hs, p, oi, its object Table 3: Link Prediction Results: Test performance of several link prediction methods on the W ORD N ET and F REEBASE datasets. Results show the M EAN R ANK (the lower, the better) and H ITS @10 (the higher, the better) for both the R AW and the F ILTERED settings [7]. Dataset W ORD N ET (WN18) F REEBASE (FB15 K ) M EAN R ANK H ITS @10 (%) M EAN R ANK H ITS @10 (%) Metric R AW F ILT. R AW F ILT. R AW F ILT. R AW F ILT. Unstructured [12] 315 304 35.3 38.2 1,074 979 4.5 6.3 RESCAL [19] 1,180 1,163 37.2 52.8 828 683 28.4 44.1 SE [8] 1,011 985 68.5 80.5 273 162 28.8 39.8 SME lin. [12] 545 533 65.1 74.1 274 154 30.7 40.8 SME bil. [12] 526 509 54.7 61.3 284 158 31.3 41.3 LFM [14] 469 456 71.4 81.6 283 164 26.0 33.1 TransE [7] 263 251 75.4 89.2 243 125 34.9 47.1 TransEA 169 158 80.5 93.5 189 73 44.0 60.1 o is replaced with every entity õ ∈ EG in the KG G in turn, generating a set of cor- rupted triples in the form hs, p, õi. The energies of corrupted triples are first computed by the model, then sorted in ascending order, and used to compute the rank of the cor- rect triple. This procedure is repeated by corrupting the subject. Aggregated over all the test triples, this procedure leads to the following two metrics: the averaged rank, denoted by M EAN R ANK, and the proportion of ranks not larger than 10, denoted by H ITS @10. This is referred to as the R AW setting. In the F ILTERED setting, corrupted triples that exist in either the training, validation or test set were removed before computing the rank of each triple. In both settings, a lower M EAN R ANK is better, while a higher H ITS @10 is better. 5.2 Evaluation of Adaptive Learning Rates Learning Time For comparing Momentum and AdaGrad with SGD on the task of solving the optimization problem in Eq. 1, we empirically evaluated such methods on the task of learning the parameters in TransE on WN18 and FB15 K, using the optimal hyper-parameter settings reported in [7]: k = 20, γ = 2, d = L1 for WN18, and k = 50, γ = 1, d = L1 for FB15 K. Following the evaluation protocol in [20], we compared the optimization methods by using a large grid of hyper-parameters. Let Gη = {10−6 , 10−5 , . . . , 101 } and Gρ = {1−10−4 , 1−10−3 , . . . , 1−10−1 , 0.5}. The grids of hyper-parameters considered for each of the optimization methods were the following: • SGD and AdaGrad: rate η, ηa ∈ Gη . • Momentum: rate ηm ∈ Gη , decay rate ρ ∈ Gρ . For each possible combination of optimization method and hyper-parameter values, we performed an evaluation consisting in 10 learning tasks, each using a different random initialization of the model parameters. Fig. 1 shows the behavior of the loss functional for each of the optimization meth- ods, using the best hyper-parameter settings selected after 100 training epochs. We can Table 4: Link Prediction Results: Test performance of the embedding models proposed in Sect. 3 on the W ORD N ET and F REEBASE datasets. Results show the M EAN R ANK (the lower, the better) and H ITS @10 (the higher, the better) [7]. Dataset W ORD N ET (WN18) F REEBASE (FB15 K ) M EAN R ANK H ITS @10 (%) M EAN R ANK H ITS @10 (%) Metric R AW F ILT. R AW F ILT. R AW F ILT. R AW F ILT. TransE, from [7] SGD 263 251 75.4 89.2 243 125 34.9 47.1 TransE AdaGrad 161 150 80.5 93.5 183 63 47.9 68.2 TransE+ (Sect. 3) AdaGrad 159 148 79.6 92.6 196 78 44.9 62.4 ScalE (Sect. 3) AdaGrad 187 174 82.7 94.5 194 62 49.8 73.0 ScalE+ (Sect. 3) AdaGrad 298 287 83.7 95.5 185 59 50.0 71.5 immediately observe that, for both W ORD N ET (WN18) and F REEBASE (FB15 K ), AdaGrad (with rate ηa = 0.1) yields sensibly lower values of the loss functional than SGD and Momentum, even after very few iterations (< 10 epochs). The duration of each epoch was similar in all methods: each epoch took approx. 1.6 seconds in W ORD - N ET (WN18), and approx. 4.6 seconds in F REEBASE (FB15 K ) on a single i7 CPU. Quality of Learned Models We also measured the quality of models learned by Ada- Grad, in terms of the M EAN R ANK and H ITS @10 metrics, in comparison with SGD. For this purpose, we trained TransE using AdaGrad (instead of SGD) with ηa = 0.1 for 100 epochs, denoted as TransEA , and compared it with results obtained with TransE from the literature on Link Prediction tasks on the W ORD N ET and F REEBASE datasets. Hyper-parameters were selected according to the performance on the validation set, using the same grids of hyper-parameters used for TransE in [7] for the Link Predic- tion tasks. The results obtained by TransEA , in comparison with state-of-the-art results reported in [7], are shown in Tab. 3. Despite the sensibly lower number of training iter- ations (we trained the model using AdaGrad for only 100 epochs, while in [7] TransE was trained using SGD for 1,000 epochs), TransEA yields more accurate link predic- tion models (i.e. with lower M EAN R ANK and higher H ITS @10 values) than every other prediction model in the comparison. 5.3 Evaluation of the Proposed Embedding Models In this section, we evaluate the embedding models inspired by TransE and proposed in Sect. 3: ScalE, TransE+ and ScalE+ . Model hyper-parameters were selected ac- cording to the performance on the validation set. In the following experiments, we considered a wider grid of hyper-parameters: in particular, we selected the embed- ding vector dimension k in {20, 50, 100, 200, 300}, the dissimilarity function d(x, y) in {kx − yk1 , kx − yk2 , −xT y}, and the margin γ in {1, 2, 5, 10}. All models were trained using AdaGrad, with ηa = 0.1, for only 100 epochs. The reduced training time enabled us to experiment with a wider range of hyper-parameters in comparison with related works in literature [7]. Results are summarized in Tab. 4. We can see that, despite their very different geo- metric interpretations, all of the embedding models proposed in Sect. 3 achieve sensi- bly higher results in terms of H ITS @10 in comparison with every other link prediction models outlined in Sect. 3. An explanation is that both TransE [7] and the proposed models TransE+ , ScalE and ScalE+ have a limited model capacity (or complexity) in comparison with other models. For such a reason, they are less prone to underfitting for lack of training instances than other more expressive link prediction models, such as RESCAL [19], SME [12] and NTN [13]. In each experiment, the proposed models ScalE and ScalE+ always improve over TransE in terms of H ITS @10. We can clearly see that, by leveraging: (i) Adaptive learning rates, and (ii) The proposed embedding models ScalE and ScalE+ , we were able to achieve a record 95.5% H ITS @10 on W ORD N ET, and a 73.0% H ITS @10 on F REEBASE. These results are sensibly higher than state-of-the-art results reported in [7]. It is also remarkable that, during learning, the proposed method required a much lower learning time (100 epochs, approx. 30 minutes on F REEBASE, on a single CPU) in comparison with [7] (1,000 epochs, and careful learning rate tuning). A significantly lower training time – from days, as reported by [10], to minutes – can sensibly improve the applicability of embedding models for knowledge graphs in the Web of Data. 6 Conclusions We focused on the problem of link prediction in Knowledge Graphs. Recently, embed- ding models like the TransE [7] model achieved new state-of-the-art link prediction results, while showing the potential to scale to very large KGs. In this paper, we proposed a method for sensibly reducing the learning time in em- bedding models based on adaptive learning rate selection, and proposed a set of new models with interesting scalability properties. We extensively evaluated the proposed methods in several experiments on real world large datasets. Our results show a sig- nificant improvement over state-of-the-art link prediction methods, while significantly reducing the required training time by an order of magnitude. The contributions in this paper sensibly improve both the effectiveness and applica- bility of embedding models on large and Web-scale KGs. Source code and datasets for reproducing the empirical evaluations discussed in this paper are available on-line 5 . References 1. K. D. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Freebase: a collaboratively created graph database for structuring human knowledge,” in Proceedings of the ACM SIG- MOD International Conference on Management of Data, SIGMOD 2008, J. T. Wang, Ed. ACM, 2008, pp. 1247–1250. 2. G. A. Miller, “Wordnet: A lexical database for english,” Commun. ACM, vol. 38, no. 11, pp. 39–41, 1995. 3. X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang, “Knowledge vault: a web-scale approach to probabilistic knowledge fusion,” in The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, S. A. Macskassy et al., Eds. ACM, 2014, pp. 601–610. 5 Source code and datasets: https://github.com/pminervini/DeepKGC 4. A. Bordes and E. Gabrilovich, “Constructing and mining web-scale knowledge graphs: KDD 2014 tutorial,” in The 20th ACM SIGKDD International Conference on Knowledge Discov- ery and Data Mining, KDD ’14, 2014, p. 1967. 5. ——, “Constructing and mining web-scale knowledge graphs: WWW 2015 Tutorial,” in Proceedings of the 24th International Conference on World Wide Web, WWW 2015, 2015, p. 1523. 6. Y. Bengio, A. C. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, 2013. 7. A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko, “Translating em- beddings for modeling multi-relational data,” in Advances in Neural Information Processing Systems 26, C. Burges et al., Eds. Curran Associates, Inc., 2013, pp. 2787–2795. 8. A. Bordes, J. Weston, R. Collobert, and Y. Bengio, “Learning structured embeddings of knowledge bases,” in Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelli- gence, AAAI 2011, W. Burgard et al., Eds. AAAI Press, 2011. 9. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” C. J. C. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger, Eds., 2013, pp. 3111–3119. 10. K. Chang, W. Yih, B. Yang, and C. Meek, “Typed tensor decomposition of knowledge bases for relation extraction,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, A. Moschitti et al., Eds. ACL, 2014, pp. 1568–1579. 11. Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang, “A tutorial on energy-based learning,” in Predicting Structured Data, G. Bakir et al., Eds. MIT Press, 2006. 12. A. Bordes, X. Glorot, J. Weston, and Y. Bengio, “A semantic matching energy function for learning with multi-relational data - application to word-sense disambiguation,” Machine Learning, vol. 94, no. 2, pp. 233–259, 2014. 13. R. Socher, D. Chen, C. D. Manning, and A. Ng, “Reasoning with neural tensor networks for knowledge base completion,” in Advances in Neural Information Processing Systems 26, C. Burges et al., Eds. Curran Associates, Inc., 2013, pp. 926–934. 14. R. Jenatton, N. L. Roux, A. Bordes, and G. R. Obozinski, “A latent factor model for highly multi-relational data,” in Advances in Neural Information Processing Systems 25, F. Pereira et al., Eds. Curran Associates, Inc., 2012, pp. 3167–3175. 15. Z. Wang, J. Zhang, J. Feng, and Z. Chen, “Knowledge graph embedding by translating on hy- perplanes,” in Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, C. E. Brodley et al., Eds. AAAI Press, 2014, pp. 1112–1119. 16. D. E. Rumelhart, G. E. Hinton, and R. J. Wilson, “Learning representations by back- propagating errors,” Nature, vol. 323, pp. 533–536, 1986. 17. J. C. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, pp. 2121–2159, 2011. 18. J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. A. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng, “Large scale distributed deep networks,” in Advances in Neural Information Processing Systems 25, F. Pereira et al., Eds. Curran Associates, Inc., 2012, pp. 1223–1231. 19. M. Nickel, V. Tresp, and H. Kriegel, “A three-way model for collective learning on multi- relational data,” in Proceedings of the 28th International Conference on Machine Learning, ICML 2011, L. Getoor et al., Eds. Omnipress, 2011, pp. 809–816. 20. T. Schaul, I. Antonoglou, and D. Silver, “Unit tests for stochastic optimization,” in Interna- tional Conference on Learning Representations, 2014.