Efficient Learning of Entity and Predicate Embeddings
        for Link Prediction in Knowledge Graphs

        Pasquale Minervini, Claudia d’Amato, Nicola Fanizzi, Floriana Esposito

LACAM – Department of Computer Science – Università degli Studi di Bari Aldo Moro, Italy
                     {firstname.lastname}@uniba.it


        Abstract. Knowledge Graphs are a widely used formalism for representing knowl-
        edge in the Web of Data. We focus on the problem of predicting missing links
        in large knowledge graphs, so to discover new facts about the world. Recently,
        representation learning models that embed entities and predicates in continuous
        vector spaces achieved new state-of-the-art results on this problem. A major lim-
        itation in these models is that the training process, which consists in learning
        the optimal entity and predicate embeddings for a given knowledge graph, can
        be very computationally expensive: it may even require days of computations for
        large knowledge graphs. In this work, by leveraging adaptive learning rates, we
        propose a principled method for reducing the training time by an order of mag-
        nitude, while learning more accurate link prediction models. Furthermore, we
        employ the proposed training method for evaluating a set of novel and scalable
        models. Our evaluations show significant improvements over state-of-the-art link
        prediction methods on the W ORD N ET and F REEBASE datasets.


1     Introduction

Knowledge Graphs (KGs) are graph-structured Knowledge Bases (KBs), where fac-
tual knowledge about the world is represented in the form of relationships between
entities. They are widely used for representing relational knowledge in a variety of do-
mains, such as citation networks and protein interaction networks. An example of their
widespread adoption is the Linked Open Data (LOD) Cloud, a set of interlinked KGs
such as Freebase [1] and WordNet [2]. As of April 2014, the LOD Cloud was composed
by 1,091 interlinked KBs, describing over 8 × 106 entities, and 188 × 106 relationships
holding between them 1 .
    Despite their large size, many KGs are still largely incomplete. For example con-
sider Freebase 2 , a core element in the Google Knowledge Vault project [3]: 71% of
the persons described in Freebase have no known place of birth, 75% of them have no
known nationality, and the coverage for less frequent predicates can be even lower [3].
    In this work we focus on the problem of completing missing links in large KGs, so to
discover new facts about the world. In the literature, this problem is referred to as link
prediction or knowledge graph completion, and has received a considerable attention
over the last years [4,5].
 1
     State of the LOD Cloud 2014: http://lod-cloud.net/
 2
     Publicly available at https://developers.google.com/freebase/data
    Recently, representation learning [6] models such as the Translating Embeddings
model (TransE) [7] achieved new state-of-the-art link prediction results on large and
Web-scale KGs [4,5]. Such models learn a unique distributed representation, or em-
bedding, for each entity and predicate in the KG: each entity is represented by a low-
dimensional continuous embedding vector, and each predicate is represented by an op-
eration in the embedding vector space, such as a translation (as in [7]) or an affine
transformation (as in [8]). We refer to these models as embedding models, and to the
learned distributed representations as embeddings.
    The embeddings of all entities and predicates in the KG are learned jointly: the
learning process consists in minimizing a global loss functional considering the whole
KG, by back-propagating the loss to the embeddings 3 . As a consequence, the learned
entity and predicate embeddings retain global, structural information about the whole
KG, and can be used to serve several kinds of applications. In link prediction, the con-
fidence of each candidate edge can be measured as a function of the embeddings of its
source entity, its target entity, and its predicate.
    A major limitation in embedding models proposed so far, however, is that the learn-
ing procedure (i.e. learning the optimal embeddings of all entities and predicates in the
KG) can be very time-consuming: it is based on an incremental optimization algorithm
that may require days of computation to converge for large KGs [10].
    In this work, we propose a novel principled method for significantly reducing the
learning time in embedding models, based on adaptive per-parameter learning rates.
Furthermore, we employ the proposed training method for evaluating a variety of novel
embedding models: our evaluations achieves new state-of-the-art link prediction results
on the W ORD N ET and F REEBASE datasets.


2    Basics
RDF Graphs The most widely used formalism for representing knowledge graphs
is the W3C Resource Description Framework (RDF) 4 , a recommended standard for
representing knowledge on the Web. An RDF KB, also referred to as RDF graph, is a
set of RDF triples in the form hs, p, oi, where s, p and o respectively denote the subject,
the predicate and the object of the triple: s and o are entities, and p is a relation type.
Each triple hs, p, oi describes a statement, which is interpreted as “A relationship p
holds between entities s and o”.
Example 2.1 (Shakespeare). The statement “William Shakespeare is an author who
wrote Othello and the tragedy Hamlet” can be expressed by the following RDF triples:

          hShakespeare,                  profession,                Authori
          hShakespeare,                  author,                    Hamleti
          hShakespeare,                  author,                    Othelloi
          hHamlet,                       genre,                     Tragedyi
 3
   In natural language processing, a similar procedure is used by the word2vec model [9] for
   learning an unique distributed representation for each word in a corpus of documents.
 4
   http://www.w3.org/TR/rdf11-concepts/
    An RDF graph can be viewed as a labeled directed multigraph, where each entity
is a vertex, and each RDF triple is represented by a directed edge whose label is a
predicate, and emanating from its subject vertex to its object vertex. In RDF KBs, the
Open-World Assumption holds: a missing triple does not mean that the corresponding
statement is false, but rather that its truth value is unknown (it cannot be observed). In
the following, given an RDF graph G, we denote as EG the set of all entities occurring
as subjects or objects in G, and as RG the set of all predicates occurring in G:

                    EG = {s | ∃hs, p, oi ∈ G} ∪ {o | ∃hs, p, oi ∈ G},
                   RG = {p | ∃hs, p, oi ∈ G}.

For instance, in the case of the RDF graph shown in Ex. 2.1, the sets EG and RG
are the following: EG = {Author, Shakespeare, Hamlet, Othello, Tragedy}
and RG = {profession, author, genre}.
    Furthermore, we denote as SG = EG × RG × EG the space of possible triples of
G, i.e. the set of all triples that can be created by using the entities and predicates in G
(note that G ⊆ SG ). We refer to all triples in G as observed triples, and to all triples in
SG \ G as unobserved triples.

Energy-Based Models Embedding models for KGs can be described in terms of
Energy-Based Models (EBMs) [11]: EBMs are a versatile and flexible framework for
modeling dependencies between variables. In the fields of representation learning and
deep learning [6], EBMs are employed as building blocks for constructing hierarchical
models that achieve ground-breaking results in several learning tasks.
    A fundamental component in an EBM is a scalar-valued energy function (or scor-
ing function) Eθ (·), parametrized by θ, which associates a scalar energy value to the
configuration of a set of variables. The energy of a configuration of a set of variables is
inversely proportional to its probability: more likely configurations are associated with
lower energy values, while less likely configurations are associated with higher energy
values. Several tractable methods have been proposed for learning the parameters of an
energy function [11,6]. In particular, the problem of learning the optimal parameters θ̂
can be cast as solving the following optimization problem [11]:

                                 θ̂ = arg min L(Eθ , D),
                                          θ∈Θ

where Θ is the parameter space, and L(·) is a loss functional which measures the qual-
ity of the energy function Eθ (·) on the data D. Intuitively, the loss functional L(·)
assigns a lower loss to energy functions that associate a lower energy (corresponding to
a higher probability) to correct answers, and higher energy (corresponding to a lower
probability value) to all other incorrect answers.

Energy-Based Models for RDF Graphs As discussed in [12], embedding models
for KGs define an energy distribution Eθ : SG → R over the space of possible triples
SG . For instance, the models proposed in [8,7,13,12] are used for assigning a score
E(hs, p, oi) to each triple hs, p, oi in SG . In a link prediction setting, such models are
Table 1: Energy functions used by several link prediction models, and the corresponding
number of parameters: ne = |EG | and nr = |RG | respectively denote the number of
entities and predicates in G, and k, d ∈ N are user-defined hyper-parameters.
       Model                Energy function      E(hs, p, oi)                  Parameters Complexity
 Unstructured [12]           kes − eo k22 ,     es , eo ∈ Rk                         O (ne k)
   TransE [7]             k(es + ep ) − eo k{1,2} ,     ep ∈ Rk                   O(ne k + nr k)
       SE [8]          kWp,1 es − Wp,2 eo k1 ,         Wp,i ∈ Rk×k               O ne k + nr k2
                                             T
                     (W1 es + W2 ep + b1 ) (W3 eo + W4 ep + b2 )
    SME lin. [12]                                                               O (ne k + nr k + dk)
                             ep ∈ Rk , Wi ∈ Rd×k , bj ∈ Rd
                                         T                 
                            (W1 ×3 ep )es      (W2 ×3 ep )eo
                                                                                                  
    SME bil. [12]                 k          d×k×k          d                  O ne k + nr k + dk2
                          ep ∈ R , Wi ∈ R         , bj ∈ R     
                      uTp f eTs Wp eo + Wp,1 es + Wp,2 eo + bp                                  
     NTN [13]                   k×k×d                d×k                   d
                                                                                 O ne k + nr dk2
                       Wp ∈ R           , Wp,i ∈ R         , up , bp ∈ R


used as follows. First, the optimal parameters θ̂ of the energy function are learned:
the parameters are composed by the embeddings of all entities and predicates in the
KG. Then, the energy function Eθ̂ (·) is used for ranking unobserved triples: those with
lower energy values have a higher probability of representing true statements, and are
considered more likely candidates for a completion of the KG.
    Consider the RDF graph shown in Ex. 2.1. In such a graph, we prefer learning an
energy-based model that assigns a lower energy (a higher probability value) to the triple
hOthello, genre, Tragedyi, which is unobserved but represents the true statement
“Othello is a Tragedy”, and a higher energy (a lower probability value) to other unob-
served triples, for example hHamlet, genre, Authori.


3     Energy-Based Embedding Models
Several EBMs have been proposed in the literature for addressing the problem of link
prediction in KGs [8,14,7,13,12,15]. These models share a fundamental characteristic:
they can be used for learning a distributed representation (or embedding) for each entity
and predicate in the KG. We refer to such models as embedding models, and denote
the distributed representation of an entity or predicate z by adding a subscript to the
corresponding vector or matrix representation, as in ez ∈ Rk .
    Formally, let G be an RDF graph. For each entity x ∈ EG , embedding models learn
a continuous vector representation ex ∈ Rk , with k ∈ N, called the embedding vector
of x. Similarly, for each predicate p ∈ RG , embedding models learn an operation on the
embedding vector space, characterized by a set of embedding parameters. This can be
an empty set of parameters, as in the Unstructured model proposed in [12]; a translation
vector ep ∈ Rk , as in the Translating Embeddings model proposed in [7]; or a more
complex set of parameters.
    The distributed representations of all entities and predicates in G are then used for
defining an energy distribution E : SG → R over the space of possible triples of G.
In particular, the energy E(hs, p, oi) of a triple hs, p, oi is defined as a function of the
distributed representations of its subject s, its predicate p and its object o.
     In Tab. 1, we report the energy functions adopted by several models proposed in the
literature. For each model, we report the number of parameters needed for storing the
distributed representations of all entities and predicates: ne = |EG | denotes the number
of entities in the KG, nr = |RG | denotes the number of predicates, and k, d ∈ N
are user-defined hyper-parameters. In general, if the number of parameters in a model
grows super-linearly with the number of entities and predicates in the KG, it becomes
increasingly harder for the model to scale to very large and Web-scale KGs.

The Translating Embeddings model Among the models outlined in Tab. 1, the re-
cently proposed Translating Embeddings model (TransE) [7] has interesting character-
istics, and recently received a considerable attention [4]:

 – It achieves more accurate link prediction results than other state-of-the-art methods
   on several datasets.
 – The number of parameters in TransE scales linearly in the number of entities ne
   and predicates nr in the KG: this allows TransE to potentially scale to large KGs.

     The TransE model is very simple. In TransE, each entity x ∈ EG is represented by
its embedding vector ex ∈ Rk , and each predicate p ∈ RG is represented by a (vector)
translation operation ep ∈ Rk . The energy of a triple hs, p, oi is given by the L1 or L2
distance between (es + ep ) and eo :

                         E(hs, p, oi) = k(es + ep ) − eo k{1,2} .

In TransE, all the embedding and translation vectors are learned jointly from the KG by
using Stochastic Gradient Descent, as discussed in Sect. 4. The number of parameters
needed by the TransE model for storing all the embedding and translation vectors is
(ne k + nr k), a quantity that grows linearly with ne and nr . For such a reason, TransE
can potentially scale to very large and highly-relational KGs [7].

A New Set of Embedding Models In the following, we propose a set of variants of the
TransE model, which preserve its scalability properties. Let d(x, y) be a dissimilarity
function, from the following set: d(x, y) ∈ {kx − yk1 , kx − yk2 , −xT y}, i.e. chosen
from the L1 and L2 distance, and the negative inner product. We propose the following
embedding models, where each is defined by the corresponding energy function E(·):

 – TransE : E(hs, p, oi) = d(es + ep , eo ),
 – TransE+ : E(hs, p, oi) = d(es + ep,1 , eo + ep,2 ),
 – ScalE : E(hs, p, oi) = d(es ◦ ep , eo ),
 – ScalE+ : E(hs, p, oi) = d(es ◦ ep,1 , eo ◦ ep,2 ),

where es , eo ∈ Rk are the embedding vectors of the entities appearing as the subject s
and the object o; ep,· ∈ Rk are the embedding parameters of the predicate p, denoting
either a translation or a scaling vector; and ◦ denotes the Hadamard (element-wise)
product, corresponding to the vector scaling operation. The energy function in TransE
is the same used in [7], but also allows using the negative inner product as a dissimilarity
measure between the (translated) subject and object embedding vectors, if it shows to
Algorithm 1 Learning the model parameters via SGD
Require: Learning rate η, Batch size n, Iterations τ
Ensure: Optimal model parameters (embeddings) θ̂
 1: Initialize the model parameters θ0
 2: for t ∈ h1, . . . , τ i do
 3:    ex ← ex /kex k, ∀x ∈ EG                        {Normalize all entity embedding vectors}
 4:    T ← S AMPLE
                 P        B ATCH (G, n)
                                           {Sample
                                                     a batch of observed and corrupted triples}
 5:    gt ← ∇ (y,ỹ)∈T γ + Eθ (y) − Eθ (ỹ) +                {Compute the gradient of the loss}
 6:    ∆t ← −ηgt                        {Compute the update to model parameters (embeddings)}
 7:    θt ← θt−1 + ∆t                                           {Update the model parameters}
 8: end for
 9: return θτ


improve the performance on the validation set. The TransE+ model generalizes TransE
by also translating the object embedding vector eo .
     The ScalE and ScalE+ models are similar to the previous two models, but replace
the vector translation with a scaling operation. The rationale behind ScalE and ScalE+
is the following: scaling the embedding vector of an entity can be seen as weighting the
(latent) features of such an entity in the embedding vector space.
     All proposed models share the same advantages as the TransE model: (i) the re-
quired number of parameters is O (ne k + nr k), which grows linearly with ne and nr ,
and (ii) the energy function and its gradient w.r.t. the embedding of entities and predi-
cates can be computed very efficiently, using element-wise vector operations.

4   Improving the Efficiency of the Embeddings Learning Process
In [8,7,12], authors propose a method for jointly learning the embeddings of all entities
and predicates in a KG G. The method relies on a stochastic optimization process, that
iteratively updates the embeddings by reducing the energy of triples in G (observed
triples) while increasing the energy of triples in SG \ G (unobserved triples).
    During the learning process, unobserved triples are randomly generated by means
of a corruption process, which replaces either the subject or the object of each observed
triple with another entity in G. More formally, given an observed triple y ∈ G, let CG (y)
denote the set of all corrupted triples obtained by replacing either its subject or object
with another entity in G:
               CG (hs, p, oi) = {hs̃, p, oi | s̃ ∈ EG } ∪ {hs, p, õi | õ ∈ EG }.
The embeddings of all entities and predicates in the KG, which compose the model
parameters, can be learned by minimizing a margin-based ranking loss. More formally,
learning the optimal model parameters θ̂, corresponding to all the entity and predicate
embeddings, is equivalent to solving the following constrained minimization problem:
                             X         X                           
                 minimize                       γ + Eθ (y) − Eθ (ỹ) +
                   θ∈Θ
                             y∈G    ỹ∈CG (y)                                       (1)
                 subject to ∀x ∈ EG kex k = 1,
                                       :
where [x]+ = max{0, x}, and γ ≥ 0 is a hyper-parameter referred to as margin.
The objective function in Eq. 1 (corresponding to the loss functional L(·) discussed in
Sect. 2) enforces the energy of observed triples to be lower than the energy of unob-
served triples. The constraints in the optimization problem prevent the training process
to trivially solve the problem by increasing the entity embedding norms.


Stochastic Gradient Descent In the literature, the constrained loss minimization prob-
lem in Eq. 1 is solved using Stochastic Gradient Descent (SGD) in mini-batch mode, as
summarized in Alg. 1. On each iteration, the algorithm samples a batch of triples from
the knowledge graph G. Batches are obtained by first randomly permuting all triples
in G, partitioning them into nb batches of the same size, and then iterating over such
batches. A single pass over all triples in G is called an epoch. Then, for each triple y in
the batch, the algorithm generates a corrupted triple ỹ uniformly sampled from CG (y):
this leads to a set T of observed/corrupted pairs of triples hy, ỹi. The observed/corrupted
triple pairs are used for computing the gradient of the objective (loss) function in Eq. 1
w.r.t. the current model parameters θ. Finally, θ is updated in the steepest descent direc-
tion of the objective function. This procedure is repeated until convergence.
    The main drawback of SGD is that it requires an initial, careful tuning of the global
learning rate η, which is then used for updating all model parameters, regardless of their
peculiarities. However, if an entity x ∈ EG occurs in a limited number of triples in G,
the corresponding embedding vector ex ∈ Rk will be updated less often, and it will
require a much longer time to be learned. For such a reason, SGD may be very time-
consuming and slow to converge. For instance, it was reported in [10] that learning the
optimal embeddings in TransE may require days of computation for large KGs.
    A possible solution to this problem consists in associating smaller learning rates
to parameters updated more often, such as the embedding vectors of entities appearing
more frequently, and larger learning rates to parameters updated less often.


Adaptive Learning Rates In order to reduce the time required for learning all entity
and predicate embeddings, in this work we propose leveraging Adaptive Per-Parameter
Learning Rates. While SGD uses a global, fixed learning rate η for updating all param-
eters, we propose relying on methods for estimating the optimal learning rate for each
parameter, while still being tractable for learning very large models.
    We consider two highly-scalable criteria for selecting the optimal learning rates,
namely the Momentum method [16] and AdaGrad [17]: they specify alternate ways of
computing the parameters update ∆t , defined in Alg. 1 on line 6.


Momentum Method The idea behind this method is to accelerate the progress along
dimensions where the sign of the gradient does not change, while slowing the progress
along dimensions where the sign of the gradient continues to change. The new update
rule is defined as follows:
                                  ∆t ← ρ∆t−1 − ηm gt ,

where ηm ∈ R is a user-defined hyper-parameter.
Table 2: Statistics for the datasets used in the Link Prediction and Triple Classification
tasks.
                                                                                                                 Train.          Valid.               Test
                            Dataset                         Entities              Predicates
                                                                                                                 Triples         Triples             Triples
      F REEBASE (FB15 K ) [7]                                   14,951                 1,345                     483,142          50,000             59,071
      W ORD N ET (WN18) [7]                                     40,943                   18                      141,442          5,000              5,000

Fig. 1: Average loss (the lower, the better) across 10 TransE parameters learning tasks
on the W ORD N ET (WN18) and F REEBASE (FB15 K ) datasets, using the optimal
TransE settings reported in [7]. For each optimization method, we report the hyper-
parameter values that achieve the lowest average loss after 100 epochs, and the corre-
sponding average loss values.

                                      W ORD N ET (WN18)                                               8000
                                                                                                                       F REEBASE (FB15 K )
                60000
                                              AdaGrad η = 0.1                                                                   AdaGrad η = 0.1
                50000                         Momentum η = 10−4, (1 − ρ) = 0.1                        7000                      Momentum η = 10−3, (1 − ρ) = 0.5
                                              SGD η = 10   −3                                                                   SGD η = 10−3
                                                                                                      6000
                40000                                                                  Average Loss
 Average Loss


                                                                                                      5000
                30000
                                                                                                      4000
                20000
                                                                                                      3000
                10000                                                                                 2000

                    0                                                                                 1000
                        0      20        40           60           80            100                         0    20       40           60          80             100
                                              Epoch                                                                             Epoch


AdaGrad This method is based on the idea that the learning rate of each parameter
should grow with the inverse of gradient magnitudes. The update rule in AdaGrad is:
                                            ηa
                               ∆t ← − q P           gt ,
                                            t     2
                                            j=1 gj

where ηa ∈ R is a user-defined hyper-parameter. AdaGrad adds nearly no complexity,
it has very strong convergence guarantees [17], and it has shown remarkable results on
large scale learning tasks in distributed environments [18].

5                  Empirical Evaluations
This section is organized as follows. In Sect. 5.1 we describe experimental settings,
datasets and evaluation metrics. In Sect. 5.2, we show that adaptive learning rates sen-
sibly improve both the efficiency of the learning process, and the predictive accuracy of
embedding models. In Sect. 5.3, we empirically evaluate the novel embedding models
proposed in Sect. 3, by training them using adaptive learning rates.

5.1                     Experimental Settings
Link Prediction In these experiments, we used the metrics proposed in [7] for eval-
uating the rank of each test triple. In particular, for each test triple hs, p, oi, its object
Table 3: Link Prediction Results: Test performance of several link prediction methods
on the W ORD N ET and F REEBASE datasets. Results show the M EAN R ANK (the lower,
the better) and H ITS @10 (the higher, the better) for both the R AW and the F ILTERED
settings [7].
          Dataset              W ORD N ET (WN18)                F REEBASE (FB15 K )
                          M EAN R ANK H ITS @10 (%)        M EAN R ANK H ITS @10 (%)
           Metric
                          R AW F ILT. R AW F ILT.          R AW F ILT. R AW F ILT.
      Unstructured [12]    315    304      35.3  38.2      1,074    979     4.5     6.3
       RESCAL [19]        1,180 1,163      37.2  52.8       828     683    28.4     44.1
           SE [8]         1,011   985      68.5  80.5       273     162    28.8     39.8
       SME lin. [12]       545    533      65.1  74.1       274     154    30.7     40.8
       SME bil. [12]       526    509      54.7  61.3       284     158    31.3     41.3
         LFM [14]          469    456      71.4  81.6       283     164    26.0     33.1
        TransE [7]         263    251      75.4  89.2       243     125    34.9     47.1
         TransEA           169    158      80.5  93.5       189     73     44.0     60.1


o is replaced with every entity õ ∈ EG in the KG G in turn, generating a set of cor-
rupted triples in the form hs, p, õi. The energies of corrupted triples are first computed
by the model, then sorted in ascending order, and used to compute the rank of the cor-
rect triple. This procedure is repeated by corrupting the subject. Aggregated over all
the test triples, this procedure leads to the following two metrics: the averaged rank,
denoted by M EAN R ANK, and the proportion of ranks not larger than 10, denoted by
H ITS @10. This is referred to as the R AW setting.
    In the F ILTERED setting, corrupted triples that exist in either the training, validation
or test set were removed before computing the rank of each triple. In both settings, a
lower M EAN R ANK is better, while a higher H ITS @10 is better.

5.2     Evaluation of Adaptive Learning Rates
Learning Time For comparing Momentum and AdaGrad with SGD on the task of
solving the optimization problem in Eq. 1, we empirically evaluated such methods on
the task of learning the parameters in TransE on WN18 and FB15 K, using the optimal
hyper-parameter settings reported in [7]: k = 20, γ = 2, d = L1 for WN18, and
k = 50, γ = 1, d = L1 for FB15 K. Following the evaluation protocol in [20], we
compared the optimization methods by using a large grid of hyper-parameters. Let Gη =
{10−6 , 10−5 , . . . , 101 } and Gρ = {1−10−4 , 1−10−3 , . . . , 1−10−1 , 0.5}. The grids of
hyper-parameters considered for each of the optimization methods were the following:
 • SGD and AdaGrad: rate η, ηa ∈ Gη .
 • Momentum: rate ηm ∈ Gη , decay rate ρ ∈ Gρ .
For each possible combination of optimization method and hyper-parameter values, we
performed an evaluation consisting in 10 learning tasks, each using a different random
initialization of the model parameters.
     Fig. 1 shows the behavior of the loss functional for each of the optimization meth-
ods, using the best hyper-parameter settings selected after 100 training epochs. We can
Table 4: Link Prediction Results: Test performance of the embedding models proposed
in Sect. 3 on the W ORD N ET and F REEBASE datasets. Results show the M EAN R ANK
(the lower, the better) and H ITS @10 (the higher, the better) [7].
           Dataset               W ORD N ET (WN18)         F REEBASE (FB15 K )
                              M EAN R ANK H ITS @10 (%) M EAN R ANK H ITS @10 (%)
           Metric
                              R AW F ILT. R AW F ILT. R AW F ILT. R AW F ILT.
  TransE, from [7] SGD        263    251    75.4 89.2   243    125    34.9 47.1
           TransE AdaGrad     161    150    80.5 93.5   183     63    47.9 68.2
 TransE+ (Sect. 3) AdaGrad    159    148    79.6 92.6   196     78    44.9 62.4
    ScalE (Sect. 3) AdaGrad   187    174    82.7 94.5   194     62    49.8 73.0
  ScalE+ (Sect. 3) AdaGrad    298    287    83.7 95.5   185     59    50.0 71.5


immediately observe that, for both W ORD N ET (WN18) and F REEBASE (FB15 K ),
AdaGrad (with rate ηa = 0.1) yields sensibly lower values of the loss functional than
SGD and Momentum, even after very few iterations (< 10 epochs). The duration of
each epoch was similar in all methods: each epoch took approx. 1.6 seconds in W ORD -
N ET (WN18), and approx. 4.6 seconds in F REEBASE (FB15 K ) on a single i7 CPU.

Quality of Learned Models We also measured the quality of models learned by Ada-
Grad, in terms of the M EAN R ANK and H ITS @10 metrics, in comparison with SGD.
For this purpose, we trained TransE using AdaGrad (instead of SGD) with ηa = 0.1 for
100 epochs, denoted as TransEA , and compared it with results obtained with TransE
from the literature on Link Prediction tasks on the W ORD N ET and F REEBASE datasets.
Hyper-parameters were selected according to the performance on the validation set,
using the same grids of hyper-parameters used for TransE in [7] for the Link Predic-
tion tasks. The results obtained by TransEA , in comparison with state-of-the-art results
reported in [7], are shown in Tab. 3. Despite the sensibly lower number of training iter-
ations (we trained the model using AdaGrad for only 100 epochs, while in [7] TransE
was trained using SGD for 1,000 epochs), TransEA yields more accurate link predic-
tion models (i.e. with lower M EAN R ANK and higher H ITS @10 values) than every
other prediction model in the comparison.

5.3   Evaluation of the Proposed Embedding Models
In this section, we evaluate the embedding models inspired by TransE and proposed
in Sect. 3: ScalE, TransE+ and ScalE+ . Model hyper-parameters were selected ac-
cording to the performance on the validation set. In the following experiments, we
considered a wider grid of hyper-parameters: in particular, we selected the embed-
ding vector dimension k in {20, 50, 100, 200, 300}, the dissimilarity function d(x, y)
in {kx − yk1 , kx − yk2 , −xT y}, and the margin γ in {1, 2, 5, 10}. All models were
trained using AdaGrad, with ηa = 0.1, for only 100 epochs. The reduced training time
enabled us to experiment with a wider range of hyper-parameters in comparison with
related works in literature [7].
    Results are summarized in Tab. 4. We can see that, despite their very different geo-
metric interpretations, all of the embedding models proposed in Sect. 3 achieve sensi-
bly higher results in terms of H ITS @10 in comparison with every other link prediction
models outlined in Sect. 3. An explanation is that both TransE [7] and the proposed
models TransE+ , ScalE and ScalE+ have a limited model capacity (or complexity)
in comparison with other models. For such a reason, they are less prone to underfitting
for lack of training instances than other more expressive link prediction models, such
as RESCAL [19], SME [12] and NTN [13].
    In each experiment, the proposed models ScalE and ScalE+ always improve over
TransE in terms of H ITS @10. We can clearly see that, by leveraging: (i) Adaptive
learning rates, and (ii) The proposed embedding models ScalE and ScalE+ , we were
able to achieve a record 95.5% H ITS @10 on W ORD N ET, and a 73.0% H ITS @10 on
F REEBASE. These results are sensibly higher than state-of-the-art results reported in
[7]. It is also remarkable that, during learning, the proposed method required a much
lower learning time (100 epochs, approx. 30 minutes on F REEBASE, on a single CPU)
in comparison with [7] (1,000 epochs, and careful learning rate tuning).
    A significantly lower training time – from days, as reported by [10], to minutes –
can sensibly improve the applicability of embedding models for knowledge graphs in
the Web of Data.

6     Conclusions
We focused on the problem of link prediction in Knowledge Graphs. Recently, embed-
ding models like the TransE [7] model achieved new state-of-the-art link prediction
results, while showing the potential to scale to very large KGs.
     In this paper, we proposed a method for sensibly reducing the learning time in em-
bedding models based on adaptive learning rate selection, and proposed a set of new
models with interesting scalability properties. We extensively evaluated the proposed
methods in several experiments on real world large datasets. Our results show a sig-
nificant improvement over state-of-the-art link prediction methods, while significantly
reducing the required training time by an order of magnitude.
     The contributions in this paper sensibly improve both the effectiveness and applica-
bility of embedding models on large and Web-scale KGs. Source code and datasets for
reproducing the empirical evaluations discussed in this paper are available on-line 5 .

References
 1. K. D. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Freebase: a collaboratively
    created graph database for structuring human knowledge,” in Proceedings of the ACM SIG-
    MOD International Conference on Management of Data, SIGMOD 2008, J. T. Wang, Ed.
    ACM, 2008, pp. 1247–1250.
 2. G. A. Miller, “Wordnet: A lexical database for english,” Commun. ACM, vol. 38, no. 11, pp.
    39–41, 1995.
 3. X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun,
    and W. Zhang, “Knowledge vault: a web-scale approach to probabilistic knowledge fusion,”
    in The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data
    Mining, KDD ’14, S. A. Macskassy et al., Eds. ACM, 2014, pp. 601–610.
 5
     Source code and datasets: https://github.com/pminervini/DeepKGC
 4. A. Bordes and E. Gabrilovich, “Constructing and mining web-scale knowledge graphs: KDD
    2014 tutorial,” in The 20th ACM SIGKDD International Conference on Knowledge Discov-
    ery and Data Mining, KDD ’14, 2014, p. 1967.
 5. ——, “Constructing and mining web-scale knowledge graphs: WWW 2015 Tutorial,” in
    Proceedings of the 24th International Conference on World Wide Web, WWW 2015, 2015, p.
    1523.
 6. Y. Bengio, A. C. Courville, and P. Vincent, “Representation learning: A review and new
    perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, 2013.
 7. A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko, “Translating em-
    beddings for modeling multi-relational data,” in Advances in Neural Information Processing
    Systems 26, C. Burges et al., Eds. Curran Associates, Inc., 2013, pp. 2787–2795.
 8. A. Bordes, J. Weston, R. Collobert, and Y. Bengio, “Learning structured embeddings of
    knowledge bases,” in Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelli-
    gence, AAAI 2011, W. Burgard et al., Eds. AAAI Press, 2011.
 9. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations
    of words and phrases and their compositionality,” C. J. C. Burges, L. Bottou, Z. Ghahramani,
    and K. Q. Weinberger, Eds., 2013, pp. 3111–3119.
10. K. Chang, W. Yih, B. Yang, and C. Meek, “Typed tensor decomposition of knowledge bases
    for relation extraction,” in Proceedings of the 2014 Conference on Empirical Methods in
    Natural Language Processing, EMNLP 2014, A. Moschitti et al., Eds. ACL, 2014, pp.
    1568–1579.
11. Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang, “A tutorial on energy-based
    learning,” in Predicting Structured Data, G. Bakir et al., Eds. MIT Press, 2006.
12. A. Bordes, X. Glorot, J. Weston, and Y. Bengio, “A semantic matching energy function for
    learning with multi-relational data - application to word-sense disambiguation,” Machine
    Learning, vol. 94, no. 2, pp. 233–259, 2014.
13. R. Socher, D. Chen, C. D. Manning, and A. Ng, “Reasoning with neural tensor networks
    for knowledge base completion,” in Advances in Neural Information Processing Systems 26,
    C. Burges et al., Eds. Curran Associates, Inc., 2013, pp. 926–934.
14. R. Jenatton, N. L. Roux, A. Bordes, and G. R. Obozinski, “A latent factor model for highly
    multi-relational data,” in Advances in Neural Information Processing Systems 25, F. Pereira
    et al., Eds. Curran Associates, Inc., 2012, pp. 3167–3175.
15. Z. Wang, J. Zhang, J. Feng, and Z. Chen, “Knowledge graph embedding by translating on hy-
    perplanes,” in Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence,
    C. E. Brodley et al., Eds. AAAI Press, 2014, pp. 1112–1119.
16. D. E. Rumelhart, G. E. Hinton, and R. J. Wilson, “Learning representations by back-
    propagating errors,” Nature, vol. 323, pp. 533–536, 1986.
17. J. C. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and
    stochastic optimization,” Journal of Machine Learning Research, vol. 12, pp. 2121–2159,
    2011.
18. J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. A. Ranzato, A. Senior,
    P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng, “Large scale distributed deep networks,” in
    Advances in Neural Information Processing Systems 25, F. Pereira et al., Eds. Curran
    Associates, Inc., 2012, pp. 1223–1231.
19. M. Nickel, V. Tresp, and H. Kriegel, “A three-way model for collective learning on multi-
    relational data,” in Proceedings of the 28th International Conference on Machine Learning,
    ICML 2011, L. Getoor et al., Eds. Omnipress, 2011, pp. 809–816.
20. T. Schaul, I. Antonoglou, and D. Silver, “Unit tests for stochastic optimization,” in Interna-
    tional Conference on Learning Representations, 2014.