=Paper=
{{Paper
|id=Vol-2322/dsi4-6
|storemode=property
|title= Analyzing Knowledge Graph Embedding Methods from a Multi-Embedding Interaction Perspective
|pdfUrl=https://ceur-ws.org/Vol-2322/dsi4-6.pdf
|volume=Vol-2322
|authors=Hung Nghiep Tran,Atsuhiro Takasu
|dblpUrl=https://dblp.org/rec/conf/edbt/TranT19
}}
== Analyzing Knowledge Graph Embedding Methods from a Multi-Embedding Interaction Perspective==
<pdf width="1500px">https://ceur-ws.org/Vol-2322/dsi4-6.pdf</pdf>
<pre>
      Analyzing Knowledge Graph Embedding Methods from a
             Multi-Embedding Interaction Perspective
                           Hung Nghiep Tran                                                              Atsuhiro Takasu
                         SOKENDAI                                                                National Institute of Informatics
         (The Graduate University for Advanced Studies)                                                    Tokyo, Japan
                            Japan                                                                       takasu@nii.ac.jp
                      nghiepth@nii.ac.jp

                                                                                    Google to provide semantic meanings into many traditional appli-
                                                                                    cations, such as semantic search engines, semantic browsing, and
ABSTRACT                                                                            question answering [2]. One important application of knowledge
Knowledge graph is a popular format for representing knowledge,                     graphs is recommender systems, where they are used to unite
with many applications to semantic search engines, question-                        multiple sources of data and incorporate external knowledge [5]
answering systems, and recommender systems. Real-world knowl-                       [36]. Recently, specific methods such as knowledge graph em-
edge graphs are usually incomplete, so knowledge graph embed-                       bedding have been used to predict user interactions and provide
ding methods, such as Canonical decomposition/Parallel factor-                      recommendations directly [10].
ization (CP), DistMult, and ComplEx, have been proposed to                              Real-world knowledge graphs are usually incomplete. For ex-
address this issue. These methods represent entities and relations                  ample, Freebase and Wikidata are very large but they do not
as embedding vectors in semantic space and predict the links                        contain all knowledge. This is especially true for the knowledge
between them. The embedding vectors themselves contain rich                         graphs used in recommender systems. During system operation,
semantic information and can be used in other applications such                     users review new items or like new items, generating new triples
as data analysis. However, mechanisms in these models and the                       for the knowledge graph, which is therefore inherently incom-
embedding vectors themselves vary greatly, making it difficult                      plete. Knowledge graph completion, or link prediction, is the task
to understand and compare them. Given this lack of understand-                      that aims to predict new triples.
ing, we risk using them ineffectively or incorrectly, particularly                      This task can be undertaken by using knowledge graph em-
for complicated models, such as CP, with two role-based em-                         bedding methods, which represent entities and relations as em-
bedding vectors, or the state-of-the-art ComplEx model, with                        bedding vectors in semantic space, then model the interactions
complex-valued embedding vectors. In this paper, we propose                         between these embedding vectors to compute matching scores
a multi-embedding interaction mechanism as a new approach                           that predict the validity of each triple. Knowledge graph embed-
to uniting and generalizing these models. We derive them the-                       ding methods are not only used for knowledge graph completion,
oretically via this mechanism and provide empirical analyses                        but the learned embedding vectors of entities and relations are
and comparisons between them. We also propose a new multi-                          also very useful. They contain rich semantic information similar
embedding model based on quaternion algebra and show that it                        to word embeddings [21] [20] [14], enabling them to be used
achieves promising results using popular benchmarks.                                in visualization or browsing for data analysis. They can also be
                                                                                    used as extracted or pretrained feature vectors in other learning
KEYWORDS                                                                            models for tasks such as classification, clustering, and ranking.
                                                                                        Among the many proposed knowledge graph embedding meth-
Knowledge Graph, Knowledge Graph Completion, Knowledge                              ods, the most efficient and effective involve trilinear-product-
Graph Embedding, Multi-Embedding, Representation Learning.                          based models, such as Canonical decomposition/Parallel factor-
                                                                                    ization (CP) [13] [17], DistMult [35], or the state-of-the-art Com-
                                                                                    plEx model [28]. These models solve a tensor decomposition
1    INTRODUCTION                                                                   problem with the matching score of each triple modeled as the
Knowledge graphs provide a unified format for representing                          result of a trilinear product, i.e., a multilinear map with three
knowledge about relationships between entities. A knowledge                         variables corresponding to the embedding vectors h, t, and r
graph is a collection of triples, with each triple (h, t, r ) denoting              of head entity h, tail entity t, and relation r , respectively. The
the fact that relation r exists between head entity h and tail en-                  trilinear-product-based score function for the three embedding
tity t. Many large real-world knowledge graphs have been built,                     vectors is denoted as ⟨h, t, r ⟩ and will be defined mathematically
including WordNet [22] representing English lexical knowledge,                      in Section 2.
and Freebase [3] and Wikidata [29] representing general knowl-                          However, the implementations of embedding vectors for the
edge. Moreover, knowledge graph can be used as a universal                          various models are very diverse. DistMult [35] uses one real-
format for data from applied domains. For example, a knowl-                         valued embedding vector for each entity or relation. The original
edge graph for recommender systems would have triples such as                       CP [13] uses one real-valued embedding vector for each relation,
(UserA, Item1, review) and (UserB, Item2, like).                                    but two real-valued embedding vectors for each entity when it is
   Knowledge graphs are the cornerstones of modern semantic                         as head and as tail, respectively. ComplEx [28] uses one complex-
web technology. They have been used by large companies such as                      valued embedding vector for each entity or relation. Moreover, a
                                                                                    recent heuristic for CP [17], here denoted as CPh , was proposed
First International Workshop on Data Science for Industry 4.0.                      to augment the training data, helping CP achieve results com-
Copyright ©2019 for the individual papers by the papers’ authors. Copying permit-   petitive with the state-of-the-art model ComplEx. This heuristic
ted for private and academic purposes. This volume is published and copyrighted
by its editors.
                                                                                    introduces an additional embedding vector for each relation, but

Published in the Workshop Proceedings of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on CEUR-WS.org.
the underlying mechanism is different from that in ComplEx. All              2.2     Categorization
of these complications make it difficult to understand and com-              Based on the modeling of the second component, a knowledge
pare the various models, to know how to use them and extend                  graph embedding model falls into one of three categories, namely
them. If we were to use the embedding vectors for data analysis              translation-based, neural-network-based, or trilinear-product-based,
or as pretrained feature vectors, a good understanding would                 as described below.
affect the way we would use the complex-valued embedding vec-
tors from ComplEx or the different embedding vectors for head                   2.2.1 Translation-based: These models translate the head en-
and tail roles from CP.                                                      tity embedding by summing with the relation embedding vector,
   In this paper, we propose a multi-embedding interaction mech-             then measuring the distance between the translated images of
anism as a new approach to uniting and generalizing the above                head entity and the tail entity embedding, usually by L1 or L2
models. In the proposed mechanism, each entity e is represented              distance:
by multiple embedding vectors {e (1), e (2), . . . } and each relation                   S(h, t, r ) = − ||h + r − t ||p
r is represented by multiple embedding vectors {r (1), r (2), . . . }.                                      D
                                                                                                                                ! 1/p
                                                                                                                                         (1)
In a triple (h, t, r ), all embedding vectors of h, t, and r interact                                                         p
                                                                                                           Õ
                                                                                                     =−       |hd + rd − td |         ,
with each other by trilinear products to produce multiple interac-                                            d
tion scores. These scores are then weighted summed by a weight
                                                                             where
vector ω to produce the final matching score for the triple. We
show that the above models are special cases of this mechanism.                    • h, t, r are embedding vectors of h, t, and r , respectively,
Therefore, it unifies those models and lets us compare them di-                    • p is 1 or 2 for L1 or L2 distance, respectively,
rectly. The mechanism also enables us to develop new models by                     • D is the embedding size and d is each dimension.
extending to additional embedding vectors.                                      TransE [4] was the first model of this type, with score function
   In this paper, our contributions include the following.                   basically the same as the above equation. There have been many
                                                                             extensions such as TransR [19], TransH [33], and TransA [34].
      • We introduce a multi-embedding interaction mechanism
                                                                             Most extensions are done by linear transformation of the entities
        as a new approach to unifying and generalizing a class of
                                                                             into a relation-specific space before translation [19].
        state-of-the-art knowledge graph embedding models.
                                                                                These models are simple and efficient. However, their model-
      • We derive each of the above models theoretically via this
                                                                             ing capacity is generally weak because of over-strong assump-
        mechanism. We then empirically analyze and compare
                                                                             tions about translation using relation embedding. Therefore, they
        these models with each other and with variants.
                                                                             are unable to model some forms of data [31].
      • We propose a new multi-embedding model by an exten-
        sion to four-embedding vectors based on quaternion alge-               2.2.2 Neural-network-based: These models use a nonlinear
        bra, which is an extension of complex algebra. We show               neural network to compute the matching score for a triple:
        that this model achieves promising results.
                                                                                                    S(h, t, r ) =N N (h, t, r ),                 (2)
2     RELATED WORK                                                           where
Knowledge graph embedding methods for link prediction are                          • h, t, r are the embedding vectors of h, t, and r , respectively,
actively being researched [30]. Here, we only review the works                     • N N is the neural network used to compute the score.
that are directly related to this paper, namely models that use only            One of the simplest neural-network-based model is ER-MLP
triples, not external data such as text [32] or graph structure such         [7], which concatenates the input embedding vectors and uses a
as relation paths [18]. Models using only triples are relatively             multi-layer perceptron neural network to compute the matching
simple and they are also the current state of the art.                       score. NTN [26] is an earlier model that employs nonlinear ac-
                                                                             tivation functions to generalize the linear model RESCAL [24].
2.1     General architecture                                                 Recent models such as ConvE [6] use convolution networks in-
Knowledge graph embedding models take a triple of the form                   stead of fully-connected networks.
(h, t, r ) as input and output the validity of that triple. A general           These models are complicated because of their use of neural
model can be viewed as a three-component architecture:                       networks as a black-box universal approximator, which usually
                                                                             make them difficult to understand and expensive to use.
    (1) Embedding lookup: linear mapping from one-hot vectors
        to embedding vectors. A one-hot vector is a sparse dis-                 2.2.3 Trilinear-product-based: These models compute their
        crete vector representing a discrete input, e.g., the first          scores by using trilinear product between head, tail, and relation
        entity could be represented as [1, 0, . . . , 0]⊤ . A triple could   embeddings, with relation embedding playing the role of match-
        be represented as a tuple of three one-hot vectors repre-            ing weights on the dimensions of head and tail embeddings:
        senting h, t, and r , respectively. An embedding vector is                                S(h, t, r ) =⟨h, t, r ⟩
        a dense continuous vector of much lower dimensionality
        than a one-hot vector thus lead to efficient distributed                                             =h ⊤diaд(r )t
        representations [11] [12].                                                                                D
                                                                                                                  Õ
    (2) Interaction mechanism: modeling the interaction between                                              =           (h ⊙ t ⊙ r )d           (3)
        embedding vectors to compute the matching score of a                                                      d =1
        triple. This is the main component of a model.                                                            ÕD
    (3) Prediction: using the matching score to predict the validity                                         =           (hd td rd ) ,
        of each triple. A higher score means that the triple is more                                              d =1
        likely to be valid.                                                  where
      • h, t, r are embedding vectors of h, t, and r , respectively,             embedding vectors and setting appropriate weight vectors. Next,
      • diaд(r ) is the diagonal matrix of r ,                                   we specify our attempt at learning weight vectors automatically.
      • ⊙ denotes the element-wise Hadamard product,                             We also propose a four-embedding interaction model based on
      • D is the embedding size and d is the dimension for which                 quaternion algebra.
          hd , td , and rd are the entries.
   In this paper, we focus on this category, particularly on Dist-               3.1     Multi-embedding interaction mechanism
Mult, ComplEx, CP, and CPh with augmented data. These models                     We globally model each entity e as the multiple embedding vec-
are simple, efficient, and can scale linearly with respect to em-                tors {e (1), e (2), . . . , e (n) } and each relation r as the multiple em-
bedding size in both time and space. They are also very effective,               bedding vectors {r (1), r (2), . . . , r (n) }. The triple (h, t, r ) is there-
as has been shown by the state-of-the-art results for ComplEx                    fore modeled by multiple embeddings as h (i), t (j), r (k ), i, j, k ∈
and CPh using popular benchmarks [28] [17].                                      {1, ..., n}.
   DistMult [35] embeds each entity and relation as a single real-                  In each triple, the embedding vectors for head, tail, and re-
valued vector. DistMult is the simplest model in this category.                  lation interact with each and every other embedding vector to
Its score function is symmetric, with the same scores for triples                produce multiple interaction scores. Each interaction is modeled
(h, t, r ) and (t, h, r ). Therefore, it cannot model asymmetric data            by the trilinear product of corresponding embedding vectors. The
for which only one direction is valid, e.g., asymmetric triples                  interaction scores are then weighted summed by a weight vector:
such as (Paper1, Paper2, cite). Its score function is:                                                                Õ
                                                                                      S(h, t, r ; Θ, ω) =                     ω (i,j,k ) ⟨h (i), t (j), r (k) ⟩,
                            S(h, t, r ) =⟨h, t, r ⟩,                       (4)                                                                                   (8)
                                                                                                            i,j,k ∈ {1,...,n }
where h, t, r ∈ Rk .                                                             where
   ComplEx [28] is an extension of DistMult that uses complex-
valued embedding vectors that contain complex numbers. Each                            • Θ is the parameter denoting embedding vectors h (i), t (j), r (k ) ,
complex number c with two components, real a and imaginary                             • ω is the parameter denoting the weight vector used to
b, can be denoted as c = a + bi. The complex conjugate c of c is                         combine the interaction scores, with ω (i,j,k ) being an ele-
c = a − bi. The complex conjugate vector t of t is form from the                         ment of ω.
complex conjugate of the individual entries. Complex algebra
requires using the complex conjugate vector of tail embedding                    3.2     Deriving trilinear-product-based models
in the inner product and trilinear product [1]. Thus, these prod-                The existing trilinear-product-based models can be derived from
ucts can be antisymmetric, which enables ComplEx to model                        the proposed general multi-embedding interaction score function
asymmetric data [28] [27]. Its score function is:                                in Eq. (8) by setting the weight vector ω as shown in Table 1.
                                                                                    For DistMult, we can see the equivalence directly. For Com-
                         S(h, t, r ) =Re(⟨h, t, r ⟩),                      (5)
                                                                                 plEx, we need to expand its score function following complex
where h, t, r ∈ Ck and Re(c) means taking the real component                     algebra [1]:
of the complex number c.
                                                                                   S(h, t, r ) =Re(⟨h, t, r ⟩)
   CP [13] is similar to DistMult but embeds entities as head
and as tail differently. Each entity e has two embedding vectors                               =⟨Re(h), Re(t), Re(r )⟩ + ⟨Re(h), Im(t), Im(r )⟩                             (9)
e and e (2) depending on its role in a triple as head or as tail,                                 − ⟨Im(h), Re(t), Im(r )⟩ + ⟨Im(h), Im(t), Re(r )⟩,
respectively. Using different role-based embedding vectors leads
                                                                                 where
to an asymmetric score function, enabling CP to also model
asymmetric data. However, experiments have shown that CP’s                             • h, t, r ∈ Ck ,
performance is very poor on unseen test data [17]. Its score                           • Re(c) and Im(c) mean taking the real and imaginary com-
function is:                                                                             ponents of the complex vector c, respectively.

                          S(h, t, r ) =⟨h, t (2), r ⟩,                     (6)      Changing Re(h) to h (1) , Im(h) to h (2) , Re(t) to t (1) , Im(t) to
                                                                                 t (2) , Re(r ) to r (1) , and Im(r ) to r (2) , we can rewrite the score
where h, t (2), r ∈ Rk .                                                         function of ComplEx as:
    CPh [17] is a direct extension of CP. Its heuristic augments
the training data by making an inverse triple (t, h, r (a) ) for each                       S(h, t, r ) =Re(⟨h, t, r ⟩)
existing triple (h, t, r ), where r (a) is the augmented relation corre-                                =⟨h (1), t (1), r (1) ⟩ + ⟨h (1), t (2), r (2) ⟩                   (10)
sponding to r . With this heuristic, CPh significantly improves CP,                                              (2)   (1)        (2)        (2)     (2)        (1)
achieving results competitive with ComplEx. Its score function                                             − ⟨h , t          ,r         ⟩ + ⟨h , t         ,r         ⟩,
is:                                                                              which is equivalent to the weighted sum using the weight vectors
               S(h, t, r ) =⟨h, t   (2)                 (2)
                                          , r ⟩ and ⟨t, h , r   (a)
                                                                      ⟩,   (7)   in Table 1. Note that by the symmetry between h and t, we can
                                                                                 also obtain the equivalent weight vector ComplEx equiv. 1.
where h, h (2), t, t (2), r, r (a) ∈ Rk .                                        By symmetry between embedding vectors of the same entity
   In the next section, we present a new approach to analyzing                   or relation, we can also obtain the equivalent weight vectors
these trilinear-product-based models.                                            ComplEx equiv. 2 and ComplEx equiv. 3.
                                                                                    For CP, note that the two role-based embedding vectors for
3    MULTI-EMBEDDING INTERACTION                                                 each entity can be mapped to two-embedding vectors in our
In this section, we first formally present the multi-embedding in-               model and the relation embedding vector can be mapped to r (1) .
teraction mechanism. We then derive each of the above trilinear-                 For CPh , further note that its data augmentation is equivalent
product-based models using this mechanism, by changing the                       to adding the score of the original triple and the inverse triple
                                                            Table 1: Weight vectors for special cases.

                                                                              ComplEx     ComplEx       ComplEx                                   CPh
              Weighted terms                 DistMult          ComplEx                                                    CP              CPh
                                                                               equiv. 1    equiv. 2      equiv. 3                                equiv.
                ⟨h (1), t (1), r (1) ⟩            1                      1        1           0             0               0              0       0
                ⟨h (1), t (1), r (2) ⟩            0                      0        0           1             1               0              0       0
                ⟨h (1), t (2), r (1) ⟩            0                      0        0          -1             1               1              1       0
                ⟨h (1), t (2), r (2) ⟩            0                      1       -1           0             0               0              0       1
                ⟨h (2), t (1), r (1) ⟩            0                      0        0           1            -1               0              0       1
                ⟨h (2), t (1), r (2) ⟩            0                      -1       1           0             0               0              1       0
                ⟨h (2), t (2), r (1) ⟩            0                      1        1           0             0               0              0       0
                ⟨h (2), t (2), r (2) ⟩            0                      0        0           1             1               0              0       0


when training using stochastic gradient descent (SGD):                                   Quaternion numbers are extension of complex numbers to
                                    (2)               (2)     (a)                     four components [15] [8]. Each quaternion number q, with one
              S(h, t, r ) =⟨h, t          , r ⟩ + ⟨t, h , r         ⟩.        (11)
                                                                                      real component a and three imaginary components b, c, d, could
We can then map r (a) to r (2) to obtain the equivalence given in                     be written as q = a + bi + cj + dk where i, j, k are fundamental
Table 1. By symmetry between h and t, we can also obtain the                          quaternion units, similar to the imaginary number i in complex
equivalent weight vector CPh equiv. 1.                                                algebra. As for complex conjugates, we also have a quaternion
   From this perspective, all four models DistMult, ComplEx,                          conjugate q = a − bi − cj − dk.
CP, and CPh can be seen as special cases of the general multi-                           An intuitive view of quaternion algebra is that each quater-
embedding interaction mechanism. This provides an intuitive                           nion number represents a 4-dimensional vector (or 3-dimensional
perspective on using the embedding vectors in complicated mod-                        vector when the real component a = 0) and quaternion multi-
els. For the ComplEx model, instead of using a complex-valued                         plication is rotation of this vector in 4- (or 3-)dimensional space.
embedding vector, we can treat it as two real-valued embed-                           Compared to complex algebra, each complex number represents
ding vectors. These vectors can then be used directly in common                       a 2-dimensional vector and complex multiplication is rotation of
learning algorithms that take as input real-valued vectors rather                     this vector in 2-dimensional plane [1].
than complex-valued vectors. We also see that multiple embed-                            Several works have shown the benefit of using complex, quater-
ding vectors are a natural extension of single embedding vectors.                     nion, or other hyper-complex numbers in the hidden layers of
Given this insight, multiple embedding vectors can be concate-                        deep neural networks [9] [23] [25]. To the best of our knowledge,
nated to form a longer vector for use in visualization and data                       this paper is the first to motivate and use quaternion numbers
analysis, for example.                                                                for the embedding vectors of knowledge graph embedding.
                                                                                         Quaternion multiplication is noncommutative, thus there are
3.3    Automatically learning weight vectors                                          multiple ways to multiply three quaternion numbers in the tri-
As we have noted, the weight vector ω plays an important role                         linear product. Here, we choose to write the score function of
in the model, because it determines how the interaction mecha-                        the quaternion-based four-embedding interaction model as:
nism is implemented and therefore how the specific model can
be derived. An interesting question is how to learn ω automat-                                               S(h, t, r ) =Re(⟨h, t, r ⟩),                                    (13)
ically. One approach is to let the model learn ω together with
the embeddings in an end-to-end fashion. For a more detailed                          where h, t, r ∈ Hk .
examination of this idea, we will test different restrictions on the                     By expanding this formula using quaternion algebra [15] and
range of ω by applying tanh(ω), sigmoid(ω), and softmax(ω).                           mapping the four components of a quaternion number to four
   Note also that the weight vectors for related models are usually                   embeddings in the multi-embedding interaction model, respec-
sparse. We therefore enforce a sparsity constraint on ω by an                         tively, we can write the score function in the notation of the
additional Dirichlet negative log-likelihood regularization loss:                     multi-embedding interaction model as:
                          Õ                     |ω (i,j,k ) |                                  S(h, t, r ) =Re(⟨h, t, r ⟩)
      Ldir = −λdir                 (α − 1) log                , (12)
                                                  ||ω||                  1
                       i,j,k ∈ {1,...,n }                                                                 =⟨h (1), t (1), r (1) ⟩ + ⟨h (2), t (2), r (1) ⟩
where α is a hyperparameter controlling sparseness (a small α                                                + ⟨h (3), t (3), r (1) ⟩ + ⟨h (4), t (4), r (1) ⟩
will make the weight vector sparser) and λdir is the regularization
strength.                                                                                                    + ⟨h (1), t (2), r (2) ⟩ − ⟨h (2), t (1), r (2) ⟩
                                                                                                             + ⟨h (3), t (4), r (2) ⟩ − ⟨h (4), t (3), r (2) ⟩               (14)
3.4    Quaternion-based four-embedding                                                                             (1)   (3)        (3)         (2)    (4)        (3)
       interaction model                                                                                     + ⟨h , t          ,r         ⟩ − ⟨h , t         ,r         ⟩
                                                                                                                   (3)   (1)        (3)         (4)    (2)        (3)
Another question is whether using more embedding vectors in                                                  − ⟨h , t          ,r         ⟩ + ⟨h , t         ,r         ⟩
the multi-embedding interaction mechanism is helpful. Moti-                                                        (1)
                                                                                                             + ⟨h , t    (4)
                                                                                                                               ,r   (4)         (2)
                                                                                                                                          ⟩ + ⟨h , t   (3)
                                                                                                                                                             ,r   (4)
                                                                                                                                                                        ⟩
vated by the derivation of ComplEx from a two-embedding inter-
                                                                                                                   (3)   (2)        (4)         (4)    (1)        (4)
action model, we develop a four-embedding interaction model                                                  − ⟨h , t          ,r         ⟩ − ⟨h , t         ,r         ⟩,
by using quaternion algebra to determine the weight vector and
the interaction mechanism.                                                            where h, t, r ∈ Hk .
4     LOSS FUNCTION AND OPTIMIZATION                                                   training, validation, and test sets are removed from the corrupted
The learning problem in knowledge graph embedding methods                              triples set before computing the rank of the true triple.
can be modeled as the binary classification of valid and invalid
triples. Because knowledge graphs do not contain invalid triples,                      5.3    Training
we generate them by negative sampling [20]. For each valid triple                      We trained the models using SGD with learning rates auto-tuned
(h, t, r ), we replace the h or t entities in each training triple with                by Adam [16], that makes the choice of initial learning rate more
other random entities to obtain the invalid triples (h ′, t, r ) and                   robust. For all models, we found good hyperparameters with
(h, t ′, r ) [4].                                                                      grid search on learning rates ∈ {10−3, 10−4 }, embedding regu-
   We can then learn the model parameters by minimizing the                            larization strengths ∈ {10−2, 3 × 10−3, 10−3, 3 × 10−4, 10−4, 0.0},
negative log-likelihood loss for the training data with the pre-                       and batch sizes ∈ {212, 214 }. For a fair comparison, we fixed
dicted probability modeled by the logistic sigmoid function σ (·)                      the embedding sizes so that numbers of parameters for all mod-
on the matching score. This loss is the cross-entropy:                                 els are comparable. In particular, we use embedding sizes of
                              Õ                                                        400 for one-embedding models such as DistMult, 200 for two-
   L(D, D ′ ; Θ, ω) = −              log σ (S(h, t, r ; Θ, ω))
                                                                                       embedding models such as ComplEx, CP, and CPh , and 100 for
                            (h,t ,r )∈ D
                                  Õ                                                    four-embedding models. We also fixed the number of negative
                                                  log σ 1 − S(h ′, t ′, r ; Θ, ω) ,
                                                                                 
                        −                                                              samples at 1 because, although using more negative samples
                            (h ′ ,t ′ ,r )∈ D ′                                        is beneficial for all models, it is also more expensive and not
                                                                                (15)   necessary for this comparative analysis.
where D is true data (p̂ = 1), D ′ is negative sampled data (p̂ = 0),                     We constrained entity embedding vectors to have unit L2 -norm
and p̂ is the empirical probability.                                                   after each training iteration. All training runs were stopped early
   Defining the class label Y(h,t ,r ) = 2p̂(h,t ,r ) − 1, i.e., the labels            by checking the filtered MRR on the validation set after every 50
of positive triples are 1 and negative triples are −1, the above loss                  epochs, with 100 epochs patient.
can be written more concisely. In cluding the L2 regularization
of embedding vectors, this loss can be written as:                                     6     RESULTS AND DISCUSSION
                                                                                       In this section, we present experimental results and analyses for
                                      log(1 + e−Y(h,t ,r ) S(h,t ,r ;Θ,ω) )
                           Õ        
 L(D, D ′ ; Θ, ω) =
                                                                                       the models described in Section 3. We report results for derived
                       (h,t ,r )∈ D∪D ′
                                                                                       weight vectors and their variants, auto-learned weight vectors,
                                                    λ
                                                             
                                              +       ||Θ||22 ,                        and the quaternion-based four-embedding interaction model.
                                                   nD
                                                                                (16)   6.1    Derived weight vectors and variants
where D is true data, D ′ is negative sampled data, Θ are the                             6.1.1 Comparison of derived weight vectors . We evaluated the
embedding vectors corresponding to specific current triples, n is                      multi-embedding interaction model with the score function in Eq.
the number of multi-embedding, D is the embedding size, and λ                          (8), using the derived weight vectors in Table 1. The results are
is the regularization strength.                                                        shown in Table 2. They are consistent with the results reported
                                                                                       in other works [28]. Note that ComplEx and CPh achieved good
5 EXPERIMENTAL SETTINGS                                                                results, whereas DistMult performed less well. CP performed
5.1 Datasets                                                                           very poorly in comparison to the other models, even though it is
                                                                                       a classical model for the tensor decomposition task [13].
For our empirical analysis, we used the WN18 dataset, the most
                                                                                          For a more detailed comparison, we report the performance on
popular of the benchmark datasets built on WordNet [22] by
                                                                                       training data. Note that ComplEx and CPh can accurately predict
Bordes et al. [4]. This dataset has 40,943 entities, 18 relations,
                                                                                       the training data, whereas DistMult did not. This is evidence that
141,442 training triples, 5,000 validation triples, 5,000 test triples.
                                                                                       ComplEx and CPh are fully expressive while DistMult cannot
In our preliminary experiments, the relative performance on
                                                                                       model asymmetric data effectively.
all datasets was quite consistent, therefore choosing the WN18
                                                                                          The most surprising result was that CP can also accurately
dataset is appropriate for our analysis. We will consider the use
                                                                                       predict the training data at a comparable level to ComplEx and
of other datasets in in future work.
                                                                                       CPh , despite its very poor result on the test data. This suggests
                                                                                       that the problem with CP is not its modeling capacity, but in its
5.2     Evaluation protocols                                                           generalization performance to new test data. In other words, CP
Knowledge graph embedding methods are usually evaluated on                             is severely overfitting to the training data. However, standard
link prediction task [4]. In this task, for each true triple (h, t, r ) in             regularization techniques such as L2 regularization did not appear
the test set, we replace h and t by every other entity to generate                     to help. CPh can be seen as a regularization technique that does
corrupted triples (h ′, t, r ) and (h, t ′, r ), respectively [4]. The goal            help CP generalize well to unseen data.
of the model now is to rank the true triple (h, t, r ) before the
corrupted triples based on the predicted score S.                                         6.1.2 Comparison with other variants of weight vectors. In
   For each true triple in the test set, we compute its rank, then we                  Table 2, we show the results for two bad examples and two good
can compute popular evaluation metrics including MRR (mean                             examples of weight vector variants. Note that bad example 1
reciprocal rank) and Hit@k for k ∈ {1, 3, 10} (how many true                           performed similarly to CP and bad example 2 performed similarly
triples are correctly ranked in the top k) [28].                                       to DistMult. Good example 1 was similar to CPh and good example
   To avoid false negative error, i.e., corrupted triples are acciden-                 2 was similar to ComplEx.
tally valid triples, we follow the protocols used in other works                          This shows that the problem of bad weight vectors is not
for filtered metrics [4]. In this protocol, all valid triples in the                   unique to some specific models. Moreover, it shows that there
                                             Table 2: Results for the derived weight vectors on WN18.

                                   Weight setting                                    MRR       Hit@1      Hit@3      Hit@10
                                   DistMult (1, 0, 0, 0, 0, 0, 0, 0)                 0.796      0.674      0.915      0.945
                                   ComplEx (1, 0, 0, 1, 0, −1, 1, 0)                 0.937     0.928      0.946       0.951
                                   CP (0, 0, 1, 0, 0, 0, 0, 0)                       0.086      0.059      0.093      0.139
                                   CPh (0, 0, 1, 0, 0, 1, 0, 0)                      0.937     0.929      0.944       0.949
                                   DistMult on train                                 0.917      0.848      0.985      0.997
                                   ComplEx on train                                  0.996      0.994      0.998      0.999
                                   CP on train                                       0.994      0.994      0.996      0.999
                                   CPh on train                                      0.995      0.994      0.998      0.999
                                   Bad example 1 (0, 0, 20, 0, 0, 1, 0, 0)           0.107      0.079      0.116      0.159
                                   Bad example 2 (0, 0, 1, 1, 1, 1, 0, 0)            0.794      0.666      0.917      0.947
                                   Good example 1 (0, 0, 20, 1, 1, 20, 0, 0)         0.938     0.934      0.942       0.946
                                   Good example 2 (1, 1, −1, 1, 1, −1, 1, 1)         0.938     0.930      0.944       0.950

                                         Table 3: Results for the auto-learned weight vectors on WN18.

                                  Weight setting                                     MRR       Hit@1      Hit@3      Hit@10
                                  Uniform weight (1, 1, 1, 1, 1, 1, 1, 1)            0.787      0.658      0.915      0.944
                                  Auto weight no restriction                         0.774      0.636      0.911      0.944
                                  Auto weight ∈ (−1, 1) by tanh                      0.765      0.625      0.908      0.943
                                  Auto weight ∈ (0, 1) by sigmoid                    0.789      0.661      0.915      0.946
                                  Auto weight ∈ (0, 1) by softmax                    0.802      0.685      0.915      0.944
                                  Auto weight no restriction, sparse                 0.792      0.685      0.892      0.935
                                  Auto weight ∈ (−1, 1) by tanh, sparse              0.763      0.613      0.910      0.943
                                  Auto weight ∈ (0, 1) by sigmoid, sparse            0.793      0.667      0.915      0.945
                                  Auto weight ∈ (0, 1) by softmax, sparse            0.803      0.688      0.915      0.944

                       Table 4: Results for the quaternion-based four-embedding interaction model on WN18.

                                Weight setting                                         MRR       Hit@1      Hit@3     Hit@10
                                Quaternion-based four-embedding                        0.941     0.931      0.950      0.956
                                Quaternion-based four-embedding on train               0.997      0.995      0.999     1.000


are other good weight vectors, besides those for ComplEx and                          matching score is also symmetric. However, other automati-
CPh , that can achieve very good results.                                             cally learned weight vectors also performed similarly to Dist-
    We note that the good weight vectors exhibit the following                        Mult. Different restrictions by applying tanh(ω), sigmoid(ω),
properties.                                                                           and softmax(ω) did not help. We noticed that the learned weight
      • Completeness: all embedding vectors in a triple should be                     vectors were almost uniform, making them indistinguishable,
        involved in the weighted-sum matching score.                                  suggesting that the use of sparse weight vectors might help.
      • Stability: all embedding vectors for the same entity or                          We enforced a sparsity constraint by an additional Dirichlet
        relation should contribute equally to the weighted-sum                        negative log-likelihood regularization loss on ω, with α tuned
                                                                                          1 and λ
                                                                                      to 16                        −2
        matching score.                                                                           dir tuned to 10 . However, the results did not im-
      • Distinguishability: the weighted-sum matching scores for                      prove. Tracking of weight vectors value showed that the sparsity
        different triples should be distinguishable. For example, the                 constraint seemed to amplify the initial differences between the
        score ⟨h (1), t (2), r (1) ⟩ + ⟨h (2), t (1), r (2) ⟩ is indistinguishable    weight values instead of learning useful sparseness. This suggests
        because switching h and t forms a symmetric group.                            that the gradient information is too symmetric that the model
                                                                                      cannot break the symmetry of ω and escape the local optima.
    As an example, consider the ComplEx model, where the mul-
                                                                                         In general, these experiments show that learning good weight
tiplication of two complex numbers written in polar coordinate
                                                                                      vectors automatically is a particularly difficult task.
format, c 1 = |c 1 |e−iθ 1 and c 2 = |c 2 |e−iθ 2 , can be written as
c 1c 2 = |c 1 ||c 2 |e−i(θ 1 +θ 2 ) [1]. This is a rotation in the complex
plane, which intuitively satisfies the above properties.                              6.3      Quaternion-based four-embedding
6.2     Automatically learned weight vectors                                                   interaction model
                                                                                      In Table 4, we present the evaluation results for the proposed
We let the models learn ω together with the embeddings in an
                                                                                      quaternion-based four-embedding interaction model. The results
end-to-end fashion, aiming to learn good weight vectors auto-
                                                                                      were generally positive, with most metrics higher than those in
matically. The results are shown in Table 3.
                                                                                      Table 2 for state-of-the-art models such as ComplEx and CPh . Es-
  We first set uniform weight vector as a baseline. The results
                                                                                      pecially, H@10 performance was much better than other models.
were similar to those for DistMult because the weighted-sum
   Note that this model needs more extensive evaluation. One                                 https://doi.org/10.1002/sapm192761164
potential problem is its being prone to overfitting, as seen in the                     [14] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe:
                                                                                             Global Vectors for Word Representation. In Proceedings of the 2014 Conference
on train results, with H@10 at absolute 1.000. This might mean                               on Empirical Methods in Natural Language Processing (EMNLP). 1532–1543.
that better regularization methods may be needed. However, the                          [15] Isai Lvovich Kantor and Aleksandr Samuilovich Solodovnikov. 1989. Hyper-
                                                                                             complex Numbers: An Elementary Introduction to Algebras. Springer.
general results suggest that extending to more embedding vectors                        [16] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic
for multi-embedding interaction models is a promising approach.                              Optimization. In Proceedings of the 3rd International Conference on Learning
                                                                                             Representations (ICLR).
                                                                                        [17] Timothée Lacroix, Nicolas Usunier, and Guillaume Obozinski. 2018. Canonical
7     CONCLUSION                                                                             Tensor Decomposition for Knowledge Base Completion. In Proceedings of the
                                                                                             35th International Conference on Machine Learning (ICML’18).
This paper proposes a multi-embedding interaction mechanism                             [18] Yankai Lin, Zhiyuan Liu, Huanbo Luan, Maosong Sun, Siwei Rao, and Song
as a new approach to analyzing state-of-the-art knowledge graph                              Liu. 2015. Modeling Relation Paths for Representation Learning of Knowledge
embedding models such as DistMult, ComplEx, CP, and CPh . We                                 Bases. In Proceedings of the 2015 Conference on Empirical Methods in Natural
                                                                                             Language Processing.
show that these models can be unified and generalized under                             [19] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. 2015. Learn-
the new approach to provide an intuitive perspective on using                                ing Entity and Relation Embeddings for Knowledge Graph Completion. In
the models and their embedding vectors effectively. We analyzed                              Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence.
                                                                                             2181–2187.
and compared the models and their variants empirically to better                        [20] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
understand their properties, such as the severe overfitting prob-                            Estimation of Word Representations in Vector Space. In ICLR’13 Workshop.
                                                                                        [21] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013.
lem of the CP model. In addition, we propose and have evaluated                              Distributed Representations of Words and Phrases and Their Compositionality.
a new multi-embedding interaction model based on quaternion                                  In Advances in Neural Information Processing Systems. 3111–3119.
algebra, which showed some promising results.                                           [22] Miller, George A. 1995. WordNet: A Lexical Database for English. Commun.
                                                                                             ACM (1995), 39–41.
    There are several promising future directions. One direction                        [23] Toshifumi Minemoto, Teijiro Isokawa, Haruhiko Nishimura, and Nobuyuki
is to find new methods of modeling the interaction mechanism                                 Matsui. 2017. Feed Forward Neural Network with Random Quaternionic
between multi-embedding vectors and the effective extension to                               Neurons. Signal Processing C, 136 (2017), 59–68. https://doi.org/10.1016/j.
                                                                                             sigpro.2016.11.008
additional embedding vectors. Another direction is to evaluate                          [24] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2011. A Three-Way
multi-embedding models such as the proposed quaternion-based                                 Model for Collective Learning on Multi-Relational Data. In Proceedings of the
                                                                                             28th International Conference on Machine Learning. 809–816.
four-embedding interaction model more extensively.                                      [25] Titouan Parcollet, Mirco Ravanelli, Mohamed Morchid, Georges Linarès, Chi-
                                                                                             heb Trabelsi, Renato De Mori, and Yoshua Bengio. 2019. Quaternion Recurrent
ACKNOWLEDGMENTS                                                                              Neural Networks. In Proceedings of the International Conference on Learning
                                                                                             Representations (ICLR’19).
This work was supported by a JSPS Grant-in-Aid for Scientific                           [26] Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Y Ng. 2013.
                                                                                             Reasoning With Neural Tensor Networks for Knowledge Base Completion. In
Research (B) (15H02789).                                                                     Advances in Neural Information Processing Systems. 926–934.
                                                                                        [27] Théo Trouillon, Christopher R. Dance, Éric Gaussier, Johannes Welbl, Sebastian
REFERENCES                                                                                   Riedel, and Guillaume Bouchard. 2017. Knowledge Graph Completion via
                                                                                             Complex Tensor Factorization. The Journal of Machine Learning Research 18,
 [1] Lars V. Ahlfors. 1953. Complex Analysis: An Introduction to the Theory of               1 (2017), 4735–4772.
     Analytic Functions of One Complex Variable. New York, London (1953), 177.          [28] Theo Trouillon, Johannes Welbl, Sebastian Riedel, Eric Gaussier, and Guil-
 [2] Amit Singhal. 2012. Official Google Blog: Introducing the Knowledge Graph:              laume Bouchard. 2016. Complex Embeddings for Simple Link Prediction. In
     Things, Not Strings. https://googleblog.blogspot.com/2012/05/introducing-               International Conference on Machine Learning (ICML’16). 2071–2080.
     knowledge-graph-things-not.html.                                                   [29] Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: A Free Collaborative
 [3] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor.            Knowledgebase. Commun. ACM 57, 10 (Sept. 2014), 78–85. https://doi.org/10.
     2008. Freebase: A Collaboratively Created Graph Database for Structuring                1145/2629489
     Human Knowledge. In In SIGMOD Conference. 1247–1250.                               [30] Q. Wang, Z. Mao, B. Wang, and L. Guo. 2017. Knowledge Graph Embedding:
 [4] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Ok-            A Survey of Approaches and Applications. IEEE Transactions on Knowledge
     sana Yakhnenko. 2013. Translating Embeddings for Modeling Multi-Relational              and Data Engineering 29, 12 (Dec. 2017), 2724–2743. https://doi.org/10.1109/
     Data. In Advances in Neural Information Processing Systems. 2787–2795.                  TKDE.2017.2754499
 [5] Walter Carrer-Neto, María Luisa Hernández-Alcaraz, Rafael Valencia-García,         [31] Yanjie Wang, Rainer Gemulla, and Hui Li. 2018. On Multi-Relational Link
     and Francisco García-Sánchez. 2012. Social Knowledge-Based Recommender                  Prediction with Bilinear Models. In Thirty-Second AAAI Conference on Artificial
     System. Application to the Movies Domain. Expert Systems with Applications              Intelligence.
     39, 12 (Sept. 2012), 10990–11000. https://doi.org/10.1016/j.eswa.2012.03.025       [32] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge
 [6] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel.               Graph and Text Jointly Embedding. In Proceedings of the 2014 Conference on
     2018. Convolutional 2d Knowledge Graph Embeddings. In In Thirty-Second                  Empirical Methods in Natural Language Processing. 1591–1601.
     AAAI Conference on Artificial Intelligence.                                        [33] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge
 [7] Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin                  Graph Embedding by Translating on Hyperplanes. In AAAI Conference on
     Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. 2014. Knowledge                   Artificial Intelligence. Citeseer, 1112–1119.
     Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion. In Proceed-         [34] Han Xiao, Minlie Huang, Yu Hao, and Xiaoyan Zhu. 2015. TransA: An Adaptive
     ings of the 20th ACM SIGKDD International Conference on Knowledge Discovery             Approach for Knowledge Graph Embedding. In AAAI Conference on Artificial
     and Data Mining - KDD ’14. ACM Press, New York, New York, USA, 601–610.                 Intelligence. arXiv:1509.05490
     https://doi.org/10.1145/2623330.2623623                                            [35] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2015.
 [8] Ron Goldman. 2010. Rethinking Quaternions. Synthesis Lectures on Computer               Embedding Entities and Relations for Learning and Inference in Knowledge
     Graphics and Animation 4, 1 (Oct. 2010), 1–157. https://doi.org/10.2200/                Bases. In International Conference on Learning Representations.
     S00292ED1V01Y201008CGR013                                                          [36] Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma.
 [9] Nitzan Guberman. 2016. On Complex Valued Convolutional Neural Networks.                 2016. Collaborative Knowledge Base Embedding for Recommender Systems.
     arXiv:1602.09046 [cs.NE] (Feb. 2016). arXiv:cs.NE/1602.09046                            ACM Press, 353–362. https://doi.org/10.1145/2939672.2939673
[10] Ruining He, Wang-Cheng Kang, and Julian McAuley. 2017. Translation-
     Based Recommendation. In Proceedings of the Eleventh ACM Conference on
     Recommender Systems (RecSys ’17). ACM, New York, NY, USA, 161–169.
     https://doi.org/10.1145/3109859.3109882
[11] Geoffrey E. Hinton. 1986. Learning Distributed Representations of Concepts.
     In Proceedings of the Eighth Annual Conference of the Cognitive Science Society,
     Vol. 1. Amherst, MA, 12.
[12] G E Hinton, J L McClelland, and D E Rumelhart. 1984. Distributed Repre-
     sentations. In Parallel Distributed Processing. Carnegie-Mellon University,
     Pittsburgh, PA, 33.
[13] Frank L. Hitchcock. 1927. The Expression of a Tensor or a Polyadic as a Sum
     of Products. Journal of Mathematics and Physics 6, 1-4 (April 1927), 164–189.

</pre>