Multiple Run Ensemble Learning with Low
      Dimensional Knowledge Graph Embeddings

     Chengjin Xu1 , Mojtaba Nayyeri1 , and Sahar Vahdati2 Jens Lehmann1,3
                               1
                                  University of Bonn, Bonn, Germany
                                   University of Oxford, Oxford, UK
                                   2
                               3
                                  Fraunhofer IAIS, Dresden, Germany
                                  {Xu, Nayyeri}@cs.uni-bonn.de
                                    sahar.vahdati@cs.ox.ac.uk
                                 jens.lehmann@iais.fraunhofer.de


         Abstract. Knowledge graphs (KGs) represent facts about a domain in
         a structured form. Although KGs can be quantitatively huge and consist
         of millions of triples, their coverage is usually still only a small fraction of
         the available knowledge. Among the most promising recent approaches for
         tackling this incompleteness problem is link prediction using knowledge
         graph embedding models. Various embedding models have been proposed
         so far, among which, the RotatE model is reported to obtain state-
         of-the-art performance in such link prediction tasks. However, RotatE
         mainly outperforms other models when using a high embedding dimension
         (e.g. 1000). In this paper, we simulate such scenarios by studying the
         performance of different models using multiple low dimensions in different
         repetition rounds of the same model. For example, our studies show
         better results when instead of training a model one time with a high
         dimension of 1200, we repeat the training of the model 6 times in parallel
         with dimension of 200 and then combine the 6 models, This can improve
         results while maintaining the overall number of adjustable parameters
         is the same. In order to justify our findings, we perform experiments
         on various models including TransE, DistMult, RotatE and ComplEx.
         Experimental results on standard benchmark dataset show that multiple
         low-dimensional models outperform a single high dimensional model while
         the overall number parameters is same.

         Keywords: Graph Embedding · Ensemble Learning · Link Prediction.


1      Introduction
 The advent of knowledge graph (KG) technology has propelled knowledge repre-
sentation to the next level and influenced the untimate results of several AI-based
applications such as question answering and recommendation systems [14,9]. Sev-
eral KGs such as WordNet [8], FreeBase [2], and YAGO [10], have been published
with different utilization purposes. In recent years, KG technology combined with
    Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
    International (CC BY 4.0).
2       Xu et al.

the recent advances of hardware technologies such as GPU processing. Therefore,
a new horizon for using machine learning approaches on structured data at scale
has been opened up for leading science and industry.
    Despite the advantages of KGs in down stream tasks, one of the main chal-
lenges of existing KGs is their incompleteness [11]. KG completion using link
prediction approaches aims at addressing this problem. Among various link
prediction approaches, KG embedding (KGE) has gained significant attention
recently. A KGE model takes a KG in the form of triple facts (h, r, t) where h, t
are entities (nodes) and r is a relation (link) between the entities. A d dimensional
vector is then assigned to each element of a triple (h, r, t) in a KG and adjusts the
vectors by optimizing a loss function. The likelihood of a triple is then measured
by using a score function over the embedding vectors (h, r, t).
    The score functions of models play an important role in the performance
of the KGEs. After early proposals of novel KGE models, the research field
has continued by designing and publishing new models with a focus on score
functions. Among those, TransE is one of the primary models which computes
the score of a triple by measuring the distance between the tail vector and the
relation-specific translated head (i.e. h + r ≈ t). Several variants of TransE such
as TransR [7], TransH [15], and TransD [4] have been proposed later to address
the limitations of the original model such as the problem of not being able to
encode one to N, symmetric and reflexive relations.
    One of the recent state-of-the-art models is RotatE [12] which utilizes rotation
in complex space to compute the score of triples. The rotation is performed
from each element of the head vector by using relation specific angle to each
element of the tail vector i.e. h ◦ r ≈ t (◦ is element-wise complex multiplication
which induces rotation in complex space). The evaluations reported in the initial
work introducing RotatE [12] as well as the ones studied afterwards [1], shows
that RotatE obtains state-of-the-art in link prediction tasks by using a relatively
high dimension (1000 on freebase datasets). QuatE [17] is another KGE model
that outperforms other models by taking the advantage of quaternion space that
contains four elements. Similar to RotatE, the high performance of QuatE is
achieved by using an embedding dimension of 1000. Due to the quaternion design
of ths model, the best performing setting with this high dimension results in
4000 adjustable parameters. A similar fact is observable in the ComplEx model
which gets a high accuracy with dimension 1000 [6].
    Such observations led us to a systematic evaluation of the state-of-the-art
models that shows in most of these works, first a high dimension and, second
multiple vectors for each entity/relation due to using either complex (with two
elements of real and imaginary) or quaternion (with four elements) space. In
other words, all the above-mentioned models use single model with a multi-part
high dimensional embeddings (number of parameters is 1 × dh ). In contrast to
those models, we use the same model multiple (k) times in parallel trainings
with low dimension (number of parameters is k × dl ). In order to have a fair
comparison, we enforce (1 × dh = k × dl ). The experimental results show that the
ensemble of the same model several times trained with low-dimensions, results in
                                  Title Suppressed Due to Excessive Length         3

a better performance than training that model once with a high dimension while
the overall numbers of adjustable parameters are same.


2    Related Work

Here we review the existing KGE models including TransE, RotatE, ComplEx,
and DistMult. Each model defines a score function f (h, r, t) which takes the
triple embeddings (h, r, t) and returns the degree of correctness of the triple.
TransE [3] Given a triple (h, r, t), the TransE model computes the score by
measuring the distance between relation-specific translated head (h + r) and the
tail t as f (h, r, t) = −kh + r − tk to enforce h + r ≈ t (h, r, t ∈ Rd ) for each
positive triple (h, r, t) in the vector space.
RotatE [12] RotatE aims at mapping each element of the head embedding (hi )
to the corresponding tail embedding ti by using relation-specific rotation ri = eiθ .
The score of each triple (h, r, t) is computed as f (h, r, t) = −kh ◦ r − tk where
h, r, t ∈ Cd . This enforces h ◦ r ≈ t for each positive triple (h, r, t).
DistMult [16] captures the interaction between elements that has the same
index in h and t. The formulation of score function is fr (h, t) = h> diag(r)t =
Pd
   i=0 ri · hi · ti The model captures symmetric relation, but not antisymmetric.
CompEx [13] was proposed as a more elegant way to solve the shortcoming of
DistMult in modeling antisymmetric relation. Its main contribution is to embed
KGs in complex space. The score function is defined as f (h, r, t) = Re(hr, h, t̄i)
where r, h, t ∈ Cd . Although effective, ComplEx is not expressive enough to
model composition relations [12].
    Several surveys have been reviewed the existing KGEs and reported about
their performance from different aspects and settings. In [5], a set of KGEs
combined evaluations have been done for multiple models and settings. It lifts
the models intro one score function and combines them during the training phase,
however, we focus on stretching and squeezing the dimensions of the same models
which are trained separately and combined for testing.


3    Proposed Approach

Recent models such as RotatE, ComplEx, and QuatE obtain state-of-the-art
performances in link prediction. They go beyond the real space and embed a KG
into a Complex or Quaternion space. Therefore, the embedding vectors contain
two (complex) or four (quaternion) parts. Consequently, designing KGE models
with specific embeddings containing multiple parts (complex vectors with real
and imaginary parts) has been shown to be effective in performance boosting.
Based on this, we initiated the methodology of this research as follows:

Hypothesis 1. Multiple combined slices of a KGE model in low-dimensions
perform better than a single version of that model in high dimensions.

    Our aim is to provide evidences in order to evaluate the hypothesis.
4       Xu et al.

   In contrast to the mentioned model which uses a single model with high
dimension, in this part we propose a new approach which combines multiple
models of the same type in which each model contains low-dimensional em-
beddings. Therefore, the overall number of adjustable parameters remains the
same compared to the approaches using a single model with high dimensions.
In order to formulate this scenario, let us have a model M (e.g. RotatE) with
embedding dimension dl . We follow the steps below in our approach:
(a) We first generate k times copies of an underlying model M. The jth slice
    of the model is denoted by Mj , j = 1, . . . , k, and the corresponding dl
    dimensional embeddings of (h, r, t) are denoted by (hj , rj , tj ). The vectors
    are randomly initialized in the beginning of learning process.
(b) We then train each of the models Mj , j = 1, . . . , k separately.
(c) Finally, the testing is performed by using the following score function
                                            k
                                            X
                            f (h, r, t) =         fMj (hj , rj , tj ),           (1)
                                            j=1

    where fMj (hj , rj , tj ) is the score of a triple (h, r, t) computed by the jth
    copy of the model.
Note that all models are trained on a same KG and the only difference between the
models is the initialization of the embedding vectors. To have a fair comparison
with original models, we keep the overall number of adjustable parameters
(embeddings) equal when comparing with the original single model, i.e. dh = k×dl .
We will later show in the evaluation part that such a simple approach improves
the performance of KGEs without additional cost.

4    Experiments
Dataset We use FB15k and FB15k-237 for evaluation. FB15k contains 483,142
triples in training set, 50,000 and 59,071 in validation and test set respectively.
FB15k-237 contains 272,115, 17,535 and 20,466 triples in the training, validation
and testing set respectively.
Evaluation Metric We use Mean Reciprocal Rank (MRR) and Hist@n (n=1,3,10)
for evaluation. The detail process of computing these metrics can be found in [3].
Experimental Setup we evaluate our proposed approach by training TransE,
RotatE, DistMult and ComplEx in both single model with high dimension and
multiple models with low dimension. For single models with high dimensions, we
set embedding dimension to dh = {2, 4, 6, 8, 10, 12} × 100. For models sliced in
multiple versions with low dimensions, we set the number of models to k = 2, . . . , 6
and dimension dl = 200. Note that each single mode e.g. RotatE with dimension
dh is compared with multiple models e.g. CRotatE (combination of k = 6 RotatE
models) with dimension dl . Therefore the number of parameters in both models
are the same i.e. k × dl = 1 × dh e.g. 6 × 200 = 1 × 1200. The implementation has
been done by using Pytorch on GPU servers. We use RotatE loss for the models
in the table 1. We use uniform negative sampling with only one negative sample.
                               Title Suppressed Due to Excessive Length          5


        Table 1: Link prediction results on FB15K and FB15K-237.
Model                           FB15K                   FB15K-237
                     MRR Hits@1 Hits@3 Hits@10 MRR Hits@1 Hits@3 Hits@10
TransE (d=200)       0.698 0.597 0.777 0.859   0.278 0.187 0.305 0.465
TransE (d=400)       0.718 0.623 0.790 0.866   0.285 0.192 0.315 0.476
TransE (d=600)       0.718 0.623 0.791 0.867 0.285 0.191 0.315 0.475
TransE (d=800)       0.716 0.620 0.790 0.866   0.283 0.191 0.310 0.473
TransE (d=1000)      0.712 0.616 0.786 0.864   0.280 0.188 0.308 0.470
TransE (d=1200)      0.704 0.604 0.781 0.862   0.277 0.186 0.303 0.464
CTransE (d=2*200)    0.720 0.624 0.794 0.870   0.288 0.193 0.318 0.479
CTransE (d=3*200)    0.726 0.632 0.799 0.873   0.292 0.198 0.324 0.485
CTransE (d=4*200)    0.729 0.635 0.800 0.874   0.295 0.200 0.324 0.490
CTransE (d=5*200)    0.731 0.639 0.802 0.875   0.297 0.202 0.327 0.490
CTransE (d=6*200)    0.732 0.640 0.802 0.876 0.298 0.202 0.329 0.491
DitMult (d=200)      0.653 0.540   0.744   0.827   0.222 0.148   0.241   0.372
DitMult (d=400)      0.675 0.564   0.764   0.839   0.227 0.148   0.248   0.388
DitMult (d=600)      0.682 0.568   0.774   0.852   0.228 0.147   0.248   0.392
DitMult (d=800)      0.685 0.570   0.777   0.854   0.227 0.144   0.250   0.397
DitMult (d=1000)     0.690 0.577   0.781   0.854   0.228 0.143   0.251   0.398
DitMult (d=1200)     0.688 0.573   0.781   0.855   0.227 0.142   0.249   0.399
CDitMult (d=2*200)   0.693 0.576   0.787   0.867   0.229 0.153   0.250   0.379
CDitMult (d=3*200)   0.703 0.587   0.799   0.874   0.232 0.156   0.255   0.384
CDitMult (d=4*200)   0.709 0.593   0.806   0.880   0.237 0.161   0.259   0.386
CDitMult (d=5*200)   0.716 0.601   0.810   0.882   0.237 0.160   0.260   0.389
CDitMult (d=6*200)   0.718 0.603   0.815   0.883   0.237 0.160   0.260   0.390
ComplEx (d=200)    0.635 0.510     0.734   0.835   0.229 0.153   0.250   0.381
ComplEx (d=400)    0.682 0.568     0.773   0.848   0.230 0.150   0.252   0.393
ComplEx (d=600)    0.687 0.572     0.780   0.860   0.229 0.145   0.252   0.398
ComplEx (d=800)    0.698 0.585     0.789   0.864   0.229 0.144   0.252   0.400
ComplEx (d=1000)   0.697 0.582     0.791   0.864   0.228 0.141   0.254   0.400
ComplEx (d=1200)   0.696 0.580     0.791   0.862   0.226 0.139   0.249   0.400
CComplEx (d=2*200) 0.684 0.562     0.782   0.869   0.235 0.158   0.255   0.391
CComplEx (d=3*200) 0.697 0.575     0.798   0.878   0.237 0.159   0.259   0.393
CComplEx (d=4*200) 0.705 0.583     0.807   0.882   0.239 0.161   0.261   0.396
CComplEx (d=5*200) 0.710 0.590     0.809   0.884   0.240 0.162   0.264   0.395
CComplEx (d=6*200) 0.710 0.590     0.810   0.886   0.240 0.162   0.264   0.398
RotatE (d=200)       0.680 0.563   0.773   0.859   0.282 0.190   0.309   0.469
RotatE (d=400)       0.719 0.617   0.801   0.871   0.295 0.203   0.324   0.482
RotatE (d=600)       0.729 0.631   0.807   0.873   0.293 0.200   0.323   0.485
RotatE (d=800)       0.735 0.639   0.810   0.873   0.294 0.200   0.323   0.482
RotatE (d=1000)      0.730 0.634   0.805   0.870   0.292 0.199   0.321   0.481
RotatE (d=1200)      0.727 0.630   0.802   0.868   0.290 0.197   0.319   0.478
CRotatE (d=2*200)    0.726 0.622   0.809   0.879   0.297 0.205   0.325   0.486
CRotatE (d=3*200)    0.739 0.639   0.821   0.885   0.301 0.208   0.331   0.492
CRotatE (d=4*200)    0.744 0.644   0.826   0.888   0.306 0.213   0.334   0.493
CRotatE (d=5*200)    0.748 0.651   0.829   0.890   0.306 0.213   0.337   0.496
CRotatE (d=6*200)    0.753 0.656   0.832   0.891   0.307 0.213   0.338   0.496
6      Xu et al.


Table 2: Link prediction results on FB15K with one to N scoring and N3 regular-
ization.
           Model                              FB15K
                                   MRR Hits@1 Hits@3 Hits@10
           DitMult (d=200)         0.816 0.776  0.841 0.889
           DitMult (d=400)         0.831 0.792  0.854 0.898
           DitMult (d=600)         0.835 0.799  0.858 0.901
           DitMult (d=800)         0.837 0.800 0.860  0.903
           DitMult (d=1000)        0.839 0.799  0.866 0.909
           DitMult (d=1200)        0.836 0.796  0.865 0.909
           CDitMult (d=2*200)      0.836 0.800  0.859 0.901
           CDitMult (d=3*200)      0.842 0.806  0.865 0.905
           CDitMult (d=4*200)      0.844 0.809  0.867 0.908
           CDitMult (d=5*200)      0.846 0.812  0.868 0.908
           CDitMult (d=6*200)      0.847 0.813 0.869 0.909
           ComplEx (d=200)    0.826 0.790          0.849   0.890
           ComplEx (d=400)    0.840 0.806          0.860   0.898
           ComplEx (d=600)    0.840 0.805          0.863   0.902
           ComplEx (d=800)    0.842 0.802          0.870   0.908
           ComplEx (d=1000)   0.842 0.803          0.869   0.909
           ComplEx (d=1200)   0.843 0.802          0.871   0.910
           CComplEx (d=2*200) 0.845 0.814          0.864   0.902
           CComplEx (d=3*200) 0.851 0.820          0.870   0.906
           CComplEx (d=4*200) 0.856 0.825          0.873   0.909
           CComplEx (d=5*200) 0.858 0.827          0.876   0.911
           CComplEx (d=6*200) 0.859 0.829          0.877   0.911
           QuatE (d=1000)          0.833 0.800     0.859   0.900


    Results Results are shown in Table 1 and Table 2. Table 1 presents the
results on FB15k and FB15k-237. As can be seen, multiple models with low
dimension (started with "C" such as CTransE) outperforms single models with
high dimensions. For example, on FB15K, the single ComplEx model with dimen-
sion 800 obtains 0.698, 0.585, 0.789 and 0.864 respectively on MRR, Hist@1,3,10
respectively. Whereas CComplEx with k = 4, dl = 200 obtains 0.705, 0.583, 0.807,
0.882. Therefore, in all metrics except Hits@1 CComplEx outperforms ComplEx
while both of the models use same number of parameters (4 × 200 = 1 × 800).

5   Conclusion
In this paper, we compare performance of a single model using high dimension
with multiple combined models using low dimension (for each model). Our
experimental evaluation on FB15k and FB15k-237 show that instead of using
a single model with dh dimension, using k models with dl = dh /k dimension
results in a higher accuracy.

Acknowledgement This work is supported by the EC Horizon 2020 grant
LAMBDA (GA no. 809965), the CLEOPATRA project (GA no. 812997), and
the German national funded BmBF project MLwin.
                                   Title Suppressed Due to Excessive Length          7

References
 1. F. Akrami, M. S. Saeef, Q. Zhang, W. Hu, and C. Li. Realistic re-evaluation of
    knowledge graph completion methods: An experimental study. In Proceedings of
    the 2020 ACM SIGMOD International Conference on Management of Data, pages
    1995–2010, 2020.
 2. K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collabo-
    ratively created graph database for structuring human knowledge. In Proceedings
    of the 2008 ACM SIGMOD international conference on Management of data, pages
    1247–1250. AcM, 2008.
 3. A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating
    embeddings for modeling multi-relational data. In Advances in neural information
    processing systems, pages 2787–2795, 2013.
 4. G. Ji, S. He, L. Xu, K. Liu, and J. Zhao. Knowledge graph embedding via dynamic
    mapping matrix. In Proceedings of the 53rd annual meeting of the association
    for computational linguistics and the 7th international joint conference on natural
    language processing (volume 1: Long papers), pages 687–696, 2015.
 5. D. Krompaß and V. Tresp. Ensemble solutions for link-prediction in knowledge
    graphs. In Proceedings of the 2nd Workshop on Linked Data for Knowledge Discovery,
    Porto, Portugal, pages 1–10, 2015.
 6. T. Lacroix, N. Usunier, and G. Obozinski. Canonical tensor decomposition for
    knowledge base completion. arXiv preprint arXiv:1806.07297, 2018.
 7. Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu. Learning entity and relation embeddings
    for knowledge graph completion. In Twenty-ninth AAAI conference on artificial
    intelligence, 2015.
 8. G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM,
    38(11):39–41, 1995.
 9. M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich. A review of relational machine
    learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33, 2015.
10. M. Nickel, V. Tresp, and H.-P. Kriegel. Factorizing yago: scalable machine learning
    for linked data. In Proceedings of the 21st international conference on World Wide
    Web, pages 271–280. ACM, 2012.
11. H. Paulheim. Knowledge graph refinement: A survey of approaches and evaluation
    methods. Semantic web, 8(3):489–508, 2017.
12. Z. Sun, Z.-H. Deng, J.-Y. Nie, and J. Tang. Rotate: Knowledge graph embedding
    by relational rotation in complex space. arXiv preprint arXiv:1902.10197, 2019.
13. T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, and G. Bouchard. Complex embed-
    dings for simple link prediction. In International Conference on Machine Learning,
    pages 2071–2080, 2016.
14. Q. Wang, Z. Mao, B. Wang, and L. Guo. Knowledge graph embedding: A sur-
    vey of approaches and applications. IEEE Transactions on Knowledge and Data
    Engineering, 29(12):2724–2743, 2017.
15. Z. Wang, J. Zhang, J. Feng, and Z. Chen. Knowledge graph embedding by translating
    on hyperplanes. In Aaai, volume 14, pages 1112–1119, 2014.
16. B. Yang, W.-t. Yih, X. He, J. Gao, and L. Deng. Embedding entities and relations
    for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575,
    2014.
17. S. Zhang, Y. Tay, L. Yao, and Q. Liu. Quaternion knowledge graph embeddings.
    In Advances in Neural Information Processing Systems, pages 2735–2745, 2019.