Contextual Language Models for Knowledge
                Graph Completion

      Russa Biswas, Radina Sofronova, Mehwish Alam, and Harald Sack

          FIZ Karlsruhe - Leibniz Institute for Information Infrastructure,
                     Karlsruhe Institute of Technology, AIFB
                     firstname.lastname@fiz-karlsruhe.de


      Abstract. Knowledge Graphs (KGs) have become the backbone of var-
      ious machine learning based applications over the past decade. However,
      the KGs are often incomplete and inconsistent. Several representation
      learning based approaches have been introduced to complete the miss-
      ing information in KGs. Besides, Neural Language Models (NLMs) have
      gained huge momentum in NLP applications. However, exploiting the
      contextual NLMs to tackle the Knowledge Graph Completion (KGC)
      task is still an open research problem. In this paper, a GPT-2 based
      KGC model is proposed and is evaluated on two benchmark datasets.
      The initial results obtained from the fine-tuning of the GPT-2 model
      for triple classification strengthens the importance of usage of NLMs for
      KGC. Also, the impact of contextual language models for KGC has been
      discussed.


Keywords: GPT-2 · Knowledge Graph Embedding · Triple Classification.


1   Introduction
Knowledge Graphs (KGs) such as DBpedia, YAGO, Freebase, etc. have emerged
as the backbone of various applications in Natural Language Processing (NLP)
such as entity linking [9], question answering [2], etc. KGs are multi-relational
directed graphs with nodes as real world entities and relationships between them
are represented on the edges. The facts are represented as a triple < h, r, t >,
where h and t are the head and tail entities respectively and r represents the
relation between them. However, these KGs are often incomplete. Knowledge
Graph Completion (KGC) is the task of predicting the missing links between
entities, mining missing relations, and discovering new facts. Recent years have
witnessed extensive research on KGC with a focus on representation learning.
Most of these models use structural information i.e., the triple information such
as TransE [3], ConvE [5] whereas a few others include textual entity descriptions
such as TEKE [22], DKRL [25], etc. However, the models considering the textual
information leverage only static word embedding approaches, such as word2vec,
GloVe etc. to generate the latent representation of the textual entity descrip-
tions. Consequently, the semantic information encoded in the contextual entity
embeddings are not exploited for KGC.


Copyright © 2021 for this paper by its authors. Use permitted under Creative Com-
mons License Attribution 4.0 International (CC BY 4.0).
2       R. Biswas et al.

    On the other hand, pre-trained contextualized Neural Language Models (NLMs)
such as BERT [11], GPT-2 [20], have gained huge momentum in applications of
NLP. These models are trained on huge amount of free text resulting in encod-
ing of the semantic information leading to better linguistic representation of the
words. GPT-2 is one of the distinguished models which has achieved state-of-
the-art results for various language understanding based tasks. It operates on a
transformer decoder architecture with attention masks to predict next word of
a sequence.
    However, a combination of contextualized NLMs for the task of KGC is an
open research problem. KG-BERT [29] is one of the pioneers in this research in
which the BERT model is fine-tuned on KG data and has been used for link
prediction and triple classification as sub-tasks of KGC. The results presented
in [29] depict that the information contained in pre-trained NLMs play an im-
portant role in the predicting the missing links in a KG. Inspired by KG-BERT,
a novel GPT-2 based KGC model is explored in this work for the triple classifi-
cation sub-task. The triples in a KG are considered as sentences and the triple
classification is considered as a sequence classification problem. Furthermore, an
analysis of the contextualised NLMs for KGC is also provided.
    The rest of the paper is organised as follows. To begin with, a review of the
related work is provided in Section 2 followed by the preliminaries in Section 3.
Section 4 accommodates the outline of the proposed approach followed by ex-
perimental results in Section 5. Finally, an outlook of future work is provided in
Section 6.


2    Related Work

This section presents the state-of-the-art (SOTA) models for KG embeddings
with a focus on the models considering the textual descriptions.
     A large variety of KG embedding approaches has been explored for the task
of link prediction, such as translational models like TransE [3] and its variants,
semantic matching models like DistMult [28], neural network based models like
ConvE [5], graph structure based like GAKE [6], and literal (e.g., text, image,
number, etc.) based like DKRL [25], Jointly(ALSTM) [27], MKBE [17], etc.
     In a translational model such as TransE [3], given a triple (eh , r, et ) in a KG
G, the relation r is considered as a translation operation between the head and
tail entities on a low dimensional vector space defined by eh + r ≈ et , where
eh , r, et are the embeddings of the head, relation and the tail entity respectively.
     Another set of algorithms improve KG embeddings by taking into account
different kinds of literals such as numeric, text or image literals and a detailed
analysis of the methods is provided in [7]. DKRL [25] extends TransE [3] by
incorporating the textual entity descriptions in the model. The textual entity
descriptions are encoded using a continuous bag-of-words approach as well as a
deep convolutional neural network based approach. Jointly(ALSTM) is another
entity description based embedding model which extends the DKRL model with
a gate strategy and uses attentive LSTM to encode the textual entity descrip-
            Contextual Language Models for Knowledge Graph Completion                 3

tions. KG-BERT [29] is a contextual NLM based model which is fine tuned on
BERT and have been used in downstream tasks.
    However, the contextual NLMs are not considered to encode the triples or
the entity descriptions in all the models except KG-BERT. Therefore, this study
proposes a novel model in which the KG is fine-tuned with GPT-2 for KGC.


3     Preliminaries

A detailed explanation of pre-trained NLMs and KGC is provided in this section.


3.1   Language Models

A LM learns the probability of word occurrences based on a text corpus which
is used for various machine learning based NLP applications such as Machine
Translation [12], Speech Recognition [30], etc. It is the task of assigning prob-
ability to each sequence of words or a probability for the likelihood of a given
word based on a sequence of words. [8]. LMs can be broadly divided into

 – Statistical Language Models (SLMs) are n-gram based approaches that
   assign probabilities to a sequence s of n words, and is given by

         P (s) = P (w1 w2 ...wn ) = P (w1 )P (w2 |w1 )...P (wn |w1 w2 ...w(n−1) ),   (1)

   where wi denotes i−th word in the sequence s. The probability of a word
   sequence is the product of the conditional probability of the next word given
   the previous words or the context [10]. The SLMs fail to assign probabilities
   to the n-grams that do not appear in the training corpus which is tackled us-
   ing the smoothing techniques. However, the curse of dimensionality refrains
   the SLMs models to be trained on huge corpora.
 – Neural Language Models (NLMs), on the other hand, are neural net-
   work based LMs that learn the distributed representation of words into a
   continuous low-dimensional vector space. The semantically similar words
   appear closer to each other in the embedding space. The contextual infor-
   mation is captured on all the different levels in the text corpus, such as,
   sentences, sub-word, character, as well as the entire corpus.

    The NLMs such as Word2Vec [13], BERT [11], GPT [19] etc. are beneficial
for several NLP downstream tasks, such as question answering [23], sentiment
analysis [26], etc. As mentioned in [18], these models can be further sub-divided
into (i) Non-contextual and (ii) Contextual Embeddings. The Non-contextual
word embeddings such as Word2Vec, GloVe, etc., are static in nature and are
context independent. Although, the latent representations of the words capture
the semantic meanings but they do not dynamically change according to the
context the words appear in. However, Contextual embeddings such as BERT,
GPT, etc., encode semantics of the words differently based on different contexts.
All the language models are trained on huge unlabelled text corpora resulting in
4         R. Biswas et al.

increased number of model parameters. Therefore, the pre-trained models help
in learning universal language representations of the words. It promotes better
initialization of the model to have a better generalization performance on the
downstream tasks. Pre-training of the NLMs also helps in avoiding overfitting of
the model for small corpora [18]. Also, it improves the reuseability of the model
as it prevents the training of the model from scratch. However, fine tuning of
pre-trained contextual NLMs is often required to adapt the model to the specific
data for the down-stream task. It bridges the gap between the data on which a
particular NLM is trained on and the target data distribution.


3.2     Knowledge Graph Completion

The goal of KGC is the task of predicting missing instances or links to deal with
the incompleteness and sparsity in KGs. As explained in [4] KGC methods can
be broadly divided into the following classes:

    – Rule Based Models that use rules or statistical features such as NELL [15],
      KGRL [24], etc., to infer new knowledge in KGs.
    – Representation Learning Based Models such as TransE [3], ConvE [5],
      etc., that learn the latent representation of the entities and relations into
      a low-dimensional continuous vector space, in which semantically similar
      entities are placed closer to each other. These representations are then used
      for the KGC tasks of link prediction and triple classification.

   In link prediction task, the head or tail entity in a triple < h, r, ? > or
<?, r, t > is predicted by defining a mapping function ψ : E × R × E → R, where
E and R are the set of entities and relations in the KG. A score is assigned to
each triple, where the higher the score of the triple indicates the more likely to
be true. The triple classification task involves the training of binary classifier
whether a given triple is false (0) or true (1).


4      Language Models for Knowledge Graph Completion

This section comprises of an analysis of NLMs on KGs followed by a detailed
description of the GPT-2 based KGC task. The basic idea of the approach lies
in the fact that the contextual NLMs trained on huge corpora also capture rela-
tional information present in the training data [16]. Consequently, NLM models
can be exploited further to predict the missing links in a KG. However, the
impact of the pre-trained contextual NLMs for KGC is still an open research.


BERT for KGC One of the pioneers in this domain is the KG-BERT [29]
model in which the pre-trained BERT model is fine-tuned on KGs for KGC.
Each triple < h, r, t > is considered as a sentence and is provided as an input
sentence of the BERT model for fine-tuning. For the entities, KG-BERT has
been trained with either the entity names or their textual entity descriptions.
             Contextual Language Models for Knowledge Graph Completion              5

The first token of every input sequence is always [CLS], whereas the separator
token [SEP ] separates the head entity, relation and the tail entity. Therefore,
each input sequence for the BERT model is given by
([CLS] head entity/description [SEP] relation [SEP] tail entity/description [SEP]).
A sigmoid scoring function is introduced on the top of the final layer for the triple
classification which is a 2-dimensional vector ∈ [0, 1].

GPT-2 for KGC Inspired by KG-BERT, GPT-2 [20] is exploited in this work
for KGC. GPT-2 is a large transformer-based language model trained on 8 mil-
lion web pages with 1.5 billion parameters. The model predicts the next word
based on all the previous words in the text corpus. An attention mechanism is
used to selectively focus on the segments of the input text. The architecture
comprises of a 12-layer decoder-only transformer, using 12 masked self-attention
heads, with 64 dimensional states each. The Adam optimization is used and the
learning rate was increased linearly from zero to a maximum of 2.5 × 10−4 . The
model was able to outperform the previous NLMs on language tasks like question
answering, reading comprehension, summarization, translation, etc. However,
the basic difference between BERT and GPT-2 is that BERT uses transformer
encoder blocks whereas GPT-2 uses transformer decoder blocks.
    Similar to KG-BERT, GPT-2 is also fine tuned with KG triples where each
triple is considered as an input sequence. In this model, two variants have been
used to model the input sequence for the fine-tuning task. Given a triple Albert
Einstein, bornIn, Germany, the input sequence is modelled as
 – Albert Einstein bornIn Germany [EOS],
 – [BOS] Albert Einstein [EOS] bornIn [EOS] Germany [EOS],
where [BOS] and [EOS] are the beginning of sequence and end of sequence
respectively. Both entity names and descriptions are considered for the head and
tail entity. The input sequences are fed into the GPT-2 model architecture which
is a transformer decoder based on the original implementation [20]. It consists
of stacked decoder blocks of the transformer architecture and the context vector
is initialised with zero for the first word embedding. The masked self-attention
is used to extract information from the prior words in the sentence as well as
the context word. The word vectors in the first layer of GPT-2 follows byte pair
encoding i.e., tokens are parts of words. Furthermore, it compresses the tokenized
words list into a set of vocabulary items by considering the most common word
components. The GPT-2 sequence classification module is leveraged to determine
the plausibility of the triples. Since, GPT-2 outputs one token at a time, the
classifier is built on the last token. A 2-dimensional vector ∈ [0, 1] sigmoid scoring
function is introduced for triple classification.

5    Experiments
This section comprises of an analysis of the initial results obtained on deploying
GPT-2 model on the triple classification task for KGC. The model has been
evaluated on two benchmark datasets WN11 and FB13.
6       R. Biswas et al.

                             Table 1. Dataset Statistics

                     Dataset #Ent. #Rel. #Train #Val. #Test
                     WN11 38,696 11 112,581 2,609 10,544
                      FB13 75,043 13 316,232 5,908 23,733

    Table 2. Results of Language Models on Triple Classification (accuracy in %)

             Model Types                 Models              WN11 FB13
            KG embeddings
                                         TEKE                 86.1   84.2
             with Textual
                                  KG-BERT (labels)        93.5       79.2
               Contextual      KG-BERT (description)        -        90.4
                 LMs           Ours with GPT2 (labels)     83         73
                             Ours with GPT2 (description) 85          89


Datasets The two benchmark datasets WN11 and FB13 are subsets of WordNet
and Freebase KGs respectively and are introduced in [21]. WordNet [14] is a large
lexical KG of English comprising of nouns, verbs, adjectives and adverbs. They
are grouped into sets of cognitive synonyms known as synsets. Each synset ex-
presses a distinct concept. They are interlinked by means of conceptual-semantic
and lexical relations. Freebase [1] is a large collaborative KG consisting of struc-
tured data captured from various sources including individual, user-submitted
wiki contributions. The statistics of the KGs used to fine-tuning with GPT-2
followed by triple classification is provided in Table 1.


Experimental Setup The pre-trained GPT-2 base model with 12 decoder
layers, 768 hidden layers, 12 attention heads and 117M parameters is used for
fine-tuning. The set of hyperparameters chosen are as follows: batch sizes =
{256, 128, 32, 8, 1}, epochs = {5, 3}, and learning rate = {2e − 5, 5e − 5}. The
experiments with GPT-2 have been performed on an Ubuntu 16.04.5 LTS system
with 503GB RAM and Tesla V100S GPU.


Results The results depicted in Table 2 represent some initial results on the
triple classification task using the pre-trained GPT-2 model on KGs. Since all
the triples in the training set are true, a negative sampling method is used to
generate synthetic negative triples for the training of the classifier. The negative
triples are generated for this task, by replacing the head and the tail entities
with arbitrary entities based on a local closed world assumption. In this work,
filtered settings is used, i.e., if by chance true triples are generated using negative
sampling methods, then they are removed. Therefore, the set of triples in the
train, test, and validation sets are disjoint.
     TEKE [22] and KG-BERT are considered as baseline models as they consider
NLMs to model the KGs for KGC. TEKE exploits structural information of the
KGs using an embedding layer, a BiLSTM layer followed by mutual attention
             Contextual Language Models for Knowledge Graph Completion            7

Table 3. Results with the pre-trained GPT2 model for Triple Classification with dif-
ferent parameter settings

    Dataset Feature     Model details                Precision Recall F1 -score
    WN11 Labels         batch=128, epoch=10, lr=2e-5 0.76       0.76    0.76
                        batch=32, epoch=3, lr=5e-5     0.74     0.74    0.74
                        batch=1, epoch=3, lr=5e -5     0.83     0.83    0.83
            Description batch=8, epoch=5, lr=2e - 5    0.79     0.79    0.79
                        batch=1, epoch=3, lr=5e -5     0.85     0.85    0.85
    FB13    Labels      batch=32, epoch=10, lr=2e -5   0.69     0.64    0.61
                        batch=256, epoch=5, lr=2e -5   0.68     0.68    0.68
            Description batch=1, epoch=3, lr=5e-5      0.90     0.89    0.89


layer. The results of the baselines are taken from the KG-BERT paper [29]
except for KG-BERT (labels) variant for FB13. The experiment for this variant
is performed with the same settings as mentioned in [29]. It is observed from
the results that with GPT-2, the model achieves comparable results with the
previous models. Also, the results are better for GPT-2 with descriptions variant,
this is because the textual entity descriptions have more contextual information
resulting in generation of better representation of triples. The same behaviour
has been observed for KG-BERT. Since the NLMs are trained on large corpora,
the model parameters contain huge amount of linguistic knowledge which helps in
overcoming the data sparsity problem in KGs. Furthermore, the main advantage
of contextual NLM based KGC methods that they do not consider the structural
information of the entities in a KG. Hence it is independent of any underlying
structure in a KG. Furthermore, these models are also applicable to the less
popular entities in KGs with lesser number of triples compared to the others.
The task of triple classification in KGC with GPT-2 is similar to the sequence
classification task in text and the self attention mask helps in identifying the
important words in the sequences. The variants with labels i.e., the entity names
for both KG-BERT and the proposed GPT-2 based model work better for WN11
as compared to FB13. This is because WordNet is a linguistic KG and the NLMs
are able to capture more information on the entity names as compared to FB13.
    Table 3 depicts the precision, recall, and F1 score of the model with different
hyper-parameter settings. It is observed that the best results are obtained with
batch=1, epoch=3, and lr=5e-5. The changing of epochs does not have much
variation in the model whereas batch size has. The lower the batch size, the
better the performance of the model.


6   Conclusion and Future Work

This work presents an analysis of the effect of exploiting NLMs for KGC. A novel
GPT-2 based KGC model has also been proposed. The initial results from the
triple classification sub-task shows that the semantic information stored in the
NLMs can provide vital information for the KGC task. In future, further hyper-
8       R. Biswas et al.

parameter tuning will improve model performance and additional experiments
on link prediction sub-tasks will be conducted.


References
 1. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collabo-
    ratively created graph database for structuring human knowledge. In: Proceedings
    of the 2008 ACM SIGMOD international conference on Management of data. pp.
    1247–1250 (2008)
 2. Bordes, A., Chopra, S., Weston, J.: Question answering with subgraph embeddings.
    In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language
    Processing (EMNLP). pp. 615–620 (2014)
 3. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating
    embeddings for modeling multi-relational data. Advances in neural information
    processing systems 26 (2013)
 4. Chen, Z., Wang, Y., Zhao, B., Cheng, J., Zhao, X., Duan, Z.: Knowledge graph
    completion: A review. IEEE Access 8, 192435–192456 (2020)
 5. Dettmers, T., Minervini, P., Stenetorp, P., Riedel, S.: Convolutional 2d knowledge
    graph embeddings. In: Thirty-second AAAI conference on artificial intelligence
    (2018)
 6. Feng, J., Huang, M., Yang, Y., Zhu, X.: GAKE: Graph aware knowledge embed-
    ding. In: Proceedings of COLING 2016, the 26th International Conference on Com-
    putational Linguistics: Technical Papers. pp. 641–651. The COLING 2016 Organiz-
    ing Committee, Osaka, Japan (Dec 2016), https://www.aclweb.org/anthology/
    C16-1062
 7. Gesese, G.A., Biswas, R., Alam, M., Sack, H.: A survey on knowledge graph embed-
    dings with literals: Which model links better literal-ly? Semantic Web (Preprint),
    1–31
 8. Goldberg, Y.: Neural network methods for natural language processing. Synthesis
    lectures on human language technologies 10(1), 1–309 (2017)
 9. Hoffart, J., Yosef, M.A., et al., I.B.: Robust disambiguation of named entities
    in text. In: Proc. of the 2011 Conf. on Empirical Methods in Natural Language
    Processing, EMNLP 2011. pp. 782–792 (2011)
10. Jing, K., Xu, J.: A survey on neural network language models. arXiv preprint
    arXiv:1906.03591 (2019)
11. Kenton, J.D.M.W.C., Toutanova, L.K.: Bert: Pre-training of deep bidirectional
    transformers for language understanding. In: Proceedings of NAACL-HLT. pp.
    4171–4186 (2019)
12. Koehn, P.: Statistical machine translation. Cambridge University Press (2009)
13. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-
    sentations of words and phrases and their compositionality. In: Advances in neural
    information processing systems. pp. 3111–3119 (2013)
14. Miller, G.A.: Wordnet: a lexical database for english. Communications of the ACM
    38(11), 39–41 (1995)
15. Paulheim, H., Bizer, C.: Improving the quality of linked data using statistical
    distributions. International Journal on Semantic Web and Information Systems
    (IJSWIS) 10(2), 63–86 (2014)
16. Petroni, F., Rocktäschel, T., Riedel, S., Lewis, P., Bakhtin, A., Wu, Y., Miller,
    A.: Language models as knowledge bases? In: Proceedings of the 2019 Conference
              Contextual Language Models for Knowledge Graph Completion                 9

    on Empirical Methods in Natural Language Processing and the 9th International
    Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 2463–
    2473 (2019)
17. Pezeshkpour, P., Chen, L., Singh, S.: Embedding multimodal relational data for
    knowledge base completion. In: Proceedings of the 2018 Conference on Empirical
    Methods in Natural Language Processing (EMNLP). pp. 3208–3218 (2018)
18. Qiu, X., Sun, T., Xu, Y., Shao, Y., Dai, N., Huang, X.: Pre-trained models for
    natural language processing: A survey. Science China Technological Sciences pp.
    1–26 (2020)
19. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language un-
    derstanding by generative pre-training (2018)
20. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language
    models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
21. Socher, R., Chen, D., Manning, C.D., Ng, A.: Reasoning with neural tensor net-
    works for knowledge base completion. In: Advances in neural information process-
    ing systems. pp. 926–934 (2013)
22. Wang, Z., Li, J., Liu, Z., Tang, J.: Text-enhanced representation learning for knowl-
    edge graph. In: Proceedings of International Joint Conference on Artificial Intelli-
    gent (IJCAI). pp. 4–17 (2016)
23. Wang, Z., Ng, P., Ma, X., Nallapati, R., Xiang, B.: Multi-passage bert: A globally
    normalized bert model for open-domain question answering. In: Proceedings of
    the 2019 Conference on Empirical Methods in Natural Language Processing and
    the 9th International Joint Conference on Natural Language Processing (EMNLP-
    IJCNLP). pp. 5878–5882 (2019)
24. Wei, Y., Luo, J., Xie, H.: Kgrl: an owl2 rl reasoning system for large scale knowledge
    graph. In: 2016 12th International Conference on Semantics, Knowledge and Grids
    (SKG). pp. 83–89. IEEE (2016)
25. Xie, R., Liu, Z., Jia, J., Luan, H., Sun, M.: Representation learning of knowl-
    edge graphs with entity descriptions. In: Proceedings of the AAAI Conference on
    Artificial Intelligence. vol. 30 (2016)
26. Xu, H., Liu, B., Shu, L., Philip, S.Y.: Bert post-training for review reading com-
    prehension and aspect-based sentiment analysis. In: Proceedings of the 2019 Con-
    ference of the North American Chapter of the Association for Computational Lin-
    guistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp.
    2324–2335 (2019)
27. Xu, J., Qiu, X., Chen, K., Huang, X.: Knowledge graph representation with jointly
    structural and textual encoding. In: Proceedings of the 26th International Joint
    Conference on Artificial Intelligence. pp. 1318–1324 (2017)
28. Yang, B., Yih, W., He, X., Gao, J., Deng, L.: Embedding entities and relations
    for learning and inference in knowledge bases. In: Bengio, Y., LeCun, Y. (eds.)
    3rd International Conference on Learning Representations, ICLR 2015, San Diego,
    CA, USA, May 7-9, 2015, Conference Track Proceedings (2015), http://arxiv.
    org/abs/1412.6575
29. Yao, L., Mao, C., Luo, Y.: Kg-bert: Bert for knowledge graph completion. arXiv
    preprint arXiv:1909.03193 (2019)
30. Yu, D., Deng, L.: Automatic Speech Recognition. Springer (2016)