<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Entity Type Prediction in Knowledge Graphs using Embeddings</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Russa Biswas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Radina Sofronova</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mehwish Alam</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harald Sack</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Knowl-</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FIZ Karlsruhe</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Karlsruhe Institute of Technology, Institute AIFB</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Leibniz Institute for Information Infrastructure</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Open Knowledge Graphs (such as DBpedia, Wikidata, YAGO) has been recognized as the backbone of diverse applications in the eld of data mining and information retrieval. Hence, the completeness and correctness of the Knowledge Graphs (KGs) is vital. Most of these KGs are mostly created either via an automated information extraction from Wikipedia snapshots or information accumulation provided by the users or using heuristics. However, it has been observed that the type information of these KGs is often noisy, incomplete and incorrect. To deal with this problem a multi-label classi cation approach is proposed in this work for entity typing using KG embeddings. We compare our approach with the current state-of-the-art type prediction method and report on experiments with the KGs.</p>
      </abstract>
      <kwd-group>
        <kwd>Type Prediction Knowledge Graph Embeddings edge Graph Completion</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Open Knowledge Graphs (KGs) such as DBpedia, Wikidata, YAGO, etc. have
been recognized as the foundations for diverse KG based applications
including Natural Language Processing, data mining and Information Retrieval. Most
of these KGs are created either via automated information extraction from
Wikipedia snapshots, information accumulation provided by the users or by
using heuristics. However, each KG follows a di erent knowledge organization
and is based on di erently structured ontologies. Moreover, it has been observed
that type information are often noisy or incomplete. On the other hand, these
KGs contain huge amount of data which makes it di cult to be used by the
applications. Therefore, recent years have witnessed an extensive research on the
latent representation of the KGs in a low dimensional vector space. In this work,
the proposed method addresses the entity typing problem in DBpedia using the
embeddings as a multi-label classi cation problem.</p>
      <p>Entity typing is the process of assigning a type to an entity and is a
fundamental task in KG completion. For example, the triple &lt;dbr:Albert Einstein,
Copyright c 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
rdf:type, dbo:Scientist&gt; states that Albert Einstein is assigned to the type
class Scientist. The type information in DBpedia is derived directly by an
external extraction framework from the Wikipedia infobox types. Since, the Wikipedia
is a crowd sourced encyclopedia, hence this type information is often incomplete.
Therefore, a huge number of entities in DBpedia are assigned to a coarse grained
rdf:type. Table 1 provides the distribution of entities of ve types. For e.g., class
dbo:SportsTeam has 14 subclasses in DBpedia and 352006 entities, out of which
only 8.9% are assigned to its subclasses. Hence, there arouses a necessity to have
ne grained types for the entities in the KGs.</p>
      <p>
        On the other hand, most of the existing state-of-the-art KG embedding
approaches such as translational approaches such as TransE [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], TransR [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], etc.
exploit only the structure of the KG.However, besides the structural information,
implicit textual semantic information is also stored in the KGs as illustrated in
Figure 1. This subgraph depicts, the birthplace of \Albert Einstein" is \Ulm",
which is located in the country \Germany". The labels of the triples in the
subgraph, such as birthplace, country, Ulm, etc. contains implicit textual
information in the graph, that is not captured in translational embedding models.
      </p>
      <p>
        In this paper, a multi-label classi cation approach is proposed for ne grained
entity typing. To do so, the model uses di erent existing word embedding models
such as Word2Vec [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], GloVe [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and FastText [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to learn the KG embeddings
capturing the graph structure as well as the implicit textual information
available. The main contributions of this paper are:
{ Vector representation of entities and relations in DBpedia using the existing
word embedding models.
{ A multi-label classi cation based approach for ne grained entity typing.
{ An analysis and comparison of the aforementioned word embedding models
for the task of entity type prediction.
      </p>
      <p>The rest of the paper is structured as follows. To begin with, a review of the
related work is provided in Section 2. Section 3 accommodates the detailed
description of the approach followed by experimental setup and report on the
results in Section 4. Finally, an outlook of future work is provided in Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        This section presents the prior related works on entity typing considering both
Wikipedia Infobox type prediction as well as RDF type prediction.
Wikipedia Infobox Type Prediction. One of the initial works in this domain
was proposed by Wu et al.[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. To do so, KYLIN considers pages having similar
infoboxes, determines the common attributes in them to learn a CRF extractor.
Sultana et al.[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] focuses on automated approach by training a SVM classi er
on the feature set of TF-IDF on the rst k sentences of an article as well as on
categories and Named Entity mentions. Biswas et al.[
        <xref ref-type="bibr" rid="ref2 ref3">3, 2</xref>
        ] provides a neural
network based approach for infobox prediction using word embeddings on abstract,
table of contents, and categories of Wikipedia articles.
      </p>
      <p>
        RDF Type Prediction. A statistical heuristic link based type prediction
mechanism, SDTyped, has been proposed by Paulheim et al. and was evaluated on
DBpedia and OpenCyc [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Another RDF type prediction of KGs has been
studied by Melo et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], where the type prediction of the KGs is performed
via the hierarchical SLCN algorithm using a set of incoming and outgoing
relations as features for classi cation. In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], the authors propose a supervised
hierarchical SVM classi cation approach for DBpedia by exploiting the contents
of Wikipedia articles. However, none of these methods exploit embeddings to
perform the type prediction. In this work, di erent word embedding algorithms
will be exploited on the KGs for the task of entity typing.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Entity Typing using Embeddings</title>
      <p>The task of entity type prediction is multi-label classi cation problem
considering the entity type information as classes which is discussed in this section.
3.1</p>
      <sec id="sec-3-1">
        <title>Word Embeddings on KGs</title>
        <p>
          Each triple or fact in the KG is considered as a sentence where the relation
serves as a verb and the two entities are considered as the subject and the
object of this relation in the sentence. For e.g., &lt; dbr : Albert Einstein; dbo :
birthplace; dbr : U lm &gt; is considered as a sentence These sentences are then
used as a corpus for all the three word embeddings. The URIs are considered for
training. The dimension of the vectors for each of the embedding models is 100
and the embeddings from all the models for DBpedia available in our GitHub [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
Word2Vec. It aims to learn the distributed representation for words
reducing the high dimensional word representations in large corpus. It comprises of
two model architectures, Continuous Bag of Words (CBOW) and Skip-gram.
In CBOW approach, the model predicts the current word from a window of
context words. On the other hand, the skip-gram model tries to predict the
context words based on the current word. In this work, the CBOW approach of
Word2Vec model has been used to learn the vector representation of the entities
and relations in the KG based on the context entity or relation.
        </p>
        <p>FastText. FastText is an extension of the word2vec model, which follows both
CBOW and Skip-gram architectures. The main di erence with the Word2Vec is
that it learns the representation of each word in the corpus as n-gram characters.
This bene ts in capturing representations for shorter or rare words which can be
obtained by breaking down words into n-grams to get its embeddings. Therefore,
it would help in having embeddings for unseen facts in KGs.</p>
        <p>GloVe. GloVe is another word embedding model which exploits the global
wordword co-occurrence statistics in the corpus. The model is essentially a log-bilinear
model with a weighted least-squares objective. The main underlying intuition is
that ratios of word-word co-occurrence probabilities have the potential for
encoding some form of meaning. The co-occurrence of the entities and the properties
is important in learning the latent representation of KGs.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Entity Typing</title>
        <p>Two approaches have been used to determine the entity types in this work, (i) a
supervised Convolutional Neural Network (CNN) based approach and (ii) vector
similarity.</p>
        <p>
          Convolutional Neural Network. The entity typing problem is converted to a
classi cation problem with the rdf:type as classes in which, a 1D CNN model [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
built on top of the embedding models. The model takes into account the vectors
of the entities generated from the embedding models and predicts its type. The
model consists of a convolutional layer which involves a feature detector followed
by a global max pool layer. The activation function ReLu has been used in the
convolutional layer. The output of the pooling layer is used which is then passed
through a fully connected nal layer, in which the sigmoid function calculates
the probabilities of an entity belonging to di erent classes. The lter size taken
is 128, with kernel sizes 3; 4; 6, are chosen for the model.
        </p>
        <p>Vector Similarity. In order to assign ne-grained type to an entity with an
already assigned coarse-grained type, class hierarchy in DBpedia has been
exploited. For e.g., in DBpedia, for the entity dbr:Baker&amp;McKenzie, the rdf:type
class is dbo:LawFirm. Next, class hierarchy of dbo:LawFirm is traversed to nd
the highest level parent class dbo:Organisation after dbo:Agent. Now, all the
subclasses of dbo:Organisation in the hierarchy is extracted and the cosine similarity
between all the subclasses and the entity dbr:Baker&amp;McKenzie has been
calculated. Since the entities of a class represent the characteristic features of the
class, the average vector of the entity vectors belonging to a certain class has
been chosen as the class vector.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments and Results</title>
      <p>This section contains description of the experiments and analysis of the results.
Dataset. In order to have ne-grained type prediction of the entities which
are already coarse-grained typed in DBpedia 2016-103, 3 datasets have been
3 https://wiki.dbpedia.org/downloads-2016-10</p>
      <p>Datasets
86 classes,
2k entities/class</p>
      <p>81 classes,
4k entities/class
58%
58%</p>
      <p>Word2Vec</p>
      <p>Vector</p>
      <p>Similarity
Hits@3 Hits@1</p>
      <p>CNN</p>
      <p>Models (Results in Accuracy)</p>
      <p>FastText</p>
      <p>Vector</p>
      <p>Similarity CNN
Hits@3 Hits@1</p>
      <p>GloVe</p>
      <p>Vector</p>
      <p>Similarity
Hits@3 Hits@1</p>
      <p>CNN
59 classes,
500 entities/class 47.83% 28.46% 56% 29.81% 17.44% 54% 7.07% 3.54% 53.7%
generated to evaluate the method. To determine the robustness of the method,
the datasets comprise of classes with less number of entities as well as the ones
with large entity count. The statistics of the dataset is provided in Table 2.
Results. The vector similarity approach is considered as the baseline model in
this work. The CNN model is evaluated on 80%-20% of training and test split
of each of the dataset as depicted in Table 2. It is trained with a batch size of
32, 125 hidden layers and 1000 epochs.</p>
      <p>It has been observed from the results that the CNN built on the top of the
embedding models achieved better results in the entity typing task. However,
the vector similarity results with Hits@3 for the word2vec vectors is comparable
to CNN for the 86 classes dataset. The results of vector similarity depict that
the vectors generated by the GloVe model is not so similar to each other, even
then the CNN predicts the correct type with a much better accuracy. Also, 81
classes is a subset of 86 classes, with more number of entities per class which
strengthens the fact the neural network models work better with more data.
For the dataset with 4000 entities per class, the CNN works the best for all the
embedding models as compared to the other methods.</p>
      <p>
        Also, the method has been compared with the available SDTyped dataset4.
This dataset consists of the entity types predicted by SDType method. It is to
be noted that only a small fraction of entities are common between the SDType
dataset and our dataset as depicted in column 3 of Table 3. The count of the
entities in SDTypes whose type information matches the ground truth is
provided in the last column of same table. Due to huge di erences in the datasets,
a direct comparison of the models with this dataset is not possible. However, an
analysis based only on the overlapping entities is available in the GitHub [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>In this paper, di erent word embeddings approaches for entity typing in a KG
have been analyzed. The achieved results demonstrate that vectors coupled with
4 http://downloads.dbpedia.org/2016-10/core-i18n/en/instance types sdtyped dbo en.ttl.bz2</p>
      <p>Datasets
59 classes,
500 entities/class</p>
      <p>86 classes,
2k entities/class</p>
      <p>81 classes,
4k entities/class</p>
      <p>#Entities in #ESDT ype
our dataset (E) #E \ #ESDT ype = GroundT ruth
28106
172000
324000
CNN works better for the task. On the other hand, set theory concept5 when
applied to generate the class vectors from the entity vectors proved to be bene
cial. In future, these embedding models would be used for other KG completion
tasks such as link prediction, triple classi cation, etc. Also, for the entity typing
task, more information to be included in this embedding space such the DBpedia
categories to improve the results.
5 A set is represented by its members, which exhibit the same properties.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>DBpedia</given-names>
            <surname>Embeddings</surname>
          </string-name>
          . https://github.com/ISE-FIZKarlsruhe/Entity-Typingwith-Word-Embeddings
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Biswas</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koutraki</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sack</surname>
          </string-name>
          , H.:
          <article-title>Predicting wikipedia infobox type information using word embeddings on categories</article-title>
          .
          <source>In: EKAW (Posters &amp; Demos)</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Biswas</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Turker, R.,
          <string-name>
            <surname>Moghaddam</surname>
            ,
            <given-names>F.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koutraki</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sack</surname>
          </string-name>
          , H.:
          <article-title>Wikipedia infobox type prediction using embeddings</article-title>
          .
          <source>In: DL4KGS@ ESWC</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Enriching word vectors with subword information</article-title>
          .
          <source>TACL</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bordes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Usunier</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia-Duran</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weston</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yakhnenko</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Translating embeddings for modeling multi-relational data</article-title>
          .
          <source>In: NIPS</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Convolutional neural networks for sentence classi cation</article-title>
          .
          <source>In: EMNLP</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kliegr</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zamazal</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <source>: LHD 2</source>
          .0:
          <string-name>
            <given-names>A</given-names>
            <surname>Text Mining</surname>
          </string-name>
          <article-title>Approach to Typing Entities in Knowledge Graphs</article-title>
          .
          <source>J. Web Sem</source>
          . (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Learning entity and relation embeddings for knowledge graph completion</article-title>
          .
          <source>In: Twenty-ninth AAAI conference on arti cial intelligence</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Melo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , Volker, J.:
          <article-title>Type Prediction in RDF Knowledge Bases Using Hierarchical Multilabel Classi cation</article-title>
          .
          <source>In: WIMS</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>E cient Estimation of Word Representations in Vector Space</article-title>
          .
          <source>CoRR</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Type Inference on Noisy RDF Data</article-title>
          . In: ISWC (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.: Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In: EMNLP</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Sultana</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>Q.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Biswas</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rahman</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>C.H.Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Infobox Suggestion for Wikipedia Entities</article-title>
          . In: CIKM (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weld</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          :
          <article-title>Autonomously Semantifying Wikipedia</article-title>
          . In: CIKM (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>