-

Entity Type Prediction in Knowledge Graphs using Embeddings

Russa Biswas

0 1

Radina Sofronova

Mehwish Alam

0 1

Harald Sack

0 1

Knowl-

0 FIZ Karlsruhe 1 Karlsruhe Institute of Technology, Institute AIFB , Germany 2 Leibniz Institute for Information Infrastructure , Germany

Open Knowledge Graphs (such as DBpedia, Wikidata, YAGO) has been recognized as the backbone of diverse applications in the eld of data mining and information retrieval. Hence, the completeness and correctness of the Knowledge Graphs (KGs) is vital. Most of these KGs are mostly created either via an automated information extraction from Wikipedia snapshots or information accumulation provided by the users or using heuristics. However, it has been observed that the type information of these KGs is often noisy, incomplete and incorrect. To deal with this problem a multi-label classi cation approach is proposed in this work for entity typing using KG embeddings. We compare our approach with the current state-of-the-art type prediction method and report on experiments with the KGs.

Type Prediction Knowledge Graph Embeddings edge Graph Completion

Open Knowledge Graphs (KGs) such as DBpedia, Wikidata, YAGO, etc. have been recognized as the foundations for diverse KG based applications including Natural Language Processing, data mining and Information Retrieval. Most of these KGs are created either via automated information extraction from Wikipedia snapshots, information accumulation provided by the users or by using heuristics. However, each KG follows a di erent knowledge organization and is based on di erently structured ontologies. Moreover, it has been observed that type information are often noisy or incomplete. On the other hand, these KGs contain huge amount of data which makes it di cult to be used by the applications. Therefore, recent years have witnessed an extensive research on the latent representation of the KGs in a low dimensional vector space. In this work, the proposed method addresses the entity typing problem in DBpedia using the embeddings as a multi-label classi cation problem.

Entity typing is the process of assigning a type to an entity and is a fundamental task in KG completion. For example, the triple <dbr:Albert Einstein, Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). rdf:type, dbo:Scientist> states that Albert Einstein is assigned to the type class Scientist. The type information in DBpedia is derived directly by an external extraction framework from the Wikipedia infobox types. Since, the Wikipedia is a crowd sourced encyclopedia, hence this type information is often incomplete. Therefore, a huge number of entities in DBpedia are assigned to a coarse grained rdf:type. Table 1 provides the distribution of entities of ve types. For e.g., class dbo:SportsTeam has 14 subclasses in DBpedia and 352006 entities, out of which only 8.9% are assigned to its subclasses. Hence, there arouses a necessity to have ne grained types for the entities in the KGs.

On the other hand, most of the existing state-of-the-art KG embedding approaches such as translational approaches such as TransE [ 5 ], TransR [ 8 ], etc. exploit only the structure of the KG.However, besides the structural information, implicit textual semantic information is also stored in the KGs as illustrated in Figure 1. This subgraph depicts, the birthplace of \Albert Einstein" is \Ulm", which is located in the country \Germany". The labels of the triples in the subgraph, such as birthplace, country, Ulm, etc. contains implicit textual information in the graph, that is not captured in translational embedding models.

In this paper, a multi-label classi cation approach is proposed for ne grained entity typing. To do so, the model uses di erent existing word embedding models such as Word2Vec [ 10 ], GloVe [ 12 ], and FastText [ 4 ] to learn the KG embeddings capturing the graph structure as well as the implicit textual information available. The main contributions of this paper are: { Vector representation of entities and relations in DBpedia using the existing word embedding models. { A multi-label classi cation based approach for ne grained entity typing. { An analysis and comparison of the aforementioned word embedding models for the task of entity type prediction.

The rest of the paper is structured as follows. To begin with, a review of the related work is provided in Section 2. Section 3 accommodates the detailed description of the approach followed by experimental setup and report on the results in Section 4. Finally, an outlook of future work is provided in Section 5.

Related Work

This section presents the prior related works on entity typing considering both Wikipedia Infobox type prediction as well as RDF type prediction. Wikipedia Infobox Type Prediction. One of the initial works in this domain was proposed by Wu et al.[ 14 ]. To do so, KYLIN considers pages having similar infoboxes, determines the common attributes in them to learn a CRF extractor. Sultana et al.[ 13 ] focuses on automated approach by training a SVM classi er on the feature set of TF-IDF on the rst k sentences of an article as well as on categories and Named Entity mentions. Biswas et al.[ 3, 2 ] provides a neural network based approach for infobox prediction using word embeddings on abstract, table of contents, and categories of Wikipedia articles.

RDF Type Prediction. A statistical heuristic link based type prediction mechanism, SDTyped, has been proposed by Paulheim et al. and was evaluated on DBpedia and OpenCyc [ 11 ]. Another RDF type prediction of KGs has been studied by Melo et al. [ 9 ], where the type prediction of the KGs is performed via the hierarchical SLCN algorithm using a set of incoming and outgoing relations as features for classi cation. In [ 7 ], the authors propose a supervised hierarchical SVM classi cation approach for DBpedia by exploiting the contents of Wikipedia articles. However, none of these methods exploit embeddings to perform the type prediction. In this work, di erent word embedding algorithms will be exploited on the KGs for the task of entity typing. 3

Entity Typing using Embeddings

The task of entity type prediction is multi-label classi cation problem considering the entity type information as classes which is discussed in this section. 3.1

Word Embeddings on KGs

Each triple or fact in the KG is considered as a sentence where the relation serves as a verb and the two entities are considered as the subject and the object of this relation in the sentence. For e.g., < dbr : Albert Einstein; dbo : birthplace; dbr : U lm > is considered as a sentence These sentences are then used as a corpus for all the three word embeddings. The URIs are considered for training. The dimension of the vectors for each of the embedding models is 100 and the embeddings from all the models for DBpedia available in our GitHub [ 1 ] Word2Vec. It aims to learn the distributed representation for words reducing the high dimensional word representations in large corpus. It comprises of two model architectures, Continuous Bag of Words (CBOW) and Skip-gram. In CBOW approach, the model predicts the current word from a window of context words. On the other hand, the skip-gram model tries to predict the context words based on the current word. In this work, the CBOW approach of Word2Vec model has been used to learn the vector representation of the entities and relations in the KG based on the context entity or relation.

FastText. FastText is an extension of the word2vec model, which follows both CBOW and Skip-gram architectures. The main di erence with the Word2Vec is that it learns the representation of each word in the corpus as n-gram characters. This bene ts in capturing representations for shorter or rare words which can be obtained by breaking down words into n-grams to get its embeddings. Therefore, it would help in having embeddings for unseen facts in KGs.

GloVe. GloVe is another word embedding model which exploits the global wordword co-occurrence statistics in the corpus. The model is essentially a log-bilinear model with a weighted least-squares objective. The main underlying intuition is that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. The co-occurrence of the entities and the properties is important in learning the latent representation of KGs. 3.2

Entity Typing

Two approaches have been used to determine the entity types in this work, (i) a supervised Convolutional Neural Network (CNN) based approach and (ii) vector similarity.

Convolutional Neural Network. The entity typing problem is converted to a classi cation problem with the rdf:type as classes in which, a 1D CNN model [ 6 ] built on top of the embedding models. The model takes into account the vectors of the entities generated from the embedding models and predicts its type. The model consists of a convolutional layer which involves a feature detector followed by a global max pool layer. The activation function ReLu has been used in the convolutional layer. The output of the pooling layer is used which is then passed through a fully connected nal layer, in which the sigmoid function calculates the probabilities of an entity belonging to di erent classes. The lter size taken is 128, with kernel sizes 3; 4; 6, are chosen for the model.

Vector Similarity. In order to assign ne-grained type to an entity with an already assigned coarse-grained type, class hierarchy in DBpedia has been exploited. For e.g., in DBpedia, for the entity dbr:Baker&McKenzie, the rdf:type class is dbo:LawFirm. Next, class hierarchy of dbo:LawFirm is traversed to nd the highest level parent class dbo:Organisation after dbo:Agent. Now, all the subclasses of dbo:Organisation in the hierarchy is extracted and the cosine similarity between all the subclasses and the entity dbr:Baker&McKenzie has been calculated. Since the entities of a class represent the characteristic features of the class, the average vector of the entity vectors belonging to a certain class has been chosen as the class vector. 4

Experiments and Results

This section contains description of the experiments and analysis of the results. Dataset. In order to have ne-grained type prediction of the entities which are already coarse-grained typed in DBpedia 2016-103, 3 datasets have been 3 https://wiki.dbpedia.org/downloads-2016-10

Datasets 86 classes, 2k entities/class

81 classes, 4k entities/class 58% 58%

Word2Vec

Vector

Similarity Hits@3 Hits@1

CNN

Models (Results in Accuracy)

FastText

Vector

Similarity CNN Hits@3 Hits@1

GloVe

Vector

Similarity Hits@3 Hits@1

CNN 59 classes, 500 entities/class 47.83% 28.46% 56% 29.81% 17.44% 54% 7.07% 3.54% 53.7% generated to evaluate the method. To determine the robustness of the method, the datasets comprise of classes with less number of entities as well as the ones with large entity count. The statistics of the dataset is provided in Table 2. Results. The vector similarity approach is considered as the baseline model in this work. The CNN model is evaluated on 80%-20% of training and test split of each of the dataset as depicted in Table 2. It is trained with a batch size of 32, 125 hidden layers and 1000 epochs.

It has been observed from the results that the CNN built on the top of the embedding models achieved better results in the entity typing task. However, the vector similarity results with Hits@3 for the word2vec vectors is comparable to CNN for the 86 classes dataset. The results of vector similarity depict that the vectors generated by the GloVe model is not so similar to each other, even then the CNN predicts the correct type with a much better accuracy. Also, 81 classes is a subset of 86 classes, with more number of entities per class which strengthens the fact the neural network models work better with more data. For the dataset with 4000 entities per class, the CNN works the best for all the embedding models as compared to the other methods.

Also, the method has been compared with the available SDTyped dataset4. This dataset consists of the entity types predicted by SDType method. It is to be noted that only a small fraction of entities are common between the SDType dataset and our dataset as depicted in column 3 of Table 3. The count of the entities in SDTypes whose type information matches the ground truth is provided in the last column of same table. Due to huge di erences in the datasets, a direct comparison of the models with this dataset is not possible. However, an analysis based only on the overlapping entities is available in the GitHub [ 1 ]. 5

Conclusion and Future Work

In this paper, di erent word embeddings approaches for entity typing in a KG have been analyzed. The achieved results demonstrate that vectors coupled with 4 http://downloads.dbpedia.org/2016-10/core-i18n/en/instance types sdtyped dbo en.ttl.bz2

Datasets 59 classes, 500 entities/class

86 classes, 2k entities/class

81 classes, 4k entities/class

#Entities in #ESDT ype our dataset (E) #E \ #ESDT ype = GroundT ruth 28106 172000 324000 CNN works better for the task. On the other hand, set theory concept5 when applied to generate the class vectors from the entity vectors proved to be bene cial. In future, these embedding models would be used for other KG completion tasks such as link prediction, triple classi cation, etc. Also, for the entity typing task, more information to be included in this embedding space such the DBpedia categories to improve the results. 5 A set is represented by its members, which exhibit the same properties.

DBpedia

Embeddings . https://github.com/ISE-FIZKarlsruhe/Entity-Typingwith-Word-Embeddings

2. Biswas , R. , Koutraki , M. , Sack , H.: Predicting wikipedia infobox type information using word embeddings on categories . In: EKAW (Posters & Demos) ( 2018 )

3. Biswas , R. , Turker, R., Moghaddam , F.B. , Koutraki , M. , Sack , H.: Wikipedia infobox type prediction using embeddings . In: DL4KGS@ ESWC ( 2018 )

4. Bojanowski , P. , Grave , E. , Joulin , A. , Mikolov , T. : Enriching word vectors with subword information . TACL ( 2017 )

5. Bordes , A. , Usunier , N. , Garcia-Duran , A. , Weston , J. , Yakhnenko , O. : Translating embeddings for modeling multi-relational data . In: NIPS ( 2013 )

6. Kim , Y. : Convolutional neural networks for sentence classi cation . In: EMNLP ( 2014 )

7. Kliegr , T. , Zamazal , O. : LHD 2 .0:

Text Mining Approach to Typing Entities in Knowledge Graphs . J. Web Sem . ( 2016 )

8. Lin , Y. , Liu , Z. , Sun , M. , Liu , Y. , Zhu , X. : Learning entity and relation embeddings for knowledge graph completion . In: Twenty-ninth AAAI conference on arti cial intelligence ( 2015 )

9. Melo , A. , Paulheim , H. , Volker, J.: Type Prediction in RDF Knowledge Bases Using Hierarchical Multilabel Classi cation . In: WIMS ( 2016 )

10. Mikolov , T. , Chen , K. , Corrado , G. , Dean , J.: E cient Estimation of Word Representations in Vector Space . CoRR ( 2013 )

11. Paulheim , H. , Bizer , C. : Type Inference on Noisy RDF Data . In: ISWC ( 2013 )

12. Pennington , J. , Socher , R. , Manning , C.D.: Glove: Global vectors for word representation . In: EMNLP ( 2014 )

13. Sultana , A. , Hasan , Q.M. , Biswas , A.K. , Das , S. , Rahman , H. , Ding , C.H.Q. , Li , C. : Infobox Suggestion for Wikipedia Entities . In: CIKM ( 2012 )

14. Wu , F. , Weld , D.S. : Autonomously Semantifying Wikipedia . In: CIKM ( 2007 )