1. Introduction

Towards an Approach based on Knowledge Graph Refinement for Relation Linking and Entity Linking

Azanzi Jiomekong

Brice Foko

Vadel Tsague

Uriel Melie

Gaoussou Camara

1 0 Department of Computer Science, University of Yaounde I , Yaounde , Cameroon 1 Unité de Formation et de Recherche en Sciences Appliquées et des TIC, Université Alioune Diop de Bambey , Bambey , Sénégal

2020

This paper presents our contribution to the SMART 3.0 Relation Linking and Entity Linking research problems. This contribution consists of an approach based on knowledge graph refinement techniques for completing the graph with missing entities and relations. The model defined is inspired by the TransE algorithm and a scoring function that defines the distance between two nodes in the graph. The scoring function for Relation Linking is defined using Naive Bayes. Concerning Entity Linking, we identify and extract candidate terms from questions. Thereafter, these terms are used to search for relevant entities in the knowledge graph using LookUp and Wikibase API.

eol>Question Answering Relation Linking Entity Linking Knowledge Graphs Refinement Wikidata DBpedia

1. Introduction

{ " Q u e s t i o n " : " Where i s U n i v e r s i t y o f

Yaounde I l o c a t e d ? " , " r e l a t i o n s " : [ " dbo : c o u n t r y " , " dbo : s t a t e " , " dbo : c i t y " ] } } SPARQL Query with RL SELECT DISTINCT ∗ WHERE { dbr : U n i v e r s i t y _ o f _ Y a o u n d _ I dbo :

c o u n t r y ? o1 . dbr : U n i v e r s i t y _ o f _ Y a o u n d _ I dbo : s t a t e

? o2 . dbr : U n i v e r s i t y _ o f _ Y a o u n d _ I dbo : c i t y

? o3 .

Result: 3 good results tre_Region_(Cameroon), :Yaoundé :Cameroon, :CenSELECT DISTINCT ∗ WHERE{ dbr : U n i v e r s i t y _ o f _ Y a o u n d _ I

? p ? o . } Result: 40 results sponding SPARQL query. The illustration is given in table 1. This example shows that Relation Linking is a crucial component to enable QA over Knowledge Graphs.

SMART 2021 was devoted to the Answer Type prediction task and Relation prediction task. However, only 2 (25%) papers were proposed for the relation prediction task. Comparison of related work results are presented by the table 2. This can be normal because relation linking for question answering is known to be a hard task [ 1 ]. In efect, some questions have multiple relations, some relations are semantically far and sometimes tokens deciding the relations are spread across the question. On the other hand, some relations are implicit in the text, and there are lexical gaps in relation surface from the KG properties labels.

System KG Khaoula et al. Wikidata Nadine [ 2 ] Khaoula and DBpedia Nadine [ 2 ] Thang et al. [ 3 ] DBpedia Thang et al. [ 3 ] Wikidata

Approach BERT BERT BERT BERT

Techniques fastai+data augmentation

Precision Recall 0.75094 0.8163 fastai+data augmentation

On the other hand, Entity Linking [ 4 ] consists of identifying entities in a question given { " Q u e s t i o n " : " Which i s t h e p r o f e s s i o n o f

Ousmane Sonko ? " , " e n t i t i e s " : [ " h t t p : / / d b p e d i a . o r g / r e s o u r c e /

Ousmane_Sonko " ] SELECT DISTINCT ∗ WHERE{

? s dbr : o c c u p a t i o n ? o . } Result: more than 10,000 results in natural language and linking them to the KG, so that they can be used for retrieving the correct answer. In efect, the ability to know the entities hidden in the question of a user query expressed in natural language allows us to signicfiantly narrow down the search space for an answer [ 4 ]. For instance, given the following question: "Which is the profession of Ousmane Sonko?" It is possible to reduce the search space from more than 10,000 entities while querying all the 1-hop entities of the "Ousmane_Sonko” resource in DBpedia "dbr:Ousmane_Sonko dbp:occupation" to one good result. This is done by predicting the following entity: http: //dbpedia.org/resource/Ousmane_Sonko. Thereafter, this entity is used to translate the user question into its corresponding SPARQL query. The illustration is given in the table 3. This example shows that EL is a crucial component in KGQA systems.

To solve the EL and RL research problems, SMART 3.0 provides us with two versions of large-scale dataset: one dataset is a DBpedia dataset (composed of 760 classes) and the second one is a Wikidata dataset (composed of 50K classes). In addition to these datasets, a restricted vocabulary of all the relations that are used in the datasets for the RL task is provided. Using these datasets we are proposing a solution to the EL and RL problems. This solution is based on Knowledge Graphs refinement techniques [ 5 ]. Whatever the method used to construct KGs, it is known that they will never be perfect. Thus, to increase their utility, various refinement methods are proposed to add missing knowledge such as missing entities, missing types of entities, and/or missing relations that exist between entities [ 5 ]. To complete a knowledge graph, internal methods use information hidden in the graph and external methods use external knowledge sources such as text corpora, existing vocabularies, ontologies and/or knowledge graphs. In this research, we are using external KG refinement techniques for link prediction and entity discovery. Thus, to solve the RL task, we are considering the restricted vocabulary provided as our external knowledge. To solve the EL task, we consider the DBpedia knowledge graph as our external knowledge and our goal will be to identify terms in the question and map these terms in the graph in order to determine the most relevant entities.

To implement our solutions, we needed to set up a development environment. Thus, we used: • Operating system: Ubuntu, • Programming languages: JavaScript, • JavaScript library: Natural4.

The source code is provided on GitHub5.

In the rest of the paper, we present the Relation Linking task in Section 2, the Entity Linking task in Section 3 and the conclusion in Section 4.

2. Relation Linking

To solve the Relation Linking task, we processed as follow: analysis of the dataset (Section 2.1), model definition (Section 2.2) and model training and use (Section 2.3).

2.1. Analysis of the dataset

The aim of this step was to gather, portray and provide a deeper understanding of the structure of the dataset so as to provide eficient solutions. As presented by the organizers, SMART 3.0 datasets are designed to provide one or many relations given a question in natural language.

To understand the RL task we thought it necessary to investigate a subset of this dataset. Thus, iffty questions for each dataset were randomly selected and their annotations automatically removed. Once the annotations were removed, a manual annotation of this subset of the dataset took place.

The main findings from these datasets: • Data are not equally distributed amongst the diferent relations; • Questions types can be divided into Wh-, How many, Of which , Thus, Did, etc. The taxonomy presented by the Fig. 1 presents the details; • Questions types can be used to identify a type of relation in the vocabulary. Given the finding of the analysis of the dataset, we build the question taxonomy, presented by the Fig. 1. This taxonomy will be used to build the model.

2.2. Model definition

The aim of Model definition step was to define a model that can be applied to any dataset to solve the Relation Linking research problem. The first thing we did was to define a graph structure that can be used to represent questions and the relations to predict (see Fig. 2). This consists of matching the taxonomy of the Fig. 1 to the corresponding relations. Thereafter, each question is matched to the corresponding node in the questions taxonomy. The idea is to say that, if a question is linked to a node in the taxonomy and this node is linked to a relation in the vocabulary, thus, this question is linked to this relation. Thus, we defined the "hasEmbedding" relation between a question and its embedding in the taxonomy. To define

4https://github.com/NaturalNode/natural 5https://github.com/jiofidelus/tsotsa/tree/SMART_22

this relation, knowledge should be extracted from the question. Knowledge extraction is the creation of knowledge from structured (relational databases, XML), semi-structured (source code) and unstructured (text, documents, images) sources [ 6 ]. The current case consists of extracting terms from unstructured sources which are questions.

We define the Relation Linking dataset as a multi-relation data using a Knowledge graph (see Fig. 2) given by the quadruple = (, , , ): • = { ∈ } the set of nodes. These nodes are questions, embedding vectors and relations that are hidden in questions; • = {(, , )} the set of edges. These are triples, with node connected to node via the relation . These are the relations between two questions, a question and the set of relations hidden in this question; • = {()} denotes the node types. The type of nodes relations are their labels and the types of node questions are their id and the title of the question; • = { ∈ } denotes the relation types. These are the labels of the relations. In addition to this graph, we suppose that we have an external knowledge source (vocabulary provided by the organizers) that can be used to complete the graph with missing knowledge.

The nodes in the graph of the Fig. 2 contains the following relations: • hasEmbedding: this relation defines how closeness a question is with an embedding vector defined in the taxonomy of question. This is very useful because when two questions have the same embedding in the taxonomy, they will have the same relation(s). • "hasRelation": represents the relation between a node and a term in the reference vocabulary.

Given this new configuration of the training dataset, we reduced the problem of Relation Linking to the problem of KG completion. To solve this problem, our goal will be to provide an eficient tool to complete the graph by adding new facts from the external vocabulary furnished by the SMART organizers. Thus, our task is to predict the link between a question and a set of relations.

Inspired by the TransE algorithm [ 7 ] for learning low-dimensional embedding of entities, we defined the model presented by the equation 1. We consider that relationships are represented as translations. The model is defined as a function which takes the node and a relation and predicts all the nodes that are related to this relation. This prediction is based on the proximity between two nodes. (1) (2) : × × →−

R (, , ) = || + − || v + ≈ if (, , ) ≈ The function is used to verify if the triples in the knowledge graph holds. The triple (, , ) holds if (, , ) ≈ - in this case, we can say that + ≈ . being the model parameter. If reaches a certain value (or belongs to a certain interval of values), thus, we can say that and are linked with the relation .

The model that we just defined is based on node embedding, the nodes being the questions and the list of relations of this question. Thus, we should find a way to model these nodes in such a way that they can be used to train our models to predict relations. This is done using the "question vector (question2vect)" function.

Before the models were trained, the first thing we did was to encode the questions so as to simplify the definition of (Question, hasEmbedding, Vector) triples. We extracted knowledge from the question title and used this knowledge to define the embedding vector of each question. To this end, we suppose that a question can be well represented by some terms in its title and we define the 2 function (see equation 2). This function takes a question and returns a vector containing the terms used as the embedding of this question.

2() = (1, 2, ..., )

In equation 2, (, <= ) denotes the diferent terms (made up of words or group of words) in the question title that can be representative of the question. The "question2vect" algorithm is given by the listing 1.

Listing 1: Question2vect algorithm to transform a question into a vector

2.3. Models training and use

We used the equation 1 to define specific models for each relations.

• Learning the "hasEmbedding" relation: the function ℎ is given by the equation 3.

ℎ(, ℎ, ) = | − | = (3) To evaluate the function ℎ, we are using the Levenshtein distance. The overall algorithm is given by the listing 2.

Listing 2: f_hasEmbedding algorithm

I n p u t :

Q : Array o f S t r i n g / / A l l words c o n t a i n e d i n t h e q u e s t i o n t i t l e V : V e c t o r o f t e r m s / / The t e r m s c o n t a i n e d i n t h e Embedded v e c t o r Output : a l p h a : I n t / / T h i s i s t h e d i s t a n c e between Q ( t h e q u e s t i o n ) and V ( t h e embedding v e c t o r ) S t e p s : 1 ) n = number o f s t r i n g c o n t a i n e d i n Q 2 ) m = number o f t e r m s c o n t a i n e d i n V / / R e c u r s i v e i m p l e m e n t a t i o n o f t h e L e v e n s h t e i n d i s t a n c e 3 ) C a l c u l a t e t h e d i s t a n c e between Q and V 3 . 1 ) i f m=0 r e t u r n n 3 . 2 ) i f n=0 r e t u r n m 3 . 3 ) i f n=m do HasEmbedded ( Queue (Q) , Queue ( V ) ) 3 . 4 ) e l s e r e t u r n 1+ min ( HasEmbedded ( Queue (Q) , V ) , HasEmbedded ( Q ,

Queue ( V ) , HasEmbedded ( Queue (Q) , Queue ( V ) ) ) / / The f u n c t i o n min ( x_1 , . . . , x_n ) o f s t e p ( 3 . 4 ) r e t u r n s t h e minimum v a l u e o f i t s p a r a m e t e r s / / The f u n c t i o n Queue ( S ) t a k e s i n p a r a m e t e r t h e q u e s t i o n t i t l e and r e t u r n an a r r a y o f s t r i n g c o n t a i n i n g t h e words whose composed t h e q u e s t i o n e x c e p t t h e f i r s t word To extract knowledge from questions, we used Term Frequency–Inverse Document Frequency (TF-IDF) technique. We identified and removed the less frequent and stop words in questions to obtain the information that can be used to well represent the question. These information were used to define the ℎ model. Globally, two vectors are closed if ≈ , being the model parameter defined during the training process. During the experiment, we vary to obtain diferent performances. • Learning "hasRelation": To predict "hasRelation", relation, we consider that each embedding node has a probability to belong to a set of relations. The scoring function defining this probability can thus be obtained using statistical learning, Naive Bayes, etc. In this research we used Naive Bayes. The algorithm 3 describes how we determine the relations between a node embedding and a set of relations.

Listing 3: f_hasRelation algorithm

I n p u t :

V : a v e c t o r o f embedded v e c t o r s Candidates relations to be predicted

Given in the external vocabulary

To train the ℎ model, we used DBpedia and Wikidata datasets, with the goal to determine the triples (V, hasRelation, R). This consists of filling the matrix presented by the Fig. 3 with values, so that each value represents the probability that a term in the embedding vector has as relation an element of the vocabulary.

Z =

One question per row

Model refinement consists of extending the taxonomy of question to obtain a larger number of nodes. This is done by varying the parameter given in the equation 1. To finalize this work, we applied our approach (with embedding vectors of diferent sizes) on the test data and we got our results analyzed by the SMART 3.0 organizers. The results by considering diferent embedding vectors are presented by the table 4 for DBpedia and table 5 for Wikidata. In these tables:

Method Naive Bayes Naive Bayes + WH-taxonomy ( = 1) Naive Bayes + taxonomy ( > 4) Naive Bayes + taxonomy ( ∈ [ 1 − 4 ] )

3. Entity Linking

To solve the EL task, a simple software engineering technique consisting of extracting terms from questions and using these terms to search for entities in the KG were used (see Fig. 4).

To well understand the EL task, we selected ten questions from the training dataset, we identified the entities in the question titles and we compared them to the provided answers. Thereafter, we make simple queries on the DBpedia online using Wikibase6 and DBpedia LookUp7 API (the listing 4 presents an example) to search for the same entity in the DBpedia KG.

Listing 4: Example of the use of LookUp and Wikibase API / / Querying u s i n g LookUp API h t t p s : / / l o o k u p . d b p e d i a . org / a p i / s e a r c h ? query =Ousmane%20 Sonko / / Querying u s i n g W i k i b a s e API h t t p s : / / en . w i k i p e d i a . org /w/ a p i . php ? a c t i o n = query&g e n e r a t o r = a l l p a g e s &prop = p a g e p r o p s | i n f o&i n p r o p = u r l&ppprop = w i k i b a s e _ i t e m&g a p l i m i t =5& gapfrom =Ousmane %20 Sonko

Globally, we reduced the EL task into two sub-problems: entity retrieval in question and entity mapping to the KG (illustrated by the Fig. 4). Thus, for each question : • Search all terms candidates that might be entities in the question . In our case, a term can be a word or a group of words. Examples of terms can be "Tchad", "Success Masra". • Use the terms identified to search for candidates entities in the KG.

If we take the example of the question "Where was Maurice Kamto borned?". From this question, the terms candidates are "Maurice", "Kamto", "was Maurice", "Maurice Kamto", "Kamto

6https://www.mediawiki.org/wiki/API:Main_page 7https://lookup.dbpedia.org/

Large scale

graph borned", "was Maurice Kamto", "was Maurice Kamto born". From these terms, we realized that the verbs are not pertinent words to find entities and we put in place a filter process to retrieve and remove all the verbs. Applied to our example, we obtain the following candidates: {"Maurice", "Kamto", "Maurice Kamto"}. These three terms are used to query the KG online to search for entities. Once the entities are retrieved, we count the occurrence of each result (see Table 6). The entity having the greatest number of occurrences is selected as the entity we are searching for. In the case of our example, the entity which has 3 occurrences over 1 for the others is Maurice_Kamto.

Entities retrieved dbr:Maurice_(emperor) dbr:Mauritius dbr:Joseph_Fotso Maurice_Kamto

NB occurrence 1 1 1 3

In the example we just presented, the entities were very easy to retrieve. However, the analysis of the dataset showed that the situation is not always so easy. The analysis of the diferent types of question showed that the terms to seek for entities can be: 1. A proper name: "Success Masra", "Cameroon", "Ousmane Sonko"; 2. A common name: "Marketing", 3. A group of words between two stop words.

The cases one and two can be easily solved. However, the third case is the most dificult because one has to determine which are the stop words between the entities. To identify the entities defined by the third case, we made two tables, one composed of the stop words and the second one composed of stop words that are generally used to link terms and form entities. Thus, for each question, we match the question with the stop word and search for the entity in the KG.

The stop words are used to delimit the names of entities to be searched in the question. The stop word table is created using the set of English language stop words. We have added the term "END_EL" at the end of all the questions. This term is considered as a stop word and is used to complete the stop words table. The overall stop words are used to build a set of term candidates. This approach allowed us to define the diferent ways an entity can be presented in a question. If we take the example: "where is the University of Yaounde 1 ?" The stop word "of" and "END_EL" are used to define the term "Yaoundé I" and the stop word "the" and "of" are used to define the term "University" as a candidate term to search for an entity in the KG. Given that the word "of" is contained in the second table generally used to link words and that "of" is between two term candidates that are "University" and "Yaounde I", a new term "University of Yaounde I" is added to the list of term candidates. This approach is illustrated by the Fig. 5 and listing 5 presents the algorithm.

Listing 5: The algorithm executed to solve the EL task

I n p u t :

Q : Array o f q u e s t i o n / / A l l t h e q u e s t i o n s from t e s t d a t a s t o p w o r d s : Array o f S t r i n g / / The E n g l i s h s t o p words l i s t . Examples : of , the , in , or , e t c . f i l t e r w o r d s : Array o f S t r i n g / / The f i l t e r words l i s t . Examples : ne , of , t h e S t e p s :

F o r e a c h q u e s t i o n i n Q : 1 ) Remove t h e s p e c i a l c h a r a c t e r s : \ , / " # : % 2 ) Put a l l t h e s t r i n g s i n l o w e r c a s e 3 ) T o k e n i z a t i o n o f r e s u l t o f t h e s t e p 2 4 ) Match t o k e n s o b t a i n e d i n s t e p ( 3 ) w it h { s t o p w o r d s } t o g e t E n t i t i e s c a n d i d a t e s / / The s t e p 5 a l l o w t o o b t a i n t h e embedding v e c t o r V a s s o c i a t e d t o t h e q u e s t i o n 5 ) Match e n t i t i e s c a n d i d a t e s w it h { f i l t e r w o r d s } t o p r o d u c e and c l e a n c a n d i d a t e s l i s t 7 ) Use w i k i b a s e API and LookUp API t o s e a r c h e n t i t i e s 8 ) s a v e t h e e n t i t y r e t r i e v e d We applied this approach on the test data. Table 7 presents the results obtained.

Precision 0.41999

Recall 0.54481

F1-score 0.45729

We approach the problem of entity linking as software developers. However, this has the following limits: Set-up the stop_words arrays for the question is a dificult task. On the other hand, the number of combinations to determine the entity is very high, making the time complexity very high.

4. Conclusion

To solve the RL and EL tasks, this paper proposes an approach based on KG refinement techniques. This consists of using external knowledge to complete the graph with missing entities and missing relations.

Concerning Relation Linking, we used the vocabulary provided by SMART 3.0 organizers as our external knowledge. Thereafter, we used Naive Bayes to build the embedding matrix that will be used to predict links between a question and its relation(s). The evaluation on the test data gave the max Precision of 0.53426, Recall of 0.47036 and F-measure of 0.48661 for DBpedia and Precision of 0.27901, Recall of 0.31036 and F1-score of 0.28541 for Wikidata.

Concerning Entity Linking, we used the KG itself as external knowledge. Thus, our goal was to identify the most relevant terms (one word or a group of words) that can be matched to the KG and from these terms, determine relevant entities. The EL task was used on DBpedia dataset only and the evaluation by the SMART 3.0 organizers gave the Precision of 0.41999, the Recall of 0.54481 and the F1-score of 0.45729.

We started this challenge with no prior knowledge on Entity Linking and Relation Linking tasks and we proposed a solution. Future work consists of exploring diferent machine learning, deep learning and natural language processing models that can be used to help us improve the quality of our results. We particularly found good papers [ 8, 9 ] on relation linking and entity linking [ 9, 10, 11 ]. We are planning to explore and test these methods to solve the Entity Linking and Relation Linking tasks.

Acknowledgments

We would like to thank the ISWC conference and the SMART challenge organizers and all the reviewers. This work was partially supported by Neuralearn.ai (Start-up Cameroon).

[1]

Rossiello ,

Mihindukulasooriya , I. Abdelaziz,

Bornea ,

Gliozzo ,

Naseem ,

Kapanipathi , Generative relation linking for question answering over knowledge bases, 2021 . URL: https://arxiv.org/abs/2108.07337. doi: 10 .48550/ARXIV.2108.07337.

[2] Reaching out for the Answer: Relation Prediction , volume 3119 of SeMantic Answer Type and Relation Prediction Task at ISWC 2021 Semantic Web Challenge (SMART2021) , CEUR Workshop Proceedings , 2021 .

[3] The Combination of BERT and Data Oversampling for Relation Set Prediction , volume 3119 of SeMantic Answer Type and Relation Prediction Task at ISWC 2021 Semantic Web Challenge (SMART2021) , CEUR Workshop Proceedings , 2021 .

[4]

Zhang , W. Hua,

Stratos , Entqa: Entity linking as question answering , CoRR abs/2110 .02369 ( 2021 ). URL: https://arxiv.org/abs/2110.02369.

[5]

Cimiano ,

Paulheim , Knowledge graph refinement: A survey of approaches and evaluation methods , Semant. Web 8 ( 2017 ) 489 - 508 . URL: https://doi.org/10.3233/SW-160218. doi: 10 .3233/SW-160218.

[6]

Jiomekong , G. Camara,

Tchuente , Extracting ontological knowledge from java source code using hidden markov models , Open Computer Science 9 ( 2019 ) 181 - 199 .

[7]

Bordes ,

Usunier ,

Garcia-Durán ,

Weston ,

Yakhnenko , Translating embeddings for modeling multi-relational data , in: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS'13 , Curran Associates Inc., Red

Hook

, NY , USA, 2013 , p. 2787 - 2795 .

[8]

Rossiello ,

Mihindukulasooriya , I. Abdelaziz,

Bornea ,

Gliozzo ,

Naseem ,

Kapanipathi , Generative relation linking for question answering over knowledge bases , in: The Semantic Web - ISWC 2021 : 20th International Semantic Web Conference, ISWC 2021 ,

Virtual

Event , October 24-28 , 2021 , Proceedings, Springer-Verlag, Berlin, Heidelberg, 2021 , p. 321 - 337 . URL: https://doi.org/10.1007/978-3- 030 -88361-4_ 19 . doi: 10 . 1007/978-3- 030 -88361-4_ 19 .

[9]

J. a.

Gomes , R. C. denbsp;Mello,

Ströele , J. F. denbsp;Souza, A study of approaches to answering complex questions over knowledge bases , Knowl. Inf. Syst . 64 ( 2022 ) 2849 - 2881 . URL: https://doi.org/10.1007/s10115-022-01737-x. doi: 10 .1007/s10115-022-01737-x.

[10]

Ran ,

Shen ,

Gao ,

Li ,

Wang ,

Jia , Learning entity linking features for emerging entities , 2022 . doi: 10 .48550/ARXIV.2208.03877.

[11]

Zhang , W. Hua,

Stratos , Entqa: Entity linking as question answering, 2021 . URL: https://arxiv.org/abs/2110.02369. doi: 10 .48550/ARXIV.2110.02369.