Towards an Approach based on Knowledge Graph Refinement for Relation Linking and Entity Linking Azanzi Jiomekong1 , Brice Foko1 , Vadel Tsague1 , Uriel Melie1 and Gaoussou Camara2 1 Department of Computer Science, University of Yaounde I, Yaounde, Cameroon 2 Unité de Formation et de Recherche en Sciences Appliquées et des TIC, Université Alioune Diop de Bambey, Bambey, Sénégal Abstract This paper presents our contribution to the SMART 3.0 Relation Linking and Entity Linking research problems. This contribution consists of an approach based on knowledge graph refinement techniques for completing the graph with missing entities and relations. The model defined is inspired by the TransE algorithm and a scoring function that defines the distance between two nodes in the graph. The scoring function for Relation Linking is defined using Naive Bayes. Concerning Entity Linking, we identify and extract candidate terms from questions. Thereafter, these terms are used to search for relevant entities in the knowledge graph using LookUp and Wikibase API. Keywords Question Answering, Relation Linking, Entity Linking, Knowledge Graphs Refinement, Wikidata, DBpe- dia 1. Introduction Knowledge Graph Question Answering (KGQA) is a popular task in the field of Natural Language Processing (NLP) and Information Retrieval (IR). The goal of this task is to answer a natural language question using the facts in the Knowledge Graph (KG) [1]. To build efficient QA systems, several subtasks should be considered. These are: Answer Type prediction, Entity Linking and Relation Linking. While SMART 20201 focused on Answer Type (AT) subtask, SMART 20212 on Answer Type and Relation Linling (RL) subtasks, SMART 3.03 considers Answer Type prediction, Relation Linking and Entity Linking (EL) subtasks. During the SMART 3.0 challenge, we worked on the three subtasks. This paper is presenting our contribution to Entity Linking and Relation Linking subtasks. Relation Linking consists of predicting relation(s) to use for identifying the correct answer of a question given in natural language. In effect, the ability to know the relations between entities in a user query expressed using natural language allowed us to significantly narrow down the search space for an answer [1]. For instance, given the following question: "Where is University of Yaounde I located?", it is possible to reduce the search space from 40 entities while querying all the 1-hop entities of the "University_of_Yaoundé_I" resource in DBpedia ("dbr:University_of_Yaoundé_I"). This is done by predicting the following relations: dbo:country, dbo:state, dbo:city. Thereafter, these relations are used to translate the question into its corre- 1 https://smart-task.github.io/2020/ 2 https://smart-task.github.io/2021/ 3 https://smart-task.github.io/2022/ A question with Relation Predicted SPARQL Query over DBpedia { SELECT DISTINCT ∗ WHERE { " Q u e s t i o n " : " Where i s U n i v e r s i t y o f dbr : U n i v e r s i t y _ o f _ Y a o u n d _ I Yaounde I l o c a t e d ? " , ?p ?o . " relations ":[ } " dbo : c o u n t r y " , Result: 40 results " dbo : s t a t e " , " dbo : c i t y " ] } SPARQL Query with RL SELECT DISTINCT ∗ WHERE { d b r : U n i v e r s i t y _ o f _ Y a o u n d _ I dbo : c o u n t r y ? o1 . d b r : U n i v e r s i t y _ o f _ Y a o u n d _ I dbo : s t a t e ? o2 . d b r : U n i v e r s i t y _ o f _ Y a o u n d _ I dbo : c i t y ? o3 . } Result: 3 good results - :Cameroon, :Cen- tre_Region_(Cameroon), :Yaoundé Table 1 SPARQL example on how Relation Linking can help to reduce the solution space over DBpedia dataset sponding SPARQL query. The illustration is given in table 1. This example shows that Relation Linking is a crucial component to enable QA over Knowledge Graphs. SMART 2021 was devoted to the Answer Type prediction task and Relation prediction task. However, only 2 (25%) papers were proposed for the relation prediction task. Comparison of related work results are presented by the table 2. This can be normal because relation linking for question answering is known to be a hard task [1]. In effect, some questions have multiple relations, some relations are semantically far and sometimes tokens deciding the relations are spread across the question. On the other hand, some relations are implicit in the text, and there are lexical gaps in relation surface from the KG properties labels. System KG Approach Techniques Precision Recall F-measure Khaoula et al. Wikidata BERT fastai+data augmentation 0.75094 0.8163 0.76018 Nadine [2] Khaoula and DBpedia BERT fastai+data augmentation 0.86475 0.9204 0.86232 Nadine [2] Thang et al. [3] DBpedia BERT data oversampling 0.8368 0.8295 0.8315 Thang et al. [3] Wikidata BERT data oversampling 0.6163 0.6110 0.6070 Table 2 Relation Linking results during SMART 2021 On the other hand, Entity Linking [4] consists of identifying entities in a question given A question with Relation Predicted SPARQL Query over DBpedia { SELECT DISTINCT ∗ WHERE { " Q u e s t i o n " : " Which i s t h e p r o f e s s i o n o f ? s dbr : occupation ? o . Ousmane Sonko ? " , } " e n t i t i e s " : [ " http : / / dbpedia . org / r e s o u r c e / Result: more than 10,000 results Ousmane_Sonko " ] } SPARQL Query with EL SELECT DISTINCT ∗ WHERE { d b r : Ousmane_Sonko dbp : o c c u p a t i o n ? o1 . } Result: 1 good result - Politician, Tax inspector Table 3 SPARQL example on how EL prediction can help to reduce the solution space over DBpedia dataset in natural language and linking them to the KG, so that they can be used for retrieving the correct answer. In effect, the ability to know the entities hidden in the question of a user query expressed in natural language allows us to significantly narrow down the search space for an answer [4]. For instance, given the following question: "Which is the profession of Ousmane Sonko?" It is possible to reduce the search space from more than 10,000 entities while querying all the 1-hop entities of the "Ousmane_Sonko” resource in DBpedia "dbr:Ousmane_Sonko dbp:occupation" to one good result. This is done by predicting the following entity: http: //dbpedia.org/resource/Ousmane_Sonko. Thereafter, this entity is used to translate the user question into its corresponding SPARQL query. The illustration is given in the table 3. This example shows that EL is a crucial component in KGQA systems. To solve the EL and RL research problems, SMART 3.0 provides us with two versions of large-scale dataset: one dataset is a DBpedia dataset (composed of 760 classes) and the second one is a Wikidata dataset (composed of 50K classes). In addition to these datasets, a restricted vocabulary of all the relations that are used in the datasets for the RL task is provided. Using these datasets we are proposing a solution to the EL and RL problems. This solution is based on Knowledge Graphs refinement techniques [5]. Whatever the method used to construct KGs, it is known that they will never be perfect. Thus, to increase their utility, various refinement methods are proposed to add missing knowledge such as missing entities, missing types of entities, and/or missing relations that exist between entities [5]. To complete a knowledge graph, internal methods use information hidden in the graph and external methods use external knowledge sources such as text corpora, existing vocabularies, ontologies and/or knowledge graphs. In this research, we are using external KG refinement techniques for link prediction and entity discovery. Thus, to solve the RL task, we are considering the restricted vocabulary provided as our external knowledge. To solve the EL task, we consider the DBpedia knowledge graph as our external knowledge and our goal will be to identify terms in the question and map these terms in the graph in order to determine the most relevant entities. To implement our solutions, we needed to set up a development environment. Thus, we used: • Operating system: Ubuntu, • Programming languages: JavaScript, • JavaScript library: Natural4 . The source code is provided on GitHub5 . In the rest of the paper, we present the Relation Linking task in Section 2, the Entity Linking task in Section 3 and the conclusion in Section 4. 2. Relation Linking To solve the Relation Linking task, we processed as follow: analysis of the dataset (Section 2.1), model definition (Section 2.2) and model training and use (Section 2.3). 2.1. Analysis of the dataset The aim of this step was to gather, portray and provide a deeper understanding of the structure of the dataset so as to provide efficient solutions. As presented by the organizers, SMART 3.0 datasets are designed to provide one or many relations given a question in natural language. To understand the RL task we thought it necessary to investigate a subset of this dataset. Thus, fifty questions for each dataset were randomly selected and their annotations automatically removed. Once the annotations were removed, a manual annotation of this subset of the dataset took place. The main findings from these datasets: • Data are not equally distributed amongst the different relations; • Questions types can be divided into Wh-, How many, Off which, Thus, Did, etc. The taxonomy presented by the Fig. 1 presents the details; • Questions types can be used to identify a type of relation in the vocabulary. Given the finding of the analysis of the dataset, we build the question taxonomy, presented by the Fig. 1. This taxonomy will be used to build the model. 2.2. Model definition The aim of Model definition step was to define a model that can be applied to any dataset to solve the Relation Linking research problem. The first thing we did was to define a graph structure that can be used to represent questions and the relations to predict (see Fig. 2). This consists of matching the taxonomy of the Fig. 1 to the corresponding relations. Thereafter, each question is matched to the corresponding node in the questions taxonomy. The idea is to say that, if a question is linked to a node in the taxonomy and this node is linked to a relation in the vocabulary, thus, this question is linked to this relation. Thus, we defined the "hasEmbedding" relation between a question and its embedding in the taxonomy. To define 4 https://github.com/NaturalNode/natural 5 https://github.com/jiofidelus/tsotsa/tree/SMART_22 Figure 1: Taxonomy of types of questions build during the analysis of the dataset this relation, knowledge should be extracted from the question. Knowledge extraction is the creation of knowledge from structured (relational databases, XML), semi-structured (source code) and unstructured (text, documents, images) sources [6]. The current case consists of extracting terms from unstructured sources which are questions. We define the Relation Linking dataset as a multi-relation data using a Knowledge graph (see Fig. 2) given by the quadruple 𝐺 = (𝑉, 𝐸, 𝑅, 𝑇 ): • 𝑉 = {𝑣𝑖 ∈ 𝑉 } the set of nodes. These nodes are questions, embedding vectors and relations that are hidden in questions; • 𝐸 = {(𝑣𝑖 , 𝑟, 𝑣𝑗 )} the set of edges. These are triples, with node 𝑣𝑖 connected to node 𝑣𝑗 via the relation 𝑟. These are the relations between two questions, a question and the set of relations hidden in this question; • 𝑇 = {𝑡(𝑣𝑖 )} denotes the node types. The type of nodes relations are their labels and the types of node questions are their id and the title of the question; • 𝑅 = {𝑟 ∈ 𝑅} denotes the relation types. These are the labels of the relations. In addition to this graph, we suppose that we have an external knowledge source (vocabulary provided by the organizers) that can be used to complete the graph with missing knowledge. The nodes in the graph of the Fig. 2 contains the following relations: • hasEmbedding: this relation defines how closeness a question is with an embedding vector defined in the taxonomy of question. This is very useful because when two questions have the same embedding in the taxonomy, they will have the same relation(s). • "hasRelation": represents the relation between a node and a term in the reference vocabulary. Given this new configuration of the training dataset, we reduced the problem of Relation Linking to the problem of KG completion. To solve this problem, our goal will be to provide an efficient tool to complete the graph by adding new facts from the external vocabulary furnished Figure 2: Knowledge graph-based representation of the dataset and the link with external knowledge by the SMART organizers. Thus, our task is to predict the link between a question and a set of relations. Inspired by the TransE algorithm [7] for learning low-dimensional embedding of entities, we defined the model presented by the equation 1. We consider that relationships are represented as translations. The model is defined as a function which takes the node and a relation and predicts all the nodes that are related to this relation. This prediction is based on the proximity between two nodes. 𝑓𝑟 :𝑉 × 𝐸 × 𝑉 −→ R 𝑓𝑟 (𝑣𝑖 , 𝑟, 𝑣𝑗 ) = ||𝑣𝑖 + 𝑣𝑟 − 𝑣𝑗 || (1) v𝑖 + 𝑣𝑟 ≈ 𝑣𝑗 if 𝑓𝑟 (𝑣𝑖 , 𝑟, 𝑣𝑗 ) ≈ 𝛽 The function 𝑓𝑟 is used to verify if the triples in the knowledge graph holds. The triple (𝑣𝑖 , 𝑟, 𝑣𝑗 ) holds if 𝑓𝑟 (𝑣𝑖 , 𝑟, 𝑣𝑗 ) ≈ 𝛽 - in this case, we can say that 𝑣𝑖 + 𝑣𝑟 ≈ 𝑣𝑗 . 𝛽 being the model parameter. If 𝛽 reaches a certain value (or belongs to a certain interval of values), thus, we can say that 𝑣𝑖 and 𝑣𝑗 are linked with the relation 𝑟. The model that we just defined is based on node embedding, the nodes being the questions and the list of relations of this question. Thus, we should find a way to model these nodes in such a way that they can be used to train our models to predict relations. This is done using the "question vector (question2vect)" function. Before the models were trained, the first thing we did was to encode the questions so as to simplify the definition of (Question, hasEmbedding, Vector) triples. We extracted knowledge from the question title and used this knowledge to define the embedding vector of each question. To this end, we suppose that a question can be well represented by some terms in its title and we define the 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛2𝑣𝑒𝑐𝑡 function (see equation 2). This function takes a question and returns a vector containing the terms used as the embedding of this question. 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛2𝑣𝑒𝑐𝑡(𝑄) = (𝑡1 , 𝑡2 , ..., 𝑡𝑛 ) (2) In equation 2, (𝑡𝑖 , 𝑖 <= 𝑛) denotes the different terms (made up of words or group of words) in the question title that can be representative of the question. The "question2vect" algorithm is given by the listing 1. Listing 1: Question2vect algorithm to transform a question into a vector Input : Q : Array o f S t r i n g / / The words c o n t a i n e d i n t h e q u e s t i o n RL_stopWord : Array o f S t r i n g / / The s t o p words l i s t Output : V : V e c t o r o f t e r m s / / The Embedded v e c t o r Steps : 1 ) Remove a l l s p e c i a l c h a r a c t e r i n s i d e Q : \ , / " # : % 2 ) E x t r a c t from Q t h e words i n RL_stopWord and p u t them i n s i d e E x t r s u c h t h a t | s i z e (Q) + s i z e ( E x t r ) − s i z e ( RL_stopWord ) | < b e t a / / HasEmbedding 3 ) P u t t h e r e s u l t s o f s t e p ( 2 ) i n l o w e r c a s e and d e l e t e d u p l i c a t e s 4 ) R e t u r n t h e v e c t o r V from s t e p ( 3 ) w i t h t h e e x t r a c t e d t e r m s ( t _ 1 , t _ 2 , . . . , t_n ) 2.3. Models training and use We used the equation 1 to define specific models for each relations. • Learning the "hasEmbedding" relation: the function 𝑓ℎ𝑎𝑠𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 is given by the equation 3. 𝑓ℎ𝑎𝑠𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 (𝑄, ℎ𝑎𝑠𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔, 𝑉 ) = |𝑄 − 𝑉 | = 𝛼 (3) To evaluate the function 𝑓ℎ𝑎𝑠𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 , we are using the Levenshtein distance. The overall algorithm is given by the listing 2. Listing 2: f_hasEmbedding algorithm Input : Q : Array o f S t r i n g / / A l l words c o n t a i n e d i n t h e q u e s t i o n t i t l e V : V e c t o r o f t e r m s / / The t e r m s c o n t a i n e d i n t h e Embedded v e c t o r Output : a l p h a : I n t / / T h i s i s t h e d i s t a n c e between Q ( t h e q u e s t i o n ) and V ( t h e embedding v e c t o r ) Steps : 1 ) n = number o f s t r i n g c o n t a i n e d i n Q 2 ) m = number o f t e r m s c o n t a i n e d i n V / / Recursive implementation of the Levenshtein distance 3 ) C a l c u l a t e t h e d i s t a n c e between Q and V 3 . 1 ) i f m=0 r e t u r n n 3 . 2 ) i f n=0 r e t u r n m 3 . 3 ) i f n=m do HasEmbedded ( Queue (Q) , Queue ( V ) ) 3 . 4 ) e l s e r e t u r n 1+ min ( HasEmbedded ( Queue (Q) , V ) , HasEmbedded ( Q , Queue ( V ) , HasEmbedded ( Queue (Q) , Queue ( V ) ) ) / / The f u n c t i o n min ( x_1 , . . . , x_n ) o f s t e p ( 3 . 4 ) r e t u r n s t h e minimum value of i t s parameters / / The f u n c t i o n Queue ( S ) t a k e s i n p a r a m e t e r t h e q u e s t i o n t i t l e and r e t u r n an a r r a y o f s t r i n g c o n t a i n i n g t h e words whose composed t h e q u e s t i o n e x c e p t t h e f i r s t word To extract knowledge from questions, we used Term Frequency–Inverse Document Frequency (TF-IDF) technique. We identified and removed the less frequent and stop words in questions to obtain the information that can be used to well represent the question. These information were used to define the 𝑓ℎ𝑎𝑠𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 model. Globally, two vectors are closed if 𝛼 ≈ 𝛽, 𝛽 being the model parameter defined during the training process. During the experiment, we vary 𝛽 to obtain different performances. • Learning "hasRelation": To predict "hasRelation", relation, we consider that each embed- ding node has a probability to belong to a set of relations. The scoring function defining this probability can thus be obtained using statistical learning, Naive Bayes, etc. In this research we used Naive Bayes. The algorithm 3 describes how we determine the relations between a node embedding and a set of relations. Listing 3: f_hasRelation algorithm Input : V : a v e c t o r o f embedded v e c t o r s Z : Embedding M a t r i x / / M a t r i x c o n t a i n i n g t h e p r o b a b i l i t y t o have a V e c t o r v _ i ( an e l e m e n t o f V ) l i n k t o a R e l a t i o n R _ j Output : R : S t r i n g / / The r e l a t i o n p r e d i c t e d Steps : f o r each v e c t o r s v_i in V Z _ i = row i from t h e m a t r i x Z f o r k = 0 , k < s i z e O f ( Z _ i ) / / t h e s i z e o f t h e row i f ( max ( | Z _ i k | ) / / S e a r c h i n g o f t h e max p r o b a b i l i t y i n t h e row r e t u r n R_k / / R e t u r n i n g t h e r e l a t i o n h a v i n g t h e max probability To train the 𝑓ℎ𝑎𝑠𝑅𝑒𝑙𝑎𝑡𝑖𝑜𝑛 model, we used DBpedia and Wikidata datasets, with the goal to determine the triples (V, hasRelation, R). This consists of filling the matrix presented by the Fig. 3 with values, so that each value represents the probability that a term in the embedding vector has as relation an element of the vocabulary. Candidates relations to be predicted Given in the external vocabulary Embedding vector The size of the matrix augment if the embedding vector augment Z= One question per row Figure 3: Embedding Matrix Model refinement consists of extending the taxonomy of question to obtain a larger number of nodes. This is done by varying the parameter 𝛽 given in the equation 1. To finalize this work, we applied our approach (with embedding vectors of different sizes) on the test data and we got our results analyzed by the SMART 3.0 organizers. The results by considering different embedding vectors are presented by the table 4 for DBpedia and table 5 for Wikidata. In these tables: Method Precision Recall F-Measure Naive Bayes 0.37266 0.27052 0.29319 Naive Bayes + WH-taxonomy (𝛽 = 1) 0.37757 0.27602 0.29864 Naive Bayes + taxonomy (𝛽 > 4) 0.38377 0.28017 0.30302 Naive Bayes + taxonomy (𝛽 ∈ [1 − 4] ) 0.53426 0.47036 0.48661 Table 4 Results of the approach applied to DBpedia dataset Method Precision Recall F-Measure Naive Bayes 0.27901 0.31036 0.28541 Table 5 Results of the approach applied to Wikidata dataset • Naive Bayes: we used the Naive Bayes to learn the model without any consideration of the model parameter 𝛽; • Naive Bayes + WH-taxonomy: we used the question taxonomy of Fig. 1. In this case, the model parameter 𝛽 is 1, representing for instance the term "Who" in the question "Who are the gymnasts coached by Amanda Reddin?"; • Naive Bayes + taxonomy (𝛽 > 4): represents the question taxonomy of Fig. 1 extended with more words extracted from the question title. In this case, we took 𝛽 > 4 words. • Naive Bayes + taxonomy (𝛽 ∈ [1 − 4] ): represents the use of the question taxonomy with a value of 𝛽 fixed between 1 and 4 words. 3. Entity Linking To solve the EL task, a simple software engineering technique consisting of extracting terms from questions and using these terms to search for entities in the KG were used (see Fig. 4). To well understand the EL task, we selected ten questions from the training dataset, we identified the entities in the question titles and we compared them to the provided answers. Thereafter, we make simple queries on the DBpedia online using Wikibase6 and DBpedia LookUp7 API (the listing 4 presents an example) to search for the same entity in the DBpedia KG. Listing 4: Example of the use of LookUp and Wikibase API / / Q u e r y i n g u s i n g LookUp API h t t p s : / / l o o k u p . d b p e d i a . o r g / a p i / s e a r c h ? q u e r y =Ousmane %20 Sonko / / Q u e r y i n g u s i n g W i k i b a s e API h t t p s : / / en . w i k i p e d i a . o r g /w/ a p i . php ? a c t i o n = q u e r y&g e n e r a t o r = a l l p a g e s &prop = p a g e p r o p s | i n f o &i n p r o p = u r l &ppprop = w i k i b a s e _ i t e m& g a p l i m i t =5& gapfrom =Ousmane %20 Sonko Globally, we reduced the EL task into two sub-problems: entity retrieval in question and entity mapping to the KG (illustrated by the Fig. 4). Thus, for each question 𝑄: • Search all terms candidates that might be entities in the question 𝑄. In our case, a term can be a word or a group of words. Examples of terms can be "Tchad", "Success Masra". • Use the terms identified to search for candidates entities in the KG. If we take the example of the question "Where was Maurice Kamto borned?". From this question, the terms candidates are "Maurice", "Kamto", "was Maurice", "Maurice Kamto", "Kamto 6 https://www.mediawiki.org/wiki/API:Main_page 7 https://lookup.dbpedia.org/ Large scale graph ... vect2GraphMapping t1, t2, t3, …, tn question2vect Natural Language Question Figure 4: Mapping a question to the knowledge graph borned", "was Maurice Kamto", "was Maurice Kamto born". From these terms, we realized that the verbs are not pertinent words to find entities and we put in place a filter process to retrieve and remove all the verbs. Applied to our example, we obtain the following candidates: {"Maurice", "Kamto", "Maurice Kamto"}. These three terms are used to query the KG online to search for entities. Once the entities are retrieved, we count the occurrence of each result (see Table 6). The entity having the greatest number of occurrences is selected as the entity we are searching for. In the case of our example, the entity which has 3 occurrences over 1 for the others is Maurice_Kamto. Entities retrieved NB occurrence dbr:Maurice_(emperor) 1 dbr:Mauritius 1 dbr:Joseph_Fotso 1 Maurice_Kamto 3 Table 6 Results of {"Maurice", "Kamto", "Maurice Kamto"} mapping to DBpedia KG In the example we just presented, the entities were very easy to retrieve. However, the analysis of the dataset showed that the situation is not always so easy. The analysis of the different types of question showed that the terms to seek for entities can be: 1. A proper name: "Success Masra", "Cameroon", "Ousmane Sonko"; 2. A common name: "Marketing", 3. A group of words between two stop words. The cases one and two can be easily solved. However, the third case is the most difficult because one has to determine which are the stop words between the entities. To identify the entities defined by the third case, we made two tables, one composed of the stop words and the second one composed of stop words that are generally used to link terms and form entities. Thus, for each question, we match the question with the stop word and search for the entity in the KG. The stop words are used to delimit the names of entities to be searched in the question. The stop word table is created using the set of English language stop words. We have added the term "END_EL" at the end of all the questions. This term is considered as a stop word and is used to complete the stop words table. The overall stop words are used to build a set of term candidates. This approach allowed us to define the different ways an entity can be presented in a question. If we take the example: "where is the University of Yaounde 1 ?" The stop word "of" and "END_EL" are used to define the term "Yaoundé I" and the stop word "the" and "of" are used to define the term "University" as a candidate term to search for an entity in the KG. Given that the word "of" is contained in the second table generally used to link words and that "of" is between two term candidates that are "University" and "Yaounde I", a new term "University of Yaounde I" is added to the list of term candidates. This approach is illustrated by the Fig. 5 and listing 5 presents the algorithm. Figure 5: Description of the Entity Linking approach Listing 5: The algorithm executed to solve the EL task Input : Q : Array o f q u e s t i o n / / A l l t h e q u e s t i o n s from t e s t d a t a s t o p w o r d s : Array o f S t r i n g / / The E n g l i s h s t o p words l i s t . Examples : of , the , in , or , e t c . f i l t e r w o r d s : Array o f S t r i n g / / The f i l t e r words l i s t . Examples : ne , of , the Steps : For each q u e s t i o n in Q: 1 ) Remove t h e s p e c i a l c h a r a c t e r s : \ , / " # : % 2 ) Put a l l the s t r i n g s in lower c a s e 3) Tokenization of r e s u l t of the step 2 4 ) Match t o k e n s o b t a i n e d i n s t e p ( 3 ) w i t h { s t o p w o r d s } t o g e t E n t i t i e s candidates / / The s t e p 5 a l l o w t o o b t a i n t h e embedding v e c t o r V a s s o c i a t e d t o t h e question 5 ) Match e n t i t i e s c a n d i d a t e s w i t h { f i l t e r w o r d s } t o p r o d u c e and c l e a n candidates l i s t 7 ) Use w i k i b a s e API and LookUp API t o s e a r c h e n t i t i e s 8) save the e n t i t y r e t r i e v e d We applied this approach on the test data. Table 7 presents the results obtained. Precision Recall F1-score 0.41999 0.54481 0.45729 Table 7 Results of EL task on the test data We approach the problem of entity linking as software developers. However, this has the following limits: Set-up the stop_words arrays for the question is a difficult task. On the other hand, the number of combinations to determine the entity is very high, making the time complexity very high. 4. Conclusion To solve the RL and EL tasks, this paper proposes an approach based on KG refinement tech- niques. This consists of using external knowledge to complete the graph with missing entities and missing relations. Concerning Relation Linking, we used the vocabulary provided by SMART 3.0 organizers as our external knowledge. Thereafter, we used Naive Bayes to build the embedding matrix that will be used to predict links between a question and its relation(s). The evaluation on the test data gave the max Precision of 0.53426, Recall of 0.47036 and F-measure of 0.48661 for DBpedia and Precision of 0.27901, Recall of 0.31036 and F1-score of 0.28541 for Wikidata. Concerning Entity Linking, we used the KG itself as external knowledge. Thus, our goal was to identify the most relevant terms (one word or a group of words) that can be matched to the KG and from these terms, determine relevant entities. The EL task was used on DBpedia dataset only and the evaluation by the SMART 3.0 organizers gave the Precision of 0.41999, the Recall of 0.54481 and the F1-score of 0.45729. We started this challenge with no prior knowledge on Entity Linking and Relation Linking tasks and we proposed a solution. Future work consists of exploring different machine learning, deep learning and natural language processing models that can be used to help us improve the quality of our results. We particularly found good papers [8, 9] on relation linking and entity linking [9, 10, 11]. We are planning to explore and test these methods to solve the Entity Linking and Relation Linking tasks. Acknowledgments We would like to thank the ISWC conference and the SMART challenge organizers and all the reviewers. This work was partially supported by Neuralearn.ai (Start-up Cameroon). References [1] G. Rossiello, N. Mihindukulasooriya, I. Abdelaziz, M. Bornea, A. Gliozzo, T. Naseem, P. Kapanipathi, Generative relation linking for question answering over knowledge bases, 2021. URL: https://arxiv.org/abs/2108.07337. doi:10.48550/ARXIV.2108.07337. [2] Reaching out for the Answer: Relation Prediction, volume 3119 of SeMantic Answer Type and Relation Prediction Task at ISWC 2021 Semantic Web Challenge (SMART2021), CEUR Workshop Proceedings, 2021. [3] The Combination of BERT and Data Oversampling for Relation Set Prediction, volume 3119 of SeMantic Answer Type and Relation Prediction Task at ISWC 2021 Semantic Web Challenge (SMART2021), CEUR Workshop Proceedings, 2021. [4] W. Zhang, W. Hua, K. Stratos, Entqa: Entity linking as question answering, CoRR abs/2110.02369 (2021). URL: https://arxiv.org/abs/2110.02369. [5] P. Cimiano, H. Paulheim, Knowledge graph refinement: A survey of approaches and eval- uation methods, Semant. Web 8 (2017) 489–508. URL: https://doi.org/10.3233/SW-160218. doi:10.3233/SW-160218. [6] A. Jiomekong, G. Camara, M. Tchuente, Extracting ontological knowledge from java source code using hidden markov models, Open Computer Science 9 (2019) 181–199. [7] A. Bordes, N. Usunier, A. Garcia-Durán, J. Weston, O. Yakhnenko, Translating embeddings for modeling multi-relational data, in: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, Curran Associates Inc., Red Hook, NY, USA, 2013, p. 2787–2795. [8] G. Rossiello, N. Mihindukulasooriya, I. Abdelaziz, M. Bornea, A. Gliozzo, T. Naseem, P. Kapanipathi, Generative relation linking for question answering over knowledge bases, in: The Semantic Web – ISWC 2021: 20th International Semantic Web Conference, ISWC 2021, Virtual Event, October 24–28, 2021, Proceedings, Springer-Verlag, Berlin, Heidelberg, 2021, p. 321–337. URL: https://doi.org/10.1007/978-3-030-88361-4_19. doi:10. 1007/978-3-030-88361-4_19. [9] J. a. Gomes, R. C. denbsp;Mello, V. Ströele, J. F. denbsp;Souza, A study of approaches to answering complex questions over knowledge bases, Knowl. Inf. Syst. 64 (2022) 2849–2881. URL: https://doi.org/10.1007/s10115-022-01737-x. doi:10.1007/s10115-022-01737-x. [10] C. Ran, W. Shen, J. Gao, Y. Li, J. Wang, Y. Jia, Learning entity linking features for emerging entities, 2022. doi:10.48550/ARXIV.2208.03877. [11] W. Zhang, W. Hua, K. Stratos, Entqa: Entity linking as question answering, 2021. URL: https://arxiv.org/abs/2110.02369. doi:10.48550/ARXIV.2110.02369.