=Paper=
{{Paper
|id=Vol-3337/smart3
|storemode=property
|title=Towards an Approach based on Knowledge Graph Refinement for Answer Type prediction (full paper)
|pdfUrl=https://ceur-ws.org/Vol-3337/smart-paper3.pdf
|volume=Vol-3337
|authors=Azanzi Jiomekong,Vadel Tsague,Brice Foko,Uriel Melie,Gaoussou Camara
|dblpUrl=https://dblp.org/rec/conf/semweb/JiomekongTFMC22
}}
==Towards an Approach based on Knowledge Graph Refinement for Answer Type prediction (full paper)==
Towards an Approach based on Knowledge Graph Refinement for Answer Type prediction Azanzi Jiomekong1 , Vadel Tsague1 , Brice Foko1 , Uriel Melie1 and Gaoussou Camara2 1 Department of Computer Science, University of Yaounde I, Yaounde, Cameroon 2 Unité de Formation et de Recherche en Sciences Appliquées et des TIC, Université Alioune Diop de Bambey, Bambey, Sénégal Abstract This paper presents our contribution to the SMART 3.0 Answer Type (AT) prediction research problem. This contribution consists of a method for Answer Type prediction that can be applied to any Knowledge Graph. Inspired by the TransE algorithm for link prediction in knowledge graphs, we defined the models. Concerning category prediction, we trained the model on 80% of the training dataset furnished by the organizers and tested on 20% by using different models parameters. Applied to the test data provided by the organizers, we obtained the max accuracy of 0.96 for the DBpedia and the category_match value of 0.94 for the Wikidata dataset. Keywords Question Answering, Answer Type Prediction, Knowledge Graphs, Knowledge Graphs Refinement, Wikidata, DBpedia, 1. Introduction Question Answering (QA) systems aim to respond to natural language questions of users. To this end, Knowledge Base Question Answering (KBQA) is required to understand the meaning of natural language questions and furnish reliable responses. Thus, KBQA is a popular task in the field of Natural Language Processing (NLP) and Information Retrieval (IR). It aims to answer a natural language question using the facts in the Knowledge Base [1]. To build efficient QA systems, several subtasks should be considered. While SMART 20201 focus on Answer Type (AT) subtask, SMART 20212 on Answer Type and Relation Linling (RL) subtasks, SMART 3.03 consider Answer Type prediction, Relation Linking and Entity Linking (EL) subtasks. During the SMART 3.0 challenge, we addressed the three subtasks. This paper is devoted to the proposition of a solution for the Answer Type prediction problem. Answer Type prediction consists of the classification of the expected answer type. In effect, the ability to know the AT enables the system to significantly narrow down the search space for an answer to a given question [1]. For example, given the following question: "Where was born Maurice Kamto?" It is possible to reduce the search space from 67 answers to 7 entities while querying all the 1-hop entities of the Maurice Kamto resource in DBpedia (dbr:Maurice_Kamto). The illustration of this example is given by the table 1. 1 https://smart-task.github.io/2020/ 2 https://smart-task.github.io/2021/ 3 https://smart-task.github.io/2022/ A question with answer type SPARQL Query over DBpedia SPARQL Query with AT predicted predicted SELECT DISTINCT ? t y p e ? o ? p WHERE { SELECT DISTINCT ? t y p e WHERE { d b r : Maurice_Kamto ? p ? o . d b r : Maurice_Kamto dbo : b i r t h P l a c e ? o { ?o r d f : type ? type . . FILTER s t r s t a r t s ( s t r ( ? t y p e ) , ?o r d f : type ? type . " Q u e s t i o n " : " Where s t r ( dbo : ) ) FILTER s t r s t a r t s ( s t r ( ? t y p e ) , s t r ( was b orn Maurice } dbo : ) ) } Kamto ? " , Result: 67 results " category " : " resource Result: 7 results ", " type " : [ " dbo : Country " , " dbo : PopulatePlace ", " dbo : P l a c e " , " dbo : L o c a t i o n " , " dbo : S e t t l e m e n t " , " dbo : v i l l a g e " ] } Table 1 SPARQL example on how AT prediction can help to reduce the solution space over DBpedia dataset SMART 2020 was devoted to the AT task and SMART 2021 to AT and RL tasks. Fourteen papers were proposed for the AT task since 2020 - all these papers were using DBpedia and 6 were using DBpedia and Wikidata. DBpedia dataset were used by all the participants during the year 2020 and 2021. Three participants (37.5%) were using Wikidata in 2020 and 3 (50%) in 2021. Past participants were having good accuracy (some reaching 90%) and good NDCG and mrr (some reaching 80%). Thus, our aim in this research is to augment the repository of solutions to the AT task by proposing new methods. To solve the AT task, SMART 3.0 provides us with two versions of large-scale dataset: one dataset is the DBpedia dataset (composed of 760 classes) and the second one is the Wikidata dataset (composed of 50K classes). Using these datasets we are proposing a solution to the AT task. This solution is based on Knowledge Graphs refinement techniques [2]. In effect, whatever the method used to construct KGs, it is known that they will never be perfect. Thus, to increase their utility, various refinement methods are proposed. These solutions are used to predict missing knowledge such as missing entities, missing types of entities, and/or missing relations that exist between entities [2]. Internal KG refinement techniques aim to complete the KG by using information hidden in the graph and external methods aim to use external knowledge sources such as text corpora to complete the graph. In this paper, we are proposing the use of internal method for category prediction and for the prediction of literal types (number, string, date) and Booleans types. For the prediction of types of resources, we are using external methods. To implement our solutions, we needed to set up a development environment. Thus, we used: • Operating system: Ubuntu, Figure 1: Data distribution for each category and literal for DBpedia • Programming languages: JavaScript and Python, • Python libraries: Json4 , CSV5 , • JavaScript library: Natural6 . The source code is available on GitHub7 . We started this work with the analysis of the dataset (Section 2). Thereafter, we processed to the model definition (Section 3) and the model training and use (Section 4). We finalize with the conclusion (Section 5). 2. Analysis of the dataset The aim of this step was to gather, portray and provide a deeper understanding of the structure of the dataset so as to provide an efficient solution. As presented by the organizers datasets are designed to provide a single answer category either "boolean", "literal", or "resource". They assign the "boolean" category as boolean, "literal" category into an answer type either "number", "date", or "string". For resource categories, DBpedia and Wikidata ontologies classes are used to identify the answer type. To understand the dataset, we thought it necessary to investigate a subset of this dataset. Thus, fifty questions for each dataset were randomly selected and their annotations automatically removed. Once the annotations were removed, a manual annotation of this subset of the dataset took place. The manual analysis of the dataset gave us the insight that "resources", "literal" and "booleans" are not equally distributed in the dataset. The analysis of the distribution by category presented by the Fig. 1 for DBpedia and Fig. 2 for Wikidata confirmed this insight. The main finding in this dataset was that: • Data are not equally distributed amongst the different categories (see Fig. 1 for DBpedia and Fig. 2 for Wikidata); • The question types can be divided into Wh-, How many, Off which, Thus, Did, etc. ques- tions. 4 https://docs.python.org/3/library/json.html 5 https://docs.python.org/3/library/csv.html 6 https://github.com/NaturalNode/natural 7 https://github.com/jiofidelus/tsotsa/tree/SMART_22 Figure 2: Data distribution for each category and literal for Wikidata Figure 3: Taxonomy of types of questions built during the analysis of the dataset Given the finding of the analysis of the dataset, we build the question taxonomy presented by the Fig. 3. This taxonomy will be used to build and refine the model. 3. Model definition The aim of the model definition step was to define a model that can be applied to any dataset to solve the Answer Type prediction research problem. The first thing we did was to define the graph structure that can be used to represent any Question Answering dataset (see Fig. 4). This consists of matching the taxonomy of the Fig. 3 to the corresponding answer type and answer category. Thereafter, to match the set of questions to the question taxonomy. The idea is to say that, if a question is linked to a node in the taxonomy and this node is linked to a category and/or a question type, thus, this question is linked to this category and/or answer type. Thus, we define the "hasEmbedding" relation between a question and its embedding in the taxonomy. To define this relation, knowledge should be extracted from the question. Knowledge extraction is the creation of knowledge from structured (relational databases, XML), semi-structured (source code) and unstructured (text, documents, images) sources [3]. In our case, we are faced with extracting terms from unstructured sources which are questions. Figure 4: Knowledge graph-based representation of the dataset From the Fig. 4, we define the Question Answering dataset as a multi-relation data using a Knowledge graph defined by the quadruple 𝐺 = (𝑉, 𝐸, 𝑅, 𝑇 ) where: • 𝑉 = {𝑣𝑖 ∈ 𝑉 } the set of nodes. These nodes are questions, embedded vectors, questions categories, and questions types; • 𝐸 = {(𝑣𝑖 , 𝑟, 𝑣𝑗 )} the set of edges, which are triples with node 𝑣𝑖 connected to node 𝑣𝑗 via the relation 𝑟. These are the relations between two questions, a question and his type or his category; • 𝑇 = {𝑡(𝑣𝑖 )} denotes the node types. The type of nodes categories and nodes types are their labels and the types of node questions are their id and the title of the question; • 𝑅 = {𝑟 ∈ 𝑅} denotes the relation types. These are the labels of the relations. On the other hand, the following relations are used to link the different nodes in the graph of Fig. 4: • hasEmbedding: this relation defines how closeness a question is with an embedding vector defined in the taxonomy of question. This is very useful because when two questions have the same embedding in the taxonomy, they will have the same category. • isCloseTo: this defines a relation between two questions that have the same embedding. This information will be particularly important during the training of the model. In effect, two questions that are close to each other share the same category. • hasCategory: this relation defines the relation between a node (the node can be a question or an embedding vector in the taxonomy) and its category; • hasType_b: this relation is used to model the relation of "Boolean" category to its type, which is "boolean"; • hasType_l: this relation is used to model the relation of "Literal" category to its type, which can be "number", "string" or "date"; • hasType_r: this relation is used to model the relation of the "Resource" category to its type in the knowledge graph. Given this new configuration of the training dataset, we reduced the problem of Answer Type prediction to the problem of KG completion. To solve this problem, our goal will be to provide an efficient tool to complete the graph by adding new facts without requiring extra knowledge. Inspired by the TransE algorithm [4] for learning low-dimensional embedding of entities, we defined our model, presented by the equation 1. We consider that relationships are represented as translations. For instance, if (Q1 isCloseTo Q2) holds, then the embedding of 𝑄1 should be close to the embedding of 𝑄2. The model is defined as a function which takes a node and a relation and predicts all the nodes that are related to this relation. This prediction is based on the proximity between two nodes. 𝑓𝑟 :𝑉 × 𝐸 × 𝑉 −→ R 𝑓𝑟 (𝑣𝑖 , 𝑟, 𝑣𝑗 ) = ||𝑣𝑖 + 𝑣𝑟 − 𝑣𝑗 || (1) v𝑖 + 𝑣𝑟 ≈ 𝑣𝑗 if 𝑓𝑟 (𝑣𝑖 , 𝑟, 𝑣𝑗 ) ≈ 𝛽 The function 𝑓𝑟 is used to verify if the triples in the knowledge graph holds. The triple (𝑣𝑖 , 𝑟, 𝑣𝑗 ) holds if 𝑓𝑟 (𝑣𝑖 , 𝑟, 𝑣𝑗 ) ≈ 𝛽 - in this case, we can say that 𝑣𝑖 + 𝑣𝑟 ≈ 𝑣𝑗 . 𝛽 being the model parameter. If 𝛽 reaches a certain value (or belongs to a certain interval of values), thus, we can say that 𝑣𝑖 and 𝑣𝑗 are linked with the relation 𝑟. The model that we just defined is based on node embedding, the nodes being the questions, the questions categories and the questions types. Thus, we should find a way to model these nodes in such a way that they can be used to train our models to predict relations. This task is done by encoding the questions so as to simplify the definition of (Question, hasEmbedding, Vector) triples. We extracted knowledge from the question title and used this knowledge to define the embedding vector of each question. To this end, we suppose that a question can be well repre- sented by some terms (word or group of words) in his title and we defined the 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛2𝑣𝑒𝑐𝑡 function (see equation 2) which takes a question and return a vector containing the terms used as the embedding of this question. 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛2𝑣𝑒𝑐𝑡(𝑄) = (𝑡1 , 𝑡2 , ..., 𝑡𝑛 ) (2) In equation 2, (𝑡𝑖 , 𝑖 <= 𝑛) denotes the different terms (word or group of words) in the question title that can be representative of the question. For instance, for category prediction, these words can be the "WH-", "Is", "How many" terms that are used to identify the category of a question. For instance, using the taxonomy of Fig. 3, the size of the vectors returned by 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛2𝑣𝑒𝑐𝑡 is 15. Using this taxonomy, we provide by the listing 1 some examples of the evaluation of 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛2𝑣𝑒𝑐𝑡 function using some questions in the dataset. The listing 2 presents the algorithm executed to transform a question into a vector of terms. Listing 1: Example of the transformation of some questions into a vector using the taxonomy of questions q u e s t i o n 2 v e c t ( What a r e t h e awards won by t h e p r o d u c e r o f P u s s i n B o o t s ? ) = ( What , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ) q u e s t i o n 2 v e c t ( Who a r e t h e gymnasts c o a c h e d by Amanda R e d d i n ? ) = ( 0 , 0 , 0 , 0 , Who , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ) q u e s t i o n 2 v e c t ( How many s u p e r p o w e r s d o e s wonder woman have ? ) = ( 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , How many , 0 , 0 , 0 ) q u e s t i o n 2 v e c t ( I s A z e r b a i j a n a member o f European Go F e d e r a t i o n ? ) = (0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , Is , 0 , 0) Listing 2: Question2vect algorithm Input : Q / / The s e t o f q u e s t i o n s l e s s F r e q u e n t / / The l e s s f r e q u e n t words i n t h e t r a i n d a t a s e t T / / The c u r r e n t taxonomy Output : V / / The v e c t o r o f t e r m s Steps : 1− l i s t O f W o r d = t o k e n i z e (Q) / / T o k e n i z e t h e q u e s t i o n t o o b t a i n t h e l i s t o f / / words i t c o n t a i n s 2− f o r w i n l i s t O f W o r d i f w in lessFrequent delete w 3− V = b u i l d V e c t o r ( l i s t O f W o r d , T ) / / T h i s method w i l l b u i l d t h e v e c t o r by / / r e p l a c i n g t h e empty s p a c e by 0 s o t h a t / / the s i z e of the vector correspond to | T | . 4. Models training and use We used the equation 1 to define specific models for each relations. • Learning the "hasEmbedding" relation: the function 𝑓ℎ𝑎𝑠𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 is given by the equation 3. 𝑓ℎ𝑎𝑠𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 (𝑄, ℎ𝑎𝑠𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔, 𝑉 ) = |𝑄 − 𝑉 | = 𝛼 (3) Globally, a node 𝑄 has embedded node the node 𝑉 if 𝛼 ≈ 𝛽 - 𝛽 being the model parameter defined during the training process. The listing 3 presents the algorithm used to determine the embedding vector of a question. Listing 3: hasEmbedding algorithm Input : Q / / The s e t o f q u e s t i o n s V / / The embedding v e c t o r Output : hasEmbedding = f a l s e / / A b o o l e a n t o d e f i n e i f t h e q u e s t i o n Q / / h a s embedding v e c t o r V 1 ) W = q u e s t i o n 2 v e c t (Q) 2 ) I f | |W − V | | < b e t a / / V e r i f y t h e number o f d i f f e r e n t words / / between W and V hasEmbedding = t r u e else hasEmbedding = f a l s e • Learning the "isCloseTo" relation: the function 𝑓𝑖𝑠𝐶𝑙𝑜𝑠𝑒𝑇 𝑜 is given by the equation 4. 𝑓𝑖𝑠𝐶𝑙𝑜𝑠𝑒𝑇 𝑜 (𝑄𝑖 , 𝑖𝑠𝐶𝑙𝑜𝑠𝑒𝑇 𝑜, 𝑄𝑗 ) = |𝑄𝑖 − 𝑄𝑗 | = 𝜃 (4) Globally, a question 𝑄1 is closed to the question 𝑄2 if 𝜃 ≈ 𝛾 - 𝛾 being the model parameter defined during the training process. The algorithm 4 is used to link two questions that are closed to each other. Listing 4: isCloseTo algorithm Input : Q_1 , Q_2 : q u e s t i o n / / Two q u e s t i o n s Output : i s C l o s e T o = f a l s e / / A b o o l e a n t o d e f i n e i f q u e s t i o n Q_1 i s c l o s e t o / / q u e s t i o n Q_2 o r n o t Steps : 1) theta = 0 2 ) Remove t h e s p e c i a l c h a r a c t e r s i n s i d e Q_1 and Q_2 : \ , / " # : % 3 ) P u t a l l t h e word i n s i d e q1 and q2 i n l o w e r c a s e 4 ) T o k e n i z a t i o n o f Q_1 and T o k e n i z a t i o n o f Q_2 5 ) F o r e a c h Token t _ 1 o f Q_1 i f t _ 1 i s i n s i d e t h e Token ’ s L i s t o f q2 theta = theta + 1 6 ) I f t h e t a < gamma isCloseTo=true • Learning "hasCategory": To predict "hasCategory", relation, we consider that each embedding node has a probability to belong to a category. The scoring function defining this probability can thus be obtained using statistical learning, Naïve Bayes, etc. For instance, if we consider that we are using statistical learning, the scoring function will be obtained by the equation 5 - to find if a node 𝑉 hasCategory the node 𝐶, we simply count the number of time the node 𝑉 hasCategory the node 𝐶 and we divide by the number of time the node 𝑉 hasCategory any other category in the training dataset. 𝑁 ∑︁ 𝑓 𝑟𝑒𝑞((𝑉, 𝐶), 𝑇 𝑟𝑎𝑖𝑛) 𝑓ℎ𝑎𝑠𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦 (𝑉, ℎ𝑎𝑠𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦, 𝐶) = 𝑖=1 𝑁 (5) ∑︁ 𝑓 𝑟𝑒𝑞((𝑉, *), 𝑇 𝑟𝑎𝑖𝑛) 𝑖=1 Globally, an embedded vector 𝑉 has category 𝐶 if it has the greatest probability to have category 𝐶 compared to the other categories. The algorithm is presented by listing 5. Listing 5: hasCategory algorithm Input : V / /A vector C / / The s e t o f c a t e g o r y Output : c_v / / The c a t e g o r y o f t h e v e c t o r V Steps : f o r each c _ i in C 1 ) nb = number o f t i m e s V h a s c a t e g o r y C i n t h e t r a i n d a t a s e t 2 ) N = number o f t i m e s V h a s c a t e g o r y any o t h e r c a t e g o r y i n t h e train dataser 3 ) b e t a _ i = nb / n / / The p r o b a b i l i t y t h a t V h a s c a t e g o r y c _ i 4 ) i f max ( b e t a _ i ) / / v e r i f y i f b e t a _ i i s t h e max p r o b a b i l i t y / / obtained c_v = c _ i • Learning "hasType_b", "hasType_l" relations: These relations are predicted using simple rules. In effect, once the category is predicted, we define a set of rules that match each element of the taxonomy to the corresponding type. The algorithms are presented by listing 6 for predicting boolean type and listing 7 for predicting literal type. Listing 6: hasType_b algorithm / / We u s e s i m p l e r u l e s Input : V // a vector Output : b o o l C a t / / A b o o l e a n t o know i f t h e c a t e g o r y i s a b o o l e a n o r n o t Steps : i f question . category = " Boolean " : boolCat = Boolean Listing 7: hasType_l algorithm Input : C / / Category of a question l i t e r a l s / / L i s t of l i t e r a l s V / / Embedded v e c t o r Output : t / / Type o f t h e embedded v e c t o r Steps : I f c a t e g o r y ( C , V ) = L i t e r a l / / Check i f t h e c a t e g o r y o f t h e v e c t o r i s // a literal / / S e a r c h f o r t h e l i t e r a l i n t h e l i s t o f l i t e r a l s t h a t match w i t h / / the vector V t = matchLiteral ( l i t e r a l s , V) • Learning "hasType_r" relation: concerning the hasType_r relation, we combine the internal method for knowledge graph refinement with the external method. In effect, we define the embedding vector of the question, so that if it matches a resource provided by the SMART organizers, thus, this resource is saved. If not, we made a SPARQL query on the graph online in order to get the resource that holds. The algorithm is given by the listing 8 and an example of a SPARQL query for DBpedia is given by the listing 9 and for Wikidata is given by the listing 10. Listing 8: hasType_r algorithm Input : V : V e c t o r / / a Embedding v e c t o r o f a T : Taxonomy / / The taxonomy Output : R : L i s t < t yp e > / / L i s t o f t y p e o f t h e KG Steps : 1 ) S e a r c h i n s i d e t h e taxonomy T , which embedding v e c t o r a s t h e same value of V 2 ) I f t h e v e c t o r V f o u n d i n s t e p { 1 } h a s one o r m u l t i p l e s t y p e v a l u e d e f i n e i n s i d e our KG , i n s e r t them i n s i d e R L i s t 3) Else 3 . 1 ) define W = ToString (V) 3 . 2 ) S e a r c h on t h e e x t e r n a l KG a e n t i t i e s t h a t match W u s i n g W i k i b a s e API 3 . 3 ) With t h e r e s u l t o f s t e p { 3 . 2 } , e x e c u t e SparQL q u e r y t o g e t the types a s s o c i a t e d 3 . 4 ) i n s e r t t h e t y p e s f o u n d on { 3 . 2 } i n s i d e R 4) Return R / / The T o S t r i n g ( V ) f u n c t i o n t r a n s f o r m s r e t u r n a S t r i n g by c o n c a t e n a t i n g e a c h term o f V Listing 9: SPARQL Query of step (3.2) of Algo hasType_r for Dbpedia KG PREFIX r d f : < h t t p : / / www. w3 . o r g / 1 9 9 9 / 0 2 / 2 2 − r d f − s y n t a x − ns #> PREFIX r d f s : < h t t p : / / www. w3 . o r g / 2 0 0 0 / 0 1 / r d f − schema #> PREFIX dbr : < h t t p : / / dbpedia . org / resource > PREFIX dbo : < h t t p : / / d b p e d i a . o r g / o n t o l o g y > SELECT DISTINCT ? t y p e WHERE{ < ‘+ S e a r c h _ p a r a m e t e r + ‘ > r d f : t y p e ? t y p e FILTER s t r s t a r t s ( s t r ( ? t y p e ) , s t r ( dbo : ) ) } / / comment : S e a r c h _ p a r a m e t e r must be r e p l a c e by t h e e n t i t i e s f o u n d on s t e p { 3 . 1 } o f h a s T y p e _ r Algo Listing 10: SPARQL Query of step (3.2) of Algo hasType_r for wikidata KG query : ‘ # d e f a u l t V i e w : Table PREFIX bd : < h t t p : / / www. b i g d a t a . com / r d f #> PREFIX mwapi : < h t t p s : / / www. m e d i a w i k i . o r g / o n t o l o g y # API / > PREFIX wdt : < h t t p : / / www. w i k i d a t a . o r g / prop / d i r e c t / > PREFIX w i k i b a s e : < h t t p : / / w i k i b a . s e / o n t o l o g y #> PREFIX r d f s : < h t t p : / / www. w3 . o r g / 2 0 0 0 / 0 1 / r d f − schema #> SELECT DISTINCT ? t y p e ? t y p e L a b e l WHERE { wd : ‘ + S e a r c h _ p a r a m e t e r + ‘ ( wdt : P279 | wdt : P31 ) ? t y p e . SERVICE w i k i b a s e : l a b e l { bd : s e r v i c e P a r a m w i k i b a s e : l a n g u a g e " en " . } } ORDER BY ASC ( ? t y p e ) LIMIT 1 0 ‘ / / comment : S e a r c h _ p a r a m e t e r must be r e p l a c e by t h e e n t i t i e s f o u n d on s t e p { 3 . 1 } o f h a s T y p e _ r Algo \ end { To extract knowledge from questions, we used the TF-IDF technique. We identify the less frequent words in questions and we remove them to obtain the information that can be used to well represent the question. These information were used to define the 𝑓ℎ𝑎𝑠𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 and 𝑓𝑖𝑠𝐶𝑙𝑜𝑠𝑒𝑇 𝑜 models. Thereafter, we defined a function which calculates the distance between two questions. This consists of subtracting the terms that are identical and count the rest of the terms in the result. If the rest is less than 𝛽, thus, the two nodes are close. To train the 𝑓ℎ𝑎𝑠𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦 model, DBpedia and Wikidata datasets were divided into training (80%) and testing (20%). The 20% of the dataset was selected proportionally to the distribution of each data category. From the 80% of the training dataset, we removed all the answers category and answer type that were provided as response to the Answer Type prediction question. Using the training datasets, we compute the function 𝑓ℎ𝑎𝑠𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦 (see equation 5 and algo- rithm 5) for each category. This is done by calculating the frequency of apparition of a node embedding 𝑉 classified as the category 𝐶 divided by the number of apparitions of the node embedding 𝑉 with any category. Once trained on the training dataset, we applied it to the test dataset and obtained the accuracy of 0.89 for DBpedia and 0.84 for Wikidata. These values were very low compared to related work results. Thus, we decided to refine the model. Given that the first taxonomy (see Fig. 3) did not permit to have good accuracy, we decided to refine the taxonomy. The refinement was done by augmenting the taxonomy of Fig. 3 with more terms that are used to identify the question category. To this end, the 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛2𝑣𝑒𝑐𝑡 Figure 5: Knowledge graph-based representation of the dataset Scoring function Accuracy Approach for resources type NDCG@5 NDCG@10 Statistic_-WH (taxon- 0.881 Statistic_-WH (taxonomy = 15 classes) + KG 0.346 0.303 omy = 15 classes) matching Statistic_-WH (taxon- 0.890 Statistic_-WH (taxonomy = 15 classes + 0.400 0.357 omy = 15 classes + other = resources) + KG matching other=resources) Statistic_-WH (taxon- 0.957 Statistic_-WH (taxonomy = 50 classes + 0.320 0.293 omy = 50 classes + other=resources) + KG matching other=resources) Statistic_-WH (taxon- 0.964 Statistic_-WH (taxonomy = 67 classes + 0.435 0.392 omy = 67 classes + other=resources) + KG matching other=resources) Bayes 0.954 Bayes + KG matching 0.432 0.389 Table 2 Results of the Answer Type prediction on DBpedia test data Accuracy mrr (KG matching approach) 0.94 0.15 Table 3 Results of the Answer Type prediction on Wikipedia test data algorithm (see listing 2) was used to obtain more larger taxonomy (see Fig. 5. This taxonomy have 67 classes. On the other hand, we wanted to test another model than the statistical one. Thus, we choose to test the Naive Bayes classifier using this approach. The statistical scoring function and the Bayes function was applied to the training dataset and we obtained the max accuracy of 0.95% on DBpedia and 0.93% on Wikidata. To finalize this work, we applied our approach on the test data provided by SMART 3.0 organizers and we got our results analyzed. The result of the different models is presented by the table 2 for DBpedia and table 3 for Wikidata. These tables showed that the more the taxonomy is extended, the better the results are. The parameter of our models (see table 2) are defined as follow: • Statistic_-WH (taxonomy = 15 classes): represents the question taxonomy of Fig. 3; • Statistic_-WH (taxonomy = 15 classes + other=resources): represents the question taxonomy of Fig. 3 in which all the others questions which cannot be embedded to the node in the taxonomy are classified as "resource" category. In effect, during the study of the dataset, we found that most questions (more than 80%) not having an embedding vector was of "resource" category; • Statistic_-WH (taxonomy = 50 classes + other=resources): this is the used of the question taxonomy extended with more classes; • Statistic_-WH (taxonomy = 67 classes + other=resources): represents the use of question taxonomy extended with more classes; • Bayes: represents the use of Naive Bayes as the scoring function. 5. Conclusion To solve the Answer Type task, we propose in this paper a model based on Knowledge Graph refinement techniques. It considers that the prediction of the answer category and answer type can be reduced to the prediction of links in a knowledge graph. This method was applied on DBpedia and Wikidata dataset provided by SMART organizers. To predict the answer category, we used two scoring functions - one built using statistical learning and the second one built using Naive Bayes. The evaluation on the test data provided by the organizers gave the max accuracy of 0.96 for DBpedia, category match of 0.94 for Wikidata. Concerning the prediction of the answer type, we combined Naive Bayes with rule-based graph matching techniques. The evaluation on the test data of DBpedia gave NDCG@5=0.435 and NDCG@10=0.392. Concerning Wikidata, the mrr=0.15. Given that these results are very low compared to the past challenges results, we are currently making a state of the art of machine learning methods and models that can be used to improve the link prediction in knowledge graphs. The results of this state of the art will be used to compare different methods and come up with the best method that can be used for link prediction in question answering systems. Acknowledgments We would like to thank the ISWC conference and the SMART challenge organizers and all the reviewers. This work was partially supported by Neuralearn.ai (Start-up Cameroon). References [1] N. Mihindukulasooriya, M. Dubey, A. Gliozzo, J. Lehmann, A.-C. N. Ngomo, R. Usbeck, SeMantic AnsweR Type prediction task (SMART) at ISWC 2020 Semantic Web Challenge, CoRR/arXiv abs/2012.00555 (2020). URL: https://arxiv.org/abs/2012.00555. [2] P. Cimiano, H. Paulheim, Knowledge graph refinement: A survey of approaches and evaluation methods, Semant. Web 8 (2017) 489–508. URL: https://doi.org/10.3233/SW-160218. doi:10.3233/SW-160218. [3] A. Jiomekong, G. Camara, M. Tchuente, Extracting ontological knowledge from java source code using hidden markov models, Open Computer Science 9 (2019) 181–199. [4] A. Bordes, N. Usunier, A. Garcia-Durán, J. Weston, O. Yakhnenko, Translating embeddings for modeling multi-relational data, in: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, Curran Associates Inc., Red Hook, NY, USA, 2013, p. 2787–2795.