=Paper= {{Paper |id=Vol-3337/smart3 |storemode=property |title=Towards an Approach based on Knowledge Graph Refinement for Answer Type prediction (full paper) |pdfUrl=https://ceur-ws.org/Vol-3337/smart-paper3.pdf |volume=Vol-3337 |authors=Azanzi Jiomekong,Vadel Tsague,Brice Foko,Uriel Melie,Gaoussou Camara |dblpUrl=https://dblp.org/rec/conf/semweb/JiomekongTFMC22 }} ==Towards an Approach based on Knowledge Graph Refinement for Answer Type prediction (full paper)== https://ceur-ws.org/Vol-3337/smart-paper3.pdf
Towards an Approach based on Knowledge Graph
Refinement for Answer Type prediction
Azanzi Jiomekong1 , Vadel Tsague1 , Brice Foko1 , Uriel Melie1 and Gaoussou Camara2
1
 Department of Computer Science, University of Yaounde I, Yaounde, Cameroon
2
 Unité de Formation et de Recherche en Sciences Appliquées et des TIC, Université Alioune Diop de Bambey, Bambey,
Sénégal


           Abstract
           This paper presents our contribution to the SMART 3.0 Answer Type (AT) prediction research problem.
           This contribution consists of a method for Answer Type prediction that can be applied to any Knowledge
           Graph. Inspired by the TransE algorithm for link prediction in knowledge graphs, we defined the models.
           Concerning category prediction, we trained the model on 80% of the training dataset furnished by the
           organizers and tested on 20% by using different models parameters. Applied to the test data provided by
           the organizers, we obtained the max accuracy of 0.96 for the DBpedia and the category_match value of
           0.94 for the Wikidata dataset.

           Keywords
           Question Answering, Answer Type Prediction, Knowledge Graphs, Knowledge Graphs Refinement,
           Wikidata, DBpedia,




1. Introduction
Question Answering (QA) systems aim to respond to natural language questions of users. To
this end, Knowledge Base Question Answering (KBQA) is required to understand the meaning
of natural language questions and furnish reliable responses. Thus, KBQA is a popular task in
the field of Natural Language Processing (NLP) and Information Retrieval (IR). It aims to answer
a natural language question using the facts in the Knowledge Base [1]. To build efficient QA
systems, several subtasks should be considered. While SMART 20201 focus on Answer Type
(AT) subtask, SMART 20212 on Answer Type and Relation Linling (RL) subtasks, SMART 3.03
consider Answer Type prediction, Relation Linking and Entity Linking (EL) subtasks. During the
SMART 3.0 challenge, we addressed the three subtasks. This paper is devoted to the proposition
of a solution for the Answer Type prediction problem.
   Answer Type prediction consists of the classification of the expected answer type. In effect,
the ability to know the AT enables the system to significantly narrow down the search space for
an answer to a given question [1]. For example, given the following question: "Where was born
Maurice Kamto?" It is possible to reduce the search space from 67 answers to 7 entities while
querying all the 1-hop entities of the Maurice Kamto resource in DBpedia (dbr:Maurice_Kamto).
The illustration of this example is given by the table 1.


1
  https://smart-task.github.io/2020/
2
  https://smart-task.github.io/2021/
3
  https://smart-task.github.io/2022/
 A question with answer type              SPARQL Query over DBpedia                              SPARQL Query with AT predicted
 predicted                                SELECT DISTINCT ? t y p e ? o ? p WHERE {              SELECT DISTINCT ? t y p e WHERE {
                                              d b r : Maurice_Kamto ? p ? o .                        d b r : Maurice_Kamto dbo : b i r t h P l a c e ? o
 {                                            ?o r d f : type ? type .                                         .
                                              FILTER s t r s t a r t s ( s t r ( ? t y p e ) ,       ?o r d f : type ? type .
 " Q u e s t i o n " : " Where                         s t r ( dbo : ) )                             FILTER s t r s t a r t s ( s t r ( ? t y p e ) , s t r (
       was b orn Maurice                  }                                                                   dbo : ) )
                                                                                                 }
       Kamto ? " ,
                                          Result: 67 results
 " category " : " resource                                                                       Result: 7 results
       ",
 " type " : [
        " dbo : Country " ,
        " dbo :
                 PopulatePlace
                 ",
        " dbo : P l a c e " ,
        " dbo : L o c a t i o n " ,
        " dbo : S e t t l e m e n t " ,
        " dbo : v i l l a g e "
        ]
 }


Table 1
SPARQL example on how AT prediction can help to reduce the solution space over DBpedia dataset


   SMART 2020 was devoted to the AT task and SMART 2021 to AT and RL tasks. Fourteen
papers were proposed for the AT task since 2020 - all these papers were using DBpedia and 6
were using DBpedia and Wikidata. DBpedia dataset were used by all the participants during the
year 2020 and 2021. Three participants (37.5%) were using Wikidata in 2020 and 3 (50%) in 2021.
Past participants were having good accuracy (some reaching 90%) and good NDCG and mrr
(some reaching 80%). Thus, our aim in this research is to augment the repository of solutions to
the AT task by proposing new methods.
   To solve the AT task, SMART 3.0 provides us with two versions of large-scale dataset: one
dataset is the DBpedia dataset (composed of 760 classes) and the second one is the Wikidata
dataset (composed of 50K classes). Using these datasets we are proposing a solution to the
AT task. This solution is based on Knowledge Graphs refinement techniques [2]. In effect,
whatever the method used to construct KGs, it is known that they will never be perfect. Thus,
to increase their utility, various refinement methods are proposed. These solutions are used to
predict missing knowledge such as missing entities, missing types of entities, and/or missing
relations that exist between entities [2]. Internal KG refinement techniques aim to complete
the KG by using information hidden in the graph and external methods aim to use external
knowledge sources such as text corpora to complete the graph. In this paper, we are proposing
the use of internal method for category prediction and for the prediction of literal types (number,
string, date) and Booleans types. For the prediction of types of resources, we are using external
methods.
   To implement our solutions, we needed to set up a development environment. Thus, we used:

    • Operating system: Ubuntu,
Figure 1: Data distribution for each category and literal for DBpedia


    • Programming languages: JavaScript and Python,
    • Python libraries: Json4 , CSV5 ,
    • JavaScript library: Natural6 .

  The source code is available on GitHub7 .
  We started this work with the analysis of the dataset (Section 2). Thereafter, we processed to
the model definition (Section 3) and the model training and use (Section 4). We finalize with the
conclusion (Section 5).


2. Analysis of the dataset
The aim of this step was to gather, portray and provide a deeper understanding of the structure
of the dataset so as to provide an efficient solution. As presented by the organizers datasets
are designed to provide a single answer category either "boolean", "literal", or "resource". They
assign the "boolean" category as boolean, "literal" category into an answer type either "number",
"date", or "string". For resource categories, DBpedia and Wikidata ontologies classes are used to
identify the answer type.
   To understand the dataset, we thought it necessary to investigate a subset of this dataset. Thus,
fifty questions for each dataset were randomly selected and their annotations automatically
removed. Once the annotations were removed, a manual annotation of this subset of the dataset
took place. The manual analysis of the dataset gave us the insight that "resources", "literal"
and "booleans" are not equally distributed in the dataset. The analysis of the distribution by
category presented by the Fig. 1 for DBpedia and Fig. 2 for Wikidata confirmed this insight.
   The main finding in this dataset was that:

    • Data are not equally distributed amongst the different categories (see Fig. 1 for DBpedia
      and Fig. 2 for Wikidata);
    • The question types can be divided into Wh-, How many, Off which, Thus, Did, etc. ques-
      tions.
4
  https://docs.python.org/3/library/json.html
5
  https://docs.python.org/3/library/csv.html
6
  https://github.com/NaturalNode/natural
7
  https://github.com/jiofidelus/tsotsa/tree/SMART_22
Figure 2: Data distribution for each category and literal for Wikidata




Figure 3: Taxonomy of types of questions built during the analysis of the dataset


Given the finding of the analysis of the dataset, we build the question taxonomy presented by
the Fig. 3. This taxonomy will be used to build and refine the model.


3. Model definition
The aim of the model definition step was to define a model that can be applied to any dataset to
solve the Answer Type prediction research problem. The first thing we did was to define the
graph structure that can be used to represent any Question Answering dataset (see Fig. 4). This
consists of matching the taxonomy of the Fig. 3 to the corresponding answer type and answer
category. Thereafter, to match the set of questions to the question taxonomy. The idea is to say
that, if a question is linked to a node in the taxonomy and this node is linked to a category and/or
a question type, thus, this question is linked to this category and/or answer type. Thus, we
define the "hasEmbedding" relation between a question and its embedding in the taxonomy. To
define this relation, knowledge should be extracted from the question. Knowledge extraction is
the creation of knowledge from structured (relational databases, XML), semi-structured (source
code) and unstructured (text, documents, images) sources [3]. In our case, we are faced with
extracting terms from unstructured sources which are questions.
Figure 4: Knowledge graph-based representation of the dataset


  From the Fig. 4, we define the Question Answering dataset as a multi-relation data using a
Knowledge graph defined by the quadruple 𝐺 = (𝑉, 𝐸, 𝑅, 𝑇 ) where:
    • 𝑉 = {𝑣𝑖 ∈ 𝑉 } the set of nodes. These nodes are questions, embedded vectors, questions
      categories, and questions types;
    • 𝐸 = {(𝑣𝑖 , 𝑟, 𝑣𝑗 )} the set of edges, which are triples with node 𝑣𝑖 connected to node 𝑣𝑗
      via the relation 𝑟. These are the relations between two questions, a question and his type
      or his category;
    • 𝑇 = {𝑡(𝑣𝑖 )} denotes the node types. The type of nodes categories and nodes types are
      their labels and the types of node questions are their id and the title of the question;
    • 𝑅 = {𝑟 ∈ 𝑅} denotes the relation types. These are the labels of the relations.
  On the other hand, the following relations are used to link the different nodes in the graph of
Fig. 4:
    • hasEmbedding: this relation defines how closeness a question is with an embedding
      vector defined in the taxonomy of question. This is very useful because when two
      questions have the same embedding in the taxonomy, they will have the same category.
    • isCloseTo: this defines a relation between two questions that have the same embedding.
      This information will be particularly important during the training of the model. In effect,
      two questions that are close to each other share the same category.
    • hasCategory: this relation defines the relation between a node (the node can be a question
      or an embedding vector in the taxonomy) and its category;
    • hasType_b: this relation is used to model the relation of "Boolean" category to its type,
      which is "boolean";
    • hasType_l: this relation is used to model the relation of "Literal" category to its type,
      which can be "number", "string" or "date";
    • hasType_r: this relation is used to model the relation of the "Resource" category to its
      type in the knowledge graph.
   Given this new configuration of the training dataset, we reduced the problem of Answer Type
prediction to the problem of KG completion. To solve this problem, our goal will be to provide
an efficient tool to complete the graph by adding new facts without requiring extra knowledge.
   Inspired by the TransE algorithm [4] for learning low-dimensional embedding of entities, we
defined our model, presented by the equation 1. We consider that relationships are represented
as translations. For instance, if (Q1 isCloseTo Q2) holds, then the embedding of 𝑄1 should be
close to the embedding of 𝑄2. The model is defined as a function which takes a node and a
relation and predicts all the nodes that are related to this relation. This prediction is based on
the proximity between two nodes.

                               𝑓𝑟 :𝑉 × 𝐸 × 𝑉 −→ R
                                   𝑓𝑟 (𝑣𝑖 , 𝑟, 𝑣𝑗 ) = ||𝑣𝑖 + 𝑣𝑟 − 𝑣𝑗 ||                           (1)
                                   v𝑖 + 𝑣𝑟 ≈ 𝑣𝑗 if 𝑓𝑟 (𝑣𝑖 , 𝑟, 𝑣𝑗 ) ≈ 𝛽

   The function 𝑓𝑟 is used to verify if the triples in the knowledge graph holds. The triple
(𝑣𝑖 , 𝑟, 𝑣𝑗 ) holds if 𝑓𝑟 (𝑣𝑖 , 𝑟, 𝑣𝑗 ) ≈ 𝛽 - in this case, we can say that 𝑣𝑖 + 𝑣𝑟 ≈ 𝑣𝑗 . 𝛽 being the
model parameter. If 𝛽 reaches a certain value (or belongs to a certain interval of values), thus,
we can say that 𝑣𝑖 and 𝑣𝑗 are linked with the relation 𝑟.
   The model that we just defined is based on node embedding, the nodes being the questions, the
questions categories and the questions types. Thus, we should find a way to model these nodes
in such a way that they can be used to train our models to predict relations. This task is done
by encoding the questions so as to simplify the definition of (Question, hasEmbedding, Vector)
triples. We extracted knowledge from the question title and used this knowledge to define the
embedding vector of each question. To this end, we suppose that a question can be well repre-
sented by some terms (word or group of words) in his title and we defined the 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛2𝑣𝑒𝑐𝑡
function (see equation 2) which takes a question and return a vector containing the terms used
as the embedding of this question.

                                      𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛2𝑣𝑒𝑐𝑡(𝑄) = (𝑡1 , 𝑡2 , ..., 𝑡𝑛 )                                            (2)
   In equation 2, (𝑡𝑖 , 𝑖 <= 𝑛) denotes the different terms (word or group of words) in the
question title that can be representative of the question. For instance, for category prediction,
these words can be the "WH-", "Is", "How many" terms that are used to identify the category of a
question.
   For instance, using the taxonomy of Fig. 3, the size of the vectors returned by 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛2𝑣𝑒𝑐𝑡
is 15. Using this taxonomy, we provide by the listing 1 some examples of the evaluation of
𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛2𝑣𝑒𝑐𝑡 function using some questions in the dataset. The listing 2 presents the algorithm
executed to transform a question into a vector of terms.
Listing 1: Example of the transformation of some questions into a vector using the taxonomy of
           questions
q u e s t i o n 2 v e c t ( What a r e t h e awards won by t h e p r o d u c e r o f P u s s i n B o o t s ? ) =
       ( What , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 )
q u e s t i o n 2 v e c t ( Who a r e t h e gymnasts c o a c h e d by Amanda R e d d i n ? ) =
        ( 0 , 0 , 0 , 0 , Who , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 )
q u e s t i o n 2 v e c t ( How many s u p e r p o w e r s d o e s wonder woman have ? ) =
        ( 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , How many , 0 , 0 , 0 )
q u e s t i o n 2 v e c t ( I s A z e r b a i j a n a member o f European Go F e d e r a t i o n ? ) =
        (0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , Is , 0 , 0)


                                       Listing 2: Question2vect algorithm
Input :
    Q / / The s e t o f q u e s t i o n s
    l e s s F r e q u e n t / / The l e s s f r e q u e n t words i n t h e t r a i n d a t a s e t
    T / / The c u r r e n t taxonomy
Output :
    V / / The v e c t o r o f t e r m s
Steps :
    1− l i s t O f W o r d = t o k e n i z e (Q) / / T o k e n i z e t h e q u e s t i o n t o o b t a i n t h e l i s t o f
                                                      / / words i t c o n t a i n s
    2− f o r w i n l i s t O f W o r d
             i f w in lessFrequent
                    delete w
    3− V = b u i l d V e c t o r ( l i s t O f W o r d , T ) / / T h i s method w i l l b u i l d t h e v e c t o r by
                                                            / / r e p l a c i n g t h e empty s p a c e by 0 s o t h a t
                                                      / / the s i z e of the vector correspond to | T | .
4. Models training and use
We used the equation 1 to define specific models for each relations.

    • Learning the "hasEmbedding" relation: the function 𝑓ℎ𝑎𝑠𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 is given by the
      equation 3.

                           𝑓ℎ𝑎𝑠𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 (𝑄, ℎ𝑎𝑠𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔, 𝑉 ) = |𝑄 − 𝑉 | = 𝛼                                           (3)
      Globally, a node 𝑄 has embedded node the node 𝑉 if 𝛼 ≈ 𝛽 - 𝛽 being the model parameter
      defined during the training process. The listing 3 presents the algorithm used to determine
      the embedding vector of a question.

                                    Listing 3: hasEmbedding algorithm
      Input :
          Q / / The s e t o f q u e s t i o n s
          V / / The embedding v e c t o r
      Output :
          hasEmbedding = f a l s e / / A b o o l e a n t o d e f i n e i f t h e q u e s t i o n Q
                                      / / h a s embedding v e c t o r V
          1 ) W = q u e s t i o n 2 v e c t (Q)
          2 ) I f | |W − V | | < b e t a / / V e r i f y t h e number o f d i f f e r e n t words
                                               / / between W and V
                hasEmbedding = t r u e
            else
                hasEmbedding = f a l s e

    • Learning the "isCloseTo" relation: the function 𝑓𝑖𝑠𝐶𝑙𝑜𝑠𝑒𝑇 𝑜 is given by the equation 4.

                               𝑓𝑖𝑠𝐶𝑙𝑜𝑠𝑒𝑇 𝑜 (𝑄𝑖 , 𝑖𝑠𝐶𝑙𝑜𝑠𝑒𝑇 𝑜, 𝑄𝑗 ) = |𝑄𝑖 − 𝑄𝑗 | = 𝜃                                      (4)
      Globally, a question 𝑄1 is closed to the question 𝑄2 if 𝜃 ≈ 𝛾 - 𝛾 being the model parameter
      defined during the training process. The algorithm 4 is used to link two questions that
      are closed to each other.

                                        Listing 4: isCloseTo algorithm
      Input :
          Q_1 , Q_2 : q u e s t i o n / / Two q u e s t i o n s
      Output :
          i s C l o s e T o = f a l s e / / A b o o l e a n t o d e f i n e i f q u e s t i o n Q_1 i s c l o s e t o
                                            / / q u e s t i o n Q_2 o r n o t
      Steps :
          1) theta = 0
          2 ) Remove t h e s p e c i a l c h a r a c t e r s i n s i d e Q_1 and Q_2 : \ , / " # : %
          3 ) P u t a l l t h e word i n s i d e q1 and q2 i n l o w e r c a s e
          4 ) T o k e n i z a t i o n o f Q_1 and T o k e n i z a t i o n o f Q_2
          5 ) F o r e a c h Token t _ 1 o f Q_1
                   i f t _ 1 i s i n s i d e t h e Token ’ s L i s t o f q2
                          theta = theta + 1
          6 ) I f t h e t a < gamma
                   isCloseTo=true
• Learning "hasCategory": To predict "hasCategory", relation, we consider that each
  embedding node has a probability to belong to a category. The scoring function defining
  this probability can thus be obtained using statistical learning, Naïve Bayes, etc. For
  instance, if we consider that we are using statistical learning, the scoring function will be
  obtained by the equation 5 - to find if a node 𝑉 hasCategory the node 𝐶, we simply count
  the number of time the node 𝑉 hasCategory the node 𝐶 and we divide by the number of
  time the node 𝑉 hasCategory any other category in the training dataset.
                                                                   𝑁
                                                                  ∑︁
                                                                        𝑓 𝑟𝑒𝑞((𝑉, 𝐶), 𝑇 𝑟𝑎𝑖𝑛)
                 𝑓ℎ𝑎𝑠𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦 (𝑉, ℎ𝑎𝑠𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦, 𝐶) = 𝑖=1
                                                     𝑁
                                                                                                             (5)
                                                    ∑︁
                                                                        𝑓 𝑟𝑒𝑞((𝑉, *), 𝑇 𝑟𝑎𝑖𝑛)
                                                                  𝑖=1

  Globally, an embedded vector 𝑉 has category 𝐶 if it has the greatest probability to have
  category 𝐶 compared to the other categories. The algorithm is presented by listing 5.

                                 Listing 5: hasCategory algorithm
  Input :
      V / /A vector
      C / / The s e t o f c a t e g o r y
  Output :
      c_v / / The c a t e g o r y o f t h e v e c t o r V
  Steps :
      f o r each c _ i in C
            1 ) nb = number o f t i m e s V h a s c a t e g o r y C i n t h e t r a i n d a t a s e t
            2 ) N = number o f t i m e s V h a s c a t e g o r y any o t h e r c a t e g o r y i n t h e
                  train dataser
            3 ) b e t a _ i = nb / n / / The p r o b a b i l i t y t h a t V h a s c a t e g o r y c _ i
            4 ) i f max ( b e t a _ i ) / / v e r i f y i f b e t a _ i i s t h e max p r o b a b i l i t y
                                        / / obtained
                  c_v = c _ i

• Learning "hasType_b", "hasType_l" relations: These relations are predicted using
  simple rules. In effect, once the category is predicted, we define a set of rules that match
  each element of the taxonomy to the corresponding type. The algorithms are presented
  by listing 6 for predicting boolean type and listing 7 for predicting literal type.

                                  Listing 6: hasType_b algorithm
   / / We u s e s i m p l e r u l e s
  Input :
        V // a vector
  Output :
        b o o l C a t / / A b o o l e a n t o know i f t h e c a t e g o r y i s a b o o l e a n o r n o t
  Steps :
        i f question . category = " Boolean " :
                boolCat = Boolean
                                      Listing 7: hasType_l algorithm
  Input :
      C / / Category of a question
      l i t e r a l s / / L i s t of l i t e r a l s
      V / / Embedded v e c t o r
  Output :
      t / / Type o f t h e embedded v e c t o r
  Steps :
      I f c a t e g o r y ( C , V ) = L i t e r a l / / Check i f t h e c a t e g o r y o f t h e v e c t o r i s
                                                       // a literal
              / / S e a r c h f o r t h e l i t e r a l i n t h e l i s t o f l i t e r a l s t h a t match w i t h
              / / the vector V
              t = matchLiteral ( l i t e r a l s , V)

• Learning "hasType_r" relation: concerning the hasType_r relation, we combine the
  internal method for knowledge graph refinement with the external method. In effect, we
  define the embedding vector of the question, so that if it matches a resource provided by
  the SMART organizers, thus, this resource is saved. If not, we made a SPARQL query on
  the graph online in order to get the resource that holds. The algorithm is given by the
  listing 8 and an example of a SPARQL query for DBpedia is given by the listing 9 and for
  Wikidata is given by the listing 10.

                                     Listing 8: hasType_r algorithm
  Input :
        V : V e c t o r / / a Embedding v e c t o r o f a
        T : Taxonomy / / The taxonomy
  Output :
        R : L i s t < t yp e > / / L i s t o f t y p e o f t h e KG
  Steps :
        1 ) S e a r c h i n s i d e t h e taxonomy T , which embedding v e c t o r a s t h e same
               value of V
        2 ) I f t h e v e c t o r V f o u n d i n s t e p { 1 } h a s one o r m u l t i p l e s t y p e v a l u e
                 d e f i n e i n s i d e our KG , i n s e r t them i n s i d e R L i s t
        3) Else
                3 . 1 ) define W = ToString (V)
                3 . 2 ) S e a r c h on t h e e x t e r n a l KG a e n t i t i e s t h a t match W u s i n g
                       W i k i b a s e API
                3 . 3 ) With t h e r e s u l t o f s t e p { 3 . 2 } , e x e c u t e SparQL q u e r y t o g e t
                       the types a s s o c i a t e d
                3 . 4 ) i n s e r t t h e t y p e s f o u n d on { 3 . 2 } i n s i d e R
        4) Return R
  / / The T o S t r i n g ( V ) f u n c t i o n t r a n s f o r m s r e t u r n a S t r i n g by c o n c a t e n a t i n g
        e a c h term o f V


       Listing 9: SPARQL Query of step (3.2) of Algo hasType_r for Dbpedia KG
        PREFIX      r d f : < h t t p : / / www. w3 . o r g / 1 9 9 9 / 0 2 / 2 2 − r d f − s y n t a x − ns #>
        PREFIX      r d f s : < h t t p : / / www. w3 . o r g / 2 0 0 0 / 0 1 / r d f − schema #>
        PREFIX      dbr : < h t t p : / / dbpedia . org / resource >
        PREFIX      dbo : < h t t p : / / d b p e d i a . o r g / o n t o l o g y >

        SELECT DISTINCT ? t y p e
            WHERE{
                   < ‘+ S e a r c h _ p a r a m e t e r + ‘ > r d f : t y p e ? t y p e
                   FILTER s t r s t a r t s ( s t r ( ? t y p e ) , s t r ( dbo : ) )
            }
      / / comment : S e a r c h _ p a r a m e t e r must be r e p l a c e by t h e e n t i t i e s f o u n d on
            s t e p { 3 . 1 } o f h a s T y p e _ r Algo


          Listing 10: SPARQL Query of step (3.2) of Algo hasType_r for wikidata KG
           query : ‘ # d e f a u l t V i e w : Table
           PREFIX bd : < h t t p : / / www. b i g d a t a . com / r d f #>
           PREFIX mwapi : < h t t p s : / / www. m e d i a w i k i . o r g / o n t o l o g y # API / >
           PREFIX wdt : < h t t p : / / www. w i k i d a t a . o r g / prop / d i r e c t / >
           PREFIX w i k i b a s e : < h t t p : / / w i k i b a . s e / o n t o l o g y #>
           PREFIX r d f s : < h t t p : / / www. w3 . o r g / 2 0 0 0 / 0 1 / r d f − schema #>

            SELECT DISTINCT ? t y p e ? t y p e L a b e l
            WHERE {
                   wd : ‘ + S e a r c h _ p a r a m e t e r + ‘ ( wdt : P279 | wdt : P31 ) ? t y p e .
                   SERVICE w i k i b a s e : l a b e l {
                          bd : s e r v i c e P a r a m w i k i b a s e : l a n g u a g e " en " .
                    }
            }
            ORDER BY ASC ( ? t y p e ) LIMIT 1 0 ‘
      / / comment : S e a r c h _ p a r a m e t e r must be r e p l a c e by t h e e n t i t i e s f o u n d on
            s t e p { 3 . 1 } o f h a s T y p e _ r Algo
      \ end {


   To extract knowledge from questions, we used the TF-IDF technique. We identify the less
frequent words in questions and we remove them to obtain the information that can be used
to well represent the question. These information were used to define the 𝑓ℎ𝑎𝑠𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 and
𝑓𝑖𝑠𝐶𝑙𝑜𝑠𝑒𝑇 𝑜 models. Thereafter, we defined a function which calculates the distance between two
questions. This consists of subtracting the terms that are identical and count the rest of the
terms in the result. If the rest is less than 𝛽, thus, the two nodes are close.
   To train the 𝑓ℎ𝑎𝑠𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦 model, DBpedia and Wikidata datasets were divided into training
(80%) and testing (20%). The 20% of the dataset was selected proportionally to the distribution of
each data category. From the 80% of the training dataset, we removed all the answers category
and answer type that were provided as response to the Answer Type prediction question.
   Using the training datasets, we compute the function 𝑓ℎ𝑎𝑠𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦 (see equation 5 and algo-
rithm 5) for each category. This is done by calculating the frequency of apparition of a node
embedding 𝑉 classified as the category 𝐶 divided by the number of apparitions of the node
embedding 𝑉 with any category.
   Once trained on the training dataset, we applied it to the test dataset and obtained the
accuracy of 0.89 for DBpedia and 0.84 for Wikidata. These values were very low compared to
related work results. Thus, we decided to refine the model.
   Given that the first taxonomy (see Fig. 3) did not permit to have good accuracy, we decided
to refine the taxonomy. The refinement was done by augmenting the taxonomy of Fig. 3 with
more terms that are used to identify the question category. To this end, the 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛2𝑣𝑒𝑐𝑡
Figure 5: Knowledge graph-based representation of the dataset
 Scoring function        Accuracy   Approach for resources type                  NDCG@5   NDCG@10
 Statistic_-WH (taxon-   0.881      Statistic_-WH (taxonomy = 15 classes) + KG   0.346    0.303
 omy = 15 classes)                  matching
 Statistic_-WH (taxon-   0.890      Statistic_-WH (taxonomy = 15 classes +       0.400    0.357
 omy = 15 classes +                 other = resources) + KG matching
 other=resources)
 Statistic_-WH (taxon-   0.957      Statistic_-WH (taxonomy = 50 classes +       0.320    0.293
 omy = 50 classes +                 other=resources) + KG matching
 other=resources)
 Statistic_-WH (taxon-   0.964      Statistic_-WH (taxonomy = 67 classes +       0.435    0.392
 omy = 67 classes +                 other=resources) + KG matching
 other=resources)
 Bayes                   0.954      Bayes + KG matching                          0.432    0.389
Table 2
Results of the Answer Type prediction on DBpedia test data

                     Accuracy                mrr (KG matching approach)
                     0.94                    0.15
Table 3
Results of the Answer Type prediction on Wikipedia test data


algorithm (see listing 2) was used to obtain more larger taxonomy (see Fig. 5. This taxonomy
have 67 classes.
   On the other hand, we wanted to test another model than the statistical one. Thus, we choose
to test the Naive Bayes classifier using this approach.
   The statistical scoring function and the Bayes function was applied to the training dataset
and we obtained the max accuracy of 0.95% on DBpedia and 0.93% on Wikidata.
   To finalize this work, we applied our approach on the test data provided by SMART 3.0
organizers and we got our results analyzed. The result of the different models is presented
by the table 2 for DBpedia and table 3 for Wikidata. These tables showed that the more the
taxonomy is extended, the better the results are.
   The parameter of our models (see table 2) are defined as follow:

    • Statistic_-WH (taxonomy = 15 classes): represents the question taxonomy of Fig. 3;
    • Statistic_-WH (taxonomy = 15 classes + other=resources): represents the question
      taxonomy of Fig. 3 in which all the others questions which cannot be embedded to the
      node in the taxonomy are classified as "resource" category. In effect, during the study
      of the dataset, we found that most questions (more than 80%) not having an embedding
      vector was of "resource" category;
    • Statistic_-WH (taxonomy = 50 classes + other=resources): this is the used of the
      question taxonomy extended with more classes;
    • Statistic_-WH (taxonomy = 67 classes + other=resources): represents the use of
      question taxonomy extended with more classes;
    • Bayes: represents the use of Naive Bayes as the scoring function.
5. Conclusion
To solve the Answer Type task, we propose in this paper a model based on Knowledge Graph
refinement techniques. It considers that the prediction of the answer category and answer type
can be reduced to the prediction of links in a knowledge graph. This method was applied on
DBpedia and Wikidata dataset provided by SMART organizers. To predict the answer category,
we used two scoring functions - one built using statistical learning and the second one built
using Naive Bayes. The evaluation on the test data provided by the organizers gave the max
accuracy of 0.96 for DBpedia, category match of 0.94 for Wikidata. Concerning the prediction
of the answer type, we combined Naive Bayes with rule-based graph matching techniques. The
evaluation on the test data of DBpedia gave NDCG@5=0.435 and NDCG@10=0.392. Concerning
Wikidata, the mrr=0.15. Given that these results are very low compared to the past challenges
results, we are currently making a state of the art of machine learning methods and models that
can be used to improve the link prediction in knowledge graphs. The results of this state of the
art will be used to compare different methods and come up with the best method that can be
used for link prediction in question answering systems.


Acknowledgments
We would like to thank the ISWC conference and the SMART challenge organizers and all the
reviewers. This work was partially supported by Neuralearn.ai (Start-up Cameroon).


References
[1] N. Mihindukulasooriya, M. Dubey, A. Gliozzo, J. Lehmann, A.-C. N. Ngomo, R. Usbeck,
    SeMantic AnsweR Type prediction task (SMART) at ISWC 2020 Semantic Web Challenge,
    CoRR/arXiv abs/2012.00555 (2020). URL: https://arxiv.org/abs/2012.00555.
[2] P. Cimiano, H. Paulheim, Knowledge graph refinement: A survey of approaches and
    evaluation methods, Semant. Web 8 (2017) 489–508. URL: https://doi.org/10.3233/SW-160218.
    doi:10.3233/SW-160218.
[3] A. Jiomekong, G. Camara, M. Tchuente, Extracting ontological knowledge from java source
    code using hidden markov models, Open Computer Science 9 (2019) 181–199.
[4] A. Bordes, N. Usunier, A. Garcia-Durán, J. Weston, O. Yakhnenko, Translating embeddings
    for modeling multi-relational data, in: Proceedings of the 26th International Conference on
    Neural Information Processing Systems - Volume 2, NIPS’13, Curran Associates Inc., Red
    Hook, NY, USA, 2013, p. 2787–2795.