=Paper= {{Paper |id=Vol-3119/paper-02 |storemode=property |title=Semantic Answer Type Prediction |pdfUrl=https://ceur-ws.org/Vol-3119/paper2.pdf |volume=Vol-3119 |authors=G P Shrivatsa Bhargav,Dinesh Khandelwal,Dinesh Garg,Saswati Dana |dblpUrl=https://dblp.org/rec/conf/semweb/BhargavKGD21 }} ==Semantic Answer Type Prediction== https://ceur-ws.org/Vol-3119/paper2.pdf
 Semantic Answer Type Prediction using Dense
             Type Embeddings

 G P Shrivatsa Bhargav, Dinesh Khandelwal, Dinesh Garg, and Saswati Dana

                             IBM Research, India
           {gpshri27, dikhand1, garg.dinesh, sdana027}@in.ibm.com



      Abstract. In this paper we describe our submission to the SMART 2021
      Answer Type Prediction task. We propose a BERT based solution to the
      problem. The proposed approach relies on type embeddings obtained
      based on the type names. It allows our model to predict types at test
      time that were not seen during training. Analysis of the training dataset
      reveals the presence of noise in the labels. Therefore, we develop a label
      augmentation scheme to reduce the noise in the annotations and increase
      the quality of the training data. Our model trained on the de-noised data
      achieves 0.986 accuracy on the answer category prediction task and 0.825
      and 0.790 NDCG@5 and NDCG@10 respectively on the test sets.

      Keywords: Answer Type Prediction · Question Answering · Natural
      Language Processing


1   Introduction
Answer Type Prediction in SMART 2021 [4] comprises two sub-tasks. The first
task is to predict the answer category of the given natural language question. The
set of possible answer categories is resource, boolean, literal. The second task
is to predict the answer type of the given question. The set of possible answer
types depends on the answer category. If the category is resource, the types are a
subset of the DBpedia or Wikidata ontology classes. If the cateogry is literal, the
type could be either number, date or string. If the category is boolean, the type
is always boolean. Table 1 shows examples from the SMART 2021 Answer Type
Prediction dataset. The metric used to evaluate answer category prediction is
accuracy score. Type prediction performance is measured using lenient NDCG@k
with linear decay [1] (with k=5,10).
    In this paper, we explore how transformers like BERT [2] can be used to
effectively address the problem of Answer Type Prediction.


2   Dataset processing
In this paper, we focus on the SMART2021-AT DBpedia dataset. The dataset
has 40621 samples for training and validation. An additional 10093 samples are
held out for testing. We perform elementary data cleanup (for example, removing
2        Bhargav et al.

Table 1. Examples from the SMART 2021 Answer Type Prediction dataset with
DBpedia as the Knowledge Graph.

    Question                           Category         Type
    Who are the gymnasts coached by    resource         [dbo:Gymnast,dbo:Athlete,
    Amanda Reddin?                                      dbo:Person, dbo:Agent]
    How many superpowers does won-     literal          [number]
    der woman have?
    When did Margaret Mead marry       literal          [data]
    Gregory Bateson?
    Is Azerbaijan a member of Euro-    boolean          [boolean]
    pean Go Federation?



samples with null labels, removing duplicates, etc) and split the dataset into
training and validation sets in the ratio 80 : 20. Table 2 summarizes the size of
different sets.


                                Table 2. Dataset size

                          Set          Number of samples
                          Train        29356
                          Validation   7340
                          Test         9104



    To establish a performance (accuracy and NDCG@k) upper-bound on this
dataset, we performed an experiment where we evaluated1 the training and
validation sets against themselves. The goal was to check what accuracy and
NDCG@k a system would obtain if its predictions exactly matched the gold an-
notations. On the training set, we found that the category prediction accuracy
was 1 whereas, NDCG@5=0.8261 and NDCG@10=0.7529. This indicates that
the gold label set is not complete. Such incompleteness/noise in the training
data will directly impact the model’s performance. Upon inspection, we found
that some of the ancestor types (also known as super types, parent types) were
missing in the gold types list. In some training samples, the types were not
sorted according to the descending order of their depths in the ontology. We
modified the gold type list of each training sample in the following ways - (i)
We completed the type list by adding the missing ancestor types. (ii) We sorted
the completed type list in descending order of their depth. We refer to these two
steps collectively as label augmentation. Table 3 summarizes the impact of each
of the above steps. The metrics in Table 3 also serve as a soft upper-bound on
the performance of any model trained on this dataset. We train our models on
the modified training set (+ ancestor types + sorting) but we validate on the
unmodified validation set.
1
    github.com/smart-task/smart-dataset/blob/master/evaluation/dbpedia/evaluate.py
             Semantic Answer Type Prediction using Dense Type Embeddings              3

            Table 3. The impact of each gold label augmentation operation.

      Data                         Category accuracy     NDCG@5       NDCG@10
      Train                        1.0                   0.826        0.753
      + sorting                    1.0                   0.846        0.768
      + ancestor types + sorting   1.0                   0.892        0.808
      Validation                   1.0                   0.823        0.748
      + sorting                    1.0                   0.842        0.804
      + ancestor types + sorting   1.0                   0.888        0.804



   The metric NDCG@k is sensitive to the ordering of the predicted types. The
evaluation script expects the predicted types to be sorted from the finest to the
coarsest (i.e, decreasing order of depth in the ontology). But in several samples,
there are multiple types of the same depth. In such cases, their ordering in the
predicted type list can be arbitrary. This phenomenon could be the reason why
the train and validation NDCG@k are not 1.


3     Proposed Approach

In this section we will describe our proposed approach to solve the Answer Type
Prediction Task.


3.1     Problem Reformulation

To simplify the modelling task, we work on an equivalent reformulation of the
problem. The reformulated problem can be stated as follows - Given a natural
language question, the first task is to predict the answer category from the set of
labels C = {resource, number, date, string, boolean}. If the predicted category is
resource, then the second task is to rank the set of DBpedia types T = {t1 , t2 , . . .}
from most relevant to least relevant. We train the model on the reformulated
task, transform its predictions and report the metrics on the original task.


3.2     Question Encoding

We embed the given natural language question Qi =< qi1 , qi2 , . . . > (where qij
are the tokens) into vector space and use this embedding to predict the cate-
gories and rank the types. We leverage BERT to obtain the question embeddings
as follows: (i) We surround the question tokens with special tokens [CLS], [C]
and [SEP] to obtain a sequence of the form <[CLS][C]qi1 , qi2 , . . .[SEP]> (ii)
The above sequence is passed through BERT to obtain a sequence of vectors
< v[CLS] , v[C] , vqi1 , vqi2 , . . . >.
    v[CLS] is used for the purpose of ranking the set of types T whereas v[C] is
used to predict the answer category from the set of labels C.
4       Bhargav et al.

3.3    Category Prediction
The vector v[C] is passed through a fully connected layer (with parameters W
and b) followed by a softmax layer in order to obtain the probability for each
of the categories in C. To tune the parameters for this task, we use the cross
entropy loss.                      X
                         LC = −           log p(c∗i |Qi )
                                  Qi ∈Dtrain
where Dtrain is the training set and c∗i is the true category for the question Qi .

3.4    Type Embedding
In order to rank the types T in the next step, we require a vector embedding ei for
every type ti ∈ T . We use BERT to obtain the initial embeddings of the types. To
do so, we first normalize the names of each type ti to obtain English phrases. For
example, the type “dbo:GovernmentalAdministrativeRegion” is transformed to
“Governmental Administrative Region”. The normalized type names are passed
through BERT and the resulting output corresponding to the [CLS] vector is
used as the initial type embedding. Thus, we obtain an embedding matrix ET ∈
R|T |×d , where d is the embedding dimension. Each row ei of ET corresponds
to the embedding of the type ti . By creating the type embeddings this way, we
ensure that we will have a good representation of all the types even though they
may be unseen during training.

3.5    Type Ranking
For a given question Qi , we predict the probability p(tj = 1|Qi ) for each type
tj and then rank the types from the most probable to the least. p(tj = 1|Qi ) is
computed as follows:
                                              1
                     p(tj = 1|Qi ) =           T
                                     1 + exp(−v[CLS] · e tj )
    We train the model to maximize the probabilities of the gold types using the
following loss:
                                   X
                          1res (Qi )           λ1 1Qi (tj ) log p(tj |Qi )
                X
    LT = −
             Qi ∈Dtrain                tj ∈T
                                                                                                 
                                                         + λ2 (1 − 1Qi (tj ))(1 − log p(tj |Qi ))

    where, Dtrain is the training dataset, 1res (Qi ) indicates whether the category
of question Qi is “resource” or not, 1Qi (tj ) indicates whether the type tj is a
valid answer type for the question Qi , λ1 and λ2 are scalar hyperparameters.
In every training sample, the number of negative types is far greater than the
number of positive types. Due to this imbalance, the model will learn to predict
zero probability for all the types. To remedy this, we use the hyperparameters
λ1 and λ2 to balance the losses corresponding to the positive and negative types.
            Semantic Answer Type Prediction using Dense Type Embeddings               5

3.6    Training
We jointly optimize LC and LT in a multi task fashion. The multi-task learning
objective function is:
                              L = LC + αLT
where α is a scalar hyperparameter that controls the relative importance of LC
and LT . The parameters of the question encoder BERT, W , b and ET are all
updated during training.

3.7    Inference
At the inference time, we first run the answer category prediction. The type
prediction depends on the predicted category. Table 4 illustrates how the output
on the reformulated task is converted to the output on the original task.


                            Table 4. Inference strategy

    Category prediction      Transformed Output
    resource                 category = resource, type = types sorted by p(tj |Qi )
    number                   category = literal, type = number
    date                     category = literal, type = date
    string                   category = literal, type = string
    boolean                  category = boolean, type = boolean




4     Experiments
4.1    Implementation Details
We implemented the proposed approach in Pytorch. The source code and the
trained models have been released on Github2 . BERT-Base is used in all our
experiments. We use ADAM [3] to optimize the objective function. Validation set
performance is used for early stopping and model selection. Table 5 summarizes
the hyperparameters and their values. Our set of DBpedia types contains 791
elements. We do not restrict the type set to only those seen during training. All
the models were trained on a single Nvidia K80 GPU. Each epoch of training
required approximately 35 minutes.

4.2    Results
Table 6 shows the performance of our models on the validation and test sets.
First, we train BERT on the training dataset without any label augmentation.
The model achieves near-perfect accuracy on the answer category prediction
2
    https://github.com/IBM/answer-type-prediction
6      Bhargav et al.

                            Table 5. Hyperparameters

           Hyperparameter            Value
           Learning rate             3e−5
           Hidden dimension d        768
           Batch size                8
           α                         1.0
           Max question length       64
           Max number of epochs      6
           λ1                        1/num. positive types in the batch
           λ2                        1/num. negative types in the batch



task. Next, we train BERT on the training dataset after label augmentation. La-
bel augmentation gives a boost of 1% in NDCG@5 and 1.4% in NDCG@10. The
performance of this model is only 6.8% and 1.9% short of the soft upperbound
established in Table 3

               Table 6. Results on the training and validation sets.

                                                       BERT
                                    BERT
                                               + label augmentation
                        Cat. acc.    0.986              0.986
           Val set      NDCG@5       0.810              0.820
                        NDCG@10      0.771              0.785
                        Cat. acc.      -                0.985
           Test set     NDCG@5         -                0.825
                        NDCG@10        -                0.790




4.3   Analysis

We performed analysis on the validation set to understand the strengths and
weaknesses of our model (BERT + label augmentation).
    Table 7 shows the confusion matrix on the answer category prediction task.
Table 8 shows randomly sampled examples of each kind of mistake made by
the classifier. In all cases where the model predicts boolean instead of literal
and boolean instead of resource, the model is correct and the gold annotation
in incorrect. The confusion between resource and literal is prominent but hard
to resolve. The answer category (resource or literal) is completely dependent on
the knowledge base. It is impossible to decide resource vs literal without finding
the answer to the question first.
    To study the errors made by the answer type predictor, we randomly sample
the validation set examples on which the model’s NDCG@5 is less than 0.2. We
              Semantic Answer Type Prediction using Dense Type Embeddings             7

show some of these examples in table 9. In examples 1 and 3, we see that the
model’s ranking is correct but the gold annotations are incorrect. Example 2
however, is a mistake by our model and the reason is unclear.


               Table 7. Confusion matrix of the answer category classifier

                                                   Prediction
                                        resource     literal  boolean
                             resource   0.9927       0.0064   0.0008
                    Gold
                             literal    0.0675       0.9276   0.0048
                             boolean    0.0023       0        0.9976




         Table 8. Examples of errors made by the answer category classifier

   Gold                Prediction       Questions
                                        How many seats are in prefectural assem-
                                        bly?
   resource            literal
                                        What is the demised place of Leo III
                                        What is the symbol for pi?
                                        what year did tim duncan enter the nba
                                        Did Barbados have a diplomatic relation-
                                        ship with Nigeria in the past?
   resource            boolean
                                        Was Natalia Molchanova born in the
                                        Bashkir Autonomous Soviet Socialist Re-
                                        public?
                                        is ANZUS a signatory?
                                        Was Gustav Mahler‘s birth place located
                                        in the administrative territorial entity of
                                        Kalista ?
                                        In what country is Mikhail Fridman a citi-
   literal             resource         zen?
                                        What’s the original language for Close En-
                                        counters of the Third Kind?
                                        Who sponsors the FC Bayern Munich?
                                        Did Masaccio die before the statement of
                                        Gregorian
   literal             boolean
                                        Is the language of Neptune, Czech?
                                        Is Thom Enriquez part of the film crew for
                                        Beauty and the Beast?
                                        Is there an audio recording of Charles
                                        Duke?
   boolean             resource         What is the geography of the planet, Mars?
8         Bhargav et al.

Table 9. Examples of errors made my the type ranking module. The examples shown
in this table are randomly sampled from those questions whose NDCG@5 is less than
0.2.

    Question                    Gold types                  Top 5 Predicted types
    What country signed the     [‘dbo:Person’,              [‘dbo:Country’,
    North Atlantic Treaty       ‘dbo:Agent’]                ‘dbo:Location’,
    that has a spoken lan-                                  ‘dbo:PopulatedPlace’,
    guage of Portuguese?                                    ‘dbo:Place’, ‘dbo:State’]

    What year doug williams     [‘dbo:FootballLeagueSeason’, [‘dbo:Software’,
    won the super bowl          ‘dbo:SoccerClub’]            ‘dbo:Work’,
                                                             ‘dbo:VideoGame’,
                                                             ‘dbo:TelevisionShow’,
                                                             ‘dbo:FootballLeagueSeason’]

    Where is the headquarters   [‘dbo:Company’,             [‘dbo:Location’,
    of the car manufacturer     ‘dbo:Organisation’,         ‘dbo:Place’,
    Lyon                        ‘dbo:Agent’]                ‘dbo:Settlement’,
                                                            ‘dbo:PopulatedPlace’,
                                                            ‘dbo:City’]


5      Conclusions
In this paper we explored how BERT can be used to address the problem of
answer type prediction. We first established a soft upper bound on the perfor-
mance of models that are trained on the SMART 2021 dataset. We developed a
label augmentation scheme to de-noise the gold annotations and hence improve
the model. Our model achieves 0.986 accuracy on the answer category prediction
task and 0.825 and 0.790 NDCG@5 and NDCG@10 respectively on the test set.
On the validation set, the performance of our model (after label augmentation)
is only 0.068 and 0.019 short of the soft upper-bound established in Table 3. We
also present error analysis that shows the nature of errors made by our model
and the noise in the dataset.


References
1. Balog, K., Neumayer, R.: Hierarchical target type identification for entity-
   oriented queries. p. 2391–2394. CIKM ’12, Association for Computing Ma-
   chinery, New York, NY, USA (2012). https://doi.org/10.1145/2396761.2398648,
   https://doi.org/10.1145/2396761.2398648
2. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirec-
   tional transformers for language understanding. In: NAACL-HLT. pp. 4171–4186
   (2019)
3. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y.,
   LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR
   2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015),
   http://arxiv.org/abs/1412.6980
           Semantic Answer Type Prediction using Dense Type Embeddings           9

4. Mihindukulasooriya, N., Dubey, M., Gliozzo, A., Lehmann, J., Ngonga Ngomo, A.C.,
   Usbeck, R., Rossiello, G., Kumar, U.: Semantic Answer Type and Relation Predic-
   tion Task (SMART 2021). arXiv (2022)