Augmentation-based Answer Type Classification
           of the SMART dataset

                    Aleksandr Perevalov and Andreas Both

         Anhalt University of Applied Sciences, Köthen (Anhalt), Germany
              {aleksandr.perevalov,andreas.both}@hs-anhalt.de


      Abstract. Recent progress in deep-learning-enabled AI researchers and
      developers to invest minimal efforts to achieve state-of-the-art results.
      Specifically, in such a task as text classification – text preprocessing
      and feature generation does not play a significant role anymore thanks
      to such a landmark model as BERT and other related models. In this
      paper, we present our solution for the Semantic Answer Type predic-
      tion task (SMART task). The solution is based on the application of
      several data augmentation techniques: machine translation to popular
      languages, round-trip translation, named entities annotation with linked
      data. The final submission was generated as a weighted result from sev-
      eral successful system outputs.

      Keywords: Answer type classification · Text classification · Text aug-
      mentation.


1   Introduction
Understanding a question’s answer type is one of the significant steps in a
question-answering process [4]. With the help of an answer type classifier – a
Question Answering system (QA system) could narrow the answer search space
and filter the inappropriate answer candidates [6].
    In general, the answer type classification task can be interpreted as a multi-
class text classification task. However, the SMART task [5] proposes a more
complicated structure of the data. There are two class levels: answer category
(resource, literal, boolean) and answer type.
    According to the official description of the data1 : If the category is “re-
source”, answer types are ontology classes from either the DBpedia ontology2
or the Wikidata ontology3 . If the category is “literal”, answer types are either
“number”, “date”, or “string”. For the category “boolean” no additional spe-
cialization is defined. It is worth mentioning that in this work we concentrate
only on the DBpedia dataset.
      Copyright c 2020 for this paper by its authors. Use permitted under Creative
  Commons License Attribution 4.0 International (CC BY 4.0).
1
  https://smart-task.github.io/
2
  http://mappings.dbpedia.org/server/ontology/classes/
3
  https://www.wikidata.org/wiki/Wikidata:WikiProject_Ontology
2       A. Perevalov, A. Both

     Each “resource” answer type contains a ranked list of the DBpedia ontol-
ogy types. All items contained in a list are part of one hierarchy, for exam-
ple: ["dbo:Person", "dbo:Agent"] or ["dbo:Opera", "dbo:MusicalWork",
"dbo:Work"]. The most general ontology type is at the end of a list.
     The DBpedia dataset contains 21,964 (train - 17,571, test - 4,393) questions.
The evaluation metric for answer category prediction task is accuracy, the
metric for answer type prediction is lenient NDCG@k with a Linear decay
[2].
     Our solution focuses on data augmentation techniques. In Section 2 we de-
scribe the dataset in detail. Section 3 incorporates the description for the data
augmentation methods used by us, as well as an algorithm for merging answer
type lists. In Section 4 we show our experimental results and describe the local
evaluation pipeline. Finally, in Section 5 the conclusions are presented.


2    Dataset analysis and transformation
The original dataset is presented using the JSON format. To train a model on
the data, it needs to be transformed into a feature-target form.
    In the case of the prediction answer category, the task is trivial – there is just
one target value for one question and it is considered as a multi-class classification
task. While predicting an answer type – things are more complicated: we have
to predict a list, which items are ordered according to the level of taxonomy and
has to match one hierarchy (e.g., dbo:Opera, dbo:MusicalWork, and dbo:Work).
The first constraint does not allow us to consider this task as a multi-class
classification. That is why we decided to make each item of a list as an individual
target value, so we can train separate models for each of them. We took only 5
most general types for each question because 95% of the answer type list’s lengths
are not more than this value. The head of the resulting dataset is presented in
Figure 1.


               Fig. 1. Tabular representation of the training dataset.


    Hence, we consider the solution for the SMART challenge task to be repre-
sented as two-level architecture where the higher-level decisions activate lower-
level classifiers:
         Augmentation-based Answer Type Classification of the SMART dataset                                                                           3

           4183
 4000


 3500


 3000
                       2716
 2500
                                    2247         2247

 2000
                                                                  1654
 1500                                                                              1402


 1000                                                                                              899           827            751          692
  500


     0   dbo:Agent   dbo:Person   dbo:Place   dbo:Location dbo:PopulatedPlace dbo:Organisation   dbo:Work   dbo:Settlement   dbo:Country   dbo:City


                                      Fig. 2. TOP 10 resource answer types


Level 1 The category is classified (Figure 1, column: “category”). There-
          after, the classification system can decide for the next required
          classifiers.
Level 2 The second-level decisions are considered to be two independent
          tasks:
            – Classification of literal type (Figure 1, column: “type 1”)
            – Classification of resource types (Figure 1, columns: “type 1”,
              “type 2”, “type 3”, “type 4”, “type 5”)
     The training dataset had 43 questions with empty textual representation.
 These questions were removed. The resulting dataset has the following charac-
 teristics:
   – 17,528 questions are contained;
   – Distribution: 9,573 question point to resources, 5,156 point to a literal datatype
     and 2,799 are Boolean questions;
   – The 95th percentile of the answer type lists’ length is 5;
   – The maximum number of tokens in a question is 60.
     In Figure 2, the top 10 most common resource answer types are presented. It
 shows that all top 10 resource types belonging either to dbo:Agent or dbo:Place
 or their sub-classes.


 3       Proposed solution
 3.1       Classifier Architecture
 The classification pipeline was created with a tree-like structure and 7 classifiers
 in total (see Figure 3). First, the category is classified. Then, depending on the
 category, the corresponding models are chosen.
     For example, if the category is “resource”, then the pipeline classifies a ques-
 tion using 5 models reflecting the decision for “type 1”, “type 2”, “type 3”,
4       A. Perevalov, A. Both

“type 4”, and “type 5” (cf., Figure 1). Given the results of these classifiers, the
answer type list is created from the computed results (obeying the correct or-
der). As there are only 5 models (one model for one list item) – the answer type
list’s length will contain no more than 5. Sometimes it may be less (when the
prediction is None).


                     Fig. 3. Tree-like classification pipeline C


3.2    Data Augmentation

To extend the training data, we used several augmentation strategies for the
given dataset:

D1 Machine translation to German, French, Spanish, and Russian is used for
   each question. Hence, in total there are 5x more questions (separated in 5
   different languages) resulting in 87,640 questions. As the dataset has become
   a multilingual one, we will use a multilingual model. There are two types of
   prediction for such a dataset: Use the original English text or use predictions
   for all languages and a majority voting algorithm.
D2 Round-trip translation [1] (English-German-English, English-Russian-English)
   – in total, there are 3x more questions, and we use a single language model.
   The dataset consists of 52,584 questions;
D3 Each question is annotated with it’s named entities pointing to DBpedia
   resources – each named entity is replaced with one of its RDF types. The data
   is extracted from DBpedia with help of DBpedia Spotlight4 . The dataset
   consists of 163,488 questions.

   Google Cloud Translation5 was used to translate the data for D1 and D2
automatically.
4
    https://www.dbpedia-spotlight.org/
5
    https://cloud.google.com/translate
       Augmentation-based Answer Type Classification of the SMART dataset          5


               Fig. 4. Example of merging 3 lists with specified weights

            D0                     D1                       ...            Dm-1
                                                  ...
  C0                    C1                                         Cm-1


            RC0                   RC1                        ...           RCm-1
                                             Pm
                             PAn ,RFinal =     i=0 Wi · PAn ,RCi


                                             RFinal

                        Fig. 5. Overview of the final process.


    Hence, additionally to the original dataset – we call it D0 – we have created
here 3 more dataset (D1 , D2 , and D3 ) that are used to spawn 4 independent
classifier pipelines (C1 , C2 , C3 , and C4 ). Consequently, the results RCi of all
classifier pipelines Ci need to be merged. Figure 4 shows an example of merging
process. The next section gives a detailed description of the process.


3.3    Results Merging

Each classification pipeline – C0 , C1 , C3 , and C4 – provide a list of classifi-
cation results. It is reasonable to assume that they also have a distinguished
classification quality.
    Hence, while merge the classification results – identified by RC0 , RC1 , RC2 ,
and RC3 – to establish a final result set RFinal as shown in Figure 5. The merging
of RCi with i ∈ {0, 1, 2, 3} is computed while numerically calculating a weighted
rank for each answer type that was predicted by at least one classifier pipeline
6       A. Perevalov, A. Both

Ci . The rank PAn ,RFinal of an answer type An in RFinal is computed as follows:
                m
                                                    (
               X                                      rank of n in RCi if n in RCi
 PAn ,RFinal =     Wi · PAn ,RCi , where PAn ,RCi =
               i=0
                                                      fallback rank f  else
                                 and m is the number of classification pipelines

Typically, the quality of Level-1 decisions would be high. However, there also
exists a special case where a different answer category was predicted by the
classifier pipelines. In this case, we currently follow a static rule-based decision
process that is favoring the more specific predictions, i.e., if one classifier pipeline
predicted the category boolean, then all other results are discarded. And, else if
one classifier pipeline is predicting the literal category, then all non-literal
categories are discarded.


4     Experiments
4.1   Evaluation
We used Bert-base-cased and Bert-base-multilingual-cased models [3] in our clas-
sification pipeline. Training data was split into two sets: train and validation set.
The validation set was created by random choice of 4400 questions and the test
set consists of 4381 questions. The models were fine-tuned on the training set
with the following hyperparameters: EPOCHS=2, MAX LEN=60, BATCH SIZE=16.
     The training process was performed on GPU resources provided by the Kag-
gle.com platform (NVIDIA TESLA P100 GPU, 16 GB RAM). The results shown
in Table 1 enable us to compare the effectiveness of each augmentation technique.
The results were obtained on the validation set locally (MV – corresponds to
Majority Voting algorithm, see Section 3.2):


                          Table 1. Local validation results

                               D0    D1 D1+MV D2       D3
                     Accuracy 0.969 0.968 0.962 0.357 0.959
                     NDCG@5 0.533 0.704 0.708 0.165 0.363
                     NDCG@10 0.499 0.661 0.665 0.140 0.317


    The best performing datasets are multilingual ones (D1 ). The round-trip
translation (D2 ) approach caused overfitting because of small differences in ques-
tions forms. The same situation occurred with the named entities annotation
approach (D3 ). The original dataset (D0 ) showed comparable performance. A
detailed analysis of the errors is given in Section 4.2.
    For the final analysis, we took only predictions from the models trained on
the original (D0 ) and the multilingual dataset (D1 ) into account. We used both
prediction approaches for the multilingual data: using the multilingual model to
      Augmentation-based Answer Type Classification of the SMART dataset           7

predict the answer type of English questions and using the same model while
retrieving predictions for all 5 languages and taking the majority vote result.
The predictions were merged using the algorithm described at the end of the
previous section, we used several weights combinations to achieve the highest
quality. The evaluation results for final submission are presented in Table 2.


                         Table 2. Final evaluation results

              .3D0 +.3D1 +.4D1+MV .5D1 +.5D1+MV .3D1 +.7D1+MV .7D1 +.3D1+MV
Accuracy              0.976              0.965           0.965           0.972
NDCG@5                0.762              0.752           0.752           0.759
NDCG@10               0.725              0.714           0.716           0.722


    The highest score on the test dataset was achieved with a merged combination
of 3 predictions (see the second column of Table 2). We evaluated the weight
combinations where each weight wi was chosen between 0.0 and 1.0, s.t., the
sum of all used weights equals 1.0. The following best weight combination was
created using this process: 30% – D0 , 30% – D1 and 40% – D1+MV . The fallback
rank f for the merging algorithm was taken equal to 10 (see Subsection 3.3).
This combination was submitted as the final solution for the task. As the weights
were obtained manually and intuitively, we can not make a statement about its
application on the other datasets. Moreover, these weights can be overfitted to
the test set because the final predictions were given by the organizers based
on the whole test dataset without private/public test splits. Hence, the weights
were selected according to the public test set results. This is a limitation of our
merging approach.

4.2   Error analysis
As we reported in the previous subsection, the approach D1 outperformed D2
and D3 due to the model overfitting caused by the nearly same surface form of
the obtained questions. The corresponding example of D2 is given below:
  Original: Who replaced Charles Evans Hughes as the Chief Justice of The
            United States?
 En-De-En: Who succeeded Charles Evans Hughes as Chief Justice of the
            United States?
 En-Ru-En: Who replaced Charles Evans Hughes as Chief Justice of the
            United States?
    Hence, we have to recognize that the questions generated using round trip
translation are not differing significantly: En-De-En differs in one word, absence
of the definite article, and non-capitalized letter “T” in the last definite article,
almost the same is true for the En-Ru-En translation.
    We can assume that round-trip translation to languages, that are non-popular
or distant from the English language, would possibly resolve this issue.
8        A. Perevalov, A. Both

     The example of D3 is given below:
     Original: Who replaced Charles Evans Hughes as the Chief Justice of The
               United States?
    Variant 1: Who replaced DBpedia:Athlete as the DBpedia:Person of The
               DBpedia:PopulatedPlace?
    Variant 2: Who replaced DBpedia:Person as the DBpedia:Person of The
               DBpedia:Country?

    Each named entity was replaced with its URI’s type in the DBpedia. As
a resource in the DBpedia may contain up to several thousands of variants
corresponding to each combination of the types. There are two major limitations
of this approach: the DBpedia resource may contain errors w.r.t. its type and the
Named Entity Linking tool may extract and link entities incorrectly. In the given
example, the “the Chief Justice of The United States” should be replaced with
a single type, while it was replaced with two different types which are incorrect.
    The D3 showed the best performance, here is the example of its fragment:
     Original: Who replaced Charles Evans Hughes as the Chief Justice of The
               United States?
     German: Wer hat Charles Evans Hughes als Oberster Richter der Vere-
               inigten Staaten abgelöst?
      French: Qui a remplacé Charles Evans Hughes en tant que juge en chef
               des États-Unis?

    However, despite the augmentation approaches, there is one significant lim-
itation of our prediction approach – each element of the answer type list is
predicted independently and therefore the elements may not from the same hi-
erarchy. For example, for the question “What is the horse characters of Madame
Sans-Gêne play?” predicted answer type list is ["dbo:Person", "dbo:Work"]
while the true value is ["dbo:Animal", "dbo:Eukaryote", "dbo:Species"].
Despite the prediction is completely incorrect, it has items "dbo:Person" and
"dbo:Work" which are located in the different ontology branches (hierarchies).
    Consequently, the mechanism of checking the correctness of the hierarchy
should be created. One of the possible solutions may be the prediction of the
most specific answer type and making the prediction according to the actual
hierarchy.


5     Conclusion

In this work, we described our solution for the Semantic Answer Type predic-
tion task. The goal was to predict the corresponding answer category and answer
types. To solve the task, we created a tree-like classification pipeline and imple-
mented several text augmentation methods described in Section 3.
    The results of our experiments show that the multilingual dataset has the
highest performance in contrast to the other augmented data. To prepare the
     Augmentation-based Answer Type Classification of the SMART dataset                 9

final submission, we used the weighted merging algorithm on top of our best
predictions (see Section 4).
    Obviously, there is room for improvement. In future work, we would use an
ensemble learning approach to merge the results instead of the current static
approach. Additionally, we would also consider each language classifier indepen-
dently assuming a distinguished translation quality leading to different classi-
fication quality. Also, the hierarchy accordance and hierarchy level validation
mechanism might be used for the prediction process.


References
1. Aiken, M., Park, M.: The efficacy of round-trip translation for mt evaluation. Trans-
   lation Journal 14(1), 1–10 (2010)
2. Balog, K., Neumayer, R.: Hierarchical target type identification for entity-
   oriented queries. In: Proceedings of the 21st ACM international confer-
   ence on Information and knowledge management. pp. 2391–2394 (2012).
   https://doi.org/10.1145/2396761.2398648
3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.N.: Bert: Pre-training of deep
   bidirectional transformers for language understanding. ArXiv e-prints (2018)
4. Hao, T., Xie, W., Wu, Q., Weng, H., Qu, Y.: Leveraging question
   target word features through semantic relation expansion for answer
   type classification. Knowledge-Based Systems 133, 43 – 52 (2017).
   https://doi.org/https://doi.org/10.1016/j.knosys.2017.06.030
5. Mihindukulasooriya, N., Dubey, M., Gliozzo, A., Lehmann, J., Ngomo, A.C.N., Us-
   beck, R.: SeMantic AnsweR Type prediction task (SMART) at ISWC 2020 Semantic
   Web Challenge. CoRR/arXiv abs/2012.00555 (2020), https://arxiv.org/abs/
   2012.00555
6. Xu, D., Jansen, P., Martin, J., Xie, Z., Yadav, V., Madabushi, H.T., Tafjord, O.,
   Clark, P.: Multi-class hierarchical question classification for multiple choice science
   exams. In: Proceedings of The 12th Language Resources and Evaluation Conference.
   pp. 5370–5382 (2020)