Augmentation-based Answer Type Classification of the SMART dataset

Augmentation-based Answer Type Classification of the SMART dataset AleksandrPerevalov aleksandr.perevalov@hs-anhalt.de Anhalt University of Applied Sciences

Köthen (Anhalt) Germany

AndreasBoth andreas.both@hs-anhalt.de Anhalt University of Applied Sciences

Köthen (Anhalt) Germany

Augmentation-based Answer Type Classification of the SMART dataset AC16ACC88692E65DABFF075ED53BD088 GROBID - A machine learning software for extracting information from scholarly documents Answer type classification Text classification Text augmentation

Recent progress in deep-learning-enabled AI researchers and developers to invest minimal efforts to achieve state-of-the-art results. Specifically, in such a task as text classification -text preprocessing and feature generation does not play a significant role anymore thanks to such a landmark model as BERT and other related models. In this paper, we present our solution for the Semantic Answer Type prediction task (SMART task). The solution is based on the application of several data augmentation techniques: machine translation to popular languages, round-trip translation, named entities annotation with linked data. The final submission was generated as a weighted result from several successful system outputs.

Introduction

Understanding a question's answer type is one of the significant steps in a question-answering process [4]. With the help of an answer type classifier -a Question Answering system (QA system) could narrow the answer search space and filter the inappropriate answer candidates [6].

In general, the answer type classification task can be interpreted as a multiclass text classification task. However, the SMART task [5] proposes a more complicated structure of the data. There are two class levels: answer category (resource, literal, boolean) and answer type.

According to the official description of the data1 : If the category is "resource", answer types are ontology classes from either the DBpedia ontology2 or the Wikidata ontology 3 . If the category is "literal", answer types are either "number", "date", or "string". For the category "boolean" no additional specialization is defined. It is worth mentioning that in this work we concentrate only on the DBpedia dataset.

Each "resource" answer type contains a ranked list of the DBpedia ontology types. All items contained in a list are part of one hierarchy, for example: ["dbo:Person", "dbo:Agent"] or ["dbo:Opera", "dbo:MusicalWork", "dbo:Work"]. The most general ontology type is at the end of a list.

The DBpedia dataset contains 21,964 (train -17,571, test -4,393) questions. The evaluation metric for answer category prediction task is accuracy, the metric for answer type prediction is lenient NDCG@k with a Linear decay [2].

Our solution focuses on data augmentation techniques. In Section 2 we describe the dataset in detail. Section 3 incorporates the description for the data augmentation methods used by us, as well as an algorithm for merging answer type lists. In Section 4 we show our experimental results and describe the local evaluation pipeline. Finally, in Section 5 the conclusions are presented.

Dataset analysis and transformation

The original dataset is presented using the JSON format. To train a model on the data, it needs to be transformed into a feature-target form.

In the case of the prediction answer category, the task is trivial -there is just one target value for one question and it is considered as a multi-class classification task. While predicting an answer type -things are more complicated: we have to predict a list, which items are ordered according to the level of taxonomy and has to match one hierarchy (e.g., dbo:Opera, dbo:MusicalWork, and dbo:Work). The first constraint does not allow us to consider this task as a multi-class classification. That is why we decided to make each item of a list as an individual target value, so we can train separate models for each of them. We took only 5 most general types for each question because 95% of the answer type list's lengths are not more than this value. The head of the resulting dataset is presented in Figure 1. Hence, we consider the solution for the SMART challenge task to be represented as two-level architecture where the higher-level decisions activate lowerlevel classifiers: Level 1 The category is classified (Figure 1, column: "category"). Thereafter, the classification system can decide for the next required classifiers. Level 2 The second-level decisions are considered to be two independent tasks:

-Classification of literal type (Figure 1, column: "type 1") -Classification of resource types (Figure 1, columns: "type 1", "type 2", "type 3", "type 4", "type 5") The training dataset had 43 questions with empty textual representation. These questions were removed. The resulting dataset has the following characteristics:

-17,528 questions are contained; -Distribution: 9,573 question point to resources, 5,156 point to a literal datatype and 2,799 are Boolean questions; -The 95th percentile of the answer type lists' length is 5; -The maximum number of tokens in a question is 60.

In Figure 2, the top 10 most common resource answer types are presented. It shows that all top 10 resource types belonging either to dbo:Agent or dbo:Place or their sub-classes.

3 Proposed solution

Classifier Architecture

The classification pipeline was created with a tree-like structure and 7 classifiers in total (see Figure 3). First, the category is classified. Then, depending on the category, the corresponding models are chosen.

For example, if the category is "resource", then the pipeline classifies a question using 5 models reflecting the decision for "type 1", "type 2", "type 3", "type 4", and "type 5" (cf., Figure 1). Given the results of these classifiers, the answer type list is created from the computed results (obeying the correct order). As there are only 5 models (one model for one list item) -the answer type list's length will contain no more than 5. Sometimes it may be less (when the prediction is None).

Data Augmentation

To extend the training data, we used several augmentation strategies for the given dataset: D 1 Machine translation to German, French, Spanish, and Russian is used for each question. Hence, in total there are 5x more questions (separated in 5 different languages) resulting in 87,640 questions. As the dataset has become a multilingual one, we will use a multilingual model. There are two types of prediction for such a dataset: Use the original English text or use predictions for all languages and a majority voting algorithm. D 2 Round-trip translation [1] (English-German-English, English-Russian-English) -in total, there are 3x more questions, and we use a single language model. The dataset consists of 52,584 questions; D 3 Each question is annotated with it's named entities pointing to DBpedia resources -each named entity is replaced with one of its RDF types. The data is extracted from DBpedia with help of DBpedia Spotlight4 . The dataset consists of 163,488 questions.

Google Cloud Translation5 was used to translate the data for D 1 and D 2 automatically.

Cm-1

Dm-1 Hence, additionally to the original dataset -we call it D 0 -we have created here 3 more dataset (D 1 , D 2 , and D 3 ) that are used to spawn 4 independent classifier pipelines (C 1 , C 2 , C 3 , and C 4 ). Consequently, the results R Ci of all classifier pipelines C i need to be merged. Figure 4 shows an example of merging process. The next section gives a detailed description of the process.

RC m-1 PA n ,R Final = m i=0 Wi • PA n,RC i R Final

Results Merging

Each classification pipeline -C 0 , C 1 , C 3 , and C 4 -provide a list of classification results. It is reasonable to assume that they also have a distinguished classification quality.

Hence, while merge the classification results -identified by R C0 , R C1 , R C2 , and R C3 -to establish a final result set R Final as shown in Figure 5. The merging of R Ci with i ∈ {0, 1, 2, 3} is computed while numerically calculating a weighted rank for each answer type that was predicted by at least one classifier pipeline C i . The rank P An,R Final of an answer type A n in R Final is computed as follows:

P An,R Final = m i=0 W i • P An,RC i , where P An,RC i = rank of n in R Ci if n in R Ci fallback rank f else

and m is the number of classification pipelines

Typically, the quality of Level-1 decisions would be high. However, there also exists a special case where a different answer category was predicted by the classifier pipelines. In this case, we currently follow a static rule-based decision process that is favoring the more specific predictions, i.e., if one classifier pipeline predicted the category boolean, then all other results are discarded. And, else if one classifier pipeline is predicting the literal category, then all non-literal categories are discarded.

Experiments

Evaluation

We used Bert-base-cased and Bert-base-multilingual-cased models [3] in our classification pipeline. Training data was split into two sets: train and validation set.

The validation set was created by random choice of 4400 questions and the test set consists of 4381 questions. The models were fine-tuned on the training set with the following hyperparameters: EPOCHS=2, MAX LEN=60, BATCH SIZE=16.

The training process was performed on GPU resources provided by the Kaggle.com platform (NVIDIA TESLA P100 GPU, 16 GB RAM). The results shown in Table 1 enable us to compare the effectiveness of each augmentation technique. The results were obtained on the validation set locally (MV -corresponds to Majority Voting algorithm, see Section 3.2):

D1 D1+MV D2 D3 Accuracy 0.969 0.968 0.962 0.357 0.959 NDCG@5 0.533 0.704 0.708 0.165 0.363 NDCG@10 0.499 0.661 0.665 0.140 0.317

The best performing datasets are multilingual ones (D 1 ). The round-trip translation (D 2 ) approach caused overfitting because of small differences in questions forms. The same situation occurred with the named entities annotation approach (D 3 ). The original dataset (D 0 ) showed comparable performance. A detailed analysis of the errors is given in Section 4.2.

For the final analysis, we took only predictions from the models trained on the original (D 0 ) and the multilingual dataset (D 1 ) into account. We used both prediction approaches for the multilingual data: using the multilingual model to predict the answer type of English questions and using the same model while retrieving predictions for all 5 languages and taking the majority vote result. The predictions were merged using the algorithm described at the end of the previous section, we used several weights combinations to achieve the highest quality. The evaluation results for final submission are presented in Table 2. The highest score on the test dataset was achieved with a merged combination of 3 predictions (see the second column of Table 2). We evaluated the weight combinations where each weight w i was chosen between 0.0 and 1.0, s.t., the sum of all used weights equals 1.0. The following best weight combination was created using this process: 30% -D 0 , 30% -D 1 and 40% -D 1+MV . The fallback rank f for the merging algorithm was taken equal to 10 (see Subsection 3.3). This combination was submitted as the final solution for the task. As the weights were obtained manually and intuitively, we can not make a statement about its application on the other datasets. Moreover, these weights can be overfitted to the test set because the final predictions were given by the organizers based on the whole test dataset without private/public test splits. Hence, the weights were selected according to the public test set results. This is a limitation of our merging approach.

Error analysis

As we reported in the previous subsection, the approach D 1 outperformed D 2 and D 3 due to the model overfitting caused by the nearly same surface form of the obtained questions. The corresponding example of D 2 is given below: Hence, we have to recognize that the questions generated using round trip translation are not differing significantly: En-De-En differs in one word, absence of the definite article, and non-capitalized letter "T" in the last definite article, almost the same is true for the En-Ru-En translation.

We can assume that round-trip translation to languages, that are non-popular or distant from the English language, would possibly resolve this issue. Each named entity was replaced with its URI's type in the DBpedia. As a resource in the DBpedia may contain up to several thousands of variants corresponding to each combination of the types. There are two major limitations of this approach: the DBpedia resource may contain errors w.r.t. its type and the Named Entity Linking tool may extract and link entities incorrectly. In the given example, the "the Chief Justice of The United States" should be replaced with a single type, while it was replaced with two different types which are incorrect.

The D 3 showed the best performance, here is the example of its fragment:

Original: Who replaced Charles Evans Hughes as the Chief Justice of The United States? German: Wer hat Charles Evans Hughes als Oberster Richter der Vereinigten Staaten abgelöst? French: Qui a remplacé Charles Evans Hughes en tant que juge en chef des États-Unis?

However, despite the augmentation approaches, there is one significant limitation of our prediction approach -each element of the answer type list is predicted independently and therefore the elements may not from the same hierarchy. For example, for the question "What is the horse characters of Madame Sans-Gêne play?" predicted answer type list is ["dbo:Person", "dbo:Work"] while the true value is ["dbo:Animal", "dbo:Eukaryote", "dbo:Species"]. Despite the prediction is completely incorrect, it has items "dbo:Person" and "dbo:Work" which are located in the different ontology branches (hierarchies).

Consequently, the mechanism of checking the correctness of the hierarchy should be created. One of the possible solutions may be the prediction of the most specific answer type and making the prediction according to the actual hierarchy.

Conclusion

In this work, we described our solution for the Semantic Answer Type prediction task. The goal was to predict the corresponding answer category and answer types. To solve the task, we created a tree-like classification pipeline and implemented several text augmentation methods described in Section 3.

The results of our experiments show that the multilingual dataset has the highest performance in contrast to the other augmented data. To prepare the final submission, we used the weighted merging algorithm on top of our best predictions (see Section 4).

Obviously, there is room for improvement. In future work, we would use an ensemble learning approach to merge the results instead of the current static approach. Additionally, we would also consider each language classifier independently assuming a distinguished translation quality leading to different classification quality. Also, the hierarchy accordance and hierarchy level validation mechanism might be used for the prediction process.

Fig. 1 .1Fig. 1. Tabular representation of the training dataset.

Fig. 3 .3Fig. 3. Tree-like classification pipeline C

Fig. 4 .4Fig. 4. Example of merging 3 lists with specified weights

Fig. 5 .5Fig. 5. Overview of the final process.

Original:Who replaced Charles Evans Hughes as the Chief Justice of The United States? En-De-En: Who succeeded Charles Evans Hughes as Chief Justice of the United States? En-Ru-En: Who replaced Charles Evans Hughes as Chief Justice of the United States?

The example of D 33is given below: Original: Who replaced Charles Evans Hughes as the Chief Justice of The United States? Variant 1: Who replaced DBpedia:Athlete as the DBpedia:Person of The DBpedia:PopulatedPlace? Variant 2: Who replaced DBpedia:Person as the DBpedia:Person of The DBpedia:Country?

Table 1 .1Local validation results

Table 2 .2Final evaluation results

.3D0+.3D1+.4D1+MV .5D1+.5D1+MV .3D1+.7D1+MV .7D1+.3D1+MVAccuracy0.9760.9650.9650.972NDCG@50.7620.7520.7520.759NDCG@100.7250.7140.7160.722

https://smart-task.github.io/ http://mappings.dbpedia.org/server/ontology/classes/ https://www.wikidata.org/wiki/Wikidata:WikiProject_Ontology https://www.dbpedia-spotlight.org/ https://cloud.google.com/translate

The efficacy of round-trip translation for mt evaluation MAiken MPark Translation Journal 14 1 2010 Hierarchical target type identification for entityoriented queries KBalog RNeumayer 10.1145/2396761.2398648 Proceedings of the 21st ACM international conference on Information and knowledge management the 21st ACM international conference on Information and knowledge management 2012 Bert: Pre-training of deep bidirectional transformers for language understanding JDevlin MWChang KLee KNToutanova 2018 ArXiv e-prints Leveraging question target word features through semantic relation expansion for answer type classification THao WXie QWu HWeng YQu 10.1016/j.knosys.2017.06.030 Knowledge-Based Systems 133 2017 SeMantic AnsweR Type prediction task (SMART) at ISWC NMihindukulasooriya MDubey AGliozzo JLehmann AC NNgomo RUsbeck CoRR/arXiv abs/2012.00555 2020 Semantic Web Challenge 2020 Multi-class hierarchical question classification for multiple choice science exams DXu PJansen JMartin ZXie VYadav HTMadabushi OTafjord PClark Proceedings of The 12th Language Resources and Evaluation Conference The 12th Language Resources and Evaluation Conference 2020