Introduction

Augmentation-based Answer Type Classi cation of the SMART dataset

Aleksandr Perevalov

Andreas Both

andreas.bothg@hs-anhalt.de 0 0 Anhalt University of Applied Sciences , Kothen (Anhalt) , Germany

Recent progress in deep-learning-enabled AI researchers and developers to invest minimal e orts to achieve state-of-the-art results. Speci cally, in such a task as text classi cation { text preprocessing and feature generation does not play a signi cant role anymore thanks to such a landmark model as BERT and other related models. In this paper, we present our solution for the Semantic Answer Type prediction task (SMART task). The solution is based on the application of several data augmentation techniques: machine translation to popular languages, round-trip translation, named entities annotation with linked data. The nal submission was generated as a weighted result from several successful system outputs.

Answer type classi cation mentation

Introduction

Understanding a question's answer type is one of the signi cant steps in a question-answering process [ 4 ]. With the help of an answer type classi er { a Question Answering system (QA system) could narrow the answer search space and lter the inappropriate answer candidates [ 6 ].

In general, the answer type classi cation task can be interpreted as a multiclass text classi cation task. However, the SMART task [ 5 ] proposes a more complicated structure of the data. There are two class levels: answer category (resource, literal, boolean) and answer type.

According to the o cial description of the data1: If the category is \resource", answer types are ontology classes from either the DBpedia ontology2 or the Wikidata ontology3. If the category is \literal", answer types are either \number", \date", or \string". For the category \boolean" no additional specialization is de ned. It is worth mentioning that in this work we concentrate only on the DBpedia dataset.

Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 https://smart-task.github.io/ 2 http://mappings.dbpedia.org/server/ontology/classes/ 3 https://www.wikidata.org/wiki/Wikidata:WikiProject_Ontology

Each \resource" answer type contains a ranked list of the DBpedia ontology types. All items contained in a list are part of one hierarchy, for example: ["dbo:Person", "dbo:Agent"] or ["dbo:Opera", "dbo:MusicalWork", "dbo:Work"]. The most general ontology type is at the end of a list.

The DBpedia dataset contains 21,964 (train - 17,571, test - 4,393) questions. The evaluation metric for answer category prediction task is accuracy, the metric for answer type prediction is lenient NDCG@k with a Linear decay [ 2 ].

Our solution focuses on data augmentation techniques. In Section 2 we describe the dataset in detail. Section 3 incorporates the description for the data augmentation methods used by us, as well as an algorithm for merging answer type lists. In Section 4 we show our experimental results and describe the local evaluation pipeline. Finally, in Section 5 the conclusions are presented. 2

Dataset analysis and transformation

The original dataset is presented using the JSON format. To train a model on the data, it needs to be transformed into a feature-target form.

In the case of the prediction answer category, the task is trivial { there is just one target value for one question and it is considered as a multi-class classi cation task. While predicting an answer type { things are more complicated: we have to predict a list, which items are ordered according to the level of taxonomy and has to match one hierarchy (e.g., dbo:Opera, dbo:MusicalWork, and dbo:Work). The rst constraint does not allow us to consider this task as a multi-class classi cation. That is why we decided to make each item of a list as an individual target value, so we can train separate models for each of them. We took only 5 most general types for each question because 95% of the answer type list's lengths are not more than this value. The head of the resulting dataset is presented in Figure 1.

Hence, we consider the solution for the SMART challenge task to be represented as two-level architecture where the higher-level decisions activate lowerlevel classi ers: dbo:Location dbo:PopulatedPlacedbo:Organisation dbo:Work dbo:Settlement dbo:Country dbo:City { 17,528 questions are contained; { Distribution: 9,573 question point to resources, 5,156 point to a literal datatype and 2,799 are Boolean questions; { The 95th percentile of the answer type lists' length is 5; { The maximum number of tokens in a question is 60.

In Figure 2, the top 10 most common resource answer types are presented. It shows that all top 10 resource types belonging either to dbo:Agent or dbo:Place or their sub-classes. 3 3.1

Proposed solution Classi er Architecture

The classi cation pipeline was created with a tree-like structure and 7 classi ers in total (see Figure 3). First, the category is classi ed. Then, depending on the category, the corresponding models are chosen.

For example, if the category is \resource", then the pipeline classi es a question using 5 models re ecting the decision for \type 1", \type 2", \type 3", \type 4", and \type 5" (cf., Figure 1). Given the results of these classi ers, the answer type list is created from the computed results (obeying the correct order). As there are only 5 models (one model for one list item) { the answer type list's length will contain no more than 5. Sometimes it may be less (when the prediction is None). To extend the training data, we used several augmentation strategies for the given dataset: D1 Machine translation to German, French, Spanish, and Russian is used for each question. Hence, in total there are 5x more questions (separated in 5 di erent languages) resulting in 87,640 questions. As the dataset has become a multilingual one, we will use a multilingual model. There are two types of prediction for such a dataset: Use the original English text or use predictions for all languages and a majority voting algorithm.

D2 Round-trip translation [ 1 ] (English-German-English, English-Russian-English) { in total, there are 3x more questions, and we use a single language model.

The dataset consists of 52,584 questions; D3 Each question is annotated with it's named entities pointing to DBpedia resources { each named entity is replaced with one of its RDF types. The data is extracted from DBpedia with help of DBpedia Spotlight4. The dataset consists of 163,488 questions.

Google Cloud Translation5 was used to translate the data for D1 and D2 automatically. 4 https://www.dbpedia-spotlight.org/ 5 https://cloud.google.com/translate D0 RC0

RC1

Hence, additionally to the original dataset { we call it D0 { we have created here 3 more dataset (D1, D2, and D3) that are used to spawn 4 independent classi er pipelines (C1, C2, C3, and C4). Consequently, the results RCi of all classi er pipelines Ci need to be merged. Figure 4 shows an example of merging process. The next section gives a detailed description of the process. Each classi cation pipeline { C0, C1, C3, and C4 { provide a list of classi cation results. It is reasonable to assume that they also have a distinguished classi cation quality.

Hence, while merge the classi cation results { identi ed by RC0 , RC1 , RC2 , and RC3 { to establish a nal result set RFinal as shown in Figure 5. The merging of RCi with i 2 f0; 1; 2; 3g is computed while numerically calculating a weighted rank for each answer type that was predicted by at least one classi er pipeline Ci. The rank PAn;RFinal of an answer type An in RFinal is computed as follows: m X i=0 PAn;RFinal =

Wi PAn;RCi ; where PAn;RCi = (rank of n in RCi fallback rank f if n in RCi else and m is the number of classi cation pipelines Typically, the quality of Level-1 decisions would be high. However, there also exists a special case where a di erent answer category was predicted by the classi er pipelines. In this case, we currently follow a static rule-based decision process that is favoring the more speci c predictions, i.e., if one classi er pipeline predicted the category boolean, then all other results are discarded. And, else if one classi er pipeline is predicting the literal category, then all non-literal categories are discarded. 4 4.1

Experiments Evaluation

We used Bert-base-cased and Bert-base-multilingual-cased models [ 3 ] in our classi cation pipeline. Training data was split into two sets: train and validation set. The validation set was created by random choice of 4400 questions and the test set consists of 4381 questions. The models were ne-tuned on the training set with the following hyperparameters: EPOCHS=2, MAX LEN=60, BATCH SIZE=16.

The training process was performed on GPU resources provided by the Kaggle.com platform (NVIDIA TESLA P100 GPU, 16 GB RAM). The results shown in Table 1 enable us to compare the e ectiveness of each augmentation technique. The results were obtained on the validation set locally (MV { corresponds to Majority Voting algorithm, see Section 3.2):

The best performing datasets are multilingual ones (D1). The round-trip translation (D2) approach caused over tting because of small di erences in questions forms. The same situation occurred with the named entities annotation approach (D3). The original dataset (D0) showed comparable performance. A detailed analysis of the errors is given in Section 4.2.

For the nal analysis, we took only predictions from the models trained on the original (D0) and the multilingual dataset (D1) into account. We used both prediction approaches for the multilingual data: using the multilingual model to predict the answer type of English questions and using the same model while retrieving predictions for all 5 languages and taking the majority vote result. The predictions were merged using the algorithm described at the end of the previous section, we used several weights combinations to achieve the highest quality. The evaluation results for nal submission are presented in Table 2.

The highest score on the test dataset was achieved with a merged combination of 3 predictions (see the second column of Table 2). We evaluated the weight combinations where each weight wi was chosen between 0:0 and 1:0, s.t., the sum of all used weights equals 1:0. The following best weight combination was created using this process: 30% { D0, 30% { D1 and 40% { D1+MV. The fallback rank f for the merging algorithm was taken equal to 10 (see Subsection 3.3). This combination was submitted as the nal solution for the task. As the weights were obtained manually and intuitively, we can not make a statement about its application on the other datasets. Moreover, these weights can be over tted to the test set because the nal predictions were given by the organizers based on the whole test dataset without private/public test splits. Hence, the weights were selected according to the public test set results. This is a limitation of our merging approach. 4.2

Error analysis

As we reported in the previous subsection, the approach D1 outperformed D2 and D3 due to the model over tting caused by the nearly same surface form of the obtained questions. The corresponding example of D2 is given below:

Original: En-De-En: En-Ru-En:

Who replaced Charles Evans Hughes as the Chief Justice of The United States? Who succeeded Charles Evans Hughes as Chief Justice of the United States? Who replaced Charles Evans Hughes as Chief Justice of the United States?

Hence, we have to recognize that the questions generated using round trip translation are not di ering signi cantly: En-De-En di ers in one word, absence of the de nite article, and non-capitalized letter \T" in the last de nite article, almost the same is true for the En-Ru-En translation.

We can assume that round-trip translation to languages, that are non-popular or distant from the English language, would possibly resolve this issue. The example of D3 is given below: Original: Who replaced Charles Evans Hughes as the Chief Justice of The

United States? Variant 1: Who replaced DBpedia:Athlete as the DBpedia:Person of The

DBpedia:PopulatedPlace? Variant 2: Who replaced DBpedia:Person as the DBpedia:Person of The

DBpedia:Country?

Each named entity was replaced with its URI's type in the DBpedia. As a resource in the DBpedia may contain up to several thousands of variants corresponding to each combination of the types. There are two major limitations of this approach: the DBpedia resource may contain errors w.r.t. its type and the Named Entity Linking tool may extract and link entities incorrectly. In the given example, the \the Chief Justice of The United States" should be replaced with a single type, while it was replaced with two di erent types which are incorrect.

The D3 showed the best performance, here is the example of its fragment: Original: Who replaced Charles Evans Hughes as the Chief Justice of The

United States? German: Wer hat Charles Evans Hughes als Oberster Richter der Vereinigten Staaten abgelost? French: Qui a remplace Charles Evans Hughes en tant que juge en chef des Etats-Unis?

However, despite the augmentation approaches, there is one signi cant limitation of our prediction approach { each element of the answer type list is predicted independently and therefore the elements may not from the same hierarchy. For example, for the question \What is the horse characters of Madame Sans-G^ene play?" predicted answer type list is ["dbo:Person", "dbo:Work"] while the true value is ["dbo:Animal", "dbo:Eukaryote", "dbo:Species"]. Despite the prediction is completely incorrect, it has items "dbo:Person" and "dbo:Work" which are located in the di erent ontology branches (hierarchies).

Consequently, the mechanism of checking the correctness of the hierarchy should be created. One of the possible solutions may be the prediction of the most speci c answer type and making the prediction according to the actual hierarchy. 5

Conclusion

In this work, we described our solution for the Semantic Answer Type prediction task. The goal was to predict the corresponding answer category and answer types. To solve the task, we created a tree-like classi cation pipeline and implemented several text augmentation methods described in Section 3.

The results of our experiments show that the multilingual dataset has the highest performance in contrast to the other augmented data. To prepare the nal submission, we used the weighted merging algorithm on top of our best predictions (see Section 4).

Obviously, there is room for improvement. In future work, we would use an ensemble learning approach to merge the results instead of the current static approach. Additionally, we would also consider each language classi er independently assuming a distinguished translation quality leading to di erent classication quality. Also, the hierarchy accordance and hierarchy level validation mechanism might be used for the prediction process.

1. Aiken , M. , Park , M.: The e cacy of round-trip translation for mt evaluation . Translation Journal 14 ( 1 ), 1 { 10 ( 2010 )

2. Balog , K. , Neumayer , R.: Hierarchical target type identi cation for entityoriented queries . In: Proceedings of the 21st ACM international conference on Information and knowledge management . pp. 2391 { 2394 ( 2012 ). https://doi.org/10.1145/2396761.2398648

3. Devlin , J. , Chang , M.W. , Lee , K. , Toutanova , K.N. : Bert: Pre-training of deep bidirectional transformers for language understanding . ArXiv e-prints ( 2018 )

4. Hao , T. , Xie , W. , Wu , Q. , Weng , H. , Qu , Y. : Leveraging question target word features through semantic relation expansion for answer type classi cation . Knowledge-Based Systems 133 , 43 { 52 ( 2017 ). https://doi.org/https://doi.org/10.1016/j.knosys. 2017 . 06 .030

5. Mihindukulasooriya , N. , Dubey , M. , Gliozzo , A. , Lehmann , J. , Ngomo , A.C.N. , Usbeck , R.: SeMantic AnsweR Type prediction task (SMART) at ISWC 2020 Semantic Web Challenge . CoRR/arXiv abs/ 2012 .00555 ( 2020 ), https://arxiv.org/abs/ 2012 .00555

6. Xu , D. , Jansen , P. , Martin , J. , Xie , Z. , Yadav , V. , Madabushi , H.T. , Tafjord , O. , Clark , P.: Multi-class hierarchical question classi cation for multiple choice science exams . In: Proceedings of The 12th Language Resources and Evaluation Conference . pp. 5370 { 5382 ( 2020 )