<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Augmentation-based Answer Type Classi cation of the SMART dataset</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aleksandr Perevalov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Both</string-name>
          <email>andreas.bothg@hs-anhalt.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Anhalt University of Applied Sciences</institution>
          ,
          <addr-line>Kothen (Anhalt)</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recent progress in deep-learning-enabled AI researchers and developers to invest minimal e orts to achieve state-of-the-art results. Speci cally, in such a task as text classi cation { text preprocessing and feature generation does not play a signi cant role anymore thanks to such a landmark model as BERT and other related models. In this paper, we present our solution for the Semantic Answer Type prediction task (SMART task). The solution is based on the application of several data augmentation techniques: machine translation to popular languages, round-trip translation, named entities annotation with linked data. The nal submission was generated as a weighted result from several successful system outputs.</p>
      </abstract>
      <kwd-group>
        <kwd>Answer type classi cation mentation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Understanding a question's answer type is one of the signi cant steps in a
question-answering process [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. With the help of an answer type classi er { a
Question Answering system (QA system) could narrow the answer search space
and lter the inappropriate answer candidates [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        In general, the answer type classi cation task can be interpreted as a
multiclass text classi cation task. However, the SMART task [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposes a more
complicated structure of the data. There are two class levels: answer category
(resource, literal, boolean) and answer type.
      </p>
      <p>According to the o cial description of the data1: If the category is
\resource", answer types are ontology classes from either the DBpedia ontology2
or the Wikidata ontology3. If the category is \literal", answer types are either
\number", \date", or \string". For the category \boolean" no additional
specialization is de ned. It is worth mentioning that in this work we concentrate
only on the DBpedia dataset.</p>
      <p>Copyright c 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
1 https://smart-task.github.io/
2 http://mappings.dbpedia.org/server/ontology/classes/
3 https://www.wikidata.org/wiki/Wikidata:WikiProject_Ontology</p>
      <p>Each \resource" answer type contains a ranked list of the DBpedia
ontology types. All items contained in a list are part of one hierarchy, for
example: ["dbo:Person", "dbo:Agent"] or ["dbo:Opera", "dbo:MusicalWork",
"dbo:Work"]. The most general ontology type is at the end of a list.</p>
      <p>
        The DBpedia dataset contains 21,964 (train - 17,571, test - 4,393) questions.
The evaluation metric for answer category prediction task is accuracy, the
metric for answer type prediction is lenient NDCG@k with a Linear decay
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Our solution focuses on data augmentation techniques. In Section 2 we
describe the dataset in detail. Section 3 incorporates the description for the data
augmentation methods used by us, as well as an algorithm for merging answer
type lists. In Section 4 we show our experimental results and describe the local
evaluation pipeline. Finally, in Section 5 the conclusions are presented.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Dataset analysis and transformation</title>
      <p>The original dataset is presented using the JSON format. To train a model on
the data, it needs to be transformed into a feature-target form.</p>
      <p>In the case of the prediction answer category, the task is trivial { there is just
one target value for one question and it is considered as a multi-class classi cation
task. While predicting an answer type { things are more complicated: we have
to predict a list, which items are ordered according to the level of taxonomy and
has to match one hierarchy (e.g., dbo:Opera, dbo:MusicalWork, and dbo:Work).
The rst constraint does not allow us to consider this task as a multi-class
classi cation. That is why we decided to make each item of a list as an individual
target value, so we can train separate models for each of them. We took only 5
most general types for each question because 95% of the answer type list's lengths
are not more than this value. The head of the resulting dataset is presented in
Figure 1.</p>
      <p>Hence, we consider the solution for the SMART challenge task to be
represented as two-level architecture where the higher-level decisions activate
lowerlevel classi ers:
dbo:Location dbo:PopulatedPlacedbo:Organisation
dbo:Work
dbo:Settlement dbo:Country
dbo:City
{ 17,528 questions are contained;
{ Distribution: 9,573 question point to resources, 5,156 point to a literal datatype
and 2,799 are Boolean questions;
{ The 95th percentile of the answer type lists' length is 5;
{ The maximum number of tokens in a question is 60.</p>
      <p>In Figure 2, the top 10 most common resource answer types are presented. It
shows that all top 10 resource types belonging either to dbo:Agent or dbo:Place
or their sub-classes.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Proposed solution</title>
      <sec id="sec-3-1">
        <title>Classi er Architecture</title>
        <p>The classi cation pipeline was created with a tree-like structure and 7 classi ers
in total (see Figure 3). First, the category is classi ed. Then, depending on the
category, the corresponding models are chosen.</p>
        <p>For example, if the category is \resource", then the pipeline classi es a
question using 5 models re ecting the decision for \type 1", \type 2", \type 3",
\type 4", and \type 5" (cf., Figure 1). Given the results of these classi ers, the
answer type list is created from the computed results (obeying the correct
order). As there are only 5 models (one model for one list item) { the answer type
list's length will contain no more than 5. Sometimes it may be less (when the
prediction is None).
To extend the training data, we used several augmentation strategies for the
given dataset:
D1 Machine translation to German, French, Spanish, and Russian is used for
each question. Hence, in total there are 5x more questions (separated in 5
di erent languages) resulting in 87,640 questions. As the dataset has become
a multilingual one, we will use a multilingual model. There are two types of
prediction for such a dataset: Use the original English text or use predictions
for all languages and a majority voting algorithm.</p>
        <p>
          D2 Round-trip translation [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] (English-German-English, English-Russian-English)
{ in total, there are 3x more questions, and we use a single language model.
        </p>
        <p>The dataset consists of 52,584 questions;
D3 Each question is annotated with it's named entities pointing to DBpedia
resources { each named entity is replaced with one of its RDF types. The data
is extracted from DBpedia with help of DBpedia Spotlight4. The dataset
consists of 163,488 questions.</p>
        <p>Google Cloud Translation5 was used to translate the data for D1 and D2
automatically.
4 https://www.dbpedia-spotlight.org/
5 https://cloud.google.com/translate
D0
RC0</p>
        <p>D1</p>
        <p>RC1</p>
        <p>Hence, additionally to the original dataset { we call it D0 { we have created
here 3 more dataset (D1, D2, and D3) that are used to spawn 4 independent
classi er pipelines (C1, C2, C3, and C4). Consequently, the results RCi of all
classi er pipelines Ci need to be merged. Figure 4 shows an example of merging
process. The next section gives a detailed description of the process.
Each classi cation pipeline { C0, C1, C3, and C4 { provide a list of classi
cation results. It is reasonable to assume that they also have a distinguished
classi cation quality.</p>
        <p>Hence, while merge the classi cation results { identi ed by RC0 , RC1 , RC2 ,
and RC3 { to establish a nal result set RFinal as shown in Figure 5. The merging
of RCi with i 2 f0; 1; 2; 3g is computed while numerically calculating a weighted
rank for each answer type that was predicted by at least one classi er pipeline
Ci. The rank PAn;RFinal of an answer type An in RFinal is computed as follows:
m
X
i=0
PAn;RFinal =</p>
        <p>Wi PAn;RCi ; where PAn;RCi =
(rank of n in RCi
fallback rank f
if n in RCi
else
and m is the number of classi cation pipelines
Typically, the quality of Level-1 decisions would be high. However, there also
exists a special case where a di erent answer category was predicted by the
classi er pipelines. In this case, we currently follow a static rule-based decision
process that is favoring the more speci c predictions, i.e., if one classi er pipeline
predicted the category boolean, then all other results are discarded. And, else if
one classi er pipeline is predicting the literal category, then all non-literal
categories are discarded.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <sec id="sec-4-1">
        <title>Evaluation</title>
        <p>
          We used Bert-base-cased and Bert-base-multilingual-cased models [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] in our
classi cation pipeline. Training data was split into two sets: train and validation set.
The validation set was created by random choice of 4400 questions and the test
set consists of 4381 questions. The models were ne-tuned on the training set
with the following hyperparameters: EPOCHS=2, MAX LEN=60, BATCH SIZE=16.
        </p>
        <p>The training process was performed on GPU resources provided by the
Kaggle.com platform (NVIDIA TESLA P100 GPU, 16 GB RAM). The results shown
in Table 1 enable us to compare the e ectiveness of each augmentation technique.
The results were obtained on the validation set locally (MV { corresponds to
Majority Voting algorithm, see Section 3.2):</p>
        <p>The best performing datasets are multilingual ones (D1). The round-trip
translation (D2) approach caused over tting because of small di erences in
questions forms. The same situation occurred with the named entities annotation
approach (D3). The original dataset (D0) showed comparable performance. A
detailed analysis of the errors is given in Section 4.2.</p>
        <p>For the nal analysis, we took only predictions from the models trained on
the original (D0) and the multilingual dataset (D1) into account. We used both
prediction approaches for the multilingual data: using the multilingual model to
predict the answer type of English questions and using the same model while
retrieving predictions for all 5 languages and taking the majority vote result.
The predictions were merged using the algorithm described at the end of the
previous section, we used several weights combinations to achieve the highest
quality. The evaluation results for nal submission are presented in Table 2.</p>
        <p>The highest score on the test dataset was achieved with a merged combination
of 3 predictions (see the second column of Table 2). We evaluated the weight
combinations where each weight wi was chosen between 0:0 and 1:0, s.t., the
sum of all used weights equals 1:0. The following best weight combination was
created using this process: 30% { D0, 30% { D1 and 40% { D1+MV. The fallback
rank f for the merging algorithm was taken equal to 10 (see Subsection 3.3).
This combination was submitted as the nal solution for the task. As the weights
were obtained manually and intuitively, we can not make a statement about its
application on the other datasets. Moreover, these weights can be over tted to
the test set because the nal predictions were given by the organizers based
on the whole test dataset without private/public test splits. Hence, the weights
were selected according to the public test set results. This is a limitation of our
merging approach.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Error analysis</title>
        <p>As we reported in the previous subsection, the approach D1 outperformed D2
and D3 due to the model over tting caused by the nearly same surface form of
the obtained questions. The corresponding example of D2 is given below:</p>
      </sec>
      <sec id="sec-4-3">
        <title>Original: En-De-En: En-Ru-En:</title>
        <p>Who replaced Charles Evans Hughes as the Chief Justice of The
United States?
Who succeeded Charles Evans Hughes as Chief Justice of the
United States?
Who replaced Charles Evans Hughes as Chief Justice of the
United States?</p>
        <p>Hence, we have to recognize that the questions generated using round trip
translation are not di ering signi cantly: En-De-En di ers in one word, absence
of the de nite article, and non-capitalized letter \T" in the last de nite article,
almost the same is true for the En-Ru-En translation.</p>
        <p>We can assume that round-trip translation to languages, that are non-popular
or distant from the English language, would possibly resolve this issue.
The example of D3 is given below:
Original: Who replaced Charles Evans Hughes as the Chief Justice of The</p>
        <p>United States?
Variant 1: Who replaced DBpedia:Athlete as the DBpedia:Person of The</p>
        <p>DBpedia:PopulatedPlace?
Variant 2: Who replaced DBpedia:Person as the DBpedia:Person of The</p>
        <p>DBpedia:Country?</p>
        <p>Each named entity was replaced with its URI's type in the DBpedia. As
a resource in the DBpedia may contain up to several thousands of variants
corresponding to each combination of the types. There are two major limitations
of this approach: the DBpedia resource may contain errors w.r.t. its type and the
Named Entity Linking tool may extract and link entities incorrectly. In the given
example, the \the Chief Justice of The United States" should be replaced with
a single type, while it was replaced with two di erent types which are incorrect.</p>
        <p>The D3 showed the best performance, here is the example of its fragment:
Original: Who replaced Charles Evans Hughes as the Chief Justice of The</p>
        <p>United States?
German: Wer hat Charles Evans Hughes als Oberster Richter der
Vereinigten Staaten abgelost?
French: Qui a remplace Charles Evans Hughes en tant que juge en chef
des Etats-Unis?</p>
        <p>However, despite the augmentation approaches, there is one signi cant
limitation of our prediction approach { each element of the answer type list is
predicted independently and therefore the elements may not from the same
hierarchy. For example, for the question \What is the horse characters of Madame
Sans-G^ene play?" predicted answer type list is ["dbo:Person", "dbo:Work"]
while the true value is ["dbo:Animal", "dbo:Eukaryote", "dbo:Species"].
Despite the prediction is completely incorrect, it has items "dbo:Person" and
"dbo:Work" which are located in the di erent ontology branches (hierarchies).</p>
        <p>Consequently, the mechanism of checking the correctness of the hierarchy
should be created. One of the possible solutions may be the prediction of the
most speci c answer type and making the prediction according to the actual
hierarchy.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this work, we described our solution for the Semantic Answer Type
prediction task. The goal was to predict the corresponding answer category and answer
types. To solve the task, we created a tree-like classi cation pipeline and
implemented several text augmentation methods described in Section 3.</p>
      <p>The results of our experiments show that the multilingual dataset has the
highest performance in contrast to the other augmented data. To prepare the
nal submission, we used the weighted merging algorithm on top of our best
predictions (see Section 4).</p>
      <p>Obviously, there is room for improvement. In future work, we would use an
ensemble learning approach to merge the results instead of the current static
approach. Additionally, we would also consider each language classi er
independently assuming a distinguished translation quality leading to di erent
classication quality. Also, the hierarchy accordance and hierarchy level validation
mechanism might be used for the prediction process.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aiken</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The e cacy of round-trip translation for mt evaluation</article-title>
          .
          <source>Translation Journal</source>
          <volume>14</volume>
          (
          <issue>1</issue>
          ),
          <volume>1</volume>
          {
          <fpage>10</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Balog</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumayer</surname>
          </string-name>
          , R.:
          <article-title>Hierarchical target type identi cation for entityoriented queries</article-title>
          .
          <source>In: Proceedings of the 21st ACM international conference on Information and knowledge management</source>
          . pp.
          <volume>2391</volume>
          {
          <issue>2394</issue>
          (
          <year>2012</year>
          ). https://doi.org/10.1145/2396761.2398648
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.N.</given-names>
          </string-name>
          :
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>ArXiv</source>
          e-prints (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Hao</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weng</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Leveraging question target word features through semantic relation expansion for answer type classi cation</article-title>
          .
          <source>Knowledge-Based Systems 133</source>
          ,
          <fpage>43</fpage>
          {
          <fpage>52</fpage>
          (
          <year>2017</year>
          ). https://doi.org/https://doi.org/10.1016/j.knosys.
          <year>2017</year>
          .
          <volume>06</volume>
          .030
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Mihindukulasooriya</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubey</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gliozzo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>A.C.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Usbeck</surname>
          </string-name>
          , R.:
          <article-title>SeMantic AnsweR Type prediction task (SMART) at ISWC 2020 Semantic Web Challenge</article-title>
          . CoRR/arXiv abs/
          <year>2012</year>
          .00555 (
          <year>2020</year>
          ), https://arxiv.org/abs/
          <year>2012</year>
          .00555
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jansen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yadav</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madabushi</surname>
            ,
            <given-names>H.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tafjord</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>Multi-class hierarchical question classi cation for multiple choice science exams</article-title>
          .
          <source>In: Proceedings of The 12th Language Resources and Evaluation Conference</source>
          . pp.
          <volume>5370</volume>
          {
          <issue>5382</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>