<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hierarchical Contextualized Representation Models for Answer Type Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Natthawut Kertkeidkachorn?</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rungsiman Nararatwong?</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Phuc Nguyen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ikuya Yamada</string-name>
          <email>ikuya@ousia.jp</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hideaki Takeda</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ryutaro Ichise</string-name>
          <email>ichiseg@nii.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Institute of Advanced Industrial Science and Technology</institution>
          ,
          <addr-line>Tokyo 135-0064</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Institute of Informatics</institution>
          ,
          <addr-line>Tokyo 101-8430</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Studio Ousia</institution>
          ,
          <addr-line>Tokyo 100-0004</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>SeMantic AnsweR Type prediction (SMART) challenge proposed a task to determine the types of answers given natural language questions. Understanding answer types play a crucial role in question answering. In this paper, we present Hierarchical Contextualized-based models, namely HiCoRe, for the SAMRT task. HiCoRe builds on top of state of the art contextualized-based models and the hierarchical strategy to deal with the hierarchical answer types. The SMART results show that HiCoRe obtains promising performance for answer type prediction on DBpedia and Wikidata datasets.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        answers. The answer type contains 3 main category types: Boolean, Literal, and
Resource. Boolean does not contain any subtypes, while Literal and Resource
can be classi ed into ne-grained types. For Literal, there are 3 ne-grained:
Number, Date, and String. For Resource, ne-grained types have corresponded
to the target ontology. In the SMART dataset, DBpedia [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and Wikidata [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
are selected as the the target ontology. In DBpedia, there are 760 coarse-grained
types, while Wikidata contains more than 50,000 coarse-grained types. Table
1. illustrates example questions and expected answer types from DBpedia
ontology and Wikidata ontology. The answer types can be multiple types. For
example, given the question "Who is the heaviest player of the Chicago Bulls?",
the expected answer types are listed as the following list: [dbo:BasketballPlayer,
dbo:Athlete, dbo:Person and dbo:Agent.]4
      </p>
      <p>In this paper, we propose the Hierarchical Contextualized Representation
Models, namely HiCoRe, for the answer type prediction. Our approach utilizes
advanced contextualized word representation models together with the
hierarchical strategy to deal with the hierarchical type of the ontology in the SMART
task.</p>
      <p>The rest of the paper is organized as follows. We describe our approach in
Section 2. In Section 3, the experimental setup and the experimental results are
reported. Related works are discussed in Section 4. In Section 5, we conclude
our work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <p>mi;kn 2 M; 1
The hierarchical structure of ontology requires a classi cation method that
recognizes multi-layer labeling, including relations among the labels. Therefore, we
created a stack of groups of classi ers for all levels (depth) of the ontology.
Suppose an ontology O consists of classes ci;j 2 C, where i indicates the level
where classes ci;j belong, and j denotes each class on the ith level. A classi er
kn zn is responsible for predicting a subset of classes ci.</p>
      <sec id="sec-2-1">
        <title>4 dbo: is http://dbpedia.org/ontology/</title>
        <p>There may be single or multiple classi ers at each level, depending on con
guration. The classi ers can also be of the same or di erent types; they operate
independently and are individually customizable.</p>
        <p>The overall architecture, as shown in Figure 1, is a modular pipeline where the
ontology and training data ow through the process to train all of the classi ers.
The intuition is for every level to have some very accurate classi ers that are
responsible for a few classes with a large amount of training data, as well as
some less accurate classi ers with more classes to classify or fewer data to train.
Since we will always know the distribution of the training data with regard to
classes prior to training, we created a ltering function that assigns classes to
the classi ers based on pre-de ned thresholds. For example, our thresholds for
the rst level of the DBpedia dataset are 400, 100, and 50. With this setting, our
classi er m1;1 would classify classes dbo:Place, dbo:Agent, dbo:Work since each
appears in the training data at least 400 times.</p>
        <p>The ltering function uses the ontology to select relevant questions, i.e., those
with at least one answer (class) that the target classi er can classify. Since part
of the data may not satisfy any of the ltering function's conditions, we may
unintentionally ignore a portion of the training data. Thus, there should to be
a default classi er that processes the rest of the data if possible, either at every
level or independently. While the training data may overlap among the classi ers
on the same level { resulting in increased training time { this method ensures
that we feed all relevant data to every classi er, thus maximizing the accuracy.</p>
        <p>The testing data ow through the pipeline to all classi ers di erently than
the training data. During testing, since we can only learn the likely answers from
predictions, we may need the classi ers to perform their tasks sequentially from
the rst level to the last for selective testing. This method can speed up the
testing process if we expected the classi ers at lower levels to be less accurate
due to less training data or more classes to classify, and, therefore, should rely
on the outcomes from a higher level to make predictions. Alternatively, every
classi er may make predictions on all questions; in which case, at the end of
the entire process, the answer selector chooses nal answers based on prede ned
policy.</p>
        <p>At each level, in cases where some questions have multiple same-level answers
that require a combination of classi ers to predict, the same-level classi cations
should not be sequential unless constrained by computing resources or any other
limitations. On the other hand, sequentially performing classi cations may yield
better results if there are no answers that belong to the same level { for the
entire dataset or parts of it { and the classi ers at higher ranks (0; 1; :::) are
better at predicting than the lower ranks (:::; z 1; z). All in all, it is up to
human judgment and experimentation to decide what classi ers to use and how
they should interact with each other.
2.1</p>
        <sec id="sec-2-1-1">
          <title>Answer Type Classi er</title>
          <p>
            Multi-class Classi cation We ne-tuned Bidirectional Encoder
Representations from Transformers (BERT) [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] for our classi cation tasks. BERT performs
outstandingly well as a base model for transfer learning across various NLP
tasks. For sequence classi cation such as ours, we paid our attention solely to
each sequence's aggregate representation, which corresponds to the rst token
([CLS]) of the sequence. In other words, we used BERT to create a vector
representation of each question, then turned it into an input for our down-stream
classi cation task.
          </p>
          <p>Following the instruction described in BERT's original paper, we used BERT's
nal hidden vector C 2 RH as a sequence representation. The multi-class
classi er consists of a single classi cation layer with weights W 2 RK H , where K is
the number of labels. We computed the classi cation loss as log(sof tmax(CW T )).
The loss function restricts the use of a multi-class classi er in our pipeline to
classi cations that only expect a single answer, meaning that it will not be
suitable for any parts of the pipeline where there can be multiple answers. On the
other hand, any groups of consequent same-level classi ers, where each classi er
expects a single answer, may take advantage of the sequential classi cation we
mentioned earlier to improve the overall accuracy.</p>
          <p>Multi-label Classi cation Our multi-label classi er is also a ne-tuned BERT
model similar to the multi-class classi er. The only di erence is its loss function,
which we use (CW T ) instead of SoftMax to allow the classi er to output
multiple answers. Unlike multi-class classi cation, multi-label classi cation should
not be part of selective testing, i.e., sequential classi cation.
2.2</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Answer Selector</title>
          <p>For the DBpedia dataset, we used DBpedia Lookup service 5 to nd DBpedia
URIs of relevant keywords. We used Natural Language Toolkit (NLTK)
plat</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>5 https://wiki.dbpedia.org/lookup</title>
        <p>form6 for Python to extract nouns and adjectives as the keywords and retrieved
the URIs for post-processing. DBpedia Lookup provides the URIs of not only
keywords in a query but other similar ones as well. Using the outputs without
any ltering will likely mix irrelevant answers into the correct ones. Therefore,
we built a ltering function that adds a set of answers for every returned
keyword from the service only if at least one of the answers match what the models
in the pipeline have predicted.</p>
        <p>Another post-processing task for both datasets is answer selection. We
dened three selection strategies, which are top-down bottom-up and independent.
The top-down strategy prioritizes answers at higher levels. It includes lower-level
answers only if their parents are present. The bottom-up strategy does the
opposite; it traces the branch where the answer belongs to the top level and adds
all elements on that branch as the answers. The independent strategy does not
change the answers.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments and Results</title>
      <p>In this section, the experimental setup and results are presented. The details of
the expreiments are as follows.
3.1</p>
      <sec id="sec-3-1">
        <title>Experimental Setup</title>
        <p>In the experimental setup, we present Dataset, Experiment Setting and
Evaluation Metrics.</p>
        <p>Datasets. The SMART task consists of two datasets: DBpedia and
Wikidata. In DBpedia dataset, the target ontology is DBpedia ontology, while in
Wikidata the target ontology is Wikidata. The statistical details of the SMART
dataset are listed in Table 2. Since both datasets do not provided the validation
set, we randomly selected 10% of the training set in both datasets to construct
the validation set.</p>
        <p>Settings. We experiment on many contextualized-based models,
including distilbert-base-uncased, bert-base-uncased, bert-large-uncased, roberta-base,
and roberta-large to train the answer type classi er. We implement the
contextualizedbased models by using the hugging face repository 7. Then, we manually set
hyper-parameters then test on the validation set to nd a reasonable set of
hyper-parameters. As a result, we set the hyperparameters as follows: batch: 16,
learning rate: 5e 5, epochs: 10-45, dropout rate: 0.1.</p>
        <p>Before training, we studied the distributions of training data with regard to
classes (labels) at each level and found a similar pattern across all levels in both
datasets. As shown in Figure 2, there are generally a few classes with a large
amount of training data, while the rest only have a few to train. Therefore, for
every level, we created a set of classi ers based on how much information we</p>
        <sec id="sec-3-1-1">
          <title>6 https://www.nltk.org</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>7 https://huggingface.co/models</title>
          <p>have to train them. For DBpedia, we created up to three classi ers per level
with thresholds of 400, 100, and 50, meaning that any classes with at least 400
training samples will be included in the rst classi er and so on. The thresholds
for Wikidata are 1000, 300, 100, and 50.</p>
          <p>Evaluation Metrics. In the ne-tuning process on the validation set, we
use standard Accuracy, F1-macro, F1-weighted by the sklearn library 8 for the
category classi cation, while only F1-macro and F1-weighted are used to evaluate
the resource types on each level of the hierarchy in the ontology. We use these
metrics to nd the hyperparameters that are the best suit for each level. Due
to the structure of the ontology in the datasets, there are ve levels in DBpedia
and 11 levels in Wikidata.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>8 https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classi cation report</title>
          <p>0.749
0.721</p>
          <p>Wikidata
Accuracy
(Category)
0.96</p>
          <p>MRR
0.59</p>
          <p>For the nal evaluation on the test set, we follow the metrics provided by the
SMART challenge 9. In the SMART challenge, the evaluation metrics are varied
due to the dataset. In DBpedia, the category accuracy and normalized discounted
cumulative gain (nDCG) are used. The nDCG is set at 5 (nDCG@5) and 10
(nDCG@10). The evaluation metrics in Wikidata are the category accuracy and
mean reciprocal rank (MRR).
3.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Results</title>
        <p>Answer type classi cation could be viewed as the entity type classi cation, where
the answer to the query question is given as the entity. There are many research</p>
        <sec id="sec-3-2-1">
          <title>9 https://smart-task.github.io</title>
          <p>
            works [
            <xref ref-type="bibr" rid="ref1 ref7 ref8">1, 7, 8</xref>
            ] related to an entity type in the NLP community. Nevertheless, the
SMART dataset does not provide the answers to the query questions. Therefore,
predicting the answer type is much more challenging than the conventional entity
type classi cation due to the absence of the answer entity. There is a study
investigating answer type prediction with the same setting as the SMART dataset. In
the study [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ], the type matcher is applied on the question to get attention words
for building the classi er based on the syntactic structure features. Nonetheless,
this work does not consider the hierarchical structure of answer types.
5
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this paper, we introduce a novel method using hierarchical contextualized
representation models, named HiCoRe, for answer type prediction. HiCoRe adopts
state of the art contextualized word representations together with the
hierarchical strategy to deal with the answer type prediction. In HiCoRe, we investigate
varieties of BERT classi ers, which could be con gured on each hierarchical
level. By ne-tuning BERT-based models in HiCoRe, we could reach promising
results on the SMART dataset. Future improvement may include data
augmentation and question-answer generation for training, especially for classes with fewer
examples. The source code is available at https://github.com/rungsiman/smart.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abhishek</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anand</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Awekar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Fine-grained entity type classi cation by jointly learning representations and label embeddings</article-title>
          . pp.
          <volume>797</volume>
          {
          <fpage>807</fpage>
          . Association for Computational Linguistics, Valencia,
          <source>Spain (Apr</source>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobilarov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ives</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Dbpedia: A nucleus for a web of open data</article-title>
          .
          <source>In: The semantic web</source>
          , pp.
          <volume>722</volume>
          {
          <fpage>735</fpage>
          . Springer (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bogatyy</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Predicting answer types for question-answering</article-title>
          . https://cs224d.stanford.edu/reports/Bogatyy.pdf, accessed:
          <fpage>2020</fpage>
          -09-25
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          : BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In: Proceedings of the 2019 Conference of NAACL-HLT</source>
          . pp.
          <volume>4171</volume>
          {
          <issue>4186</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Mihindukulasooriya</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubey</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gliozzo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>A.C.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Usbeck</surname>
          </string-name>
          , R.:
          <article-title>SeMantic AnsweR Type prediction task (SMART) at ISWC 2020 Semantic Web Challenge</article-title>
          . CoRR/arXiv abs/
          <year>2012</year>
          .00555 (
          <year>2020</year>
          ), https://arxiv.org/abs/
          <year>2012</year>
          .00555
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Vrandecic</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Krotzsch, M.:
          <article-title>Wikidata: a free collaborative knowledgebase</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>57</volume>
          (
          <issue>10</issue>
          ),
          <volume>78</volume>
          {
          <fpage>85</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barbosa</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Neural ne-grained entity type classi cation with hierarchyaware loss</article-title>
          .
          <source>In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long Papers). pp.
          <volume>16</volume>
          {
          <fpage>25</fpage>
          . Association for Computational Linguistics, New Orleans,
          <source>Louisiana (Jun</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Yogatama</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gillick</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lazic</surname>
          </string-name>
          , N.:
          <article-title>Embedding methods for ne grained entity type classi cation</article-title>
          .
          <source>In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)</source>
          . pp.
          <volume>291</volume>
          {
          <issue>296</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>