Reaching out for the Answer:
                    Answer Type Prediction

      Kanchan Shivashankar, Khaoula Benmaarouf, and Nadine Steinmetz

                       Technische Universität Ilmenau, Germany
                         firstname.lastname@tu-ilmenau.de


        Abstract. Natural language is complex and similar statements can be
        expressed in various ways. Therefore, understanding natural language is
        an ongoing challenge in the field of Question Answering. To make the
        process from the question to the answer, the QA pipeline can be broken
        down to several sub-steps, as e.g. prediction of the potential type of the
        answer, entity linking and detection of references for properties relevant
        for the formal query. The SMART Task challenge, co-located with ISWC
        2021, focussed on two of these sub-tasks: relation linking and answer type
        prediction. With this paper, we present our approach for the latter task.
        Our solution is a two-staged process combining two separate multi-label
        classification tasks. For the answer category prediction, we utilize the
        RoBERTa language model. Questions predicted as resource questions
        are then further classified regarding the concrete answer types – a list of
        ontology classes – utilizing the BERT language model.

        Keywords: Question Answering · Answertype Prediction · Semantic
        Web


1     Introduction

The discipline of Question Answering(QA) is a very popular research domain.
It combines the fields of Information Retrieval (IR) and Natural Language Pro-
cessing (NLP). The objective is to answer questions posed by humans in natu-
ral language (NL) by extracting relevant information from available knowledge
bases. Its popularity arises from the increase in the demand to find relevant and
useful information from huge amounts of data. This might seem like a very sim-
ple problem to begin with. As humans, language is the integral part of our life,
we use it in our daily life with ease. However, teaching a computer to understand
and interact in NL is very complex. There are several aspects to understanding
natural language. It is not only dependent on the syntax and semantics, but
we also take in reference of context, tone, body language and other inputs to
fully understand. And despite that, we as humans are prone to misinterpreta-
tions and misunderstandings. Training a machine to understand and interact
    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
2      K. Shivashankar et al.

with only written text makes it a real challenge and leaves no gap for mistakes.
Additionally, once the text is correctly interpreted, extracting relevant informa-
tion from the vast amount of information, is like trying to find a needle in a
haystack. In general, there are two parts to solving this problem. The first part
is to break down the question to extract its meaning. This also includes identi-
fying the expected answer type for the question, which is the problem addressed
in this paper. This transformation is performed using different tools from NLP.
The next part is to use the extracted meaning to build a query. The query is
used to find answers from the data source resp. knowledge base. The extracted
answers are then sorted using ranking algorithms to prioritize the most accurate
and relevant answer for the question.
     The scope of this paper is the prediction of answer types for the question.
The data source from which the answer must be retrieved can be structured data
such as knowledge bases or unstructured data in the form of documents. In this
paper, we are dealing with two popular KBs: Wikidata and DBpedia. We present
our solution developed in the context of the SMART challenge (co-located with
the ISWC 2021) – specifically Task 1: Answer Type Prediction. We took part in
the challenge for both ontologies: Wikidata and DBpedia [13]. Both containing
large and distinct answer type classes. The prediction task helps retrieve more
accurate and relevant information, but is challenging due to the highly granular
(Wikidata: 50k classes and DBpedia: 760 classes) nature of the classes available.
     We propose a two-step approach to solve this problem. Firstly, a classifi-
cation model is used at the higher level to predict the answer category based
on RoBERTa. Secondly, to predict the more granular resource types, we use a
second multi-label classification based on BERT. The remainder of the paper is
organized as following: Section 2 we give an overview on the evolution of the field
of Question Answering, KBs and the state of the art methods. Section 3 defines
the characteristics of the datasets used. Section 4 introduces our approach. This
is followed by a detailed evaluation in Section 5. We conclude with some remarks
and scope for future work.


2   Related Work

“Can machines think?” This is one of the most popular questions of the 20th cen-
tury, posed by renowned Mathematician Alan Turing in his paper “Computing
Machinery and Intelligence” [17]. This paper led to the birth of what we today
call as Artificial Intelligence (AI). In his paper, the author proposes a simple
yet complex problem - the Turing Test. The rules to perform the test however
simple, achieving a machine to pass the test is quite complex. The test – simply
put – requires a machine to interpret and generate natural language. This is the
main objective of all the tasks in Natural Language Processing (NLP). NLP is
one of the widely researched fields in AI. From CAPTCHAs, chatbots to the
voice activated virtual assistants – all fall under the applications of NLP.
    Question Answering is an advanced form of Information Retrieval(IR) sys-
tems that help extract concise and short answers from a pool of information to
                      Reaching out for the Answer: Answer Type Prediction       3

answer a NL question. There are many architectures to define a QA system but
it consists of 3 main parts: question processing, document processing and answer
processing. The question processing step involves extracting important informa-
tion from the question like keywords and answer type. In document processing, a
subset of relevant information (for example paragraphs) is selected which could
potentially contain the answer. The final stage of answer processing performs
detailed analysis of the extracted information from the previous stage in order
to locate and extract the answer [11].
    Obviously, approaches have evolved over time. The most basic is a linguistic
approach, breaking down the question with different text processing techniques
such as tokenization, POS tagging and parsing to build a query. Easy access
and availability to huge amounts of data, made statistical approaches advanta-
geous. They outperformed traditional linguistic approaches. Another important
approach to mention is the pattern matching approach, ideally used in instances
which require less computational power with predefined templates or formats
for question answering. Both linguistic and statistical approaches are suitable
for only closed domain applications [14].
    Knowledge Bases(KB) contain information of real world entities as nodes and
relations as edges. Each directed edge with head entity and tail entity consti-
tutes a triple also called as a fact. These facts can be accessed through a query
language called SPARQL, however, it requires expert knowledge of program-
ming and is difficult to learn for non-programmers. The KBQA systems allow
lay person, access to the information, by posing a natural language question.
The question is internally converted into a SPARQL query, which retrieves the
answer, all without requiring the knowledge of the underlying schema. The un-
derlying process comprises of two main steps namely, Entity (subject or object)
Linking and Relation (predicate) Linking. This forms the main pipeline of a
KBQA system[2].
    Several existing methods for KBQA directly parse the natural language ques-
tion into a structured query using semantic parsing including syntactic parsing,
semantic role labeling etc. Some of them map the natural language question to a
triple-based representation, but they fail to capture the original semantic struc-
ture of question using triples[7]. Keyword based methods use keywords in the
question and map them to predicates by keyword matching. They are successful
in answering simple questions and needs to be extended to complex questions.
Keyword matching is difficult as it can have many representations in the natural
language. This can be solved to an extent by generating alternative queries, but
results in many candidate answers to choose from[6].
    QA systems can be based on different types of KBs: semantic knowledge
graphs (as it is for this paper), relational databases or simple document collec-
tions [1]. A knowledge Based Question Answering System (KBQA) is equipped
to answer questions on the knowledge stored. With the evolution of KBs and
NLP, today the challenge is to answer complex questions with multiple rela-
tions amongst multiple subjects. It involves multi-hop reasoning, constrained
relations, numerical operations or a combination of the above [4].
4       K. Shivashankar et al.

    Some of the state-of-the-art solutions consist of complex end-to-end neural
network models, performing well on various NLP tasks. But they require a lot
of computational power, time to train the networks and limit the opportunity to
fully understand the underlying problem. The SMART challenge took place for
the second time in 2021. The previous edition was co-located with the ISWC 2020
and lists eight participants for the answer type prediction task [12] The best ap-
proach used BERT (Bidirectional Encoder Representations from Transformers)
and achieved accuracy > 95% for category prediction and > 70% accuracy for
DBpedia and 68% accuracy for Wikidata resource type predictions [16], proving
the performance of neural network models in KBQA systems. But these systems
fail to understand the nuances of NL beyond the training data.
    Although statistical approaches seem like the best option, the complexity of
languages and limit on the computational power requires us to use a combina-
tion of both. A hybrid approach combines analytical approaches with statistical
approach to build adaptable systems with better performance. The semantic
parsing technique aims at representing semantic structures of the questions, and
neural networks have shown promising results in KBQA. Combining the seman-
tic parsing and neural network has shown promising results for several KBQA
datasets [15] [10].
    Answering complex questions using knowledge bases is an open and challeng-
ing task. However, there are many challenges to overcome. Multiple predicates
are required to constrain and query the answer set. Each KB has a unique un-
derlying ontological structure. The nuances of natural language and complex
questions consisting of multiple entity-relation pairs, introduces an additional
challenge. There also exists multiple entities in the knowledge base with the
same name, which creates ambiguity.


3   Data Analysis

Data analysis is an essential step to get to know the data. It provides insight
into the quality and features of the data and helps to decide the type of pre-
processing steps to be utilized. In machine learning, quality and quantity of
the dataset (train and test) both determine how well the model performs. High
quality of the training data improves the models learning curve. Supervised
Learning also requires that the data is labeled accurately. Failure to remove
or correct noisy data results in wrong predictions. Some of the examples of
noisy data are missing or null values, duplicate data, mislabeled data etc. The
datasets are based on two different knowledge bases: Wikidata and DBpedia,
with different ontologies. Wikipedia is a large collection of structured data, across
multiple disciplines and languages. There is a great potential for this data to be
searched, analyzed and reused. But this data cannot be easily accessed, queried
or downloaded due to lack of management of data. Wikidata aims to overcome
such inconsistencies by creating new ways for Wikipedia to manage its data
on a global scale [3]. The DBpedia community project also extracts structured,
multilingual knowledge from Wikipedia info boxes and tabular summaries and
                      Reaching out for the Answer: Answer Type Prediction         5

makes it freely available on the Web using Semantic Web and Linked Data
technologies [8]. The structured data can be used to query information from
Wikipedia content.
   The datasets are provided in json format. Training data fields include

(a) unique ID
(b) question
(c) answer category
(d) answer type

The test datasets only contains fields (a) and (b).
    There are 3 different answer categories: boolean, literal, and resource.
Literal has three different answer types: number, date and string. The answer
type for the category boolean is also a boolean. The answer category resource
requires a list of classes from the respective ontology: 760 classes for DBpedia and
50k classes for Wikidata. We performed some pre-processing and analyses on the
datasets. Firstly, to have more data for training the model, we have merged the
training datasets of SMART 2020 and SMART 2021 challenges. The combined
dataset is checked for null and duplicate values and removed from the dataset.
Table 1 provides details on the training and test datasets.


            Table 1: Statistics on SMART Task Training Datasets.
    Knowledge Base SMART 2020 SMART 2021 Total Training Data Test Data
      Wikidata       18251      43554           46713          10864
       DBpedia       17571      36670           39508          10093


                                                             (b) DBpedia
              (a) Wikidata

                             Fig. 1: Category distribution
6      K. Shivashankar et al.

   In addition, we also identified the distribution of different answer categories
and types present in the dataset. The most commonly occurring category is
resource with around 80% of the questions belonging to this category. We also
found erroneous labels for category and type fields in the Wikidata training
dataset. As shown in Figure 1, there are a few records with multiple different
types for the literal category, as e.g. number and string both for the same
record.


                                                       (b) DBpedia
             (a) Wikidata

               Fig. 2: Commonly occurring Resource Type labels


    The top 10 commonly occurring resource types for both datasets are shown in
Figure 2. The Wikidata training dataset contains many questions about persons,
and locations on second place. For the DBpedia dataset it is vice-versa. The
DBpedia training dataset contains 380 different classes of the DBpedia ontology.
62 classes only occur once in the dataset. For the Wikidata dataset these numbers
are 3,311 and 1,027 respectively.


4     Method
Our prediction approach consists of two parts: category prediction and resource
type prediction. Both parts are multi-label classification problems, but with
different degrees in terms of number of class labels. Figure 3 depicts our overall
process.

4.1   Category Prediction
The label categories are split into 5 main classes: resource, literal-number,
literal-string, literal-date, and boolean. Thus, this part is treated as a
                      Reaching out for the Answer: Answer Type Prediction         7


              Fig. 3: The overall prediction process in two stages.


classification problem with five different classes. RoBERTa (Robustly optimized
BERT Pretraining Approach) is an optimized version of the BERT model with
some modification to the key hyper-parameters [9]. We use the roberta-base pre-
trained model from the transformers library for both tokenizer and the sequence
classification model. Transformers is a huggingface library for NLP, with state
of the art general purpose architectures and pre-trained models readily available
for use. This model is trained on a large corpus of data in English language and
is a case-sensitive model.

Pre-processing BERT can take up to two sentences as input. The second sentence
is used in case of a classification task and both sentences are separated by special
tokens. The input data for training the classification model should contain the
input text, in this case the question and also the expected label i.e, answer
category. The model expects a numerical representation of the textual input.
This process is called word embedding. First step is the tokenization where the
text is split into a sequence of words. Some words are further split into sub-words
or characters. This is due to the limitation of words that can be represented in
the vocabulary. These tokens are replaced by the indices of the vocabulary. We
use a pre-trained tokenizer for our model.
    The RoBERTa tokenizer is derived from the GPT-2 tokenizer, using byte-
level byte-pair-encoding, which is a form of sub-word tokenization. It has been
trained to treat spaces like parts of the tokens, so a sub-word/sub-string that
occurs at the beginning of the word (without space) will be encoded differently
from the one found at the end of the word (with space). Words not present
in the vocabulary are marked as unknown tokens. The batch encoding method
of the Roberta tokenizer takes the question in textual format and returns two
outputs: input ids and attention masks. The input ids is a list of indices for the
8       K. Shivashankar et al.

tokens from the vocabulary with separator tokens CLS and SEP to indicate the
beginning and the end of the input to the model. The input questions are also
padded to have a uniform length across all input questions. The attention mask
consists of sequence of 1s (text) and 0s(padding) to distinguish between the text
and the padded sequence to the BERT model. All the questions are padded to
the maximum length of 64. This step is repeated for the test data questions.
    Label encoding is performed to assign a unique integer for each category label.
Boolean and resource category names remain the same. We flatten the category
for literal by combining with its types (i.e, integer, string and date) to produce
new labels (i.e, literal-integer, literal-string, literal-date). The 5 category labels
are defined as a new column in the data frame.

Training The RoBERTa sequence classification model is built using the BERT
base architecture with 12-layers, 768-hidden, 12-heads, 125M parameters. The
classification model is defined with the number of classes required for the clas-
sification task, in our case 5. The training dataset (tokenized questions and
encoded labels) is split in the ratio of 80:20 for training and validation. The
classifier model is fine tuned on the training data for 4 epochs, using a batch size
of 32 and learning rate=2e-5, with Adam optimizer. Accuracy is the measure of
validation for classification task and categorical cross entropy as loss function.
For both, Wikidata and DBpedia, we achieved a training accuracy of 98% and
validation accuracy of 97%.


4.2   Resource Type Prediction

In case the category prediction, assigns the resource category to a question, the
next steps are processed to predict the concrete answer types from the respective
ontology.

Pre-processing For the different classification processes for DBpedia and Wiki-
data, we utilize the same pre-processing pipeline except for the first step: For
DBpedia, we remove the prefix from the class labels, then apply lowercasing, and
resolve the camel case format. For instance, dbo:TelevisionShow is transformed
to television show. The Wikidata type labels remain unprocessed.
    To extend the training data, several augmentation strategies might be ap-
plied. For our approach, we utilized the Copy-Paste method three times. Thus,
the sample size of DBpedia is increased to 102,612 records and to 72,336 records
for Wikidata.
    The questions, class labels are assigned to blocks. We remove duplicates inside
the block, create an indexation between the index of the label inside vocabulary
and their block index. We add class labels from the vocabulary that do not exist
in the training data. Then, the labels are merged inside the sentence randomly.
    The questions syntactic order is shuffled in three different orders and re-
combined. We utilize lists consisting of: question, length, block, relations.Training
data Q is appended into these lists separately and then the question list is split
                      Reaching out for the Answer: Answer Type Prediction          9

into Q1 and Q2. The subsequent data is appended randomly into split question
lists. Also, duplicates are removed.
    For the tokenization step, we utilized the FastAiBertTokenizer within the
fastai library as described in [5].

Training We utilized pre-trained model BERT BASE UNCASED which is not case-
sensitive and has a much lower number of layers compared to the BERT LARGE
model. The loss function algorithm is used as binary cross entropy with logistic
losses which applies a sigmoid activation layer to the output of the binary cross
entropy layer to be mapped to 0 or 1. We achieved higher accuracies when using
binary cross entropy compared to multi-label entropy. The maximum sequence
length was set to 256, with a batch size of 32, and the model is trained for 20
iterations.
    In terms of validation, we evaluated three different validation methods: static
80/20 split, static 99/1 split, and k fold cross validation with 3 folds. The results
achieved are shown in the next section.


5   Evaluation

In this section, we present the evaluation results for our approach at the SMART
challenge edition of ISWC 2021. The following metrics have been used for the
evaluation:

 – Accuracy: for category prediction (the percentage of questions classified in
   the correct category).
 – Lenient NDCG@k: for DBpedia type prediction, measures the distance be-
   tween the predicted type and the most specific type of the answer.
 – MRR: for Wikidata type prediction, the Reciprocal Rank (RR) is reciprocal
   of the rank at which the first relevant type was retrieved, averaged across
   all types retrieved gives the Mean Reciprocal Rank.

The best results achieved by our two-staged approach are shown in Tables 2
and 3. The category prediction achieved an accuracy of 0.991 for both datasets
and is best among all participants. As stated in the previous section, we evalu-
ated different combinations of validation approaches and our data augmentation
strategy. For DBpedia, we achieved the best results using the static 80/20 split
for validation and the data augmentation strategy as described in Section 4.2.
For Wikidata, data augmentation did not achieve an improvement of the results,
therefore we skipped it. But, the cross validation using three folds achieved the
best results compared to the static 80/20 and 99/1 splits.


6   Conclusion

In this paper, we presented our solution for an answer type prediction task as
proposed by the SMART challenge co-located with ISWC 2021. We used two
10      K. Shivashankar et al.


Table 2: Results of our approach and the best competitor at SMART task 2021
challenge for DBpedia
                                  category accuracy NDCG@5 NDCG@10
          our approach            0.991              0.734       0.658
          best competitor         0.984              0.842       0.854


Table 3: Results of our approach and the best competitor at SMART task 2021
challenge for Wikidata
                                      category accuracy MRR
                our approach          0.991               0.45
                best competitor       0.98                0.7


different multi-label classification approaches for the category prediction and
the type prediction for resource questions. Our results show that there is a
significant difference between the necessary handling of the the datasets for the
DBpedia and Wikidata. Obviously, Wikidata is a lot harder to train and achieve
good results. As there is potential for improvements regarding our approach, we
consider more intelligent data augmentation processes. The datasets are both
very imbalanced regarding the distribution of type assignments especially for
specific ontology classes. Therefore, it might be helpful to support the training
process with more balanced data. Obviously, the difficult part of the challenge are
resource questions and the prediction of the ontology classes. Future work will
therefore include an investigation if the prediction of classes can be transformed
independently from the requested ontology. Questions with similar type lists
from both datasets could be consolidated – which would require a class mapping
between the ontologies – and then the training process could benefit from a
enhanced amount of different variants of resource questions with similar answer
types.


7    Acknowledgements

This work was partially funded by the German Research Foundation (DFG)
under grant no. SA 782/30-1 and STE 3033/1-1.


References

 1. Barlas, I., Ginart, A., Dorrity, J.L.: Self-evolution in knowledge bases. In: IEEE
    Autotestcon, 2005. pp. 325–331. IEEE (2005)
 2. Buzaaba, H., Amagasa, T.: Question answering over knowledge base: a scheme
    for integrating subject and the identified relation to answer simple questions. SN
    Computer Science 2(1), 1–13 (2021)
 3. Denny Vrandečić, M.K.: Wikidata: a free collaborative knowledgebase (2014)
                       Reaching out for the Answer: Answer Type Prediction         11

 4. Fu, B., Qiu, Y., Tang, C., Li, Y., Yu, H., Sun, J.: A survey on complex question
    answering over knowledge base: Recent advances and challenges (2020)
 5. Howard, J., Gugger, S.: fastai: A layered API for deep learning. CoRR
    abs/2002.04688 (2020), https://arxiv.org/abs/2002.04688
 6. Jang, H., Oh, Y., Jin, S., Jung, H., Kong, H., Lee, D., Jeon, D., Kim,
    W.: Kbqa: Constructing structured query graph from keyword query
    for semantic search. In: Proceedings of the International Conference on
    Electronic Commerce. ICEC ’17, Association for Computing Machinery,
    New York, NY, USA (2017). https://doi.org/10.1145/3154943.3154955,
    https://doi.org/10.1145/3154943.3154955
 7. Kaufmann, E., Bernstein, A., Fischer, L.: Nlp-reduce: A naive but domainindepen-
    dent natural language interface for querying ontologies. In: 4th European Semantic
    Web Conference ESWC. pp. 1–2. Springer Berlin (2007)
 8. Lehmann, J.e.a.: Dbpedia – a large-scale, multilingual knowledge base extracted
    from wikipedia (2015)
 9. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
    Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining
    approach (2019)
10. Luo, K., Lin, F., Luo, X., Zhu, K.: Knowledge base question answering via encoding
    of complex query graphs. In: Proceedings of the 2018 Conference on Empirical
    Methods in Natural Language Processing. pp. 2185–2194 (2018)
11. Malik, N., Sharan, A., Biswas, P.: Domain knowledge enriched framework for
    restricted domain question answering system. In: 2013 IEEE International Con-
    ference on Computational Intelligence and Computing Research. pp. 1–7 (2013).
    https://doi.org/10.1109/ICCIC.2013.6724163
12. Mihindukulasooriya, N., Dubey, M., Gliozzo, A., Lehmann, J., Ngomo, A.N., Us-
    beck, R.: Semantic answer type prediction task (SMART) at ISWC 2020 semantic
    web challenge. CoRR abs/2012.00555 (2020), https://arxiv.org/abs/2012.00555
13. Mihindukulasooriya, N., Dubey, M., Gliozzo, A., Lehmann, J., Ngonga Ngomo,
    A.C., Usbeck, R., Rossiello, G., Kumar, U.: Semantic answer type and relation
    prediction task (smart 2021). arXiv (2022)
14. Pundge, A.M., Khillare, S., Mahender, C.N.: Question answering system, ap-
    proaches and techniques: A review. International Journal of Computer Applica-
    tions 141(3), 0975–8887 (2016)
15. Qu, Y., Liu, J., Kang, L., Shi, Q., Ye, D.: Question answering over freebase via
    attentive rnn with similarity matrix based cnn. arXiv preprint arXiv:1804.03317
    38 (2018)
16. Setty, V., Balog, K.: Semantic answer type prediction using bert: Iai at the iswc
    smart task 2020 (2021)
17. Turing, A.M.: Computing machinery and intelligence. In: Parsing the turing test,
    pp. 23–65. Springer (2009)