Mutilingual Hierarchical Expected Answer Type
Classification using the SMART 2021 Dataset
Aleksandr Perevalov and Andreas Both
Anhalt University of Applied Sciences, Köthen (Anhalt), Germany
{aleksandr.perevalov,andreas.both}@hs-anhalt.de
Abstract. The Knowledge Graph Question Answering (KGQA) sys-
tems are required to understand natural language in order to transform
textual questions into structured queries to a knowledge graph. One of
the important tasks of natural-language understanding (NLU) in the con-
text of KGQA systems is the classification of the expected answer type
(EAT). In this paper, we present our approach on the EAT classification
within the SeMantic Answer Type Prediction Task 2021. The approach is
based on machine-translation-based data augmentation, it supports 104
input languages1 and works over DBpedia and Wikidata. The obtained
evaluation results demonstrate reasonable quality in comparison to both
last year’s and this year’s solutions.
Keywords: Expected Answer Type Classification · Target Type Identi-
fication · Multilingual Question Answering · Knowledge Graph Question
Answering.
1 Introduction
The expected answer type (EAT, or answer type) prediction in the context of
knowledge graph question answering (KGQA) is one of the tasks for natural-
language understanding (NLU). The ability to know the EAT enables the system
to significantly narrow down the search space for an answer to a given question.
For example, given the following question: “In what city was Angela Merkel
born?” it is possible to reduce the search space from 861 to 6 entities while
querying all the 1-hop entities of the Angela Merkel resource in the DBpedia
[1] knowledge base dbr:Angela Merkel2 . The illustration for the corresponding
example is in Figure 1.
In this work, we present our solution for the 2021 SeMantic Answer Type
and Relation Prediction3 (SMART) Task [11] over the DBpedia and Wikidata
datasets in the Answer Type Prediction track (SMART2021-AT). In the simplest
1
https://github.com/google-research/bert/blob/master/multilingual.md#
list-of-languages
2
dbr:Angela Merkel is the prefix-based shortcut for the DBpedia resource http:
//dbpedia.org/resource/Angela_Merkel
3
cf. https://smart-task.github.io/2021/
2 A. Perevalov, A. Both
# without EAT prediction # with EAT prediction
SELECT (COUNT(DISTINCT ?obj) as ?count) SELECT (COUNT(DISTINCT ?obj) as ?count)
WHERE { WHERE {
dbr:Angela_Merkel ?p ?obj . dbr:Angela_Merkel ?p ?obj .
} ?obj rdf:type ?type .
# ?count = 861 FILTER(?type = dbo:City)
}
# ?count = 6
Fig. 1. Motivation for the EAT classification in KGQA. The example question: “In
what city was Angela Merkel born?”
case, the EAT classification task can be treated as a multi-class text classification
task. In the context of the SMART task, the data structure is more sophisti-
cated. There are two class levels: answer category (resource, literal, boolean) and
answer type. Hence, the class taxonomy is not flat and requires approaches for
hierarchical classification.
The official description of the data states4 : If the category is “resource”, an-
swer types are ontology classes from either the DBpedia ontology5 or the Wiki-
data ontology6 . If the category is “literal”, answer types are either “number”,
“date”, or “string”. For the category “boolean”, no additional specialization is
defined. The number of unique “resource” classes is high. Moreover, they are
represented in the form of a list.
While following our long-term research agenda of increasing the accessibil-
ity of KGQA systems and their components through multilingualization, the
presented solution is mainly based on the multilingual language models and
datasets (as in our last year iteration [13]). We used open-source machine trans-
lation models [17] to translate the provided data into 10 languages (German,
Spanish, Mandarin Chinese, Italian, Romanian, Vietnamese, Russian, French,
Czech, Japanese). Thereafter, a multilingual BERT-based [7] classifier was fine-
tuned on the original and translated data. We used a multi-level classification
pipeline to make the final predictions. In the conclusion of the paper, we discuss
the final results as well as our findings during the working process.
This work is structured as follows, in Section 2 we review the related research,
Section 3 describes the exploratory data analysis of the provided datasets, we
describe our approach in Section 4 and present the evaluation results in Section
5. Section 6 concludes the paper.
2 Related Work
The entity- and type-centric models were introduced in [2] to identify the an-
swer type of a question. These models are used to rank the queries given the
entity- or type-related content [9]. The idea of incorporating an additional con-
text to improve answer type predictions was proposed in work [18]. One of the
4
https://smart-task.github.io/
5
http://mappings.dbpedia.org/server/ontology/classes/
6
https://www.wikidata.org/wiki/Wikidata:WikiProject_Ontology
Mutilingual Hierarchical EAT Classification using the SMART 2021 Dataset 3
ISWC 2020’s Semantic Web challenges was addressing the answer type classifica-
tion (SeMantic AnsweR Type prediction task, SMART) [10]. It has shown that
transformer-based models demonstrate the highest results in this task [16,12].
The approach based on using external data (e.g., KGQA datasets) was intro-
duced in [14]. Recently, the authors of [6] proposed a system for EAT prediction
in a “distantly supervised fashion” (i.e., no manual data annotation is required).
3 Exploratory Data Analysis
There are two class levels in the datasets: answer type category (resource,
literal, boolean) and answer type. As the class taxonomy is not flat it requires
us to use approaches for hierarchical classification. The official description of the
data states7 : If the category is “resource”, answer types are ontology classes
from either the DBpedia ontology8 or the Wikidata ontology9 . If the category
is “literal”, answer types are either “number”, “date”, or “string”. For the
category “boolean” no additional specialization is defined. Such data has to be
analyzed on different levels for class distribution, noise, missing values, etc. In
this section, we demonstrate the analysis of both DBpedia and Wikidata answer
type classification datasets.
3.1 DBpedia Dataset
The DBpedia dataset contains Train (40,621 examples) and Test (10,093 exam-
ples) subsets. After removing the null values from the Train subset, the number
of examples decreased to 37,061 (-3,560 examples). The example of the DBpedia
dataset for the answer type prediction is shown in Figure 2. Here, if the category
is resource, the answer type field is represented as an ordered list of DBpedia
ontology types, where the first type is the most specific and the last is the most
general one (see the very first question in the Figure). While manually analyzing
this year’s SMART dataset we were capable to find several noisy examples (see
Figure 3). In this example, the type field contains duplicated DBpedia ontology
types.
We also analyzed the distributions of values for category, literal, type
fields. The results are demonstrated in Figure 4. There is a huge imbalance
towards the resource value while the literal values are more or less balanced.
The distribution of the resource field values is demonstrated in Figure 5.
Surprisingly, the most frequent values of the answer type field contain noise.
For example, the second most frequent value has incorrect order of DBpedia
types. The third most frequent value has DBpedia types of different hierarchies
(dbo:Place and dbo:Agent).
7
https://smart-task.github.io/
8
http://mappings.dbpedia.org/server/ontology/classes/
9
https://www.wikidata.org/wiki/Wikidata:WikiProject_Ontology
4 A. Perevalov, A. Both
[
{
"id": "dbpedia_1",
"question": "Who are the gymnasts coached by Amanda Reddin?",
"category": "resource",
"type": ["dbo:Gymnast", "dbo:Athlete", "dbo:Person", "dbo:Agent"]
},
{
"id": "dbpedia_2",
"question": "When did Margaret Mead marry Gregory Bateson?",
"category": "literal",
"type": ["date"]
},
{
"id": "dbpedia_3",
"question": "Is Azerbaijan a member of European Go Federation?",
"category": "boolean",
"type": ["boolean"]
}
]
Fig. 2. An example of DBpedia dataset
{
"id":26178,
"question":"What kind of music is the album farewell aldebaran",
"category":"resource",
"type":[
"dbo:Genre", "dbo:TopicalConcept", "dbo:MusicGenre",
"dbo:Genre", "dbo:TopicalConcept", "dbo:MusicGenre"
]
}
Fig. 3. A noisy example in the DBpedia dataset
Fig. 4. Data distribution for the category and literal literal values for DBpedia
training subset
Mutilingual Hierarchical EAT Classification using the SMART 2021 Dataset 5
Fig. 5. Data distribution of the type values for the Wikidata training subset
[
{
"id": "1",
"question": "Who is the child of Ranavalona I's husband?",
"category": "resource",
"type": ["person", "omnivore", "natural person"]
}
]
Fig. 6. An example of Wikidata dataset. Only “resource” category is demonstrated as
the other ones are the same as in DBpedia dataset in Figure 2.
3.2 Wikidata Dataset
The Wikidata dataset contains Train (43,604 examples) and Test (10,864 exam-
ples) subsets. We also cleaned the training data by removing the null values.
Consequently, the number of examples decreased to 43,554 (-50 examples). In
this aspect, the Wikidata data has significantly fewer null values rather than
the DBpedia data. An example of the Wikidata dataset for the answer type
prediction is shown in Figure 6.
Here, if the category is resource, the answer type field is represented as a
list of Wikidata classes that are retrieved according to the following SPARQL
query:
PREFIX wd:
PREFIX wdt:
SELECT ?subClasses WHERE {
wd:Q123456789 wdt:P31 ?x . # subject to be replaced with actual answer entity
?x wdt:P279 ?subClasses .
}
While manually analyzing the training subset of Wikidata we were capable
to find several noisy examples of different nature in comparison to DBpedia (see
Figure 7). In this example, the type field contains duplicated DBpedia ontology
types.
We also analyzed the distributions of values for category, literal, type
fields. The results are demonstrated in Figure 8. For Wikidata we observed the
same data distribution patterns: there is a huge imbalance towards the resource
6 A. Perevalov, A. Both
{
"id":10395,
"question":"what is the grammatcal mood of turkish",
"category":"resource",
"type":[
"grammatical category", "Q26869183", "grammatical mood"
]
}
Fig. 7. A noisy example in the Wikidata dataset
Fig. 8. Data distribution for the category and literal literal values for Wikidata
training subset
value while the literal values are more or less balanced. The distribution of the
resource field values is demonstrated in Figure 9. Obviously, the answer type
values are extremely imbalanced towards the ones related to the person class.
3.3 Summary
We assume that the observed noise in the data was unintended and there was a
risk of having the data quality in the test dataset. Hence, we decided not to use
this year’s SMART Task data. In our local training and evaluation process, the
data from the previous year was used [10]. The corresponding data analysis for
the previous year’s data is available in our paper [13].
4 Approach
While following our long-term research agenda on enhancing multilingual acces-
sibility of KGQA systems [15], we base our approach on multilingual data aug-
mentation. We used only the data from SMART 2020 (the previous year’s chal-
lenge). For both DBpedia and Wikidata, all the textual questions in the corre-
sponding datasets were machine-translated from English into German, Spanish,
Mutilingual Hierarchical EAT Classification using the SMART 2021 Dataset 7
Fig. 9. Data distribution for the answer type values for the Wikidata training subset
boolean
Boolean Common model
Value for DBpedia and
Wikidata
literal
Category category Literal Literal Trained separately
Question value for DBpedia and
Classifier Classifier Value Wikidata
resource Resource
Resource
Types
Classifier
Value
Fig. 10. Architecture of the classification pipeline
Chinese, Italian, Romanian, Vietnamese, Russian, French, Czech, and Japanese
using Helsinki NLP tool [17]. For the DBpedia dataset, we fetched additional
data from LC-QuAD 1.0 [19] dataset. The same was done w.r.t. the Wikidata
dataset and LC-QuAD 2.0 [8].
We used a multi-level hierarchical classification pipeline for both DBpedia
and Wikidata. The pipeline consists of the following models: (1) category clas-
sifier, (2) literal classifier – are common models for both knowledge graphs, and
(3) resource classifier – trained separately for DBpedia and Wikidata. The archi-
tecture of the classification pipeline is shown in Figure 10. The resource classifier
for DBpedia was trained in the multi-class classification setting to predict the
most specific type of the hierarchy. When the prediction is executed, the rest of
the hierarchy is fetched from DBpedia via the following SPARQL query based
on the predicted type:
PREFIX owl:
PREFIX rdfs:
PREFIX dbo:
SELECT DISTINCT ?parentClass WHERE {
rdfs:subClassOf* ?parentClass .
FILTER(?parentClass != && CONTAINS(STR(?parentClass), "dbpedia"))
}}
The resource classifier for Wikidata was trained in the multi-label classification
setting. It works without any additional steps in the prediction phase.
8 A. Perevalov, A. Both
Accuracy NDCG@5 NDCG@10
DBpedia
0.991 0.643 0.577
Accuracy MRR
Wikidata
0.980 0.430
Table 1. Final results obtained on the private test set by organizers
5 Evaluation Results
For the classification model, BERT multilingual model [7] was used. We utilized
“transformers”10 Python library for implementing the classification pipeline. For
the multi-class classifiers (category, literal, resource for DBpedia), one fully-
connected layer of size n was added to the BERT model, where n – is the number
of classes. The input for this layer was the last hidden state of the BERT model
(i.e., [CLS] token). We used categorical cross-entropy [5] as a loss function for
the multi-class models. The multi-label classifier – resource for Wikidata – was
also provided with a fully connected layer of the same size, however, binary
cross-entropy loss [5] was used as each of the output neurons represents the
probability of predicted label. The training of the models was done using early
stopping criteria targeted at minimizing the loss with the patience of one epoch.
The prediction results were evaluated with the following metrics. For both
DBpedia and Wikidata, the quality of the category predictions was measured
using accuracy. The quality of the answer type predictions was measured using
Mean Reciprocal Rank (MRR) for Wikidata and lenient Normalized Discounted
Cumulative Gain @k with a Linear decay (NDCG@k) for DBpedia, where k ∈
5, 10 [3].
The evaluation was done by organizers on the private dataset using an in-
ternal process. The final results are shown in Table 1. The obtained results are
demonstrating reasonable quality in comparison to the other participants as well
as to the results of the last year. The Accuracy score of the solution demonstrates
one of the best results on both DBpedia and Wikidata, while metrics related to
the ranking of the answer type hierarchies (NDCG@k, MRR) achieved relatively
poor results. The source code of our solution is available online11 . In addition,
we also have deployed a demo-interface for EAT classification over DBpedia
online12 .
6 Discussion and Conclusion
We would like to raise the following questions concerning the evaluation process.
First, as we observed a significant imbalance in the data w.r.t. the category
values (see Section 3), we think that the usage of accuracy score to measure
10
https://huggingface.co/bert-base-multilingual-cased
11
https://github.com/Perevalov/smart-2021
12
https://webengineering.ins.hs-anhalt.de:41009/eat-classification
Mutilingual Hierarchical EAT Classification using the SMART 2021 Dataset 9
the results is not the best option since that it is not robust to the imbalanced
data. Instead, we propose to use precision and recall scores computed in the
classification setting. Secondly, as the answer types of the resource category
questions in Wikidata are not ordered and do not form a hierarchy, we think
that the usage of mean reciprocal rank is not acceptable as this metric is used to
evaluate ordered result sets. Hence, the measure for unordered lists is naturally
applicable to this task, such as precision and recall calculated in an information-
retrieval setting.
In this paper, we demonstrated our approach for hierarchical EAT prediction
based on a multi-level classification pipeline. As we used multilingual data and
models for training, the classification pipeline supports input in 104 languages.
The evaluation process demonstrated reasonable results w.r.t. the quality met-
rics. For future work, we are targeting on improving the quality of the resource
answer types classification, and we are planning to create a study on the impact
of EAT classification on QA quality for multiple QA systems using a component-
oriented QA framework (e.g., [4]).
References
1. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: A
nucleus for a web of open data. In: Aberer, K., Choi, K.S., Noy, N., Allemang, D.,
Lee, K.I., Nixon, L., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber,
G., Cudré-Mauroux, P. (eds.) The Semantic Web. pp. 722–735. Springer Berlin
Heidelberg, Berlin, Heidelberg (2007)
2. Balog, K., Neumayer, R.: Hierarchical target type identification for entity-oriented
queries. In: Proceedings of the 21st ACM international conference on Information
and knowledge management. pp. 2391–2394. CIKM ’12, ACM, New York, NY,
USA (2012). https://doi.org/10.1145/2396761.2398648
3. Balog, K., Neumayer, R.: Hierarchical target type identification for entity-oriented
queries. pp. 2391–2394 (10 2012). https://doi.org/10.1145/2396761.2398648
4. Both, A., Diefenbach, D., Singh, K., Shekarpour, S., Cherix, D., Lange, C.: Qa-
nary – a methodology for vocabulary-driven open question answering systems. In:
Sack, H., Blomqvist, E., d’Aquin, M., Ghidini, C., Ponzetto, S.P., Lange, C. (eds.)
The Semantic Web. Latest Advances and New Domains. pp. 625–641. Springer
International Publishing, Cham (2016)
5. Cox, D.R.: The regression analysis of binary sequences. Journal of the Royal Sta-
tistical Society: Series B (Methodological) 20(2), 215–232 (1958)
6. Dash, S., Mihindukulasooriya, N., Gliozzo, A., Canim, M.: Type prediction sys-
tems. CoRR (2021), https://arxiv.org/abs/2104.01207
7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.N.: BERT: Pre-training of deep
bidirectional transformers for language understanding. ArXiv e-prints (2018)
8. Dubey, M., Banerjee, D., Abdelkawi, A., Lehmann, J.: LC-QuAD 2.0: A large
dataset for complex question answering over Wikidata and DBpedia. In: Ghi-
dini, C., Hartig, O., Maleshkova, M., Svátek, V., Cruz, I., Hogan, A., Song, J.,
Lefrançois, M., Gandon, F. (eds.) The Semantic Web – ISWC 2019. pp. 69–78.
Springer International Publishing, Cham (2019)
10 A. Perevalov, A. Both
9. Garigliotti, D., Hasibi, F., Balog, K.: Target type identification for entity-bearing
queries. In: Proceedings of the 40th International ACM SIGIR Conference on Re-
search and Development in Information Retrieval. pp. 845–848. SIGIR ’17, ACM,
New York, NY, USA (2017). https://doi.org/10.1145/3077136.3080659
10. Mihindukulasooriya, N., Dubey, M., Gliozzo, A., Lehmann, J., Ngomo, A.C.N.,
Usbeck, R.: SeMantic AnsweR Type prediction task (SMART) at ISWC 2020 Se-
mantic Web Challenge. CoRR/arXiv abs/2012.00555 (2020), https://arxiv.
org/abs/2012.00555
11. Mihindukulasooriya, N., Dubey, M., Gliozzo, A., Lehmann, J., Ngonga Ngomo,
A.C., Usbeck, R., Rossiello, G., Kumar, U.: Semantic answer type and relation
prediction task (smart 2021). arXiv (2022)
12. Nikas, C., Fafalios, P., Tzitzikas, Y.: Two-stage semantic answer type predic-
tion for question answering using BERT and class-specificity rewarding. In: Pro-
ceedings of the SeMantic AnsweR Type prediction task (SMART) at ISWC
2020. CEUR Workshop Proceedings, vol. 2774, pp. 19–28. CEUR-WS.org (2020),
http://ceur-ws.org/Vol-2774/paper-03.pdf
13. Perevalov, A., Both, A.: Augmentation-based answer type classification of the
SMART dataset. In: Proceedings of the SeMantic AnsweR Type prediction task
(SMART) at ISWC 2020. CEUR Workshop Proceedings, vol. 2774, pp. 1–9. CEUR-
WS.org (2020), http://ceur-ws.org/Vol-2774/paper-01.pdf
14. Perevalov, A., Both, A.: Improving answer type classification quality through com-
bined question answering datasets. In: Knowledge Science, Engineering and Man-
agement. pp. 191–204. Springer International Publishing, Cham (2021)
15. Perevalov, A., Diefenbach, D., Usbeck, R., Both, A.: QALD-9-plus: A multilingual
dataset for question answering over DBpedia and Wikidata translated by native
speakers. In: 2022 IEEE 16th International Conference on Semantic Computing
(ICSC). IEEE (2022)
16. Setty, V., Balog, K.: Semantic answer type prediction using BERT IAI at the ISWC
SMART task 2020. In: Proceedings of the SeMantic AnsweR Type prediction task
(SMART) at ISWC 2020. CEUR Workshop Proceedings, vol. 2774, pp. 10–18.
CEUR-WS.org (2020), http://ceur-ws.org/Vol-2774/paper-02.pdf
17. Tiedemann, J., Thottingal, S.: OPUS-MT — Building open translation services
for the World. In: Proceedings of the 22nd Annual Conference of the European
Association for Machine Translation (EAMT). Lisbon, Portugal (2020)
18. Tonon, A., Catasta, M., Prokofyev, R., Demartini, G., Aberer, K.,
Cudré-Mauroux, P.: Contextualized ranking of entity types based on
knowledge graphs. Journal of Web Semantics 37-38, 170–183 (2016).
https://doi.org/10.1016/j.websem.2015.12.005
19. Trivedi, P., Maheshwari, G., Dubey, M., Lehmann, J.: LC-QuAD: A corpus for
complex question answering over knowledge graphs. In: d’Amato, C., Fernandez,
M., Tamma, V., Lecue, F., Cudré-Mauroux, P., Sequeda, J., Lange, C., Heflin,
J. (eds.) The Semantic Web – ISWC 2017. pp. 210–218. Springer International
Publishing, Cham (2017)