COALA – A Rule-Based Approach to Answer Type Prediction Nadine Steinmetz and Kai-Uwe Sattler Technische Universität Ilmenau, Germany firstname.lastname@tu-ilmenau.de Abstract. For answering a question correctly, the previous detection of the answer type is essential. Especially in the field of Question Answering (QA) over knowledge bases, answers might be of many different types as natural language is ambiguous and a question might lead to different relevant queries. For semantic knowledge bases data types (such as date, string, or number) as well as all ontology classes (such as athlete, cham- pionship, or television show) have to be taken into account. Therefore, the previous detection of the answer type is a helpful sub-task for QA systems, but also a complex classification problem. We present our rule- based approach COntext Aware anaLysis of Answer types (COALA). Our approach is based on the extraction of several question features and the context aware disambiguation to retrieve the correct answer type. COALA has been developed in the course of the SMART task challenge and we evaluated our approach based on over 21,000 questions. Keywords: Question Answering · Answer Type Detection · Semantic Web 1 Introduction Answer type prediction (ATP) is an essential subtask for automatic question answering (QA). It can either be considered a classification task or an analyt- ical problem. For predicting the answer type using a classifier, a large dataset for annotation and obviously human annotators are required. Consequentially, the classifier is rather determined to a specific version of the ontology and the knowledge base (KB). If ontology classes, properties, or knowledge facts are removed or added, the classifier requires to be retrained utilizing additionally annotated source data. In contrast to that, an analytical approach is rather flexible. The prediction task mostly requires mapping processes to the under- lying KB. In case of changes to the ontology or the KB, the data source is adapted while the mapping processes and the analysis in general stay the same. Changes in the underlying ontology might be rare for well established knowledge bases. But especially for newly created KBs and ontologies, changes are common Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 N. Steinmetz et al. and trained approaches might not take into account these changes. In contrast, changes within the KB in terms of new entities and facts are quite frequent. Here, an analytical approach is able to adapt to the changes on the contrary to pre-trained classifiers. With this paper, we introduce our analytical approach to detect the answer type of an input question utilizing rules. We analyzed a large amount of questions provided with the training dataset for the SMART task challenge aligned with ISWC 2020 [5]. We extracted question features to be able to find the correct references for the detection of the answer type. Following this approach, we defined a set of rules and developed a plain but efficient application which is able to detect the answer types for the majority of questions wrt. the SMART datasets. The remainder of the paper is organized as follows: Section 2 covers some related approaches on ATP and QA. We describe our approach in Section 3. Based on the training and the test dataset provided by the SMART task chal- lenge, we evaluated our approach and present the results in Section 4. We also discuss flaws and errors in that section and give an outlook on improvements and future work in Section 5. 2 Related Work Our approach – along with ATP approaches in general – contributes to a higher accuracy of QA systems. More specifically, the answer and answer types are part of a semantic knowledge graph. For this purpose, several approaches have been developed and evaluated, especially in the course of the QALD challenge1 . Developers of QA systems face several challenges to overcome. Unger et al. pro- vided a fundamental publication on Semantic QA systems [8]. And Höffner et al. presented a survey on the challenges of QA systems [3]. The prediction of answer types is a subtask within the procedure of transforming a natural lan- guage question to a formal query and ranking the results. Therefore, hereinafter we will examine related work with a focus on the answer type prediction instead of the complete QA system. Mostly, previous work deals with a very limited amount of types to be predicted. For instance, Murdock et al. presented their approach on typing answer candidates in the course of the development of IBM Watson2 . In consequence, the authors did not restrict the answers to a specific amount of types, because “nearly any word can be used as a type” [6]. Bast et al. presented an approach on question answering for Freebase [2]. For their system, they included a simple answer type classification based on whether the question asks for a person, a place or a date. This procedure is similar to the answer type check we included for our QA system HYDRA [7]. In addition to the types utilized by Bast et al., HYDRA checks for the answer type number, when the question starts with how many or how much. 1 http://qald.aksw.org/ 2 https://www.ibm.com/watson COALA – A Rule-Based Approach to Answer Type Prediction 3 For QA on a knowledge graph, the amount of different types depends on the number of classes defined in the underlying ontology and additionally the types utilized as range for the ontology properties, such numbers, dates etc. Therefore, the typing of expected answers requires an intelligent but efficient approach. Ziegler et al. presented an approach for efficiency-aware answer type pre- diction for compositional questions [9]. The idea is to decompose a complex question and retrieve the most specific types for all sub-questions. Similar to our approach, they use CoreNLP POS tagging to find references for types in the questions. But, there is no differentiation between different patterns as well as multiple references from which some might not be constructive and rather be misleading. For our approach, we examined a large amount of questions to identify the position and pattern of references to the expected answer type of a question. Although we analyze the context of the question to disambiguate answer type candidates, we also keep in mind the efficiency of the subtask within the complete QA pipeline. 3 Method Our analytical approach includes several separate processing steps. The first step is the detection of the question type. We describe this step in detail in Section 3.1. According to the detected question type, the question is analyzed for further characteristics. These subsequent analysis steps are specified in Section 3.2. In general, the result of the question analysis is the answer category and type(s). Our overall approach is depicted in Figure 1 and further described in detail in the next sections. 3.1 Question Type Detection For the SMART task challenge, a set of answer types respectively categories have been predefined: – boolean – literal – number – literal – date – literal – string – resource – set of ontology classes These answer types can be deduced from different question types. Therefore, we identified different question types in the first step. Naturally, we first analyzed the first words of the questions in the training dataset (the numbers in the brackets depict the amount of occurrence in the training set of the challenge having types of the DBpedia ontology): – Most frequently, questions start with which or what (≈ 7,514 – 42%) 4 N. Steinmetz et al. Fig. 1. Overview of our approach COALA – A Rule-Based Approach to Answer Type Prediction 5 – Questions starting with an auxiliary verb, such as is or did, are second most frequent (≈ 2, 841 – 16%) – There are questions that begin with a main verb and the questions are more a request, such as list, give, show (≈ 1,283 – 7%) – Questions starting with who, whose or whom are also very frequent (≈ 2,189 – 12%) – There are several hundred questions either starting with when (≈ 1,146 – 6%), where(≈ 389 – 2%), and how much, or how many(≈ 845 – 5%). – In addition to the main question types above, a considerable high amount of questions in the dataset start with an entity, as in Paris is the capital of which country? (≈ 1,200 – 7%). – Naturally, there is a relatively low number of further questions with miscel- laneous beginnings. Obviously, many answer types can be deduced easily from the first words of a question. Questions starting with an auxiliary verb, such as Was Albert Einstein a vegetarian? require a boolean as answer type. Within the dataset, out of 2,799 questions having boolean as answer type, 2,764 start with an auxiliary verb. Questions starting with when, such as When did Tycho Brahe start working in Uraniborg?, require a date as answer type. The dataset contains approx. 1,486 questions assigned with the answer type date of which 1,146 start with the word when. Questions starting with how much or how many, such as How many orga- nizations work for Environmentalism?, obviously require the answer type to be number. The dataset contains 1,633 questions assigned with the answer type number and 770 of them start with the words how much or how many. Consequently, we deduce the answer types of these question types directly from the first words of the sentence. For questions starting with who or where, the general answer type can be anticipated from the first words with dbo:Person and dbo:Place respectively. But, in order to detect the more specific type, we perform an elaborate analysis on these questions. The same applies for all other question types for which the answer type cannot be deduced directly from the first words. For these remaining questions, all answer types are taken into account and we try to detect the answer type according to several processing steps. Our analysis approach regarding the different question types is described in detail in the following sections. 3.2 Question Analysis This section describes the analysis of a question in detail. We first perform some preprocessing steps to prepare the question for further steps. Answer types are mainly derived from the ontology classes that are directly asked us- ing a (combined) noun, or the range of an ontology property that is refer- enced in the question by a verb or noun. For instance, the question Which 6 N. Steinmetz et al. languages were influenced by Perl? requires the answer to be an instance of dbo:ProgrammingLanguage and the class is referenced by the noun languages. For the question Who is starring in Spanish movies produced by Benicio del Toro? the referencing verb is starring and not the (combined) noun Spanish movies. Therefore, we extract the first occurrence of a noun or the first occur- rence of a main verb. Indications and methods are described in detail in the next sections. Preprocessing In the first step, the questions are cleaned for redundant quo- tation/punctuation marks and other characters not necessary for the analy- sis. Subsequently, the question is analyzed for Part-of-Speech (POS) using the CoreNLP annotation pipeline3 . The annotator pipeline consists of the annota- tors tokenize, ssplit, pos. As the test and training datasets contain questions collected from web inquiries, the capitalization is not correct in some cases and we do not find any relevant combined nouns or verbs. In that case, we add the truecase annotator to the pipeline. For the identification of the key term that hints the answer type, we follow two different strategies: – extraction of the first noun, and – extraction of the first verb. These strategies have emerged from several analyses on the training dataset. For instance, we compared the labels of the ontology classes in the type list with the questions and extracted the POS for the words with the highest similarity to the class labels. After several (manual) evaluation rounds, these two strategies turned out to be the most effective. First Noun Extraction Consider the question What is the birth place of Kon- rad Adenauer?. Here, the (combined) noun birth place references a property in the ontology and we need to examine the range of this property. Furthermore, consider the question What team does John McGeever play for?. The noun team references a class from the underlying ontology which represents the answer type of the question. Eventually, the request Give me the number of home stadiums of teams managed by John Spencer. references the answer type with the word number. For all these questions, the first noun depicts the phrase that requires to be extracted to further examine the answer type of the question. There are a few rules we follow to do that: – The noun must follow the first which/what in the sentence, if these words occur in the question. For instance, the question The capital of Vilnius is in which sovereign state? requires the answer to be a sovereign state and not a capital. 3 https://stanfordnlp.github.io/CoreNLP/ COALA – A Rule-Based Approach to Answer Type Prediction 7 – We take into account several sequential combinations of word types regarding Part-Of-Speech: [ JJ N IN N ], [ N IN N ], [ JJ N ], [ N IN DT N ], [ N ]. Table 1 shows sample phrases for each combination. For this task, another option would be the usage of a language model (cf. [1]). A comparison of evaluation results will be included in future approaches. – A combination might be followed by one our more simple nouns, such as [ JJ NN NN ] as in official birth place. – After extraction of relevant combined nouns, we only consider the longest combination with the lowest start index. – The combination of nouns must not contain a proper noun as proper nouns reference named entities and do not reference the answer type. – A combined noun must not be followed by an apostrophe. For the question What is a neutron’s gyromagnetic ratio? the first noun would be neutron, but the referencing combined noun for the answer type is gyromagnetic ratio. Table 1. Sample phrases for combined nouns and the respective sequential POS pat- tern. POS pattern Sample question with marked terms following the pattern JJ N IN N What is the approximate date of birth of Eusebius of Caesarea? N IN N What is branch of biology that starts with z? JJ N What is the official language of Papua New Guinea? N IN DT N What is the name of the opera based on Twelfth Night? N Which films are located in New York City? Combined nouns are extracted from the question according to these rules – along with the start index to be able to extract the first occurrence and compare them to the start index of the extracted verbs. Verb Extraction Consider the following sample questions: – What are Breann McGregor and Anika Knudsen, both known for? – Who coached the Marquette golden eagles men team in 09 to 10 and then again in 13 to 14? – Who painted Mona Lisa? For these questions, the verbs known for, coached, and painted are referencing the required answer type respectively. The CoreNLP tagger uses the Penn Tree- bank tagset. Within this tagset, the tags for verbs are starting with VB, such as VBZ for a Verb, 3rd person singular present or VBD for a Verb, past tense. There- fore, we extract the first occurrence of a term tagged by the POS tagger with tags starting with VB. Naturally, there are combined verbs, such as is married, or will be dancing. Therefore, we extract combinations of terms that are tagged with a verb tag consecutively. 8 N. Steinmetz et al. Class identification The extracted combined noun is stemmed4 and mapped to the search index for DBpedia ontology classes. The search index consists of the original labels of the classes, provided by DBpedia, as well as additional labels obtained by utilizing the datamuse API5 for each original label. In addition, we enriched the search index of the generalized versions of the labels. This means, for each original label, we extracted the last term (only in case the label consists of more than one term) and added this term as additional label. In this way, we still find the ontology class basketball player when only player is referenced in the question. Property Identification The extracted noun combination could also reference a property. Hence, we also stem and map the phrase to the search index for prop- erties. Similar to the search index for classes, this index consists of the original labels of the properties and additional labels obtained via datamuse. In con- trast to the class label index, we also utilized data from the Wikidata knowledge graph. Wikidata often contains a lot more information on entities in terms of specific characteristics. For instance, for the chemical substance Malathion DB- pedia only lists one essential property describing the substance (iupac name). Wikidata lists around 20 facts about it. Moreover, many questions in the train- ing and test dataset of the challenge refer to facts only contained in Wikidata. Therefore, we decided to add properties, their ranges and their occurrence with entities to our search index. Disambiguation Due to the enrichment of the search indices for properties and classes with additional labels, we are able to find the relevant classes and properties for a higher amount of natural language phrases. But, concurrently we created ambiguous mappings and it might occur to obtain more than one class or property for an extracted phrase. Therefore, the disambiguation step is necessary to find the most appropriate and relevant property or class. For the disambiguation between different classes, it is essential to semanti- cally distinguish them from each other. For an algorithm-based disambiguation, context information for each class is necessary. Unfortunately, the DBpedia on- tology does not provide much information about each class. Therefore, we created context texts by gathering all abstracts of the instances of each class. We use these texts to map the context information from the question to the context in- formation of each relevant class for the extracted phrase. In this way, a ranking score for each class is calculated by simply counting the occurrences of contex- tual terms from the question within the context information of each class. All scores are normalized to a range between [0.0...1.0]. Another option for the disambiguation of classes is the utilization of Wikipedia page links and check, if the relevant classes are linked directly or indirectly to named entities mentioned in the question – via their instances. We identify 4 Utilizing the Snowball Stemmer: https://snowballstem.org/ 5 https://www.datamuse.com/api/ COALA – A Rule-Based Approach to Answer Type Prediction 9 named entities within the question, retrieve all resources (in case of ambiguous surface forms) and check, if these resources are instances of the relevant classes or if there are links between the resources and the instances of the classes. Again, we simply count the number of links and normalize the number to a ranking score between [0.0...1.0]. For the disambiguation of phrases that are mapped to ontology properties, we utilize the named entities mentioned in the question alike. Here, we retrieve all properties that are part of a triple with the named entities (as subject or object). We then check, if the list of properties mapped from the extracted noun phrase contains properties that are associated with the named entities identified in the question. A property achieves a ranking score by simply counting the named entities it is associated with. In this way, all classes and properties resulting from the mapping of the extracted noun phrase achieve a ranking score. Naturally, the class respectively property with the highest score is selected as most relevant for the extracted phrase. In case of a class, it constitutes the answer type at once. In case of a property, the range of the property is assigned as answer type for the question. 4 Evaluation Our work has been developed as part of the SMART task challenge at ISWC 2020. For this challenge, a training dataset and eventually a test dataset has been published. The training dataset consists of 17,571 items of which 44 items do not contain a proper question. Therefore, we processed 17,527 questions for the training dataset. The test dataset consists of 4,381 items of which 3 items do not contain a proper question. Therefore, we processed 4,378 questions for the test dataset. Both datasets are publicly available6 . As we developed a rule-based approach (in contrast to a trained approach), we did not split up the training dataset for evaluation issues and provide results for the complete dataset. The system output consists of a result category which is either literal, boolean, or resource, and a (list of) type(s). The types depend of the category. For the literal category, the type list either contains string, date, or number. If the category is resource, the type list consists of several ontology classes from the DBpedia ontology. Therefore, the evaluation includes three different metrics: accuracy, NDCG@5, and NDCG@10. The accuracy score is used as metric for the category prediction of the system output. NDCG@5 and NDCG@10 are used to calculate the quality of the ranking of the type list. NDCG@5 is calculated of the first 5 results in the type list and NDCG@10 for the first 10 results respectively. 4.1 Results Evaluation results are presented in Table 2. The results show that our approach achieved similar results for the training and the test dataset having better re- 6 https://github.com/smart-task/smart-dataset/tree/master/datasets/DBpedia 10 N. Steinmetz et al. sults for the class prediction of the training dataset. Out of 17,527 and 4,378 questions, 14,272 respectively 3,683 questions were answered by our approach. The remaining questions – 3,255 and 695 respectively – have an answer type to be detected as unknown. Table 2. Evaluation results for the SMART task challenge datasets against the DB- pedia ontology Dataset Processed Answered Accuracy NDCG@5 NDCG@10 Training 17,527 14,272 0.746 0.655 0.620 Test 4,378 3,683 0.744 0.540 0.521 4.2 Discussion The reasons for failures of our approach can be summarized in the following categories: 1. Follow-up errors because of erroneous Part-of-Speech tagging 2. High complexity of the question 3. Missing mapping from verb/noun to class and property labels 4. Incomprehension of questions We discuss these issues further in detail in the following paragraphs: Follow-Up errors The questions have been derived from web questions and often lack the correct capitalization or are completely in upper case etc. Therefore, the POS tagging has a higher error rate than for correct sentences. Our approach is based on the correct extraction of the first noun or verb. If that fails, the approach fails in most cases. As stated in Section 3.2, we add the truecase annotator to the CoreNLP pipeline, but only when no noun or verb could be extracted. Therefore, the error rate is still considerable high because of wrong capitalization. Sometimes, we are able to derive a general class from the characteristic of the question. For instance, if the question starts with who or where, we add the respective ontology class dbo:Person or dbo:Place. But this results in a lower ranking score, if the expected class is more specific. Complexity Some questions are quite complex and are even hard to answer at all. So, the detection of the answer type is also more complex and cannot be achieved by our straight forward approach. For instance, the question What is the songwriter of Hard Contract known for? requires to first detect who the songwriter of the song Hard Contract is. As the property dbo:knownFor has not a range constraint, the concrete object for the songwriter having this property must be retrieved. Subsequently, the type can be deduced. But, the answer type detection should be a first step for a QA approach and supporting the result COALA – A Rule-Based Approach to Answer Type Prediction 11 of the complete approach. At the end of the detection for this sample question, we already have the answer for the question – after an elaborate procedure. From our point of view, the answer type detection should be a tradeoff between required cost and result. In addition, the dataset also contains questions which we are unable to answer as humans. Naturally, it is hard to find rules for those type of questions. But, of course, the limits of artificial intelligence should not be the individual human brain. So, this is an issue for future work. Missing mapping As described in Section 3.2, we utilize the original labels of the classes and properties derived from DBpedia (for classes and properties) and Wikidata (only for properties). For the properties, Wikidata also provides additional labels via skos:altLabel. In addition, we retrieved more labels by utilizing the datamuse API and thereby enriched our search index for classes and properties. But still, some phrases cannot be mapped to a property or class and therefore the approach either fails to predict an answer type at all or detects a wrong type. Therefore, the lexical gap is a major difficulty for the field of Semantic Question Answering and remains an aspect to work on further. Incomprehension In addition to very complex questions, the dataset also con- tains a few items where the question is rather an abstract than a question (cf. question with id dbpedia 4857) and it is not clear what it asks for. Also, there are questions like What it is? which we naturally skipped, too. 5 Summary With this paper, we present our straightforward rule-based approach COALA on answer type prediction. As ATP is supposed to be a subtask of a QA system, we focussed on a simple, but efficient procedure. As shown in Section 4, we achieved an accuracy on the category prediction of almost 75% for both, the training and the test dataset. The NDCG score for the ranking quality of the type lists ranges between 52% and 65%. As discussed in Section 4.2, we identified several error categories and fu- ture work tasks. We expect the most effects on accuracy and ranking quality by investigating more in the comprehension of complex questions. We already achieve a good result with the identification of the references for the answer types. But, questions with sub-sentences and listings of required characteristics for the demanded answer are still a challenge. Therefore, we plan to take further investigations in lexical patterns to identify all references mentioned in a ques- tion and take these hints into account for more specific predictions of the answer type. Another challenge is the missing mapping of natural language phrases to the labels of ontology classes and properties. For this issue, we plan to examine the ClueWeb12 dataset7 and find additional phrases similar to Ziegler et al.[9] and Lin et al.[4]. 7 http://lemurproject.org/clueweb12/index.php 12 N. Steinmetz et al. Overall, the typing of answer candidates should an efficient and contributing subtask of QA systems. With our approach, we anticipate some tasks that are necessary for the construction of the correct SPARQL query in any case, such as question type analysis and named entity identification and disambiguation. In this way, the process is contributing to the complete QA procedure and facilitates the expected type of answer for a higher accuracy of the overall system. References 1. Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence la- beling. In: Proceedings of the 27th International Conference on Computational Lin- guistics. pp. 1638–1649. Association for Computational Linguistics, Santa Fe, New Mexico, USA (Aug 2018), https://www.aclweb.org/anthology/C18-1139 2. Bast, H., Haussmann, E.: More accurate question answering on freebase. In: Pro- ceedings of the 24th ACM International on Conference on Information and Knowl- edge Management. p. 1431–1440. CIKM ’15, Association for Computing Machinery, New York, NY, USA (2015) 3. Höffner, K., Walter, S., Marx, E., Usbeck, R., Lehmann, J., Ngonga Ngomo, A.C.: Survey on challenges of question answering in the semantic web. Semantic Web 8(6), 895–920 (2017) 4. Lin, T., Mausam, Etzioni, O.: Entity linking at web scale. In: Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowl- edge Extraction (AKBC-WEKEX). pp. 84–88. Association for Computational Lin- guistics, Montréal, Canada (Jun 2012), https://www.aclweb.org/anthology/W12- 3016 5. Mihindukulasooriya, N., Dubey, M., Gliozzo, A., Lehmann, J., Ngomo, A.C.N., Usbeck, R.: SeMantic AnsweR Type prediction task (SMART) at ISWC 2020 Semantic Web Challenge. CoRR/arXiv abs/2012.00555 (2020), https://arxiv.org/abs/2012.00555 6. Murdock, J.W., Kalyanpur, A., Welty, C., Fan, J., Ferrucci, D.A., Gondek, D.C., Zhang, L., Kanayama, H.: Typing candidate answers using type coercion. IBM Jour- nal of Research and Development 56(3.4), 7:1–7:13 (2012) 7. Steinmetz, N., Arning, A., Sattler, K.: From natural language questions to SPARQL queries: A pattern-based approach. In: Datenbanksysteme für Busi- ness, Technologie und Web (BTW 2019), 18. Fachtagung des GI-Fachbereichs ,,Datenbanken und Informationssysteme” (DBIS), 4.-8. März 2019, Rostock, Ger- many, Proceedings. pp. 289–308 (2019). https://doi.org/10.18420/btw2019-18, https://doi.org/10.18420/btw2019-18 8. Unger, C., Freitas, A., Cimiano, P.: An Introduction to Question Answering over Linked Data, pp. 100–140. Springer International Publishing, Cham (2014) 9. Ziegler, D., Abujabal, A., Saha Roy, R., Weikum, G.: Efficiency-aware answering of compositional questions using answer type prediction. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers). pp. 222–227. Asian Federation of Natural Language Processing, Taipei, Taiwan (Nov 2017), https://www.aclweb.org/anthology/I17-2038