CitySAT: a System for the Semantic Answer Type Prediction Task⋆ Chaeyoon Kim1 and Ernesto Jiménez-Ruiz1,2 1 City, University of London, London 2 SIRIUS, University of Oslo, Norway Abstract. This paper describes the CitySAT system that we developed for the DBpedia Answer Type (AT) prediction task of the SMART 2021 challenge. The challenge can be interpreted as a multi-class classification task that takes natural language questions and returns pairs of the pre- dicted answer category and types. For training, we merged the SMART 2021 DBpedia dataset with the 2020 dataset given for the previous year’s AT task. In this study, three local Machine Learning (ML) models are deployed to cover the three different types of task and question (category prediction, literal type prediction and resource type prediction). The best model obtains a 98.36% accuracy for the category prediction using a Lo- gistic Regression (LR) classifier. Similarly, another LR model results in 97.90% accuracy for the literal type prediction task. Lastly we also built a Multi-Layer Perceptron (MLP) model to deal with several ontology classes (∼760 classes for DBpedia) in the resource type prediction task. The best MLP model achieves 79.34% on the merged training dataset. The final system output obtained a 98.4% accuracy, 84.2% NDCG@5, and 85.4% NDCG@10 on the (official) test dataset. Keywords: Semantic answer type prediction · SMART DBpedia chal- lenge · multi-class classification 1 Introduction In computer science, Answer Type prediction (AT) is a research domain provid- ing with simplified tasks of the Question Answering (QA) discipline. It inherits the QA task missions to understand the meaning of natural language ques- tions but identifies their answer types instead of retrieving the most relevant information among the answer candidates. Due to the number of classes of the categorical answer type candidates, AT task can be interpreted as a multi-class text classification task upon pre-defined classes. The SeMantic Answer type and Relation prediction Task (SMART),1 orga- nized by the International Semantic Web Conference (ISWC), expands the AT ⋆ Copyright ©2021 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 1 https://smart-task.github.io/ 2 C. Kim and E. Jiménez-Ruiz Table 1. The size of datasets: SMART 2021 challenge increased a 231% w.r.t. 2021. SMART 2020 SMART 2021 Merged Train Test Total Train Test Total Train 17,571 4,369 21,940 36,670 9,104 45,774 39,556 challenge to a more complicated structure of knowledge base data which con- sists of the answer category and type in two hierarchical levels. They provide two versions of large-scale dataset for the AT task; one edition using DBpedia ontology (∼760 classes) and the other edition using the Wikidata taxonomy (∼50K classes). This paper demonstrates our participation in the SMART 2021 AT task [6], concentrating only on the DBpedia dataset, that takes a set of nat- ural language questions alongside with the corresponding answer category and ontological type classes of DBpedia to predict a suitable type for new questions. The reason of the dataset choice is because of the clearer separation between ontology (i.e., terminology) and data (i.e., assertions) in DBpedia. The CitySAT system firstly verifies the accuracy score of the answer category prediction and additionally uses the lenient Normalized Discounted Cumulative Gain (NDCG) at each 5 and 10 answer type values to compete with comparable systems. The best optimization confirms 98.4% accuracy, 84.2% NDCG@5, and 85.4% NDCG@10. As NDCG evaluation metric follows a linear decay, NDCG@10 is highly valued and our result proves to the beneficiaries that the more prede- fined knowledge the better the system predicts. The rest of the paper is structured as follows. Section 2 (Context) surveys the used materials and related work. Section 3 (Methods) presents a detailed view of the entire research progress from data loading to evaluation with justifications for each of the steps. Section 4 (Results) summarizes the respective high-valued experiments. Section 5 (Discussion and Future work) examines what is a mean- ingful correlation between our research motivation and results that conclude the next stage of studies. The code for reproducing the experimental results of this study is publicly available at https://github.com/chaeyoonyunakim/ smart-2021-AT. 2 Context: Materials and related work SMART 2020 [7] published eight AT systems on the DBpedia dataset that our research aimed at comparing as reference models for SMART 2021. To start with training datasets, the previous work (e.g., [2] and [10]) sup- ports the positive effect of larger training datasets and thereby this study also tried to maximize the volume of the training data by merging every relevant re- source. Table 1 gives an overview of our merged training dataset size compared to SMART 2020 and 2021 challenge dataset. Originally the prior DBpedia edition contained 21,940 questions but the later edition has increased the data size up to 45,774. At the end, our study settled down on the merged training data with 39,556 set of questions and model answers. SMART datasets are designed to provide a single answer category either “boolean”, “literal”, or “resource”. They CitySAT: a System for the Semantic Answer Type Prediction Task 3 Table 2. SMART 2020 Leaderboard - DBpedia AT task. System Accuracy NDCG@5 NDCG@10 Setty et al.[11] 0.98 0.80 0.79 Nikas et al.[8] 0.96 0.78 0.76 Perevalov et al.[10] 0.98 0.76 0.73 Kertkeidkachorn et al.[5] 0.96 0.75 0.72 Ammar et al.[1] 0.94 0.62 0.61 Vallurupalli et al.[14] 0.88 0.54 0.52 Steinmetz et al.[12] 0.74 0.54 0.52 Bill et al.[2] 0.79 0.31 0.30 assign the “boolean” category as “boolean”, “literal” category into an answer type either “number”, “date”, or “string”. For the “resource” category, DBpedia ontology classes are placed in the answer type. Another one of the most important observations over the previous work is that they presented a very high accuracy for their answer category classifier on the upper level of the hierarchical data structure. For the 2020 performance results, Table 2 shows that top ranked outputs even exceeded over 90% accuracy. It led an initial decision that our experiments can be conducted on traditional Central Processing Units (CPUs). Due to the increased data volume, however, it is worth to define how to selectively adapt the referencing models in this study. Hence, some initial data analysis on lower level hierarchy (i.e., targets on the answer type) is following in next section. 2.1 Initial data analysis Regarding the shape of the bottom level data for answer type, Figure 1 gives details how many ontological classes are located in the resource type. Both 2020 and 2021 DBpedia editions are mostly consisted of equal or less than 6 answer types, but the distribution of type value numbers is more positively skewed in the 2020 edition. Therefore, the previous systems considered the first five (e.g., [10]) or six values (e.g., [2]) for their local classifier to predict the resource type. Some questions of 2021 edition are shown to have more than 10 ontology classes: 666 questions are assigned between 11 and 30 types, and 9 questions have up to 627 types which are all going to be excluded from the NDCG evaluation matrix as it measures up to 10 classes. In this study, we would try to train by every number of type values up to 10 so that it can be compared with last year performance. To understand the context of the answer type, Figure 2 illustrates an example question and how its ontological classes are lined up in the given dataset. Among the classes, there is an exception (dbo:Location) which is defined as an entity in DBpedia but it is considered as a class having an equal level with dbo:Place specifically in SMART 2020 and SMART 2021 shared tasks. As explained in [10], the 2020 edition is sorted to have the most general class at the end. In spite of that, the 2021 edition is in the mixed order of classes. For example, the answer “type” in Figure 2 can be rearranged to [“dbo:River”, “dbo:Stream”, · · ·, “dbo:NaturalPlace”, “dbo:Place”, “dbo:Location”] if the same principle as 4 C. Kim and E. Jiménez-Ruiz in the 2020 edition is applied. Because of the computational logic in NDCG, this study has an initial hypothesis that the order of the classes in the answer type does not affect the later normalised distance thereby there is no need to apply extra logic to sort the term orders. Fig. 1. Comparison table and charts between two years of SMART AT dataset for resource answer type. Fig. 2. Graphical representation of a sample question and resource answer type. CitySAT: a System for the Semantic Answer Type Prediction Task 5 Fig. 3. Tabular representation of the SMART 2021 training dataset 3 Methods 3.1 Data loading and manipulation In SMART AT task, the training dataset and test dataset share the same JSON format (see example below). Comparing the DBpedia datasets from the two SMART editions, at first glance, we found the “id” attribute changed its value from “dbpedia 1” to “1” and this has been taken into account when merging the datasets according to the arrangement in the 2021 edition. { "id": "1", "question": "Who are the gymnasts coached by Amanda Reddin?", "category": "resource", "type": ["dbo:Gymnast", "dbo:Athlete", "dbo:Person", "dbo:Agent"] } This study uses Python pandas library [13] for data manipulation. Firstly, it loads each year’s dataset into a tabular representation as shown in Figure 3. Through the data cleaning, both year’s datasets match a single answer cate- gory (either boolean, literal, or resource) and at least one answer type per each question. Lastly, the duplicates are removed when a question and correspond- ing answer are the same, however, the diverse answers for a questions are kept. Figure 4 gives example cases of our dataset pre-processing. To construct a robust data structure, feature engineering is necessary to drop the missing value observations such as zero number of type values in resource cat- egory and invalid values such as “n/a” in question. Due to the NDCG evaluation at the 10th answer type, we justify outliers as having more than 10 components in the answer type. In the case of not having the counter number of components, an indicator of missing value has to be marked since the evaluation matrix will return infinite distance of type path that means no relevance between the output (i.e., prediction types) and the ground truth (i.e., gold types). As the ground truth was not open for the 2021 challenge, this study used 20% of the merged training dataset for validation to check the performance of the system. For submission, the test dataset (9,104 questions) is predicted by the best model trained with the entire training dataset (39,556 questions and answers). 6 C. Kim and E. Jiménez-Ruiz Fig. 4. (a) Top 10 list of duplicates when datasets are merged, (b) An example case of keeping: a single question has multiple answers–keep variants, (c) An example of removing: duplicated question and answer pairs. 3.2 Data derivation: Extract, Transform, Load (ETL) This study uses the NLTK library [3] to parse natural language question inputs into tokenized words, and also normalise text with PorterStemmer (stemming) and WordNetLemmatizer (lemmatization) after stop-words removal. Initially, this study was specifically interested in dealing with Wh-terms (Who, What, When, Where, Which, Whom, Whose, Why, and include How) questions which accounts for 84.4% of the training questions. So, we customized stop words dictionaries to exclude the Wh-terms. Unfortunately, however, the new stop-words removal works less efficiently in despite of the initial interests. Hence, this study keeps the original NLTK stop words for further analysis. Additionally, this study explores term-frequency (TF) and inverse term- frequency (TF-IDF) in text feature extraction using CountVectorizer and TFID- FVectorizer from the scikit-learn library [9] from lessons of [11]. Empirical evi- dence concludes that applying the combination of stemming and TF for the first 10,000 unigram or bigrams is the best suitable for the SMART AT task. Categorical targets (answer category and type) are mapped into numerical labels. In particular, we distribute the answer type by wish-number of training target classes and encode “missing” if nothing exists in the location. For example, Figure 5 illustrates a sample conceptual data when we select the 10 values of the answer type across all categories. The classifiers can be programmed to have the first and single value (i.e., type1 column in Figure 5) if the category is either “boolean” or “literal”. However, one or more values are taken when the CitySAT: a System for the Semantic Answer Type Prediction Task 7 Fig. 5. Conceptual table to refer each allocated location of values in type. Fig. 6. System model design to perform hierarchical classification. category is “resource”. The order of values is mixed as stated in Figure 2, and our justification is to select the first five to ten values. Furthermore, the selective number of features are mapped into a dictionary with keys for an indicator of the value’s location. For example, by setting the argument type no 11 (i.e., type1 to type10) in the reference code below, dic- tionary keys [“type1”, · · ·, “type10”] and dictionary values [{“dbo:Opera”: 0, · · · “dbo:RadioStation”: 297}, · · ·, {“dbo:Politician”: 0, · · ·, “dbo:Entomologist”: 21}] will create 10 json files for {“type1”: {“dbo:Opera”: 0, · · · “dbo:RadioStation”: 297}} to {“type10: {“dbo:Settlement”: 0, · · ·, “dbo:Village”: 21}}. The final ac- cumulated dictionary of type maps is {“type1”: {”dbo:EducationalInstitution”: 0, · · ·, “dbo:Holiday”: 282}, “type2”: {“dbo:MusicalWork”: 0, · · · “dbo::Presenter”: 171}, · · ·, “type10”: {“dbo:Politician”: 0, · · ·, “dbo:Entomologist”: 21}}. 3.3 Multi-classification model design To perform hierarchical classification, this study considers local classifiers per level as shown in Figure 6 which is widely used in the state-of-the-art (e.g., [8], [10], and [11]) because of the imbalanced number of classes among the classifica- tion group. Whilst both the category prediction and the “literal” type prediction are for each three unique classes, the “resource” type prediction is towards ∼760 unique classes. To decide a suitable classifier for each level, we implemented a small sam- ple batch of python codes with several Machine Learning (ML) algorithms by 8 C. Kim and E. Jiménez-Ruiz Fig. 7. Diagram of the final multi-class classifier design: three ML models cover each number of unique features. importing SVM, LogisticRegression (LR), and MLPClassifier (MLP) from scikit- learn [9] and used them on SMART 2020 dataset (17,571 questions and answers for training, 4,360 questions for test) for comparison with reference models per- formance. To check the baseline performance, three sample ML models are ini- tially determined for answer category classification and they return the accept- able performance: SVM (kernel = ‘linear’, random state = 0, probability=True) results 87%, LogisticRegression(multi class=‘multinomial’, solver=‘lbfgs’) results 88%, and MLPClassifier (hidden layer sizes = (11, 11, 11), max iter = 500) re- sults 85% respectively. Additionally, this study finds that the MLP model is more efficient in many number of classes such as resource type (∼760 classes) than literal type (3 classes) classification. Then, we moved the confirmed imple- mentation to the merged training dataset to align with the SMART 2021 task requirements. 3.4 Multi-classification model implementation CitySAT system is programmed in Google Colab CPU environment with two processing threads. The evaluation of each stage algorithms has been conducted with validation data which is 20% of the training dataset. By expanding experiments from Section 3.3, we get the best optimised hyper- parameters in a combination system of two LRs and a MLP as briefly captured in the Appendix. Figure 7 shows the results of the design of our classification model from top to bottom level, starting from a single LR model which classifies the answer category at the top level. Two different models are used to classify the answer type at the bottom level: a LR model for the type of the literal category and a MLP model for the type of the resource category. As last step, to meet the submission format specifications, it is essential to decode all mapping data and convert the format back to the JSON format as shown by the JSON below. The CitySAT system workflow is depicted in Figure 8. { "id": "5586", "category": "resource", "type": ["dbo:Company", "dbo:Activity", "dbo:RecordLabel", "dbo:Agent", "dbo:Species", "dbo:Organisation", "dbo:AdministrativeRegion", "dbo:Location", "dbo:Country", "dbo:PopulatedPlace"] } CitySAT: a System for the Semantic Answer Type Prediction Task 9 Fig. 8. System workflow chart of the final CitySAT. One of the key algorithms is for data transformation and mapping the intended number of types in the experiments looping sub-process in type classification (partially retrieved code is stated in Ap- pendix). 4 Results Table 3 summarizes our submission results with the corresponding configuration parameters in Table 4. The final CitySAT submission (*) obtained the best model performance on the DBpedia test dataset: 0.984 on accuracy, 0.842 on NDCG@5, 0.854 on NDCG@10. Most importantly, the results table represents that the system design aims to perform better when it takes more knowledge-based information. For example, training with ten resource types (i.e., ontology classes) associated with a question is more informative for a MLP model than giving five types per question. Table 5 shows the results of other systems participating in the SMART 2021 challenge. The better performance in NDCG@10 was highly valued when the Table 3. Confirmed Results in participation of SMART 2021 challenge. Validation Test Conf. Accuracy NDCG@5 NDCG@10 Accuracy NDCG@5 NDCG@10 1 0.969 0.732 0.649 0.970 0.778 0.683 2 0.973 0.735 0.656 0.981 0.839 0.739 3 0.953 0.699 0.622 0.967 0.810 0.713 4 0.973 0.737 0.658 0.984 0.836 0.737 5 0.973 0.736 0.656 0.984 0.842 0.742 6* 0.973 0.736 0.738 0.984 0.842 0.854 10 C. Kim and E. Jiménez-Ruiz Table 4. Configuration settings of submissions. Conf. Stopwords Stemming Lemma Text feature Iteration No of Type 1 False True True TF 100/10 5 2 False True True TF 100/10 5 3 True True True TF-IDF 100/10 5 4 False True False TF 200/20 5 5 False True False TF 200/10 5 6* False True False TF 200/10 10 other two scores (Accuracy, NDCG@5) are similar. The best results of CitySAT is ranked at the top in DBpedia AT task. Table 5. SMART 2021 Leaderboard - DBpedia AT task. System Accuracy NDCG@5 NDCG@10 CitySAT 0.984 0.842 0.854 Bhargav et al. 0.985 0.825 0.790 Celebi et al. 0.985 0.725 0.704 Hoang et al. 0.985 0.727 0.664 Steinmetz et al. 0.991 0.734 0.658 Perevalove et al. 0.991 0.643 0.577 5 Discussion and future works With the given DBpedia data in the SMART 2021 AT challenge, this study tried various explorations on text normalization. Including filters of Wh-terms in stop words, there were multiple configuration settings we could have imagined to have improvement in classification performance which was not the case for this challenge. This opens the door for future work in finding how to improve the text features involving semantic meanings in a more human thought process. Although CitySAT models are optimised in Section 3.4 to find the best combi- nation of local classifiers for two levels, there might be more options and different combinations of models that can be discovered in the future. Especially, once we expand our evaluation environment to a Graphics Processing Unit (GPU), there are more applicable ML models for our problem. As in previous studies, we have also found that several of the participating systems used the fine-tuned BERT models [4]. Because of limitations on com- puting resources, however, this study intentionally deploys the ML models in CPU computation with inexpensive computational cost during the project. In the future, injecting the BERT model in CitySAT is possible to check if any performance benefits in the AT task. CitySAT: a System for the Semantic Answer Type Prediction Task 11 Acknowledgements We would like to thank the ISWC conference and the SMART challenge organis- ers. This work was partially supported by the SIRIUS Centre for Scalable Data Access (Research Council of Norway). References 1. Ammar, A., Mehryar, S., Celebi, R.: A methodology for hierarchical classification of semantic answer types of questions. In: SMART@ ISWC. pp. 41–48 (2020) 2. Bill, E., Jiménez-Ruiz, E.: Question embeddings for semantic answer type predic- tion. In: SMART@ ISWC (2020) 3. Bird, S., Klein, E., Loper, E.: Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.” (2009) 4. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidi- rectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). pp. 4171–4186 (2019). https://doi.org/10.18653/v1/n19-1423, https://doi.org/10.18653/v1/n19-1423 5. Kertkeidkachorn, N., Nararatwong, R., Nguyen, P., Yamada, I., Takeda, H., Ichise, R.: Hierarchical contextualized representation models for answer type prediction. In: SMART@ ISWC. pp. 49–56 (2020) 6. Mihindukulasooriya, N., Dubey, M., Gliozzo, A., Lehmann, J., Ngomo, A.N., Us- beck, R., Rossiello, G., Kumar, U.: Semantic answer type and relation prediction task (SMART 2021). CoRR abs/2112.07606 (2021), https://arxiv.org/abs/ 2112.07606 7. Mihindukulasooriya, N., Dubey, M., Gliozzo, A., Lehmann, J., Ngonga Ngomo, A.C., Usbeck, R.: SeMantic AnsweR Type Prediction Task at ISWC 2020 Semantic Web Challenge. CEUR-WS 2774 (2020), http://ceur-ws.org/Vol-2774/ 8. Nikas, C., Fafalios, P., Tzitzikas, Y.: Two-stage semantic answer type prediction for question answering using bert and class-specificity rewarding. In: SMART@ ISWC. pp. 19–28 (2020) 9. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011) 10. Perevalov, A., Both, A.: Augmentation-based answer type classification of the smart dataset. In: SMART@ ISWC. pp. 1–9 (2020) 11. Setty, V., Balog, K.: Semantic answer type prediction using bert: Iai at the iswc smart task 2020. arXiv preprint arXiv:2109.06714 (2021) 12. Steinmetz, N., Sattler, K.U.: Coala-a rule-based approach to answer type predic- tion. In: SMART@ ISWC. pp. 29–40 (2020) 13. pandas development team, T.: pandas-dev/pandas: Pandas (Feb 2020). https://doi.org/10.5281/zenodo.3509134, https://doi.org/10.5281/zenodo. 3509134 14. Vallurupalli, S., Sleeman, J., Finin, T., et al.: Fine and ultra-fine entity type embed- dings for question answering. In: International Semantic Web Conference (2020) 12 C. Kim and E. Jiménez-Ruiz Appendix # distribute categorical targets (answer category & type) to numericals def type_to_int(self, data, type_no): return data.type.map( lambda x: self.type_maps[f"type{type_no}"][x[type_no - 1]] if len(x) >= type_no else self.type_maps[f"type{type_no}"]["missing"] ) # model for category classification clf_category = LogisticRegression( random_state=seed,penalty='elasticnet',solver='saga', l1_ratio=0.2,n_jobs=-1,verbose=2,max_iter=200) .fit(X_train_category,y_train_category) # model for type of literal category clf_literal = LogisticRegression( random_state=seed,penalty='elasticnet',solver='saga', l1_ratio=0.5,n_jobs = -1,verbose = 2,max_iter = 200) .fit(X_train_category[train_literal_rows,:], y_train_literal) # model for type of resource category clf_type = MLPClassifier( random_state=seed,max_iter=10, hidden_layer_sizes=(1000,500,300),verbose=2) .fit(X_train_category[train_resource_rows],y_train_type)