Reaching out for the Answer: Relation Prediction Khaoula Benmaarouf and Nadine Steinmetz Technische Universität Ilmenau, Germany firstname.lastname@tu-ilmenau.de Abstract. This paper presents our contribution to the SMART chal- lenge 2021 (SeMantic AnsweR Type Prediction Task), specifically the relation prediction task for both the DBpedia and Wikidata datasets. We introduce our approach to predict the ontology properties (relations) mentioned in a natural language question in order to be able to answer the question correctly. Our solution is based on a pre-trained BERT model using fastai and in combination with data augmentation (for DB- pedia). The techniques separately are proven to be very effective for text classification problems and outperform other approaches. In this paper, we used a multi-label classification method built-in fastai library for the SMART task, which gives very good results. Achieving high per- formances for relation prediction is assured by using DBpedia (∼760 classes) and Wikidata class hierarchy (∼ 50K classes) for results of an experimental evaluation. Keywords: Question Answering · Text Augmentation· Relation Predic- tion. 1 Introduction This paper describes our approach on relation prediction for natural language questions within the context of the SMART (Semantic Answer Type and Rela- tion Prediction Task) challenge 2021 [7]. This task is focused on one of the most popular tasks in Natural Language Processing (NLP) – Knowledge Base Ques- tion Answering (KBQA). The aim of KBQA is to transform a natural language question to a formal query – specifically SPARQL – to be able to answer the question. In order to achieve this, two main subtasks can be utilized within the process pipeline: answer type prediction, and relation prediction. Question Answering (QA) systems are commonly used as interface between a large amount of (un)structured data and users who are enabled to request the data without knowledge of a formal query language. There are two types of QA: open and closed domain. Open domain QA systems do not solve specific topics, but are used to get the proper answers from various topics in shorter form. The downside of open domain QA is that it is difficult for the system to get answers for all possible questions facing various challenges, as e.g. ambiguity, or incomplete knowledge bases. On the other hand, closed domain QA systems are 2 Benmaarouf et. al focused on particular domains, where the QA application has been developed for a specific task, which helps the system to get the answers very fast, and (mostly) correct. For instance, QA systems in the medical field (Alzheimer’s diseases) [2] or chatbots applied for specific customer service tasks. In both cases, the QA application can benefit from various subtasks within the QA pipeline. Relation prediction detects references within the natural language questions to assign the correct ontology properties which are necessary for the formal query (specifically SPARQL). This paper proposes a solution for solving the relation prediction task using Bidirectional Encoder Representations from Transformers (BERT), where the prediction task is considered a multi-label classification problem. This paper is structured as following: Section 2 discusses some previous work that is related to our subtask. Section 3 gives an overview of the datasets and Section 4 depicts some results of the analysis of the both datasets (DBpedia and Wikidata). The preprocessing steps and the training steps are discussed in Sections 5 and 6. Evaluation results are described in Section 7. Finally, a conclusion for our results as well as an outlook is described in Section 8. 2 Related Work The approach on relation and entity linking by Sakor et. al is based on a set of rules and a mapping of connected entities and relations [11]. Therefore, the approach is independent from the underlying knowledge and the it can be trans- ferred to various knowledge graphs. The initial solution has been tested on Wiki- data and achieved good results, but the publicly available API also provides links to the DBpedia knowledge graph. Abolghasemi et. al proposed an instance-based method to detect the relation of a new question using similar paraphrases of questions in the training data [1]. This method uses two subnetworks : question-question network which uses semantic matching between input question and training questions to know the shortest distance between the input question and its corresponding question in the dataset. The second subnetwork is created from the question-answer rela- tion, where the output question from question-question network is used to get the corresponding answer from the dataset. The approach benefits from the as- sumption that there are various lexical representations for each question about a relation. Based on these similarities, the authors claimed that the likeness of questions can be utilized to find out the relation hidden behind question phrases. The dataset SimpleQuestions were used for the training. The approach achieved an accuracy of 93.41% which is increased compared to the other state-of-the-art models. Zhao et. al proposed a solution to solve the problem of incompleteness of KGQA, as the researches are focused on processing each problem independently, without taking the hidden relations inherit from the neighborhood in their con- sideration [14]. The authors used attention-based graph embedding to capture both entity and relation features between entities in the near neighborhood. The implemented KGQA has an increased F1 score for the relation prediction task Reaching out for the Answer: Relation Prediction 3 over the model that has no relation configuration for the datasets SimpleQues- tions, WebQuestions, GQ and QALD-5. The problem of QA has been investigated by Mohammed et. al, where the authors aimed to focus on accuracy-complexity tradeoff, as simple straightfor- ward baselines CNN and GRUs were used plus a few heuristics on the Simple- Questions dataset [9]. The results show that the basic deep learning approach achieves similar results as the state-of-the-art result. The authors performed several experiments utilizing birectional Gated Recurrent Units (BiGRU) and Convolutional Neural Networks (CNN), amongst others. The best approach has reached accuracies of 82.3%, 82.8% respectively in relation predictions. Since its publication, BERT [3] has been widely used for tasks that require the transformation of language patterns. Transformation of language applies to text summarization, language translation, or question answering. Relation prediction can also be considered a transformation task: from a natural language question to a set of relation labels. Mihindukulasooriya et. al proposed their approach SLING using a BERT embedding based classifier and the AMR graph of the question [8]. After creating AMR triples from the AMR graph representation, the authors combine and rank the results of supervised and unsupervised classification tasks. Naseem et. al presented an approach utilizing a pre-trained BERT model and leveraging the AMR (Abstract Meaning Representation) of a question for the relation linking task [10]. The two-staged approach first identifies the number and position of potential relations in the sentence and the respective AMR graph. In the next step, the most relevant relation is predicted for each previously identified spot. With their approach, the authors outperform several other approaches ([11] and [8] amongst others) on the datasets QALD-9, LC-QuAD 1.0/2.0, and SimpleQuestions. 3 Datasets Table 1: Datasets of DBpedia and Wikidata. Datasets Train Test Total DBpedia 34.204 8.552 42.756 Wikidata 24.112 6.029 30.141 The SMART task provides datasets for the two KBs DBpedia and Wikidata. Some statistical details on the datasets provided for the challenge are shown in Table 1 and Section 4. Our approach considers the task as a relation prediction classification, where each question is assigned a relation category. While this task is considered a short-text classification, what makes the classification challenging is a few unique characteristics of the datasets which contribute to data sparsity. For the challange, for both ontology tasks the following datasets are provided: 4 Benmaarouf et. al relation vocabularies, train data, and test questions. To train a model on the data, it needs to be transformed into a feature-target form. 3.1 DBpedia The DBpedia dataset consists of 42,756 samples, which is split into 80% as training data, and 20% as testing as shown in Table 1. The dataset is divided into three files (relation vocabulary, test questions, and train data). The rela- tion vocabulary contains properties from the mapped ontology1 and unmapped properties2 . Moreover, the total number of relations in the vocabulary is 717. The train file contains the following four attributes: question, relations, number of relations and ID. The number of relations specifies how many different classes of relations are contained in the questions - which results in one list of relevant relations per class. The test questions file only contains the questions and the ID. A sample of the DBpedia train data is shown in Fig ??. Fig. 1: Five train samples of the DBpedia dataset. 3.2 Wikidata Wikidata consist of 24,112 as training data, and 6,029 as testing data as shown in Table 1. The train data contains five attributes: questions, relations, rela- tion labels, num of rels, id. For Wikidata, ontology properties have a unique identifier3 which is not human-readable (attribute relations) and additionally human-readable labels (attribute relation labels). The length of the vocabu- lary is 3,639. Figure 2 shows five sample records from the training dataset. 4 Data Analysis 4.1 DBpedia We analyzed the training datasets for the frequencies of the occurring properties to be able to assess the distribution of properties and the sparsity of data for 1 having http://dbpedia.org/ontology/ as prefix 2 having http://dbpedia.org/property/ as prefix 3 usually starting with a P, such as P2397 for the property with the human-readable label “YouTube channel ID” Reaching out for the Answer: Relation Prediction 5 Fig. 2: Five training samples of the Wikidata dataset. specific classes (class as in classification of relation sets). As shown in Figure 3 the distributions of properties is long-tailed. The most frequent properties are dbo:genre and dbp:birthPlace respectively. Out of 338 unique mapped properties, 42 only occur once. For the unmapped properties these numbers are 357 and 58 respectively. (a) (b) Fig. 3: Top 10 (a) mapped and (b) unmapped ontology properties within the DBpedia train dataset 4.2 Wikidata Figure 4 shows the most frequently used labels in the training dataset, where instance of is the most frequent relation with 6,418 occurrences which is 4 times higher than the second most frequent relation (point of time with 1,314 occur- rences). Out of 3,171 unique relations in the dataset, a large amount of 1,889 only occurs once. 4.3 Handling of Imbalanced Data As shown in the previous sections, the training datasets are very imbalanced regarding the distribution and frequency of relations throughout the dataset. 6 Benmaarouf et. al Fig. 4: Top 10 relations within the Wikidata train dataset This results in low accuracies for questions referring to relations in the long tail of the distribution. Although not considered for the approach presented in this paper, we examined strategies to compensate such imbalances. Unfortunately, simple data augmentation strategies, as discussed in Section 5.2 do not suffice. The training dataset requires to be enriched with questions that contain the long tail relations, but with different contexts and wording than contained in the dataset. We consider this future work to further improve the results of our approach. 5 Preprocessing Our analytical approach includes several processing steps. For the different clas- sification processes for DBpedia and Wikidata, we utilize the same preprocessing pipeline except in the first step. For DBpedia, we remove the prefix from the relation labels, then lower casing the letters, and resolve the camel case format of the labels. For example, dbo:RecordLabel is transformed to record label. After that, the pipelines for Wikidata and DBpedia are the same. In the next step, the pipeline takes the questions, relations, and blocks throw data parsing, data augmentation, remove duplicates inside the block, create an indexation between the index of the label inside vocabulary and their block index. Labels that ex- ist in the vocabulary and do not exist in the training set are added. A binary matrix is created and then the labels are merged inside the sentence randomly. The training dataset is split into train and validation sets and finally, the text is tokenized using the BERT tokenizer. 5.1 Parsing Data Questions, relations, and block parsing is done on training data in the way that the questions syntactic order is shuffled in three different orders and then Reaching out for the Answer: Relation Prediction 7 combined all the three different types of questions. Indexation is used between vocabulary and their related blocks to get the index values to speed up the process. We utlize lists consisting of: question, length, block, relations.Training data Q is appended into these lists separately and then the question list is split into Q1 and Q2. The subsequent data is appended randomly into splited question lists. We remove duplicates that cause redundancy, as this will lead to easier computations for the model to find patterns from unique blocks, without being biased to one block instead of the other because it is redundant. The matrix is transformed to binary format to be fitted for the training. 5.2 Data Augmentation: To extend the training data, several augmentation strategies can be applied [4]. The method used for augmentation in our approach was the Copy-Paste method. The Copy-Paste technique is a method which duplicates existing data to increase the sample size and add slightly modified or synthetic data. Increasing the number of data helps the model to “see” the specific pattern more often, which is useful when the data is relatively small to be feed to the neural network model. In our case, the data has been copied three times. Thus, the sample size of DBpedia increased to 102,612 records and to 72,336 records for Wikidata. 5.3 BERT Tokenization For the tokenization of the input data, we utilized the FastAIBertTokenizer from the fastai library [6]. The BERT tokenizer takes the text input and maps it to its integer representation in the BERT word embeddings dictionary and adds some special tokens as [CLS] at the beginning of the input text, and [SEP] at the end of each input text, [PAD] for padding to have all the input texts at the same assigned maximum length, [UNK] is given for the tokens that do not exist inside the vocabulary of BERT dictionary. The input text that exceeds the given maximum length is truncated automatically to make the input matrix all the input matrices with the same size. More details on tokenization with BERT are described in [3]. 6 Training of the Model 6.1 Language Model and Prerequisites Most of the modern NLP systems utilized gated recurrent neural networks (RNNs), such as long short-term memory (LSTMs) and Gated Recurrent Units (GRUs), with additional attention mechanisms before the release of transform- ers [13]. Recurrent neural networks were the state-of-the-art in sequence models especially in NLP problems such as machine translation, text summarization as they can memorize sequence dependencies using the help of some gates [12]. Since RNNs are taking the input tokens one by one according to their position 8 Benmaarouf et. al in the sequence, this increases sequential computation and training time, espe- cially at longer sequence lengths [5]. In addition, RNNs suffer from challenges in handling long-term dependencies as by increasing the number of sequence data, this will be harder for the model to memorize all the past dependencies [13]. Vanishing and exploding gradients are also the reasons that prevent the RNNs to capture the long-term dependencies [12]. In 2017, an encoder-decoder model called transformer was introduced to solve the problems that facing RNNs. Transformers can be used in classification prob- lems that are considered supervised learning, as in our case. For our approach the BERT model is used which is a model from the transformers family. We utilized BERT BASE UNCASED which is not case-sensitive and has a much lower number of layers compared to the BERT LARGE model, as base models have only 12 layers in Encode, and Decoder, with a total number of parameters 110M [3]. The input text is tokenized using the BERT tokenizer, then the transformed input is given to the BERT model to classify the input text to one of the given classes. 6.2 Hyperparameters The loss function algorithm is used as binary cross entropy with logistic losses which applies a sigmoid activation layer to the output of the binary cross entropy layer to be mapped to 0 or 1. The binary cross entropy is preferred over multilabel entropy because of higher accuracies. The evaluation metric used to evaluate the model is F1-score. The simple accuracy metrics cannot be used, as it is not taking into consideration the im- balance of the dataset, while F1-score uses precision, and recall getting a score out of 100% to know how good the model could predict each label. The learning rates are set to 6 different rates. The maximum sequence length was set to 256, with a batch size of 32, and the model is trained for 20 iterations. In terms of training and validation, we utilized two different validation meth- ods: static 80/20 split and k fold cross validation with 3 folds. As shown in our results in the next section, we achieved better results with the cross validation method for the Wikidata dataset, but not for DBpedia. 7 Evaluation The results from different combinations of strategies are shown in Table 2. We utilized the basic approach with trained model and in addition data augmenta- tion for the increase of training data and three different validation split methods: 80/20 split, 99/1 split and cross validation as a flexible version for the validation step. Obviously, the quality of the results did not increase for both datasets us- ing all additional strategies. The cross validation strategy did not achieve better results for DBpedia, but for Wikidata. For DBpedia, the 80/20 split achieved better results than 99/1 split. Whereas the data augmentation step was only Reaching out for the Answer: Relation Prediction 9 Fig. 5: The pipelines of the best achieving combinations of strategies for both datasets – Wikidata and DBpedia. Table 2: Results for DBpedia and Wikipedia using different combinations of strategies: cv – cross validation (3 folds), da – data augmentation Approach Wikidata Precision Recall F1 base + 80/20 0.72152 0.74473 0.70697 base + 99/1 0.79481 0.73229 0.74884 base + cv 0.75094 0.8163 0.76018 base + 80/20 + da 0.79532 0.20055 0.29985 competitor SMART 2021 0.6163 0.61105 0.60701 Approach DBpedia Precision Recall F1 base + 80/20 0.84094 0.84204 0.83586 base + 80/20 + da 0.86135 0.87602 0.86232 base + 99/1 + da 0.86475 0.87129 0.86194 base + cv + da 0.82497 0.9204 0.85404 competitor SMART 2021 0.83682 0.82958 0.83151 10 Benmaarouf et. al successful for DBpedia. For Wikidata, the recall decreases significantly using the augmented training dataset. While we were not able to identify the exact reason for that behavior, we noticed a significantly increased amount of pre- dicted relations when utilizing the augmented training dataset – an average of 9 relations are predicted compared to the results with the best F1 score having only 2 relations predicted at average. The pipelines of the best achieving combinations are depicted in Figure 5. Overall, our best strategy combinations could outperform the other competitor of the SMART 2021 challenge, as shown in Table 2. 8 Conclusion In this paper, we presented our approach for the SMART Task challenge of ISWC 2021 for the Relation Prediction Task. The goal was to predict a set of relations relevant to create the formal SPARQL query to be able to answer the question. We created a classification pipeline and additionally implemented data augmentation and cross-validation methods. The results of our experiments show different combinations of the strategies for both datasets – Wikidata and DBpedia. The combination of different strategies achieved very good results compared to the other participant of the relation prediction task. We consider the problem as a set of sequence classification tasks, each one making use of a fine-tuned BERT classifier. For the more fine-grained (and more challenging) problem of Relation Prediction (since the classes can be hundreds or thousands), we have proposed the enrichment of the BERT trained model with additional strategies. For future work, we consider a more adaptive strategy to deal with the imbalanced datasets and utilize data augmentation only for the long tail of the properties in terms of the frequency distribution. Also, for Wikidata, the relation labels should be considered instead of the IDs - although they are the ones required for the classification task. 9 Acknowledgements This work was partially funded by the German Research Foundation (DFG) under grant no. SA 782/30-1 and STE 3033/1-1. References 1. Abolghasemi, A., Momtazi, S.: Neural relation prediction for simple question an- swering over knowledge graph. CoRR abs/2002.07715 (2020), https://arxiv. org/abs/2002.07715 2. Buzaaba, H., Amagasa, T.: Question answering over knowledge base: A scheme for integrating subject and the identified relation to answer simple questions. SN Comput. Sci. 2(1), 25 (2021). https://doi.org/10.1007/s42979-020-00421-7, https: //doi.org/10.1007/s42979-020-00421-7 Reaching out for the Answer: Relation Prediction 11 3. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirec- tional transformers for language understanding. CoRR abs/1810.04805 (2018), http://arxiv.org/abs/1810.04805 4. Feng, S.Y., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Mitamura, T., Hovy, E.: A survey of data augmentation approaches for nlp (2021) 5. Ghojogh, B., Ghodsi, A.: Attention mechanism, transformers, bert, and gpt: Tu- torial and survey. http://dx.doi.org/10.31219/osf.io/m6gcn (2020) 6. Howard, J., Gugger, S.: fastai: A layered API for deep learning. CoRR abs/2002.04688 (2020), https://arxiv.org/abs/2002.04688 7. Mihindukulasooriya, N., Dubey, M., Gliozzo, A., Lehmann, J., Ngonga Ngomo, A.C., Usbeck, R., Rossiello, G., Kumar, U.: Semantic answer type and relation prediction task (smart 2021). arXiv (2022) 8. Mihindukulasooriya, N., Rossiello, G., Kapanipathi, P., Abdelaziz, I., Ravishankar, S., Yu, M., Gliozzo, A., Roukos, S., Gray, A.: Leveraging semantic parsing for relation linking over knowledge bases. The Semantic Web – ISWC 2020 p. 402–419 (2020) 9. Mohammed, S., Shi, P., Lin, J.: Strong baselines for simple question answering over knowledge graphs with and without neural networks (2018) 10. Naseem, T., Ravishankar, S., Mihindukulasooriya, N., Abdelaziz, I., Lee, Y., Ka- panipathi, P., Roukos, S., Gliozzo, A., Gray, A.G.: A semantics-aware transformer model of relation linking for knowledge base question answering. In: Zong, C., Xia, F., Li, W., Navigli, R. (eds.) Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 2: Short Papers), Virtual Event, August 1-6, 2021. pp. 256–262. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.acl-short.34, https://doi.org/10.18653/v1/2021.acl-short.34 11. Sakor, A., Singh, K., Patel, A., Vidal, M.E.: Falcon 2.0: An entity and re- lation linking tool over wikidata. In: Proceedings of the 29th ACM Interna- tional Conference on Information & Knowledge Management. p. 3141–3148. CIKM ’20, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3340531.3412777, https://doi.org/10.1145/3340531. 3412777 12. Siami-Namini, S., Tavakoli, N., Namin, A.S.: A comparative analysis of forecasting financial time series using arima, lstm, and bilstm. CoRR abs/1911.09512 (2019), http://arxiv.org/abs/1911.09512 13. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. pp. 5998–6008 (2017), https://proceedings.neurips.cc/paper/2017/ hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html 14. Zhao, F., Hou, J., Li, Y., Bai, L.: Relation prediction for answer- ing natural language questions over knowledge graphs. In: 2021 Interna- tional Joint Conference on Neural Networks (IJCNN). pp. 1–8 (2021). https://doi.org/10.1109/IJCNN52387.2021.9534205