Introduction

Semantic Answer Type Prediction using BERT?

Vinay Setty

Krisztian Balog

vinay.j.setty

krisztian.balogg@uis.no

0 0 University of Stavanger , Norway

This paper summarizes our participation in the SMART Task of the ISWC 2020 Challenge. A particular question we are interested in answering is how well neural methods, and speci cally transformer models, such as BERT, perform on the answer type prediction task compared to traditional approaches. Our main nding is that coarse-grained answer types can be identi ed e ectively with standard text classi cation methods, with over 95% accuracy, and BERT can bring only marginal improvements. For ne-grained type detection, on the other hand, BERT clearly outperforms previous retrieval-based approaches.

Answer type prediction answer category classi cation natural language understanding question answering

Introduction

The importance of being able to identify the types or semantic categories of answers requested has been long recognized in question answering (QA) research as a key step towards interpreting the meaning of natural language questions [ 4, 8 ]. This task may be performed either against a set of coarse-grained types (e.g., at the TREC QA track [ 9 ]) or against ne-grained type systems of knowledge bases, such as DBpedia [ 1, 5 ]. The Semantic Answer type prediction (SMART) task [ 7 ]1, organized as a challenge at the 2020 International Semantic Web Conference (ISWC '20) provides a large-scale evaluation platform for assessing answer type prediction both at course-grained and ne-grained taxonomical levels.

Speci cally, given a natural language question as input, rst a high-level answer type category is to be predicted, which can be one of resource, literal, or boolean. If the predicted category is resource, a more speci c ontological class is to be provided, using the type system of DBpedia or Wikidata. If the predicted category is literal, it also has to be further classi ed as number, date, or string. In this paper, we refer to the task of coarse-grained answer detection as category classi cation and to the problem of ne-grained prediction of (resource) types as type prediction. Table 1 shows some examples. As seen from the examples, answers for the resource category are provided as a ranked list of types.

The main research objectives in this work are to assess (1) How do neural approaches perform compared to traditional feature-based classi cation approaches on the category classi cation task? (2) How do neural classi cation approaches fare against well-establised (fusion-based) IR approaches on the type prediction problem? We nd that (1) is essentially a \solved" problem. Our baseline SVM classi er with word unigrams as features achieved 95% accuracy. Neural approaches yield only minor improvements. As for (2), type prediction has previously been approached as a ranking problem, due to the large number of possible types ( 760 types in DBpedia and 50k types in Wikidata) that rendered classi cation-based approaches infeasible. We draw on recent work on extreme multi-class classi cation and demonstrate substantial gains over the IR baselines. It appears that ne-grained type detection on Wikidata is more challenging than on DBpedia. However, the two are not directly comparable due to the di erent evaluation measures that are employed, which calls for further analysis.

Code and resources developed in this work are made publicly available at https://github.com/iai-group/smart-task. 2

Approach

We follow a two-phase approach. In the rst phase, we perform category classi cation, that is, a supervised classi er predicts the high-level category of the answer type. Then, in the second phase, we perform type prediction to identify the top-k types for the questions for which answer type was predicted to be a resource. For category classi cation we use two classi ers: SVM with word unigrams as features and ne-tuned BERT (Section 2.1). Type prediction has previously been approached as a ranking task [ 1, 5 ], due to the large number of possible types. As an alternative, we can cast it as an extreme classi cation problem (Section 2.2).

Category Classi cation

We atten the high-level categories into following ve categories: boolean, literal-number, literal-string, literal-date, and resource. Since the category classi cation task is same for both DBpedia and Wikidata, we combine the training datasets for the two and predict the categories for their respective test datasets using the combined model.

Feature-based classi cation As a rst approach to category classi cation, we use TF-IDF-weighted word unigrams as features. The vocabulary construction and IDF computations are based only on the training portion of the dataset, to avoid any assumptions on the test data. Our implementation is based on the CountVectorizer and TFIDFVectorizer classes from the sklearn library2 with default parameters. We then train an SVM classi er with a linear kernel. We also experimented with using a Naive Bayes classi er, but decided to exclude that after observing inferior performance.

Neural approach As a second approach, we ne-tune a pre-trained BERT model (RoBERTa) [ 6 ] for a sequence classi cation task to classify the category. Our implementation uses the HuggingFace API3 for ne-tuning and category classi cation. 2.2

Type Prediction

IR-based methods We employ two ranking-based approaches from [ 1 ], which were introduced for the task of identifying target types of (entity-bearing) search queries. These approaches are representatives of two main families of object ranking strategies, which have been termed as early and late fusion design patterns in [ 11 ]. According to the type-centric (TC, a.k.a. early fusion) approach, rst a textual representation is built for each type by concatenating the descriptions of entities that are assigned that type. Then, these type description (pseudo) document can be ranked using standard IR models. Speci cally, we use the DBpedia short abstracts of entities and then rank type documents using BM25. The second strategy is termed entity-centric (EC, a.k.a. late fusion). There, the top-k most relevant entities from the underlying knowledge base are retrieved using the question as a query. Then, the relevance score of a given type is computed by aggregating the relevance scores of entities with that type. We use BM25 as the underlying retrieval model and a \catch-all" entity representation, following the settings in [ 5 ]. The cut-o parameter k is chosen empirically based on the training set (k = 20). 2 https://scikit-learn.org/ 3 https://huggingface.co/ Neural method Due to the large number of possible labels, using standard Transformer models is not feasible. Instead, we cast the type prediction task as an extreme multi-label text classi cation (XMC) problem: given a question as input text, return the top-k most relevant types from a large collection of possible types. Vanilla transformer models such as BERT [ 3 ], RoBERTa [ 6 ], and XLNet [ 10 ] are ine ective in this scenario due to the memory and computation requirements imposed by the large number of possible labels. This was also conrmed from our experiments that ne-tuning the above mentioned transformer models using the Huggingface framework exhausted all the memory on a 32GB Nvidia Tesla V100 GPU. While this may work on a GPU with larger memory, since we do not have access such a GPU we could not verify and it may still be computationally very expensive to train them. In addition to the computational limitations, as we show in Section 4, the types are very sparse with most of them having only a few training instances. In order to address these challenges, a model designed for XMC is essential. We use the recent solution to extend the transformer models for XMC coined X-Transformers [ 2 ] for this purpose, which shall be referred to as XBERT in the rest of the paper.

XBERT consists of three components: 1. Semantic Label Indexing (SLI), which performs hierarchical clustering on the labels to reduce the label space. 2. Deep Neural Matching (DNM), to ne-tune the Transformer models for each of the label clusters identi ed by SLI. 3. Ensemble Ranking (ER), which ranks the instances within the label clusters by training a linear ranker conditionally on the label clusters and the DNM Transformer's output. 3

Experimental Evaluation

In this section, we discuss our experimental setup, introduce the evaluation measures, and present our results. 3.1

Data Methods

The following methods are compared: { SVM: Support Vector Machine for category classi cation { BERT: RoBERTa for category classi cation { XBERT: X-Transformers for type prediction { IR/TC: Type-centric IR approach for type prediction { IR/EC: Entity-centric IR approach for type prediction We train all neural models on a single Nvidia Tesla V100 GPU with 32GB memory. 3.3

Evaluation Metrics

Category classi cation is evaluated in terms of classi cation accuracy. Type prediction is cast as a ranking task and is evaluated using rank-based metrics. It, however, considers only those questions that fall into the literal or resource answer categories. Furthermore, evaluation is performed di erently for DBpedia and for Wikidata, given the nature of their respective type taxonomies. Types in the DBpedia Ontology are organized hierarchically, up to 7 levels deep. There, a graded evaluation metric, Normalized Discounted Cumulative Gain (NDCG@k), is used. Speci cally: { For literal answer types, only a single predicted type is considered that can be either correct (NDCG=1) or incorrect (NDCG=0). { For resource answer types, a ranked list of top-k ontology classes is considered and evaluated in terms of lenient NDCG@k with linear decay [ 1 ]. The gain for a given predicted type is 0 if it is not on the same path with any of the gold types, and otherwise it is 1 d(t; tq)=h, where d(t; tq) is the distance between the predicted type and the closest matching gold type in the type hierarchy, with h being the maximum depth of the type hierarchy. In case of Wikidata, the type hierarchy is rather at. Therefore, type prediction is evaluated using a binary notion of relevance, with Mean Reciprocal Rank (MRR) as the metric.

We report results on the training dataset, using 5-fold cross-validation. For our o cial submissions, we also report the performance on the test set. 3.4

Results

Category Classi cation It can be seen from the results in Table 3 that both feature-based and neural approaches perform quite well for category classi cation. BERT has a slight advantage over SVM. We hypothesize that due to the clear patterns which the models can learn, the high-level category classi cation is a fairly easy task and hence the high accuracy scores. However, most mistakes occur for the resource class, which is the majority class in both datasets.

Dataset DBpedia Wikidata SVM

BERT Method Train

Test SVM BERT Type Prediction Since di erent metrics are used for DBpedia and Wikidata, we report results on the two datasets separately, in Tables 4 and 5, respectively. Recall, that (stage-two) type prediction is applied on top of (stage-one) category classi cation (SVM or BERT) and is only carried out when the predicted category is resource. We thus pre x the method names in the result tables with SVM- or BERT- to indicate how category classi cation was performed.

On DBpedia (Table 4), XBERT clearly outperforms the IR approaches. We attribute this to the fact that XBERT is tailored for XMC problem which can deal with large number of types and sparsity with tail resource types. The slight di erence between SVM-XBERT and BERT-XBERT is due to the mistakes made by SVM in category classi cation. Given the large advantage of XBERT over the IR approaches, our o cial submissions on Wikidata (Table 5) only considered the former. It should nevertheless be noted that the IR approaches are unsupervised methods that do not need any training data. Supervised alternatives have shown to perform signi cantly better [ 5 ]. We leave that comparison to future work. In this section, we analyze the errors made by the our best performing approach, BERT-XBERT. First, we look at resource types where most errors occur. That is, types which are present in the gold labels but are missing from the predicted labels. Table 6 shows the top-10 errors in type prediction for DBpedia and Wikidata, together with their total instance counts. Ideally, we would expect that the number of mistakes to be directly proportional to the total frequency of the resource type. In DBpedia, some types such as dbo:State, dbo:Activity, dbo:Band, and dbo:Profession break this pattern. Similarly in Wikidata, natural person, political territorial entity, and big city are some of types with which the BERT-XBERT model struggles.

In Table 7, we show anecdotal examples of the mistakes made by the BERTXBERT approach. Most of these errors are due to irrelevant types returned in the result list. In several cases, the predicted labels do contain the the gold label but place them at lower ranks, which a ects the NDCG and MRR scores. In some cases the predicted labels are appropriate, even though they do not exactly match the gold labels. For example, for the last question in Table 7, publication is one of the gold labels, which is not predicted, but written work and periodical are still relevant among the predicted labels. We also spotted several instances with double questions such as \What con ict occurred in Philoctetes and who was involved?" and questions with grammatical errors and typos. 5

Conclusions

In this paper, we presented our solution for the SMART Task challenge of ISWC 2020, which was the best performing approach on both datasets and tasks, across all evaluation metrics. Our ndings suggest that for coarse-grained category prediction, simple feature-based approaches are quite e ective with over Question

Gold types

Predicted types 95% accuracy, while sophisticated neural Transformer architectures only improve marginally. For ne-grained type prediction, on the other hand, Transformer models for extreme multilabel classi cation clearly outperform retrieval-based approaches.

Our future work concerns an in-depth analysis of the results on DBpedia vs. Wikidata, to understand the di erences and modeling requirements for small and hierarchical (DBpedia) vs. large and shallow (Wikidata) type taxonomies. ["dbo:Country", "dbo:Location", "dbo:Place", "dbo:PopulatedPlace", "dbo:Continent", "dbo:Airport", "dbo:MeanOfTransportation", "dbo:Aircraft", "dbo:Infrastructure", "dbo:ArchitecturalStructure"] ["non-alcoholic beverage", "carbonated beverage", "soft drink", "trademark", "food", "long gun", "goods", "dish", "cyclic process", "tea"]

[1]

Balog and

Neumayer . Hierarchical target type identi cation for entityoriented queries . In Proceedings of the 21st ACM international conference on Information and knowledge management , CIKM '12 , pages 2391 { 2394 , 2012 .

[2]

W.-C.

Chang ,

H.-F.

Yu ,

Zhong ,

Yang ,

and I. S.

Dhillon . Taming pretrained transformers for extreme multi-label text classi cation . In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages 3163 { 3171 , 2020 .

[3]

Devlin , M.-

Chang ,

Lee , and

Toutanova . Bert: Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), pages 4171 { 4186 , 2019 .

[4]

D. A.

Ferrucci ,

E. W.

Brown , J. Chu-Carroll , J.

Fan , D.

Gondek , A.

Kalyanpur , A.

Lally , J. W.

Murdock , E.

Nyberg , J. M.

Prager , N.

Schlaefer , and C. A.

Welty. Building Watson : An overview of the DeepQA project . AI Magazine , 31 ( 3 ): 59 { 79 , 2010 .

[5]

Garigliotti ,

Hasibi , and

Balog . Target type identi cation for entitybearing queries . In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '17 , pages 845 { 848 , 2017 .

[6]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen ,

Levy ,

Lewis ,

Zettlemoyer , and

Stoyanov . Roberta: A robustly optimized bert pretraining approach . arXiv preprint arXiv: 1907 .11692, 2019 .

[7]

Mihindukulasooriya ,

Dubey ,

Gliozzo ,

Lehmann ,

A.-C. N.

Ngomo , and

Usbeck. SeMantic AnsweR Type prediction task (SMART) at ISWC 2020 Semantic Web Challenge . CoRR/arXiv, abs/ 2012 .00555, 2020 . URL https://arxiv.org/abs/ 2012 .00555.

[8]

Shen ,

Geng ,

Tao ,

Guo ,

Tang ,

Duan ,

Long , and

Jiang . Multi-task learning for conversational question answering over a large-scale knowledge base . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP) , pages 2442 { 2451 , 2019 .

[9]

E. M.

Voorhees . The trec question answering track . Nat. Lang . Eng., 7 ( 4 ): 361 { 378 , Dec . 2001 .

[10]

Yang ,

Dai ,

Yang , J. Carbonell,

R. R.

Salakhutdinov , and

Q. V.

Le . Xlnet: Generalized autoregressive pretraining for language understanding . In Advances in neural information processing systems , pages 5753 { 5763 , 2019 .

[11]

Zhang and

Balog . Design patterns for fusion-based object retrieval . In Proceedings of the 39th European conference on Advances in Information Retrieval, ECIR '17 , pages 684 { 690 . Springer, 2017 .