𝑟𝑒𝑤𝑜𝑟𝑑𝑄𝐴𝐿𝐷9: A Bilingual Benchmark with Alternative Rewordings of QALD Questions Manuela Sanguinetti1,∗ , Maurizio Atzori1 and Nicoletta Puddu2 1 Department of Mathematics and Computer Science, University of Cagliari, Italy 2 Dipartimento di Lettere, Lingue e Beni Culturali, University of Cagliari, Italy Abstract In this paper we describe an extended version of the QALD dataset, a well-known benchmark resource used for the task of Question Answering over knowledge graphs (KGQA). Along the lines of similar projects, the purpose of this work is to make available a) high-quality data even for languages other than English, and b) multiple reformulations of the same question, to test systems’ robustness. The QALD version we used is the one released for the 9th edition of the challenge of Question Answering over Linked Data, and the languages involved are English and Italian. Besides a revised and improved quality of Italian translations of questions, the resource presented here features a number of alternative rewordings of both English and Italian questions. The usability of the resource has been tested on the QAnswer multilingual Question Answering system and through the GERBIL platform. The resource has been publicly released for research purposes. Keywords question answering, knowledge graphs, benchmark, paraphrases 1. Introduction The task of Question Answering over Knowledge Graphs (KGQA) consists precisely in mapping a natural language question into a formal query - most typically in SPARQL - to an underlying knowledge graph, such as Wikidata, DBpedia or Freebase. Regardless of the approach adopted (whether based on templates, semantic parsing, or end-to-end neural approaches), the availability of data for both training and testing systems is critical. A well-known benchmark dataset for this task is the one released in conjunction with the challenge series on Question Answering over Linked Data (QALD) [1], just recently in its tenth edition.1 It consists of a set of questions accompanied by the corresponding SPARQL query, along with the answer (or list of answers) and the information on the answer type. Except for the latest edition of the QALD challenge, where Wikidata has been adopted as underlying KG, questions and related queries in the previous dataset versions used to target DBpedia. Besides representing a widely used dataset for benchmarking, the QALD dataset has also served multiple purposes, among which the automatic extraction of query templates [2, 3], the development SEMANTICS 2022 EU: 18th International Conference on Semantic Systems, September 13-15, 2022, Vienna, Austria ∗ Corresponding author. Envelope-Open manuela.sanguinetti@unica.it (M. Sanguinetti); atzori@unica.it (M. Atzori); n.puddu@unica.it (N. Puddu) Orcid 0000-0002-0147-2208 (M. Sanguinetti); 0000-0001-6112-7310 (M. Atzori); 0000-0003-4924-0825 (N. Puddu) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 1 https://www.nliwod.org/challenge of micro-benchmarking platforms [4] or the creation of QALD-like resources based on its extension [5, 6]. Another aspect that characterizes the resource since QALD3 – and which still constitutes one of the main challenges in KGQA – is multilinguality: the same question is in fact expressed in multiple languages (up to 11 in QALD9), thus enabling the development of KGQA systems that also support languages other than English. However, the quality of non-English translations in the dataset is sometimes poor, with sentences that are unnatural, if not entirely ungrammatical. A recent paper addressed these issues [5] by working specifically on improving the quality of translated questions for a sub-set of QALD9 languages (French, German, and Russian) and adding brand new translations into underrepresented languages, such as Armenian, Belarusian, Lithuanian, Bashkir, and Ukrainian. The work introduced here partially follows a similar path, in that it aims to provide manually- curated translations of Italian questions from the QALD9 dataset. Among the main goals of our work is also the development of a resource that could be used as a testbed for systems’ robustness and flexibility in handling a wider range of language variations, possibly including both standard and informal register. Therefore, a further enhancement of the resource introduced with this work is the addition of question reformulations to the original QALD9 data. Besides the increased size of the final resource, we believe the main contributions of this work are 1) the improved quality of questions for Italian, 2) the availability of alternative rewordings of the same questions, that could be useful as adversarial examples (similar in spirit to Ribeiro et al. [7]) or for the development of template-based KGQA systems. The rest of the paper will provide a brief description of the resulting dataset, that we called rewordQALD9, and a preliminary baseline evaluation obtained with QAnswer [8], a state-of- the-art multilingual KGQA system. The resource is freely available at the following link: https://github.com/msang/rewordQALD9 2. Dataset Development and Description The two main tasks required for the development of rewordQALD9 were translation revision and question reformulation that possibly include informal registers; for this purpose, eight undergraduate students majoring in translation studies were involved in the project. They were all Italian native speakers proficient in English. As for the second task in particular, specific guidelines were provided to the annotators, according to which question reformulations should consist of more or less pervasive changes in the sentence structure or lexicon; we recommended to alter the question form as much as possible, as long as its meaning is still preserved. Proposed changes included (but were not limited to) use of different parts of speech, diathesis alternation and different verb tenses, and use of synonyms. To further enhance variety in language forms, paraphrases were expected to include even linguistic phenomena more typical of informal registers (e.g. dislocations, clefts, etc.) and features that mimic spoken language (as if the question were posed to a conversational agent/personal assistant). Once these tasks were completed, a pairwise review was carried out, so that each annotator had the opportunity to verify another annotator’s work. Finally, a third annotator, not involved in the previous stages, performed a final check over the entire dataset, to ensure that both revisions and paraphrases were consistent with the task specifications. In addition, duplicate or Table 1 Overview of the final dataset. Lang. Split Orig. questions Paraphrases Tot. Questions Train 405 729 1134 En 1546 Test 146 266 412 Train 405 843 1248 It 1707 Test 146 313 459 near-duplicate questions in the original dataset were removed (for example, the question ”Where did Abraham Lincoln die” appeared both in the training and test set). Annotators were able to provide an average of two paraphrases per question (on a range of one to three questions), with a higher proportion of paraphrases in the Italian section. The resulting dataset thus consists of 995 additional questions for English and 1,156 for Italian, and a total amount of around 1.5K and 1.7K questions for English and Italian, respectively, as also summarized in Table 1.2 To ensure its usability and compatibility with well-known benchmarking platforms such as GERBIL [9], the dataset has been encoded in the QALD-JSON format.3 Before testing the resource on available KGQA systems, we first carried out some preliminary qualitative analysis with the aim to provide an overall picture of the data at hand by means of simple descriptive measures. We considered in particular five aspects for the analysis, all reflecting the possible challenging nature of the additional questions to the QA task: they range from average question length and lexical diversity to syntactic and semantic similarity. The average question length was measured both in terms of average character count and average number of tokens per question, while lexical diversity is expressed through the type/token ratio (i.e. the number of unique tokens over the total number of tokens in a question). All three were computed using the LinguaF library,4 as in Perevalov et al. [5]. Syntactic and semantic similarities were taken into account in this analysis to verify whether annotators effectively reworded original questions using structurally different forms, while still conveying the same content. Therefore, comparisons were made between each pair of original and paraphrased questions. To compute syntactic similarity, all questions were first parsed using the UDPipe REST service,5 [10] and the pairs of syntactic trees were then compared using the b l o c k . e v a l . f 1 module of UDAPI6 [11]. The resulting 𝐹 1 score has been used as a proxy measure to determine the structural similarity between the tree pairs. Finally, semantic similarity was computed using a multilingual pre-trained language model based on Sentence-BERT [12]7 and cosine similarity as metric. All the results obtained with these measures were averaged over the whole dataset, 2 In terms of time, it took around 25h for each annotator to process an average of 69 English-Italian question pairs. Also, an intermediate and a final meeting were scheduled to discuss particularly difficult or possibly ambiguous questions. 3 https://github.com/dice-group/gerbil/wiki/Question-Answering#web-service-interface 4 https://github.com/Perevalov/LinguaF 5 https://lindat.mff.cuni.cz/services/udpipe/ 6 https://udapi.readthedocs.io/en/latest/udapi.block.eval.html#module-udapi.block.eval.f1 7 Hosted on the HuggingFace Model Hub: https://huggingface.co/sentence-transformers/ paraphrase-multilingual-MiniLM-L12-v2 Table 2 Dataset basic statistics. Table 3 EN IT Baseline evaluation on QAnswer (metric: F1-QALD). Avg. question length (tok./sent.) 7.69 7.96 EN IT Avg. question length (char./sent.) 35.84 39.38 QALD9 0.3235 0.2826 Lexical diversity (type/tok.) 0.18 0.19 rewordQALD9 0.3186 0.2603 Syntactic similarity (F1) 41.32% 51.21% Semantic similarity (𝑐𝑜𝑠(𝜃)) 0.91 0.92 as shown in Table 2. What reported in the table shows in particular that lexical diversity is quite low, meaning that annotators tended to use the same terms as the ones in the original questions, but resorting to alternative structural forms that overall did not alter the meaning of the original question. While a higher lexical variability would have been preferable, the latter two measures show the suitability of the resource to test systems robustness to alternative reformulations of a same question. 3. Baseline Evaluation and Future Work To test the resource usability, as well as the possible impact of the reworded questions on the QA task, we compared the results of a multilingual KGQA system on the original QALD9 questions and on the rewordQALD9 introduced here. For our experiments, we selected QAnswer [8] as it is, to the best of our knowledge, the only available system that supports both English and Italian (among other languages). System’s performance was evaluated on GERBIL8 and according to the F1-QALD metric, as computed in the platform. Results are reported in Table 3. As expected, performance consistently decreases in rewordQALD9, mainly due to the increased difficulty the system has to face in handling reformulations: as a matter of fact, a separate evaluation on the group of paraphrases only reported a F1-QALD = 0.3113 for English and 0.244 for Italian, thus confirming the more challenging nature of reworded questions for the QA system. The baseline evaluation just described, although not expanded here due to space con- straints, paves the way to further investigations that assess the resource usefulness for micro- benchmarking purposes (along the lines of Singh et al. [4]), e.g. measuring the impact of given reformulations on the system’s output accuracy. As future work, we intend to provide a more systematic account of such impact, both comparing multiple KGQA systems (whenever at least one of the two dataset languages are supported) and verifying possible correlations between the systems’ output and the descriptive measures defined in Section 2. Furthermore, although conducted independently, this work has several aspects in common with what proposed in Perevalov et al. [5], especially regarding motivations and approach. The eventual integration of the two projects is certainly a hopeful outcome, with a view to further extending and improving the original QALD benchmark in terms of both multilinguality and interoperability among resources. 8 www.gerbil-qa.aksw.org/gerbil/config Acknowledgments We warmly thank the students who participated in this work: V. Carta, F. Carrucciu, M. Desogus, S. Enis, G. Fanari, E. Massa, M. Montis, and C. Sannia. We also thank Aleksandr Perevalov for his help in using QAnswer. The work of M. Sanguinetti and M. Atzori is partially funded by PRIN 2017 (2019-2022) project HOPE - High quality Open data Publishing and Enrichment and Fondazione di Sardegna project ASTRID - Advanced learning STRategies for high-dimensional and Imbalanced Data (CUP F75F21001220007). References [1] R. Usbeck, R. Hari Gusmita, M. Saleem, A.-C. Ngonga Ngomo, 9th Challenge on Question Answering over Linked Data (QALD-9), in: Joint proceedings of SemDeep4 and NLIWoD4, CEUR-WS.org, 2018. [2] A.-K. Hartmann, E. Marx, T. Soru, Generating a large dataset for neural question answering over the DBpedia knowledge base, in: Workshop on Linked Data Management, 2018. [3] D. Vollmers, R. Jalota, D. Moussallem, H. Topiwala, A. N. Ngomo, R. Usbeck, Knowledge graph question answering using graph-pattern isomorphism, CoRR (2021). URL: https: //arxiv.org/abs/2103.06752. [4] K. Singh, M. Saleem, A. Nadgeri, F. Conrads, J. Z. Pan, A.-C. N. Ngomo, J. Lehmann, Qaldgen: Towards microbenchmarking of question answering systems over knowledge graphs, in: C. e. a. Ghidini (Ed.), The Semantic Web – ISWC 2019. Lecture Notes in Computer Science, Springer International Publishing, 2019, pp. 277–292. [5] A. Perevalov, D. Diefenbach, R. Usbeck, A. Both, Qald-9-plus: A multilingual dataset for question answering over dbpedia and wikidata translated by native speakers, in: Proceedings of ICSC, 2022, pp. 229–234. [6] L. Siciliani, P. Basile, P. Lops, G. Semeraro, MQALD: Evaluating the impact of modifiers in Question Answering over Knowledge Graphs, Semantic Web 1 (2021) 1–14. [7] M. T. Ribeiro, S. Singh, C. Guestrin, Semantically equivalent adversarial rules for debugging NLP models, in: Proceedings of ACL, Association for Computational Linguistics, 2018, pp. 856–865. [8] D. Diefenbach, A. Both, K. Singh, P. Maret, Towards a question answering system over the semantic web, Semantic Web 11 (2020) 421–439. [9] R. Usbeck, M. Röder, M. Hoffmann, F. Conrads, J. Huthmann, A. Ngonga Ngomo, C. Demm- ler, C. Unger, Benchmarking question answering systems, Semantic Web 10 (2019) 293–304. [10] M. Straka, UDPipe 2.0 prototype at CoNLL 2018 UD shared task, in: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, 2018, pp. 197–207. [11] M. Popel, Z. Zabokrtský, M. Vojtek, Udapi: Universal API for Universal Dependencies, in: M. de Marneffe, J. Nivre, S. Schuster (Eds.), Proceedings of the NoDaLiDa Workshop on Universal Dependencies, Association for Computational Linguistics, 2017, pp. 96–101. [12] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using siamese BERT- networks, in: Proceedings of EMNLP, Association for Computational Linguistics, 2019.