<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>M. Sanguinetti);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>: A Bilingual Benchmark with Alternative Rewordings of QALD Questions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Manuela Sanguinetti</string-name>
          <email>manuela.sanguinetti@unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maurizio Atzori</string-name>
          <email>atzori@unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicoletta Puddu</string-name>
          <email>n.puddu@unica.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics and Computer Science, University of Cagliari</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dipartimento di Lettere, Lingue e Beni Culturali, University of Cagliari</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>In this paper we describe an extended version of the QALD dataset, a well-known benchmark resource used for the task of Question Answering over knowledge graphs (KGQA). Along the lines of similar projects, the purpose of this work is to make available a) high-quality data even for languages other than English, and b) multiple reformulations of the same question, to test systems' robustness. The QALD version we used is the one released for the 9th edition of the challenge of Question Answering over Linked Data, and the languages involved are English and Italian. Besides a revised and improved quality of Italian translations of questions, the resource presented here features a number of alternative rewordings of both English and Italian questions. The usability of the resource has been tested on the QAnswer multilingual Question Answering system and through the GERBIL platform. The resource has been publicly released for research purposes.</p>
      </abstract>
      <kwd-group>
        <kwd>question answering</kwd>
        <kwd>knowledge graphs</kwd>
        <kwd>benchmark</kwd>
        <kwd>paraphrases</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The task of Question Answering over Knowledge Graphs (KGQA) consists precisely in mapping
a natural language question into a formal query - most typically in SPARQL - to an underlying
knowledge graph, such as Wikidata, DBpedia or Freebase. Regardless of the approach adopted
(whether based on templates, semantic parsing, or end-to-end neural approaches), the availability
of data for both training and testing systems is critical.</p>
      <p>
        A well-known benchmark dataset for this task is the one released in conjunction with the
challenge series on Question Answering over Linked Data (QALD) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], just recently in its tenth
edition.1 It consists of a set of questions accompanied by the corresponding SPARQL query,
along with the answer (or list of answers) and the information on the answer type. Except for
the latest edition of the QALD challenge, where Wikidata has been adopted as underlying KG,
questions and related queries in the previous dataset versions used to target DBpedia. Besides
representing a widely used dataset for benchmarking, the QALD dataset has also served multiple
purposes, among which the automatic extraction of query templates [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ], the development
CEUR
Workshop
Proceedings
of micro-benchmarking platforms [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or the creation of QALD-like resources based on its
extension [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. Another aspect that characterizes the resource since QALD3 – and which
still constitutes one of the main challenges in KGQA – is multilinguality: the same question is
in fact expressed in multiple languages (up to 11 in QALD9), thus enabling the development
of KGQA systems that also support languages other than English. However, the quality of
non-English translations in the dataset is sometimes poor, with sentences that are unnatural, if
not entirely ungrammatical. A recent paper addressed these issues [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] by working specifically
on improving the quality of translated questions for a sub-set of QALD9 languages (French,
German, and Russian) and adding brand new translations into underrepresented languages,
such as Armenian, Belarusian, Lithuanian, Bashkir, and Ukrainian.
      </p>
      <p>
        The work introduced here partially follows a similar path, in that it aims to provide
manuallycurated translations of Italian questions from the QALD9 dataset. Among the main goals of our
work is also the development of a resource that could be used as a testbed for systems’ robustness
and flexibility in handling a wider range of language variations, possibly including both standard
and informal register. Therefore, a further enhancement of the resource introduced with this
work is the addition of question reformulations to the original QALD9 data. Besides the increased
size of the final resource, we believe the main contributions of this work are 1) the improved
quality of questions for Italian, 2) the availability of alternative rewordings of the same questions,
that could be useful as adversarial examples (similar in spirit to Ribeiro et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]) or for the
development of template-based KGQA systems.
      </p>
      <p>
        The rest of the paper will provide a brief description of the resulting dataset, that we called
rewordQALD9, and a preliminary baseline evaluation obtained with QAnswer [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], a
state-ofthe-art multilingual KGQA system. The resource is freely available at the following link:
https://github.com/msang/rewordQALD9
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset Development and Description</title>
      <p>The two main tasks required for the development of rewordQALD9 were translation revision
and question reformulation that possibly include informal registers; for this purpose, eight
undergraduate students majoring in translation studies were involved in the project. They were
all Italian native speakers proficient in English. As for the second task in particular, specific
guidelines were provided to the annotators, according to which question reformulations should
consist of more or less pervasive changes in the sentence structure or lexicon; we recommended
to alter the question form as much as possible, as long as its meaning is still preserved. Proposed
changes included (but were not limited to) use of diferent parts of speech, diathesis alternation
and diferent verb tenses, and use of synonyms. To further enhance variety in language forms,
paraphrases were expected to include even linguistic phenomena more typical of informal
registers (e.g. dislocations, clefts, etc.) and features that mimic spoken language (as if the
question were posed to a conversational agent/personal assistant).</p>
      <p>Once these tasks were completed, a pairwise review was carried out, so that each annotator
had the opportunity to verify another annotator’s work. Finally, a third annotator, not involved
in the previous stages, performed a final check over the entire dataset, to ensure that both
revisions and paraphrases were consistent with the task specifications. In addition, duplicate or
near-duplicate questions in the original dataset were removed (for example, the question ”Where
did Abraham Lincoln die” appeared both in the training and test set). Annotators were able to
provide an average of two paraphrases per question (on a range of one to three questions), with
a higher proportion of paraphrases in the Italian section. The resulting dataset thus consists of
995 additional questions for English and 1,156 for Italian, and a total amount of around 1.5K
and 1.7K questions for English and Italian, respectively, as also summarized in Table 1.2</p>
      <p>
        To ensure its usability and compatibility with well-known benchmarking platforms such as
GERBIL [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], the dataset has been encoded in the QALD-JSON format.3
      </p>
      <p>
        Before testing the resource on available KGQA systems, we first carried out some preliminary
qualitative analysis with the aim to provide an overall picture of the data at hand by means
of simple descriptive measures. We considered in particular five aspects for the analysis, all
reflecting the possible challenging nature of the additional questions to the QA task: they range
from average question length and lexical diversity to syntactic and semantic similarity.
The average question length was measured both in terms of average character count and average
number of tokens per question, while lexical diversity is expressed through the type/token ratio
(i.e. the number of unique tokens over the total number of tokens in a question). All three
were computed using the LinguaF library,4 as in Perevalov et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Syntactic and semantic
similarities were taken into account in this analysis to verify whether annotators efectively
reworded original questions using structurally diferent forms, while still conveying the same
content. Therefore, comparisons were made between each pair of original and paraphrased
questions. To compute syntactic similarity, all questions were first parsed using the UDPipe
REST service,5 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and the pairs of syntactic trees were then compared using the b l o c k . e v a l . f 1
module of UDAPI6 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The resulting  1 score has been used as a proxy measure to determine
the structural similarity between the tree pairs. Finally, semantic similarity was computed using
a multilingual pre-trained language model based on Sentence-BERT [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]7 and cosine similarity
as metric. All the results obtained with these measures were averaged over the whole dataset,
2In terms of time, it took around 25h for each annotator to process an average of 69 English-Italian question pairs.
Also, an intermediate and a final meeting were scheduled to discuss particularly dificult or possibly ambiguous
questions.
3https://github.com/dice-group/gerbil/wiki/Question-Answering#web-service-interface
4https://github.com/Perevalov/LinguaF
5https://lindat.mff.cuni.cz/services/udpipe/
6https://udapi.readthedocs.io/en/latest/udapi.block.eval.html#module-udapi.block.eval.f1
7Hosted on the HuggingFace Model Hub: https://huggingface.co/sentence-transformers/
paraphrase-multilingual-MiniLM-L12-v2
as shown in Table 2. What reported in the table shows in particular that lexical diversity is
quite low, meaning that annotators tended to use the same terms as the ones in the original
questions, but resorting to alternative structural forms that overall did not alter the meaning of
the original question. While a higher lexical variability would have been preferable, the latter
two measures show the suitability of the resource to test systems robustness to alternative
reformulations of a same question.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Baseline Evaluation and Future Work</title>
      <p>
        To test the resource usability, as well as the possible impact of the reworded questions on the QA
task, we compared the results of a multilingual KGQA system on the original QALD9 questions
and on the rewordQALD9 introduced here. For our experiments, we selected QAnswer [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] as it
is, to the best of our knowledge, the only available system that supports both English and Italian
(among other languages). System’s performance was evaluated on GERBIL8 and according to
the F1-QALD metric, as computed in the platform. Results are reported in Table 3.
      </p>
      <p>As expected, performance consistently decreases in rewordQALD9, mainly due to the increased
dificulty the system has to face in handling reformulations: as a matter of fact, a separate
evaluation on the group of paraphrases only reported a F1-QALD = 0.3113 for English and
0.244 for Italian, thus confirming the more challenging nature of reworded questions for the QA
system.</p>
      <p>
        The baseline evaluation just described, although not expanded here due to space
constraints, paves the way to further investigations that assess the resource usefulness for
microbenchmarking purposes (along the lines of Singh et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]), e.g. measuring the impact of given
reformulations on the system’s output accuracy. As future work, we intend to provide a more
systematic account of such impact, both comparing multiple KGQA systems (whenever at least
one of the two dataset languages are supported) and verifying possible correlations between
the systems’ output and the descriptive measures defined in Section 2. Furthermore, although
conducted independently, this work has several aspects in common with what proposed in
Perevalov et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], especially regarding motivations and approach. The eventual integration of
the two projects is certainly a hopeful outcome, with a view to further extending and improving
the original QALD benchmark in terms of both multilinguality and interoperability among
resources.
8www.gerbil-qa.aksw.org/gerbil/config
We warmly thank the students who participated in this work: V. Carta, F. Carrucciu, M. Desogus,
S. Enis, G. Fanari, E. Massa, M. Montis, and C. Sannia. We also thank Aleksandr Perevalov for
his help in using QAnswer. The work of M. Sanguinetti and M. Atzori is partially funded by
PRIN 2017 (2019-2022) project HOPE - High quality Open data Publishing and Enrichment and
Fondazione di Sardegna project ASTRID - Advanced learning STRategies for high-dimensional
and Imbalanced Data (CUP F75F21001220007).
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Usbeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Hari</given-names>
            <surname>Gusmita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saleem</surname>
          </string-name>
          , A.-C.
          <article-title>Ngonga Ngomo, 9th Challenge on Question Answering over Linked Data (QALD-9), in: Joint proceedings of SemDeep4 and NLIWoD4, CEUR-WS</article-title>
          .org,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>A.-K. Hartmann</surname>
          </string-name>
          , E. Marx, T. Soru,
          <article-title>Generating a large dataset for neural question answering over the DBpedia knowledge base</article-title>
          ,
          <source>in: Workshop on Linked Data Management</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Vollmers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jalota</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moussallem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Topiwala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Usbeck</surname>
          </string-name>
          ,
          <article-title>Knowledge graph question answering using graph-pattern isomorphism</article-title>
          ,
          <source>CoRR</source>
          (
          <year>2021</year>
          ). URL: https: //arxiv.org/abs/2103.06752.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saleem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nadgeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Conrads</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Z.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          , Qaldgen:
          <article-title>Towards microbenchmarking of question answering systems over knowledge graphs</article-title>
          , in: C. e. a. Ghidini (Ed.),
          <source>The Semantic Web - ISWC 2019. Lecture Notes in Computer Science</source>
          , Springer International Publishing,
          <year>2019</year>
          , pp.
          <fpage>277</fpage>
          -
          <lpage>292</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Perevalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Diefenbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Usbeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Both</surname>
          </string-name>
          , Qald-9
          <article-title>-plus: A multilingual dataset for question answering over dbpedia and wikidata translated by native speakers</article-title>
          ,
          <source>in: Proceedings of ICSC</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>229</fpage>
          -
          <lpage>234</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Siciliani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lops</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Semeraro, MQALD: Evaluating the impact of modifiers in Question Answering over Knowledge Graphs, Semantic Web 1 (</article-title>
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          ,
          <article-title>Semantically equivalent adversarial rules for debugging NLP models</article-title>
          ,
          <source>in: Proceedings of ACL, Association for Computational Linguistics</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>856</fpage>
          -
          <lpage>865</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Diefenbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Both</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Maret</surname>
          </string-name>
          ,
          <article-title>Towards a question answering system over the semantic web</article-title>
          ,
          <source>Semantic Web</source>
          <volume>11</volume>
          (
          <year>2020</year>
          )
          <fpage>421</fpage>
          -
          <lpage>439</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Usbeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Röder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Conrads</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huthmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ngonga</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Demmler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Unger</surname>
          </string-name>
          ,
          <article-title>Benchmarking question answering systems</article-title>
          ,
          <source>Semantic Web</source>
          <volume>10</volume>
          (
          <year>2019</year>
          )
          <fpage>293</fpage>
          -
          <lpage>304</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Straka</surname>
          </string-name>
          , UDPipe
          <volume>2</volume>
          .
          <article-title>0 prototype at CoNLL 2018 UD shared task</article-title>
          ,
          <source>in: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw</source>
          Text to Universal Dependencies,
          <year>2018</year>
          , pp.
          <fpage>197</fpage>
          -
          <lpage>207</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Popel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zabokrtský</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vojtek</surname>
          </string-name>
          ,
          <article-title>Udapi: Universal API for Universal Dependencies</article-title>
          , in: M. de Marnefe,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nivre</surname>
          </string-name>
          , S. Schuster (Eds.),
          <source>Proceedings of the NoDaLiDa Workshop on Universal Dependencies, Association for Computational Linguistics</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>96</fpage>
          -
          <lpage>101</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          , Sentence-BERT:
          <article-title>Sentence embeddings using siamese BERTnetworks</article-title>
          ,
          <source>in: Proceedings of EMNLP, Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>