<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>How to Revert Question Answering on Knowledge Graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gaurav Maheshwari</string-name>
          <email>gaurav.maheshwari@uni-bonn.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohnish Dubey</string-name>
          <email>dubey@cs.uni-bonn.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Priyansh Trivedi</string-name>
          <email>priyansh.trivedi@uni-bonn.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jens Lehmann</string-name>
          <email>jens.lehmann@cs.uni-bonn.de</email>
          <email>jens.lehmann@iais.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer IAIS</institution>
          ,
          <addr-line>Bonn</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Bonn</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A large scale question answering dataset has a potential to enable development of robust and more accurate question answering systems. In this direction, we introduce a framework for creating such datasets which decreases the manual intervention and domain expertise, traditionally needed. We describe the architecture and the design decisions we took while creating the framework, in detail.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Knowledge bases (KB), such as DBpedia [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Freebase [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and Wikidata [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] contain
a large amount of interesting information. However, this information can often only
be queried by users familiar with query languages and the structure of the knowledge
base. To migitate this problem, numerous question answering (QA) approaches for
knowledge graphs have been devised (see [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]). Many of them rely on machine learning
techniques, which require large amounts of labeled training data. We introduced
LCQuAD [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] (Large-Scale Complex Question Answering Dataset), consisting of 5000
natural language questions (NLQ) along with the intended SPARQL queries required to
answer questions over DBpedia [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In this article, we describe the framework used for
creating LC-QuAD and how it can be applied to other data sources.
Traditionally, question answering datasets have been created by manually converting a
set of questions to their logical forms (for instance, [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). Although a dataset created
in this manner can have varied questions, it requires substantial manual work from
domain experts, and hence is not scalable. Thus, the need for an alternate workflow with
(i) lesser human intervention and (ii) reduced domain expertise is felt.
      </p>
      <p>
        In this direction, we present a framework for generating questions and their
corresponding logical forms. We created our framework by reverse engineering the
architecture of Semantic Parsing based Question Answering systems (SQA) such as [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
These systems convert a NLQ to a formal query language expression, whereas we start
These three authors contributed equally.
with a formal query language expression and convert it to an NLQ. This reverse task is
easier because formal query languages have well defined semantics, and the entities and
predicates occurring in the query are explicitly mentioned. Moreover, the target language
(NL) is much more resilient to minute errors.
      </p>
      <p>
        We now describe how we reverse engineered the architecture of AskNow[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to create
a question generation framework. Figure 1 shows an illustration of our architecture and
that of AskNow. Throughout the description, we will use the question: "Name the capital
of Germany?" as our running example, to elaborate our dataset generation process in
contrast to the process of answering this question.
      </p>
      <p>
        AskNow[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] breaks down the process of answering questions into two parts:
conversion of questions into an expression of a semi-formal language, namely Normalized
Query Structure (NQS) and subsequently converting the NQS expression to that of a
formal language. NQS acts like a surface level syntactic template for queries, which
maps NLQs having different syntactic structures into a common structure. Given the
question, AskNow uses rules based on NL features to create a corresponding NQS
instance. Thereafter it maps the entity (dbr:Germany), and the predicate (dbo:capital)
present in the question to resources in the target KB, i.e. DBpedia. After mapping, the
NQS instance is converted into SPARQL using machine generated and hand made rules.
Finally, the query is executed on the query endpoint of DBpedia to retrieve the required
answer: dbr:Berlin.
      </p>
      <p>We begin our process where AskNow ends, i.e. by collecting the answer of the
questions we aim to generate. For our framework, a list of these answers act as seed
entities3. Let "dbr:Berlin" be one such entity. We then create SPARQL queries which
would all have this entity as one of their answers. To do so, we generate a subgraph of
all the triples within 2-hops around these entities. The triple &lt;dbr:Germany dbo:capital
dbr:Berlin&gt; would be one such triple in the subgraph. Since the number of such triples
increases exponentially with increasing hops, we reduce the size of subgraph, by
stochastically removing predicates. We also remove every triple whose predicates do not occur
in our whitelist4. We then juxtapose triples of this subgraph onto a list of SPARQL
templates5 to generate a set of valid SPARQL queries, all of them having dbr:Berlin as
one of its answers. For instance, SELECT ?uri WHERE { dbr:Germany dbo:capital
?uri. } Note that while, in our example, the SPARQL query comprises of only one
triple, we can generate queries having two or more triples in the same manner.</p>
      <p>At this point, instead of converting the SPARQL queries to NLQ directly, we
employ a two step process similar to AskNow. This helps in further reducing manual
intervention required for this conversion. We first transform the queries to a
Normalized Natural language Question Templates (NNQT) instance. Corresponding to every
SPARQL template, there exists these NNQTs, which are questions with placeholders.
These placeholders can be replaced with the surface form of entities and predicates
present in the query, to form a coherent question. These NNQTs interpret the
semantics of its corresponding SPARQL template and verbalize it, thereby transforming our
aforementioned problem to that of converting semi-correct (grammatically) NLQ to a
grammatically correct question. These NNQT’s are KB independent and thus can be
appropriated for any target KB. An NNQT instance for our running example would look
like: W hat is the h capital i of h Germany i?</p>
      <p>As is evident, this NNQT instance can be easily converted to a grammatically correct
NLQ by native English speakers, without understanding the SPARQL syntax or having
any knowledge of the underlying KB schema. This allows our process to scale with
minimal efforts. The person correcting these questions is also expected to paraphrase
the questions, in order to increase the diversity of our dataset. The resultant NLQ of our
NNQT instance would at this point be- "Name the capital of germany ?" Finally, an
independent reviewer revises every corrected question, and is expected to make minute
tweaks and edits in case of errors. The reviewer is not expected to paraphrase the queries,
thereby significantly reducing the time required in this step. The final output of our
process would be - "Name the capital of Germany?"</p>
      <p>
        Throughout the process, we use numerous techniques to increase the diversity and
complexity of the questions so generated. Some of these techniques are: (i) replacing the
entity surface forms within the NNQT instances with their synonyms, using WordNet[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
(ii) similarly, using Wikidata[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for predicate surface forms, (iii) encouraging the
question corrector to paraphrase the questions, keeping their semantics intact, (iv)
declaring multiple NNQTs for every SPARQL template and stochastically selecting one
of them. Due to the modularity of our framework, more such techniques can be added in
the future.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Conclusions</title>
      <p>In this article, we described a framework for generating QA datasets having questions
and their equivalent logical forms. The general design of this framework follows a
reverse process of SQAs, and in effect reduces the efforts involved in creating questions.</p>
      <sec id="sec-2-1">
        <title>4https://figshare.com/articles/White_List_Relations/5008283 5https://figshare.com/articles/Templates/5242027</title>
        <p>
          We have successfully used our framework to create a dataset, LC-QuAD[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], having 5000
questions and their corresponding SPARQLs. Our framework is available as an open
sourced repository6, under a GPL 3.07 License.
        </p>
        <p>In the future, we aim to explore more techniques to increase the diversity and
complexity of the questions so generated. We will also explore machine translation based
techniques to further reduce the need of manually correcting the questions.</p>
        <p>Acknowledgements This work was partly supported by the grant from the European
Union’s Horizon 2020 research Europe flag and innovation programme for the projects
Big Data Europe (GA no. 644564), HOBBIT (GA no. 688227) and WDAqua (GA
no. 642795).</p>
      </sec>
      <sec id="sec-2-2">
        <title>6https://github.com/AskNowQA/LC-QuAD</title>
        <p>7https://www.gnu.org/licenses/gpl.html</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>K.</given-names>
            <surname>Bollacker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Paritosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sturge</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Taylor</surname>
          </string-name>
          . Freebase:
          <article-title>a collaboratively created graph database for structuring human knowledge</article-title>
          .
          <source>In Proceedings of the 2008 ACM SIGMOD Conference on Management of Data</source>
          , pages
          <fpage>1247</fpage>
          -
          <lpage>1250</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Q.</given-names>
            <surname>Cai</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Yates</surname>
          </string-name>
          .
          <article-title>Large-scale semantic parsing via schema matching and lexicon extension</article-title>
          .
          <source>In ACL</source>
          , pages
          <fpage>423</fpage>
          -
          <lpage>433</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>M.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dasgupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Höffner</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          .
          <article-title>Asknow: A framework for natural language query formalization in sparql</article-title>
          .
          <source>In International Semantic Web Conference</source>
          , pages
          <fpage>300</fpage>
          -
          <lpage>316</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>K.</given-names>
            <surname>Höffner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Walter</surname>
          </string-name>
          , E. Marx,
          <string-name>
            <given-names>R.</given-names>
            <surname>Usbeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          , and A.
          <string-name>
            <surname>-C. Ngonga</surname>
          </string-name>
          <article-title>Ngomo. Survey on challenges of question answering in the semantic web</article-title>
          .
          <source>The Semantic Web</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>26</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jakob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jentzsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Morsey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Van Kleef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>DBpedia-a large-scale, multilingual knowledge base extracted from wikipedia</article-title>
          .
          <source>The Semantic Web</source>
          , pages
          <fpage>167</fpage>
          -
          <lpage>195</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Miller</surname>
          </string-name>
          .
          <article-title>Wordnet: a lexical database for english</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>38</volume>
          (
          <issue>11</issue>
          ):
          <fpage>39</fpage>
          -
          <lpage>41</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>P.</given-names>
            <surname>Trivedi</surname>
          </string-name>
          , G. Maheshwari,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dubey</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          .
          <article-title>Lc-quad: A corpus for complex question answering over knowledge graphs</article-title>
          .
          <source>In International Semantic Web Conference</source>
          . Springer,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>C.</given-names>
            <surname>Unger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bühmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Gerber</surname>
          </string-name>
          .
          <article-title>Templatebased question answering over rdf data</article-title>
          .
          <source>In Proceedings of the 21st International World Wide Web Conference</source>
          , pages
          <fpage>639</fpage>
          -
          <lpage>648</lpage>
          . ACM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>C.</given-names>
            <surname>Unger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          , and
          <string-name>
            <surname>E. Cabrio.</surname>
          </string-name>
          <article-title>6th open challenge on question answering over linked data (qald-6)</article-title>
          .
          <source>In Semantic Web Evaluation Challenge</source>
          , pages
          <fpage>171</fpage>
          -
          <lpage>177</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. D. Vrandecˇic´. Wikidata:
          <article-title>A new platform for collaborative data collection</article-title>
          .
          <source>In Proceedings of the 21st International World Wide Web Conference</source>
          , pages
          <fpage>1063</fpage>
          -
          <lpage>1064</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>