<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Question Answering Over Linked Data: What is Di cult to Answer? What A ects the F scores?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Muhammad Saleem</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Samaneh Nazari Dastjerdi</string-name>
          <email>Samaneh.nazari-dastjerdi@tu-ilmenau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ricardo Usbeck</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Axel-Cyrille Ngonga Ngomo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TU Ilmenau</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University Leipzig</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University Paderborn</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present a ne-grained analysis of the Question Answering over Linked Data (QALD-6) challenge. We divide the QALD-6 questions into 8 main categories and compare state-of-the-art questions answering (QA) systems over Linked Data against the individual categories. We show the di culty (in terms of overall F scores of the QA systems) of each category. We show the e ect of various natural language and SPARQL features such as the number of triple patterns, number of keywords, the answer size, the type of answers, the e ect of aggregate functions, and the SPARQL query forms on the overall F scores of the QA systems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        introduction
The SPARQL query language is a W3C standard4 to retrieve data from the
Linked Open Data cloud5. However, learning SPARQL for naive user can be
tricky. In particular, formulating meaningful queries to retrieve the desired data
is not a trivial task for the common users. To this end, the signi cant growth
of the LOD datasets has motivated a considerable amount of works on question
answering over Linked Data [
        <xref ref-type="bibr" rid="ref1 ref10 ref2 ref3 ref4 ref5 ref9">1,2,3,4,5,9,10</xref>
        ]. Consequently, this has motivated a
good number of benchmarks (see, e.g., QALD [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">8,7,6</xref>
        ]) to test QA systems. QALD
is a series of question answering challenges over Linked Data. The questions are
collected various sources such as real-life log les, users or experts. The questions
are based on di erent versions of the DBpedia dataset and are provided in many
languages such as English, Hindi, German, French etc. For each natural language
question, QALD provides the corresponding SPARQL query, the exact answers,
and the list of keywords extracted from the question. The overall goal is to
compare QA systems over Linked Data with respect to di erent key performance
indicators such as precision, accuracy, and F scores. Such benchmarks provided
the possibility to analyze the strengths and weaknesses of many QA systems
objectively.
4 https://www.w3.org/TR/rdf-sparql-query/
5 http://lod-cloud.net/
      </p>
      <p>While QALD benchmarks contain a variety of questions, the QALD challenge
does not provide a ne-grained analysis of the questions itself or a detailed
evaluation of the results. It is paramount to know what were the easy categories?
Which category is in general hard to answer and why? Which features increase
or decrease the complexity of the question? With which feature category does
a system's performance correlate? Where do systems fail? In this report, we
provide a ne-grained analysis of the QALD-6 challenge results. In particular,
we are interested in the following questions:
{ What are the general categories of QALD-6 questions? Which feature can be
derived?
{ How do QA systems over Linked Data perform across the di erent question
categories?
{ Which types of questions are hard to answer and which are relatively easy?</p>
      <p>
        Why?
{ What is the e ect of the number of triple patterns on the performance?
{ What is the e ect of the number of keywords on the performance?
{ What is the e ect of the number of answers on the performance?
{ What is the e ect of answer types (e.g., String, Date, Resource, Boolean etc.)
on the performance?
{ Do SPARQL aggregates such as sum, min, max or avg, increase or decrease
the performance?
It is important to mention that in this paper we are not presenting the details
of the QA systems over Linked Data. Rather, we are interested in the detailed
analysis of the QALD-6 questions and its corresponding results. The description
of the QALD-6 participated QA systems can be found at [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Results and Discussion</title>
      <p>In this section we provide a ne-grained analysis of QALD-6 results. In particular,
we rst divide the overall QALD-6 questions in 8 main categories. We then
show the performance of the QA systems (participated at QALD-6 challenge) for
individual categories. We then discuss the complexity of each category in terms
of how di cult it is correctly answer. After that, we measure the e ect of various
performance indicators on the F scores of the QA systems. The overall goal is to
nd answers for each of the questions discussed in previous section and pinpoint
limitations of QA systems.
2.1</p>
      <sec id="sec-2-1">
        <title>QALD-6 Questions Categories and Result</title>
        <p>First, we want to investigate what the general categories of QALD-6 questions
are? To this end, we categories the complete 100 QALD-6 questions into 8
categories given in Table 1. We categorized the questions according to their
starting word. We can see that the questions of type \Who?" (total 21 questions)
and \What?" (total 22 question) are the majority. The questions of type \When?"
and \Where?" (6 respectively 3)are not that common. The main conclusion by
analysis is that QALD-6 questions are not uniformly distributed across the
di erent categories. Thus, adding more questions of type \Where?" could be
useful for the overall quality of QALD challenges in the future.</p>
        <p>Once we know the general categories of the QALD-6 questions, next we want
to investigate how QA systems over Linked Data perform across the di erent
question categories?</p>
        <p>Our main hypothesis is that a system can win the QALD challenge, even if it
performs worse than other systems in some speci c question category.</p>
        <p>This ne-grained analysis will enable developers of QA systems to get to
know their system shortcomings for a particular category and hence let them
focus on these particular categories in future improvements. Figure 1 shows the
comparison of the QA systems across the individual categories of the QALD-6
questions. We can see that CANALI is the overall winner for the challenge.
However, it is not the winner across all the categories. For example, in the \Give
me?" category, UTQA outperforms CANALI. This clearly suggest that CANALI
developers should focus on correctly answering the questions starting with \Give
me". In conclusion, our hypothesis is proved that there is not a single winner
across all the categories.
The second research questions investigates the complexities of each category,
i.e., which type of questions are hard to answer and which are relatively easy?
Figure 2 shows the average F score for individual questions categories over all 6
QA systems. Note, that the F Score used within here is the macro F score, i.e.,
the average over all F scores for individual questions.6 The results show that the
questions of type \When?" (avg. F score = 0.57) is easier to correctly answer
which is followed by category \In Which?" (avg. F score = 0.55), \Who?" (avg.
F score = 0.48), \What?" (avg. F score = 0.44), \Which?" (avg. F score = 0.43),
\Where?" (avg. F score = 0.38), \Give Me?" (avg. F score = 0.34), and \How
many?" (avg. F score = 0.32). Interestingly, the questions starting with \How
Many?" are the most di cult to answer. The result of such type of questions is
a single value and the corresponding SPARQL query of these questions mostly
use the COUNT function. However, some time it simply uses a property, e.g.
How many inhabitants has France uses a property not a count. It seems that
aggregates functions, e.g, count, min, max, avg, are hard to answer correctly. We
will further investigate the e ect of aggregates functions in Section 2.5.
Along with questions in natural languages, the QALD challenge also provide
corresponding SPARQL queries to answer every questions. In addition, they
also provide the exact keywords or named entities in each question. We want
6 https://qald.sebastianwalter.org/6/documents/qald-6_results.pdf
(a) E ect of #triple patterns.
(b) E ect of #key words
to investigate the e ect of both of these features on the average F scores of the
QA systems. The result in Figure 3a shows that the number of triple patterns
in the SPARQL query has an inverse relationship with the average F score. If
the number of triple patterns in the query is only 1 then the average F score is
0.47 which is dropped to 0.41 with number of triple patterns equal to 2, which
is further dropped to only 0.27 with number of triple patterns equal to 3. The
reason for this behaviour is that the complexity of the SPARQL increases with
the triple patterns. Consequently, the more triple patterns the harder are the
questions.</p>
        <p>Figure 3b shows the e ect of the number of keywords on the average F score
of the systems. When the number of keywords is 1 the average F score is 0.53
which is reduced to 0.43 with number of keywords = 2. However, it is again
increased to 0.45 when number of keywords = 3. This shows that the number of
keywords does not have a signi cant impact on the average F scores. However,
this point needs further investigation. This result could be di erent when applied
to another QALD versions, i.e., the upcoming QALD-7 and QALD-8.
2.4</p>
      </sec>
      <sec id="sec-2-2">
        <title>E ect of Answer Size and Answer Types</title>
        <p>QALD questions can have exactly one answer or a set of answers. Moreover, the
answers can be of four types: 1) String, 2) Boolean, 3) RDF Resource(s), i.e., an
Http URI, and 4) Date. In this section, we want to investigate the e ect of these
two features, i.e. the size of the answer set as well as the type of the answer set,
on the average F scores of the QA systems. The result in Figure 4a shows that
the number of answers has a direct relationship to the average F score. When the
number of answers = 1 the average F score = 0.41, when the number of answers
= 2 the average F score = 0.45, and when the number of answers = 3 the average
F score = 0.50. The reason for this behaviour is that if the number of answers
are more than one then it is possible for a given QA system to correctly identify
some of the answers if not all exactly. Thus, the given system still can achieve
high F scores with only a partial answer set.</p>
        <p>(a) E ect of #answers</p>
        <p>(b) E ect of answer type
Finally, we want to investigate the e ect of aggregate functions and the SPARQL
query forms on the overall F scores of the state-of-the-art QA systems. For
each question, QALD-6 provides information whether any aggregate function
is required in the corresponding SPARQL query to correctly answer the given
question. In addition, SPARQL has four query forms7 namely SELECT, ASK,
DESCRIBE, and CONSTRUCT. QALD challenges make use of the SELECT
and ASK query forms.</p>
        <p>Figure 5a shows that aggregates are much harder to answer. We have an
average F score = 0.22 when aggregates is true and average F score = 0.47 when
no aggregates are used. Surprisingly, the ASK (avg. F score = 0.38) queries are
much harder to answer correctly compared to SELECT (avg. F score = 0.38)
queries. Note that the answer of ASK queries is a Boolean value. Thus, this result
is related to results presented in Figure 4b.
7 SPARQL query forms: https://www.w3.org/TR/rdf-sparql-query/#QueryForms
(a) E ect of Aggregates
(b) E ect of SPARQL forms
In this paper, we presented a ne-grained analysis of the QALD-6 challenge. We
divided the QALD-6 questions into 8 di erent categories. We then compared
the existing QA systems over Linked Data across each of these 8 categories. We
proved that there is no sole winner across all of the categories. It turned out that
questions of category \When" and \In Which" are relatively easy as compared to
the categories \Give Me" and \How Many". We showed that the number of triple
patterns has an inverse relationship with average F scores of the QA systems.
In addition, it was shown that the number of keywords does not signi cantly
a ect the overall F scores of QA systems. Yet, the number of answers has a direct
relation with the average F scores of the systems. Date type questions are easier
than questions whose answer is of type String. Aggregates are much harder to
be correctly handled by most of the QA systems. Finally, ASK query forms are
harder than SELECT query forms w.r.t. to the average F scores.</p>
        <p>In the future, we want to do the same analysis for all of the QALD challenges.
We will then present the combined results of all the challenges. We believe this
will lead us to more concrete and stable results.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <p>This work has been supported by the H2020 project HOBBIT (GA no. 688227)
as well as the EuroStars projects DIESEL (no. 01QE1512C) and QAMEL (no.
01QE1549C). This work has also been supported by the German Federal Ministry
of Transport and Digital Infrastructure (BMVI) in the projects LIMBO (no.
19F2029I) and OPAL (no. 19F2028A) as well as by the German Federal Ministry
of Education and Research (BMBF) within 'KMU-innovativ: Forschung fur die
zivile Sicherheit' in particular 'Forschung fur die zivile Sicherheit' and the project
SOLIDE (no. 13N14456).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>P.</given-names>
            <surname>Baudis</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Sedivy</surname>
          </string-name>
          .
          <article-title>Modeling of the question answering task in the yodaqa system</article-title>
          .
          <source>In Experimental IR Meets Multilinguality</source>
          , Multimodality, and Interaction - 6th
          <source>International Conference of the CLEF Association, CLEF</source>
          <year>2015</year>
          , Toulouse, France, September 8-
          <issue>11</issue>
          ,
          <year>2015</year>
          , Proceedings, pages
          <volume>222</volume>
          {
          <fpage>228</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Freitas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Oliveira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Curry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. O</given-names>
            <surname>'Riain</surname>
          </string-name>
          ,
          <string-name>
            <surname>and J. C. P.</surname>
          </string-name>
          da Silva.
          <article-title>Treo: combining entity-search, spreading activation and semantic relatedness for querying linked data</article-title>
          .
          <source>In 1st Workshop on Question Answering over Linked Data (QALD-1)</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>V.</given-names>
            <surname>Lopez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          , E. Motta, and
          <string-name>
            <given-names>N.</given-names>
            <surname>Stieler</surname>
          </string-name>
          . PowerAqua:
          <article-title>Supporting users in querying and exploring the Semantic Web</article-title>
          .
          <source>Semantic Web Journal</source>
          ,
          <volume>3</volume>
          :
          <fpage>249</fpage>
          {
          <fpage>265</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>S.</given-names>
            <surname>Shekarpour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Marx</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          . Sina:
          <article-title>Semantic interpretation of user queries for question answering on interlinked data</article-title>
          .
          <source>Journal of Web Semantics</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>C.</given-names>
            <surname>Unger</surname>
          </string-name>
          , L. Buhmann, J.
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>A. N.</given-names>
          </string-name>
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Gerber</surname>
            , and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Cimiano</surname>
          </string-name>
          .
          <article-title>Template-based question answering over RDF data</article-title>
          .
          <source>In 21st WWW conference</source>
          , pages
          <volume>639</volume>
          {
          <fpage>648</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>C.</given-names>
            <surname>Unger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Forascu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lopez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          , E. Cabrio,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Walter</surname>
          </string-name>
          .
          <article-title>Question answering over linked data (QALD-4)</article-title>
          .
          <source>In CLEF</source>
          , pages
          <volume>1172</volume>
          {
          <fpage>1180</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>C.</given-names>
            <surname>Unger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Forascu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lopez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          , E. Cabrio,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Walter</surname>
          </string-name>
          .
          <article-title>Question answering over linked data (QALD-5)</article-title>
          . In Working Notes of CLEF 2015 -
          <article-title>Conference and Labs of the Evaluation forum</article-title>
          , Toulouse, France, September 8-
          <issue>11</issue>
          ,
          <year>2015</year>
          .,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>C.</given-names>
            <surname>Unger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ngonga</surname>
          </string-name>
          , and
          <string-name>
            <surname>E. Cabrio.</surname>
          </string-name>
          <article-title>6th open challenge on question answering over linked data (qald-6)</article-title>
          . In The Semantic Web:
          <article-title>ESWC 2016 Challenges</article-title>
          .,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>R.</given-names>
            <surname>Usbeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-C.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          , L. Buhmann, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Unger</surname>
          </string-name>
          .
          <article-title>Hawk { hybrid question answering using linked data</article-title>
          .
          <source>In The Semantic Web. Latest Advances and New Domains</source>
          , volume
          <volume>9088</volume>
          of Lecture Notes in Computer Science, pages
          <volume>353</volume>
          {
          <fpage>368</fpage>
          . Springer International Publishing,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>K.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <article-title>Question answering via phrasal semantic parsing</article-title>
          .
          <source>In Experimental IR Meets Multilinguality</source>
          , Multimodality, and Interaction - 6th
          <source>International Conference of the CLEF Association, CLEF</source>
          <year>2015</year>
          , Toulouse, France, September 8-
          <issue>11</issue>
          ,
          <year>2015</year>
          , Proceedings, pages
          <volume>414</volume>
          {
          <fpage>426</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>