=Paper= {{Paper |id=Vol-1391/164-CR |storemode=property |title=SemGraphQA@QALD5: LIMSI participation at QALD5@CLEF |pdfUrl=https://ceur-ws.org/Vol-1391/164-CR.pdf |volume=Vol-1391 |dblpUrl=https://dblp.org/rec/conf/clef/BeaumontGL15 }} ==SemGraphQA@QALD5: LIMSI participation at QALD5@CLEF== https://ceur-ws.org/Vol-1391/164-CR.pdf
SemGraphQA@QALD-5: LIMSI participation at
           QALD-5@CLEF

    Romain Beaumont (1,2), Brigitte Grau (1,3) and Anne-Laure Ligozat (1,3)

            (1) LIMSI-CNRS, (2) Universite Paris-Sud, (3) ENSIIE, France
                          {firstname.lastname}@limsi.fr



        Abstract. For our participation to QALD-5, we developed a system for
        answering questions on a knowledge base. We proposed an unsupervised
        method for the semantic analysis of questions, that generates queries,
        based on graph transformations, in two steps. First step is independent
        of the knowledge base schema and makes use of very general constraints
        on the query structure that allows us to maintain semantic ambiguities
        in different graphs. Ambiguities are then solved globally at the final step
        when querying the knowledge base.


Keywords: Semantic Web, Question-Answering system, Semantic question anal-
ysis.


1     Introduction
More and more structured knowledge is made available on the Web, as with
DBpedia [Auer et al., 2007], Freebase [Bollacker et al., 2008] or Yago2 [Hoffart
et al., 2011], allowing a user to find answers related to some precise information
need. To enable a user to have simple access to these bases, simple interfaces that
support questions given in Natural Language (NL) are developed, which do not
require a user to know how the knowledge is structured, i.e. the knowledge base
schema, and to learn a formal language, e.g. SPARQL. A SPARQL query relies
on a conjunction of triples that can be represented as a graph: nodes are entities
and edges are typed relations. The two interrelated problems we are faced with
when transforming a NL question into such a graph are: i) identifying entities,
relations, entity classes and operators and ii) determining the structure of the
query, i.e. how these entities and relations are related and how the operators
apply. [Unger et al., 2012] and [Xu et al., 2014] make first structural choices based
on the knowledge base schema before knowledge base mapping, while [Yahya
et al., 2013] and [He et al., 2014] solve all the ambiguities together.
    We chose to develop an unsupervised method based on graph transformation
that acts in two steps: first, syntactic graphs are transformed for building all the
structural valid syntactico-semantic representations of the question that main-
tain ambiguities related to the queried knowledge base schema and content. The
second step consists in generating all the semantic graphs made of valid triples,
determined according to the knowledge base schema and content, i.e. DBpedia,
each leading to a SPARQL query. These ambiguities are solved at the last step,
close to [Zou et al., 2014]. Each semantic graph is weighted according to scores
attributed to triples, and the answer is the result of the top ranked query that
applies.
    The system, named SemGraphQA, was designed for answering DBpedia
queries, and cannot answer hybrid ones. It obtained a F1 of 0.31 at QALD-5.


2   Description of the system

The aim of the system is to transform a question Q into semantic graphs GSQ ,
each corresponding to a possible meaning. The nodes of the graph are entities
and the edges are relations. Every entity and relation has a relevance score,
which is used to compute the score of the graph. This transformation involves
different steps (Figure 1).
     Question phrases are identified in order to link them to semantic items of
the knowledge base: entities, types or relations (Figure 1a). Entity identification
is performed by DBpedia Spotlight [Daiber et al., 2013], type identification is
done using the type labels, and relation identification by a method we propose
based on variations extracted from WordNet. To each possible semantic item a
relevance score is given.
     The syntactic analysis of the question by the Stanford Parser [Klein and
Manning, 2003] produces a syntactic graph which is simplified and rewritten to
generate a GSyQ graph (Syntactic Graph of the Question) which has for nodes the
words and for edges the syntactic relations. In that graph, the semantic nature
of each word (entity or relation) is unknown (Figure 1b).
     The GSyQ graph is multiplied by assuming each word can be either entity
or relation in order to keep the possible ambiguities, and then transformed and
filtered in order to obtain the graphs GSS
                                         Q which comply to structural constraints
(Figure 1c). For example, if two relation-word are linked by a syntactic relation,
it means there is an implicit entity and we add an entity-word between them. In
the question Who is the daughter of Bill Clinton married to?, when daughter and
married are considered as relation-word, they are linked directly, so we create
an entity-word unknown between the two, which explicits the fact to have to
find the daughter of Bill Clinton to answer to the question.
     The last step consists in generating the possible semantic graphs, noted GSQ ,
using the ambiguities of each node and edge obtained after the mapping phase.
These graphs are constituted of triples consistent with the knowledge base (Fig-
ure 1d). These semantic graphs are sorted based on a relevance score and the
corresponding queries are executed in descending order until at least one answer
is returned.
     The method we propose aims at never eliminating an interpretation that has
a coherent semantic structure, i.e. a structure that can be instantiated in the
queried data. The constraints on the graph structure are very general and not
specific to the schema of a given knowledge base. Thus, these constraints could
also be used when searching information in texts. The constraints linked to the
 Who is the daughter of Bill Clinton married to?       Entity       Relation     Class
                                                                                           Paraphrases
                      Question                                      Legend

            married                      Who           daughter        Bill Clinton        married

     dep               nsubjpass
                                      dbo:Person       dbo:child res:Bill Clinton dbp:spouse
 Who                   daughter
                                     res:The Who dbp:daughter          dbp:clinton       dbo:partner
                  prep of

                      Bill Clinton                                                       res:Marriage

(b) Simplified syntactic graph                             (a) Mapping
                                                                                    Knowledge
                                                                                    base


                                                      res:Bill Clinton             res:Bill Clinton
           Who   Bill Clinton
                                         Triplet    dbo:child                    dbo:child
    married           daughter                                  x                            x
                                       Triplet     dbp:spouse                  dbo:partner
        Implicit entity
                                                       dbo:Person:?                  dbo:Person:?
   (c) Syntactico-semantic
                                                         (d) Semantic graph GS
                                                                             Q
   graph GSS
           Q


                                                 Marc Mezvinsky
                                                    Answer


                             Fig. 1. System overview
queried base are applied at the last step and the final filtering is operated by the
search in the knowledge base, based on finding an answer or not.

3     Semantic component mapping
In order to map question terms to semantic components of the knowledge base
(KB), i.e. entity, relation and class (cf. Fig. 1a), we determine a set of phrases
that could be related to a semantic component according to its label. These
phrases correspond to all n-grams, with n maximum 5, that contain a noun, a
verb or an adjective, as each type of word may lead to a type of component. A
reliability score is associated to each found component.

3.1     Entity extraction and identification
Entity mapping consists of recognizing an entity mention in a question, i.e. a
word or a phrase that refers to an entity, linking it to a KB resource and de-
termining its class. To this purpose, we make use of DBpedia Spotlight (DBS)
[Daiber et al., 2013]. DBS determines possible entities, extracts their mention,
associates them with a score, and classifies them according to the DBpedia ontol-
ogy1 that contains a hierarchy of classes. We use the DBS score as the reliability
score of an entity. It is based on a score computed during the indexing of the
entities and contextual scores depending on the other question terms. As we do
not want to choose between entities at a early stage, we keep all results provided
by DBpedia Spotlight.

3.2     Class identification
Some question phrases refer to classes, and not to entities. For example, in the
question Which languages are spoken in Estonia?, languages corresponds to the
class dbo:Language and in Which software has been developed by organizations
founded in California? , organizations corresponds to the class dbo:Company
and software to the class dbo:Software.
    The identification of classes enables to further associate it with the domain
or range of a relation. Target classes are those of DBpedia classes and Yago
classes2 . Their identification is based on a comparison of candidate n-grams to
class labels. Recognized classes are considered as unknown entities that could be
found when querying the KB.
    Scores of recognized classes are based on the Jaccard distance of matching
words to favour specific classes. For example, the class StatesOfTheUnitedStates
will obtain a greater score than States. Thus, the score Sclass for evaluating the
reliability of a class is:
                                         # words
               Sclass = α ∗                                                     (1)
                              maximum number of words in a label
1
    http://mappings.dbpedia.org/server/ontology/classes/
2
    DBpedia includes classes of Yago [Hoffart et al., 2011]
We set maximum to 10. As DBpedia classes are less numerous than Yago classes
(529 vs 379900), and more often present in the DBpedia triples, we set a param-
eter α to 2 for DBpedia classes and 1 for Yago to lower Yago classes score.
    A special case is due to interrogative words. They are processed separately
and each interrogative word is mapped to a class, for example Who is mapped
to dbo:Agent and Where is mapped to dbo:Place.

3.3   Relation identification
Identification of a relation consists of associating a mention extracted from the
question to a KB relation. Each KB relation is associated with a label, which
corresponds to a possible mention in texts. In order to maximize recall for re-
lations, we depart from all candidate mentions, made of the different question
n-grams, rather than pre-filtering them according to a classification into relation
and non relation classes.
    However, few relations are mentioned in questions using their KB label and
several kinds of variation have to be considered:
 1. The POS category of the word in the label is different from the POS category
    of the word in the mention. For example, in Who is Barack Obama married
    to ?, the verb married has to be related to the relation with a label made of
    the noun spouse.
 2. Mentions of relation and labels are paraphrases or present a greater semantic
    distance. For example a mention can refer to an hyponym as in Figure 1a.
 3. Labels and mentions are made of several words, i.e. n-gram with n greater
    than 1, and each of the words can present variations.
   In order to overcome this lexical gap, we built a lexicon with variants acquired
from WordNet [Fellbaum, 1998]. We selected the following relations in WordNet:
 – derivationally related forms: successor ↔ succeed
 – synonyms: author ↔ writer
 – hypernyms/hyponyms : relative ↔ sister
   The algorithm for acquiring variants depart from a label. For each word, the
network is recursively explored at a maximum depth d. We limit the depth in
order to compute the most relevant variants. We experimentally fixed d = 4.
   Each variant is assigned a score based on the path length:
                                           1
                     Sr =                                                       (2)
                            path length between two terms
with the path length equal to the number of crossed hypernyms/hyponyms links.
    We built a dictionary of variations for all the relations in DBpedia. These
relations all have an English label, for example dbo:child, which indicates all the
children of a person, has the label child and dbo:birthDate, which indicates the
birthday of a person, has the label birth date. There is a total of 1,252,327 varia-
tions generated for 12,331 relations, with a mean of 102 variations by relations.
4     Generation of a semantic representation of a question

4.1   Question Semantic Representation

Each possible meaning of a question is represented by a set of triples (e1 , R, e2 ),
with e1 and e2 entities and R a relation, making a connected and weighted
semantic graph GSQ . Entities are linked by a relation only if a syntactic relation
exists between their mentions. Triples are those which have an instantiation in
the KB.
    An entity is described by four fields: mention, type, value and score. The
fields mention and type can be empty. For example an interrogative word is an
entity with only a type and a label. The value field is an uri, a variable x when it
is an implicit entity in the question, or ? when it is the entity to find. The entity
score is computed by the mapping process (cf Section 3.1), and has a value when
an uri is found.
    A relation has a label (in the knowledge base), a domain, a range and an
identifier in the knowledge base (uri). For example the relation dbo:birthDate
(http://dbpedia.org/ontology/birthDate) has the label birth date, the domain
dbo:Person and the range xsd:date
    In case of ambiguous question terms, that have been related to different
elements of the knowledge base, several semantic graphs will be generated for
representing the different meanings of the question.


4.2   Triple Generation

Triples are generated from the syntactic dependency graph given by the Stanford
parser [Klein and Manning, 2003]. This process is decomposed into two stages:
the production of syntactically well-formed syntactico-semantic graphs GSS
                                                                        Q , and
                                                           S
the generation of semantically coherent semantic graphs GQ .
    The transformation of a syntactic graph into a GSS  Q graph is done by the
following steps:

 1. the syntactic graph is simplified to keep only the relevant dependency rela-
    tions. The syntactic relations such as determiner, noun compound modifier,
    controlling subject, passive auxiliary, auxiliary, open clausal complement,
    controlling subject and copula are filtered. The remaining relations form a
    graph of words (see Figure 1b);
 2. a word may refer to an entity or a relation. This entails several possible
    graph structures and several graphs are generated for every possibility with
    each term replaced by an entity − word or a relation − word node;
 3. these graphs are then filtered by well-formedness criteria: each entity − word
    must be linked to a relation − word; if two relation − word are linked, an
    entity − word corresponding to an implicit entity is added; every relation −
    word must be linked to two entity − words; the graph must be connected
    (cf Figure 1c).
    Once these GSSQ graphs of words typed as relation or entity are generated, it
is possible to generate the semantic graphs GSQ constituted of semantic triples:

1. every triple element is mapped to each semantic item found during the map-
   ping step. It makes it possible to generate different semantic triples for each
   triple of a GSS
                Q graph;
2. the semantic coherence of each triple is checked by keeping only the triples
   which have an instantiation in the knowledge base;
3. the Cartesian product between the remaining triple lists is computed to ob-
   tain the semantic graphs GSQ , which are the possible semantic representations
   of the question (cf Figure 1d).

    Generating GSS                              S
                 Q graphs before generating GQ graphs enables to filter a subset
of the graphs with structural criteria, without taking into account the mapping
ambiguities on each node and edge, and hence produce less semantic graphs.


4.3   Incompatible triple identification

Once all the possible triples are known, it is possible to evaluate their con-
sistency relative to semantic constraints provided by the KB schema. In KB
triples, each of the two entities can be associated to a type, and relations are
defined by a domain and a range. Using this information, we can compute a
compatibility score by giving a lower score to triples that are not compatible,
i.e. the types of the joint entities do not fit domain and range. Indeed we cannot
simply dismiss the incompatible triples because the knowledge base may be in-
consistent: some entities are linked by a relation without observing the domain
or range constraints. For example, the answer to the question Who developed
Minecraft? corresponds in DBpedia to the triple hres:Minecraft dbo:developer
res:4J Studiosi. The domain of dbo:developer is dbo:UnitOfWork and the range
is dbo:Agent. res:Minecraft has type dbo:VideoGame which is not compatible
with dbo:UnitOfWork.
     We define the compatibility between an entity e of type Ce and a domain
D of type CD as: e is incompatible with D only if CD is not included in the
set formed of Ce and of its supertypes and subtypes. There are three possi-
bilities: compatibility, incompatibility and the case where the information is
missing if the domain or entity types are unknown. For example the relation
dbo:childOrganisation identified as a candidate relation for the mention daugh-
ter between Bill Clinton and his daughter (implicit entity mention for which
the type is unknown) is incompatible because it has as range dbo:Organisation
which is not compatible with the type dbo:President of Bill Clinton.
     This compatibility verification makes it possible to compute a score Sc of
compatibility of a triple:
                                    Sc = Scd × Scr                             (3)
with Scd and Scr the compatibility scores of the domain and range. They take
the value β in case of compatibility, γ in case of missing information and δ in
case of incompatibility. A compatible domain, or range, must obtain a better
score than missing information or incompatibility, so we take β > γ > δ and we
set β = 1.0, γ = 0.75 et δ = 0.5.

4.4     Weighting of a semantic graph
Each semantic graph is assigned a weighting score S given by the formula:
                                          n
                                          P
                                               Sti
                                         i=1
                                    S=                                          (4)
                                            n
      with n the number of triples in the graph, and Sti the score of the triple i
                                    Se1 + Sr + Se2
                              Sti = Sc                                        (5)
                                            3
    with Se1 and Se2 the scores of the two entities, Sr the score of the relation
and Sc the compatibility score of the triple.
    The graphs are then converted to SPARQL queries and executed in descend-
ing order of their scores.

4.5     Question types and operators
The SPARQL query may involve the use of operators. Thus, we developed a
basic method to handle some cases. The following question types are identified:
 – count question : question starting by How many and that either needs a
   COUNT operator in the SPARQL query, or a relation usually containing
   word such as count or total in their label;
 – boolean question: question usually starting by Is or Are and that are an-
   swered using a ASK type of query;
 – factual question: all the other questions, that are answered by SELECT
   queries.
The question type is identified using a simple keyword matching. The same
analysis is then done for all three kinds of questions, only the generation of the
SPARQL query differs a little to handle the difference in the expected answer.
    In several questions, binary operators are present. The operators can be for
example more than, less than, the same as and are expressed in SPARQL queries
as FILTER. We currently handle more than and less than operators by consid-
ering them as a special kind of relation. These relation are processed the same
way as the knowledge base relation, until the transformation in SPARQL queries
where they are represented as FILTER. For example the question Which caves
have more than 3 entrances? is expressed by the query select distinct ?2 where
{FILTER(?1 > 3). ?2 dbp:entranceCount ?1. ?2 rdf:type dbo:Cave.}
    Some binary operators are not handled yet such as same-as operators, neither
are unary operators represented in SPARQL as ORDER BY and LIMIT that
correspond to superlative in the question such as ”oldest child”.
5    Experiments
The results in Table 1 are obtained on the QALD-5 test question set. It is made
of 60 questions, with 50 questions on DBpedia and 10 hybrid questions. Our
system is evaluated on the 50 questions on DBpedia.


                       Processed Right Partial Recall Precision F1
                Xser      42      26     7      0.72    0.74    0.73
               APEQ       26       8     5      0.48    0.40    0.44
             QAnswer      37       9     4      0.35    0.46    0.40
           SemGraphQA     31       7     3      0.32    0.31 0.31
              YodaQA      33       8     2      0.25    0.28    0.26
                     Table 1. QALD-5 test evaluation



    Among the 39 questions that the system does not answer correctly, there
are overlapping errors in 26 questions: an entity is not found in 7 questions, a
relation is not found in 25 questions and a type is not found in 4 questions. In
the 13 remaining questions with errors, there are some issues such as the order
operation (usually represented as a superlative in the question), and some binary
operators such as same-as which are not handled by the system currently. An
other error cause is an incorrect ranking of semantic graphs.
    The main difficulty relies in the identification of relations, due to the lexical
distance between the relation mention and its label in the knowledge base, or to
a relation wordings in the question that does not make use of a content word.


6    Conclusion
In the context of systems enabling to answer questions on knowledge bases on
the Semantic Web, we developed SemGraphQA, a system based on graph trans-
formations designed for handling structural and lexical ambiguities. It enables to
postpone choices until the latest point while limiting the number of ambiguities
to consistent ones with graph structure. The method we used is unsupervised
and can be easily applied to knowledge base other than DBpedia. Moreover, the
semantic representation generated is independent of the query language.
    We plan to build a hybrid system on knowledge base and texts in order to take
into account the information present in textual corpora using the same semantic
representation generation method with an adaptation of the identification of
semantic items.


References
[Auer et al., 2007] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., and
  Ives, Z. (2007). Dbpedia: A nucleus for a web of open data. Springer.
[Bollacker et al., 2008] Bollacker, K., Evans, C., Paritosh, P., Sturge, T., and Taylor,
  J. (2008). Freebase: a collaboratively created graph database for structuring human
  knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on
  Management of data, pages 1247–1250. ACM.
[Daiber et al., 2013] Daiber, J., Jakob, M., Hokamp, C., and Mendes, P. N. (2013).
  Improving efficiency and accuracy in multilingual entity extraction. In Proceedings
  of the 9th International Conference on Semantic Systems (I-Semantics).
[Fellbaum, 1998] Fellbaum, C. (1998). WordNet: An Electronic Lexical Database.
  Bradford Books.
[He et al., 2014] He, S., Liu, K., Zhang, Y., Xu, L., and Zhao, J. (2014). Question
  answering over linked data using first-order logic. In Proceedings of Empirical Methods
  in Natural Language Processing.
[Hoffart et al., 2011] Hoffart, J., Suchanek, F. M., Berberich, K., Lewis-Kelham, E.,
  De Melo, G., and Weikum, G. (2011). Yago2: exploring and querying world knowledge
  in time, space, context, and many languages. In Proceedings of the 20th international
  conference companion on World wide web, pages 229–232. ACM.
[Klein and Manning, 2003] Klein, D. and Manning, C. D. (2003). Accurate unlexical-
  ized parsing. In Proceedings of the 41st Annual Meeting on Association for Com-
  putational Linguistics-Volume 1, pages 423–430. Association for Computational Lin-
  guistics.
[Unger et al., 2012] Unger, C., Bühmann, L., Lehmann, J., Ngonga Ngomo, A.-C.,
  Gerber, D., and Cimiano, P. (2012). Template-based question answering over rdf
  data. In Proceedings of the 21st international conference on World Wide Web, pages
  639–648. ACM.
[Xu et al., 2014] Xu, K., Feng, Y., and Zhao, D. (2014). Xser@ qald-4: Answering
  natural language questions via phrasal semantic parsing.
[Yahya et al., 2013] Yahya, M., Berberich, K., Elbassuoni, S., and Weikum, G. (2013).
  Robust question answering over the web of linked data. In Proceedings of the 22nd
  ACM international conference on Conference on information & knowledge manage-
  ment, pages 1107–1116. ACM.
[Zou et al., 2014] Zou, L., Huang, R., Wang, H., Yu, J. X., He, W., and Zhao, D.
  (2014). Natural language question answering over rdf: A graph data driven approach.
  In Proceedings of the 2014 ACM SIGMOD International Conference on Management
  of Data, SIGMOD ’14, pages 313–324, New York, NY, USA. ACM.