Extending DUDES for Ranked Template Generation

    Hyunwhan Joe1, Sungkwon Yang1, Yongsun Shim1, Sueun Jang1, Hong-Gee Kim1
                      1 Biomedical Knowledge Engineering Laboratory,

                        Seoul National University, Seoul, Korea
                 {hyunwhanjoe, sungkwon.yang, yongsun0926,
                         jchr119, hgkim}@snu.ac.kr


        Abstract. Question answering systems for Linked Open Data represented in
        RDF has received attention lately. These systems allow users to access datasets
        without any prior knowledge of the data model, schema, or query language.
        Template generation is one such method used by systems to transform natural
        language questions into SPARQL queries. TBSL is a representative of systems
        that use template-based approaches. TBSL first transforms questions into an in-
        termediate semantic representation. After this the semantic representation is
        transformed into SPARQL templates. Several candidate templates can be gen-
        erated. In this paper we propose an example of a possible scoring method on the
        intermediate semantic representations that can be later used for ranking the
        templates.

        Keywords: Question Answering, Semantic Web, Natural Language Patterns.


1       Introduction

There is a large amount of RDF data interlinked together as Linked Open Data
(LOD). The problem is that end-users interested in this data have to be familiar with
Semantic Web technologies to be able to access them. Question answering (QA) sys-
tems are one solution to this problem. QA systems allow users to access datasets
without any knowledge of RDF, vocabularies, and SPARQL. One approach to QA is
a template-based approach. A template is a SPARQL template that represents the
general structure of the intended query. The template is not a full SPARQL query and
has slots which contain information about what kind of entity will fill the slot (re-
source, class, or property) and the matching lexical term. An example of a template
can be seen in Fig. 1. A representative QA system of the template-based approach is
TBSL [1].
  TBSL consists of three major modules in order; template generation, entity linking,
and query filtering and ranking. A natural language question is the input for the tem-
plate generation module and the output is one or more templates. The templates are
the input for entity linking and after the slots are filled with candidate entities and the
output of the module is candidate SPARQL queries. The queries are then filtered and
ranked where the highest ranked query will be used to retrieve the answer from the
dataset.
2


Fig. 1. An example of a template with slots for the question “Who produced the most
films?”

    The query filtering and ranking module is needed because the template generation
module can produce several templates which leads to several queries. In this paper,
we address the query ranking issue by adding possible ranking scores to the candidate
templates generated. The intuition is that certain templates tend to be used more often
than others depending on the grammar patterns of the question. The paper is an on-
going work and presents preliminary results. Section 2 will go more into detail about
the template generation process that is needed for the next section. Section 3 will give
an example of a possible template ranking score.


2      Template Generation

The idea behind TBSL is that the structure of the SPARQL query is decided by the
syntax and the domain-independent expressions in the natural language question. The
SPARQL equivalent of these expressions are the same throughout any dataset which
is why they are considered domain-independent. Examples of this are question words
such as who, what, where, and when. During the template generation process, the
natural language question is parsed into its syntactic structure. TBSL uses Lexicalized
Tree Adjoining Grammar (LTAG) [2] for parsing but for ease of explanation, we will
be using in this paper dependency and constituency parsing together instead.
   A template is not directly generated from the parse tree of the question. It is first
transformed into an intermediate representation which captures the semantics of the
original question. The intermediate representation is DUDES [3], a variation of Un-
derspecified Discourse Representation Structures (UDRS) [4]. A template is then
formed from this DUDES.
   Each node on the dependency parse tree of the question has a corresponding
DUDES which is the semantic representation of that node. Each DUDES also have
additional constituent constraints. The DUDES for domain-independent expressions
are defined manually beforehand while the DUDES of domain-dependent expressions
are built automatically based on POS tags. Named entities have the DUDES equiva-
lent of resources. Nouns are either represented as class DUDES or property DUDES.
                                                                                    3


Verbs are represented as property DUDES and empty DUDES which assumes that the
property slot comes from a noun elsewhere.
   The final DUDES representing the semantics of the question is formed by starting
from the bottom node of the dependency tree and merging its DUDES with the
DUDES above if the dependency relations match. This will form a new DUDES
which will merge with the DUDES above and this will continue till there is no more
DUDES to merge. An example of this merging process can be seen in Fig. 2. Since
there can be possibly more than one DUDES per node, more than one final DUDES
can be made. This also leads to several templates being generated.


Fig. 2. An example of DUDES merging for the question “Who produced the most
films?”


3      Template Ranking

In this section we continue to use the question “Who produced the most films?” as an
example for possible template scoring. This question leads to three possible DUDES.
The main reason for this is that the “produced” node has three possible DUDES de-
fined. The first is where the verb “produced” is interpreted as a property. The two
other DUDES assume that the property is contributed by a noun elsewhere. An obser-
vation from this is that “produced” alone has three possible DUDES but with more
context we can see that the first DUDES is more likely to be correct. The context is
that when a verb is followed by “the most” and a noun we can assume that the verb is
behaving as a property. An exception to this would be if the verb is “has” where it is
not behaving as a property but the noun after will. This won’t be an issue since “has”
is a domain-independent expression which will be defined beforehand and the sen-
tence will not be interpreted as a verb + “the most” + noun sentence.
   An example of how context scoring can be used is if each DUDES starts with a de-
fault score such as zero. After this each DUDES has grammar conditions that if met
they will increase the score of the template such as adding one. The DUDES where
“produced” is interpreted as a property would have a grammar condition such as if
“the most” is followed by a noun then when it merges with another DUDES the re-
sulting DUDES would have a higher score than the DUDES where “produced” is
considered empty. In the example such as in Fig. 3 the DUDES for “films” and “the
most” will merge into a “the most films” DUDES. After it will merge with the “pro-
4


duced” DUDES and the resulting DUDES will have a higher score. The final DUDES
score will carry on to the generated template which will be used later during query
filtering and ranking.


Fig. 3. An example of possible template scoring for the question “Who produced the
most films?”


4         Conclusion

In this paper we proposed a possible scoring method to score templates generated
from a template-based QA system. We used the architecture from TBSL which uses
DUDES as an intermediate semantic representation for our paper. The final DUDES
will be scored based on context. Certain natural language patterns will be given high-
er scores because they are more likely representations of the question. The generated
templates from the DUDESs will carry these scores which will be used later for query
filtering and ranking.

Acknowledgments. This work was supported by Institute for Information &
communications Technology Promotion(IITP) grant funded by the Korea
government(MSIP) (No. 2013-0-00109, WiseKB: Big data based self-evolving
knowledge base and reasoning platform).


References
    1. Unger, C., Bühmann, L., Lehmann, J., Ngomo, A., Gerber, D., Cimiano, P.: Sparql tem-
       plate-based question answering. In: the 21st international conference on World Wide Web,
       (2012).
    2. Schabes, Y.: Mathematical and Computational Aspects of Lexicalized Grammars. PhD
       thesis, University of Pennsylvania, (1990).
    3. Cimiano, P.: Flexible semantic composition with DUDES. In: The Eighth International
       Conference on Computational Semantics, (2009).
    4. Reyle, U.: Dealing with ambiguities by underspecification: Construction, representation
       and deduction. Journal of Semantics, 10(2):123-179, (1993).