=Paper= {{Paper |id=None |storemode=property |title=Linguistic Modeling of Linked Open Data for Question Answering |pdfUrl=https://ceur-ws.org/Vol-913/06_ILD2012.pdf |volume=Vol-913 |dblpUrl=https://dblp.org/rec/conf/esws/WendtGD12a }} ==Linguistic Modeling of Linked Open Data for Question Answering== https://ceur-ws.org/Vol-913/06_ILD2012.pdf
   Linguistic Modeling of Linked Open Data for
               Question Answering

               Matthias Wendt, Martin Gerlach, and Holger Düwiger

           Neofonie GmbH, Robert-Koch-Platz 4, 10115 Berlin, Germany
                    {wendt,gerlach,duewiger}@neofonie.de,
             WWW home page: http://www.neofonie.de/Forschung




       Abstract. With the evolution of linked open data sources, question
       answering regains importance as a way to make data accessible and ex-
       plorable to the public. The triple structure of RDF-data at the same
       time seems to predetermine question answering for being devised in its
       native subject-verb-object form. The devices of natural language, how-
       ever, often exceed this triple-centered model. But RDF does not preclude
       this point of view. Rather, it depends on the modeling. As part of a gov-
       ernment funded research project named Alexandria, we implemented an
       approach to question answering that enables the user to ask questions in
       ways that may involve more than binary relations.


Introduction
In recent years, the Semantic Web has evolved from a mere idea into a growing environ-
ment of Linked Open Data (LOD)1 sources and applications. This is due in particular
to two current trends: The first is automatic data harvesting from unstructured or
semi-structured knowledge that is freely available on the internet, most notably the
DBpedia project [1]. The second notable trend is the evolution of linked data sources
with possibilities of collaborative editing such as Freebase2 . The growth of LOD gives
rise to a growing demand for means of semantic data exploration. Question Answer-
ing (QA), being the natural device of querying things and aqcuiring knowledge, is a
straightforward way for end users to access semantic data.
    RDF3 and other languages for triple-centered models, which are often used to
model and describe linked data, seem to predetermine a specific way of thinking - and
of asking questions. Many RDF sources offer information in the form “X birth-place

  Copyright � c 2012 by the paper’s authors. Copying permitted only for private and
  academic purposes. This volume is published and copyrighted by its editors.
  In: C. Unger, P. Cimiano, V. Lopez, E. Motta, P. Buitelaar, R. Cyganiak (eds.): Pro-
  ceedings of Interacting with Linked Data (ILD 2012), Workshop co-located with the
  9th Extended Semantic Web Conference, Heraklion, Greece, 28-05-2012, published
  at http://ceur-ws.org
1
  See http://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/
  LinkingOpenData
2
  http://www.freebase.com/
3
  http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/
76      M. Wendt, M. Gerlach, H. Düwiger

Y” and “X birth-date Z”, etc. Of course, in natural language, we are used to formulate
complexer queries. It is natural to make statements like “X was born in Y on Z”. While
this does not matter as long as singular events like birth or death are involved, things
become more complicated as soon as events are involved that can occur more than
once. For example, the question “Who was married to Angelina Jolie in 2006?” can
only be answered if the temporal (and potentially limited) nature of a relation like
marriage is taken into account.
    In this paper we present the QA-driven ontology design behind Alexandria4 , a plat-
form for exploring data of public interest comprising a system for answering questions
in German. The domain consists of persons, organizations, locations as well as works
such as books, music albums, films and paintings. Moreover, the Alexandria ontology
is designed for holding information on events relating the various resources, including
temporal information and relations involving more than two participants – so called N-
ary Relations. Also, we describe the mapping algorithm used in our question answering
system and how it benefits from the ontology design.
    The ontology is built from and continuously being updated with data primarily
from Freebase, and few parts from DBpedia, news feeds, and user generated content.


Related Work

Open domain question answering is of current research interest. There are several
approaches to the subject based on linguistic analysis of natural language questions for
generating queries against linked data.
    FREyA [5] and PowerAqua [9, 10] are both question answering systems that are to
a certain degree independent of the underlying ontology schema. Both systems work on
existing Linked Open Data as is and can be configured to use multiple ontologies. They
rely on rather shallow approaches to query mapping, in favor of portability and schema-
independence. However, this also limits them to the data structures and languages used
by the schemas (e.g., DBpedia does not support N-ary relations).
    There are also systems based on deeper, compositional mapping approaches. For
example, ORAKEL [3,4] translates syntax tree constructed by lexicalized tree adjoining
grammars (LTAGs) to a representation in first order logic which can be converted to
F-Logic [8] or SPARQL5 queries, depending on the target knowledge base. ORAKEL
also principally supports N-ary relations. Though the system is in principle very similar
to the one presented in this paper, it is not proven to scale up to a large data set.
    In contrast to other projects that use Linked Open Data for question answering,
our approach is an attempt to combine the advantages of availability of huge LOD
sources and of tailoring the T-Box to the use case of QA. While the latter facilitates
the fully automated mapping of natural language questions to SPARQL queries, we
trade off the possibility to use existing labels for T-Box entities, which, combined with
existing lexical resources such as WordNet6 , GermaNet7 , etc., boost lexical coverage.
    Another difference to the above-mentioned projects is that the focus of Alexandria
is on answering questions in German, not English.
4
  http://alexandria.neofonie.de/
5
  http://www.w3.org/TR/rdf-sparql-query/
6
  http://wordnet.princeton.edu/
7
  http://www.sfs.uni-tuebingen.de/lsd/
           Linguistic Modeling of Linked Open Data for Question Answering              77

Design of the Alexandria Ontology
The design of the Alexandria Ontology was basically driven by practical demands of
the application as well as linguistic considerations. According to [7], our approach can
be seen as a unification of the “Type 4” and the “Type 3” approaches to ontology
creation. The knowledge base has to meet the following requirements:

Linguistic Suitability The data model needs to be suitable for natural language
   question answering, i.e. mapping natural language parse tree structures onto our
   data must be possible.
LOD Compatibility Compatibility with existing LOD sources like Freebase and DB-
   Pedia needs to be maintained in order to facilitate mass data import for practical
   use.
Scalability Large amounts of data need to be stored, maintained and updated while
   keeping the time for answering a question at minimum.

    One of the major aspects relating to linguistic suitability in the Alexandria use case
is that its target domain goes beyond what we refer to in the following as attributive
data, i.e. data about things that are commonly known as named entities like persons,
organizations, places, etc. In addition, the domain was designed to contain what we
call eventive data, i.e. (historic) events and relations to participants within them.
    As mentioned above, there are certain relations, such as birth, where this distinction
is not important, because n-ary relations consisting of unique binary parts (like place
and date of birth) can be covered by joining on a participant (the person) as proposed
in [4]. The distinction between eventive and attributive data becomes important when
relations are involved, which (may) occur repetitively and/or include a time span.
Questions like “Who was married to Angelina Jolie in 2001?” and “Which subject
did Angela Merkel major in at the German Academy of Sciences?” can no longer be
generally answered by joining binary facts.
    It is possible to model such eventive n-ary facts as proposed in Pattern 1, use case
3 of the W3C Working Group Note on N-ary Relations on the Semantic Web8 . This
approach is also close to the semantic model advocated in Neo-Davidsonian theories
[12], where participants in an event are connected to the event using roles.
    As for the aspect of LOD compatibility, it is our aim to access existing large-
scale sources to populate our knowledge base. DBpedia was the first LOD source to
retrieve and constantly update its data repository by crawling Wikipedia9 . Apart from
its possibilities for end-users to add and update information, the majority of data
contained in Freebase is obtained from Wikipedia as well. Therefore, using one (or
both) of these sources is an obvious starting point for harvesting information on a
broad range of popular entities, as it is required by Alexandria.
    However, though DBpedia contains much valuable attributive data for entities of
our interest, it does not offer eventive information as stated above. Also, DBpedia’s
T-Box does not provide a model for adding such n-ary facts, either.
    As opposed to DBpedia, which relies on the RDF standard, Freebase implements
a proprietary format. Whereas in RDF, all information is abstractly represented by
triples, Freebase abstractly represents information as links between topics. The Free-
base data model incorporates n-ary relations by means of Compound Value Types10 ,
8
   http://www.w3.org/TR/swbp-n-aryRelations/
9
   http://www.wikipedia.org/
10
   http://wiki.freebase.com/wiki/Compound\_Value\_Type
78       M. Wendt, M. Gerlach, H. Düwiger

also called “Mediators”. A mediator links multiple topics and literals to express a single
fact.
    So Freebase’s data model suits our requirements, but we need to use RDF to be
able to use Virtuoso Open Source Edition11 which has proven to scale well for both
loading and querying the amounts of data we expected.
    Using the Freebase query API to pull a set of topics and links, an RDF based
knowledge base can be built according to the Neo-Davidsonian model. The API also
supports querying link updates for continously updating the knowledge base.
    There are straightforward mappings of Freebase topic and mediator types onto
OWL12 classes, and of Freebase link types onto OWL properties. For example, a mar-
riage relation is imported from Freebase as follows:
nary:m_02t82g4 rdf:type     dom:Marriage ;
               dom:spouse   res:Angelina_Jolie ;
               dom:spouse   res:Brad_Pitt ;
               alx:hasStart "2005"^^xsd:gYear .
The subject URI is generated from the Freebase mediator ID. The resource URIs are
generated from Freebase topic names with some extra processing and stored perma-
nently for each Freebase topic ID.
    We differentiate the following three layers of our ontology (which correspond to
three namespace prefixes alx:, dom:, and res: that appear in the examples):

The Upper Model (alx:) contains the abstract linguistic classes needed for a language-
   , domain- and task-independent organization of knowledge. The Alexandria upper
   model is inspired by [6].
The Domain Model (dom:) contains the concrete classes and properties for enti-
   ties, events and relations of the modeled domain (e.g., Marriage, Study) as sub-
   classes of the upper model classes and properties. Needed to make the domain-
   specific distinctions which are necessary for the task of question answering.
The A-Box (res:) consists of all “resources”, i.e. entity, event and relation instances,
   known to Alexandria.

    The examples of Angela Merkel’s education and Angelina Jolie’s and Brad Pitt’s
marriage, which we used above, would be represented as shown in Table 1.
    Syntactically, we model our domain concepts as OWL subclasses of one or more
upper model concepts and our domain properties as OWL subproperties of one or
more upper model properties, where the latter correspond to the Neo-Davidsonian
roles mentioned earlier. We can then obtain hints to the upper model concepts and
roles of interest by mapping question verbs onto domain concepts and then try to
match the roles defined for the upper model classes to the respective given parts of the
question.


Putting the Model in Action: Question Answering
As mentioned above, one of the major design goals of our ontology schema was to
stay reasonably close to the phenomena and structure of natural language. Achieving
11
     http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/
12
     Web Ontology Language, builds on RDF, see
     http://www.w3.org/TR/owl2-overview/
            Linguistic Modeling of Linked Open Data for Question Answering              79

Upper model           Domain           U.m. role          Domain        Participant
concept               concept          props.             props.
agentive process      study            agent              student       Angela Merkel
effective process      study            affected            subject       Quantum Chemistry
locative relation     study            location           institution   German Academy of
                                                                        Sciences
                                       located      student             Angela Merkel
attributive rel.      marriage*        carrier      spouse              Brad Pitt
                                       attribute    spouse              Angelina Jolie
temporal concept marriage              start        wedding             2005
                                                    date
∗
  In this example, marriage is modeled as a symmetric relation, expressing a spouse as
attribute of the other spouse, i.e. carrier and attribute may be swapped.
                          Table 1. Upper and domain model




this would facilitate the mapping of a natural language question to a SPARQL graph
pattern that conveys the information need expressed in the question. The basic idea of
the translation algorithm is to understand the problem of mapping of natural language
to SPARQL as a graph mapping problem.
    From a linguistic viewpoint, the syntactic structure of a sentence may be repre-
sented in the form of a dependency tree, as obtained by the application of a dependency
parser. A dependency graph is formalized like this:
    Given a set L of dependency labels, a dependency graph for the sentence x =
w1 . . . wn is a directed graph D = (VD , ED ) with:

 1. VD is a set of vertices {w0 , w1 , . . . , wn } and
 2. ED ⊆ VD × L × VD a set of labeled edges

     The vertices are composed of the tokens in the sentence plus an artificial root node
w0 . A well-formed dependency graph is a tree rooted at w0 .
     Likewise, the structure of a SPARQL Select query basically consists of a graph
pattern (in the Where clause) and a projection. Given a set of variable names NV (?x,
?y . . . ), the set of concept names NC , a set of role names NP , a set of resource names
NR and a set of literals NL defined by the ontology, we define a SPARQL Select Graph
G = (VG , EG , PG ) as:

 1. VG ⊆ NV ∪ NR ∪ NC ∪ NL
 2. EG ⊆ NV ∪ NR × NP × VG
 3. the projection PG ⊆ NV

    Formally, we define the translation as a mapping f (D) of a dependency graph D
to a SPARQL Select Graph G.


Linguistic Processing

The dependency graph is the result of the application of a linguistic analysis to the
input sentence. An example of a resulting dependency structure may be found on the
80      M. Wendt, M. Gerlach, H. Düwiger

left in Figure 1. The analysis consists in tokenization, POS-tagging13 and dependency
parsing. Dependency parsing is conducted using the MaltParser [11], which was trained
on the German Tiger corpus14 [2]. The corpus has been slightly adapted by adding a
small sub-corpus of German questions and a minor change to the set of role labels used.
     To normalize surface form variation and identify morphosyntactic features, lemma-
tization and morphological analysis is applied to each of the tokens. This is roughly
illustrated by the lemmata in square brackets at the verbal nodes (e.g. “verheiratet”
has the lemma “verheiraten”).


Compositional Semantics




Fig. 1. Dependency parse of the sentence “Mit wem ist Angelina Jolie seit 2005 ver-
heiratet?” and examples of the lexicalization (1) for a subset of its nodes, and the
application of the actions BIND (2) and MERGE (3).


    The mapping of the dependency graph to the SPARQL query is largely done in
two steps: lexicalization and composition. By lexicalization we refer to the process of
mapping tokens (lemmata) or multi-word units to corresponding ontological units.
    We refer to the identification of resources (identified by resource URIs) of the A-Box
as lexical named entity identification. For this, we make use of the title (the name of an
entity) and the alternative names (consisting of synonyms and different surface forms)
that are imported from Freebase into a Lucene15 index containing the resource URI in
Alexandria (e.g. res:Angelina Jolie), and the OWL classes it belongs to. While the
user enters a question, matching entities are looked up in the index based on whole
words already entered and a disambiguation choice is continuously updated. The user
can select from the found entities at any time, whereupon the respective part of the
question is updated.
    The second noteworthy component in lexical named entity identification is the
identification of dates (and time). For these, we have adapted the open source date
parser provided by the Yago project16 to German.
13
   For German tokenization and POS-Tagging we use OpenNLP with some pre-trained
   models. (http://incubator.apache.org/opennlp/)
14
   http://www.ims.uni-stuttgart.de/projekte/TIGER/
15
   http://lucene.apache.org/
16
   http://www.mpi-inf.mpg.de/yago-naga/javatools/
             Linguistic Modeling of Linked Open Data for Question Answering               81

   All other linguistic tokens or configurations (linguistic units) corresponding to T-
Box concepts are mapped using hand-crafted lexica.
   The complete set of mappings for the question shown in Fig. 1 is shown in Table 2.


T-Box Class T-Box Role        A-Box URI          Literal              T-Box Class
Custom Lexica                 Lucene             Date, Literal Parser Custom Lexica
“wer”        “mit”            “Angelina Jolie”   “2005”               “verheiraten” “sein”
dom:Person   alx:hasAttribute res:Angelina Jolie "2005"^^xsd:date     dom:Marriage owl:Thing

                             Table 2. Types of Lexical Mappings


    Our syntax-semantics mapping is largely done by the composition of the lexical
semantic entries attached to each dependency node. This lexicalized approach devises
the notion of a semantic description. A semantic description represents the semantic
contribution of a dependency node or (partial) dependency tree and encodes obligatory
semantic complements (slots). During the composition, the slots are being filled by
semantic descriptions (properties) until the semantic description is satisfied.
    By virtue of the lexical mapping each linguistic unit is mapped to a set of semantic
descriptions, also called readings.
    Given a set of variable names NV , the set of concept names NC , a set of role names
NP , a set of resource names NR and a set of literals NL defined by the ontology, a
semantic description S of an ontological entity n ∈ NV ∪ NL ∪ NR is defined as a
five-tuple S = (n, c, Sl, P r, F l) with:

 1. c ∈ NC the concept URI of the semantic description
 2. Sl = [r1 , r2 , . . . , rn ] a ordered set of slots (ri ∈ NP )
 3. P r = {(p1 , S1 ), . . . (pm , Sm )} with Sj a semantic description and pm ∈ NP
 4. F l ⊆ {proj, asc, desc} a set of flags (with proj indicating that n to be part of the
    projection of the output graph

    For convenience, we define the following access functions for the semantic descrip-
tions S = (n, c, Sl, P r, F l):

 1. node(S) = n ⇔ S = (n, c, Sl, P r, F l)
 2. pred(S) = {p|(p, o) ∈ P r} ⇔ S = (n, c, Sl, P r, F l)

   A semantic description S = (n, c, Sl, P r, F l) is well-formed if the set of bound
properties and the slots are disjoint pred(S) ∩ Sl = ∅ and all bound properties are
uniquely bound ∀(p, o) ∈ P r → ¬∃(p, o1 ) ∈ P r ∧ o �= o1 .
   By definition, there is a strong correlation between a semantic description and
a SPARQL Select query. A SPARQL Select query can recursively be built from a
semantic description S = (n, c, Sl, P r, F l) and an initially empty input graph G0 =
(VG0 = ∅, EG0 = ∅, PG0 = ∅):

    toSPARQL(S, G0 ): Gm
    G0 = (V0 ∪ {n}, E0 , P0 ) : the output SPARQL graph pattern
    E0 = Es ∪ {(n, a, c)}
    P0 ← P ∪ {n} ⇔ proj ∈ F l otherwise Po = P

    foreach (pi , oi ) in (p1 , o1 ), . . . , (pm , om ) = P r
    begin
82       M. Wendt, M. Gerlach, H. Düwiger

        Ei ← Ei−1 ∪ (n, p, node(o))
        Gi ← toSPARQL(oi , Gi−1 )
     end
     return Gm


     To give an example in an informal notation, the linguistic units of the sentence
“Mit wem ist Angelina Jolie seit 2005 verheiratet?” are displayed in Table 3. The first
row shows the linguistic unit, the ontological unit described corresponds to the variable
or resource URI in the second row. The prefix ?! in a variable designation (e.g. ?!x)
is equivalent to the flag proj, denoting that the variable will be part of the projection,
i.e. proj ∈ P roj. Note that the verbal nodes “verheiratet” and “sein” are each mapped
to a (distinct) variable ?e, which corresponds to the Neo-Davidsonian event variable e.
     Slots and bound properties are displayed in the third column. The slots are des-
ignated with the argument being just a ?, whereas a variable is denoted by the pre-
fix ‘?’ and a lower case letter (e.g. ?v). In the example above the semantic descrip-
tions for “verheiratet” and “mit” contain the slots alx:hasCarrier (verb only) and
alx:hasAttribute.


         “Angelina Jolie” res:Angelina Jolie a dom:Person
         “seit 2005”      ?e                   a alx:TemporalRelation ;
                                               alx:hasStart 2005 .
         “mit”            ?e                   a alx:AttributiveRelation ;
                                               alx:hasAttribute ?
         “wem”            ?!x                  a dom:Person
         “sein”           ?e                   a alx:AttributiveRelation
         “verheiratet”    ?e                   a dom:Marriage ;
                                               alx:hasCarrier ? ;
                                               alx:hasAttribute ?
                     Table 3. Semantic descriptions of lexical units.




Putting it all Together

The composition algorithm devises a fixed set of two-place composition operators,
called actions. An action defines the mapping of two semantic descriptions related to
an edge in the dependency graph to a composed semantic description, corresponding
to the semantics of the subgraph of the dependency tree.
    The two most important actions are BIND and MERGE. These two basic opera-
tions on the semantic descriptions involved in the composition intuitively correspond to
(1) the mapping of the syntactic roles to semantic roles (otherwise called semantic role
labeling) and (2) the aggregation of two nodes to one in the output graph pattern. Given
two semantic descriptions S1 = (n1 , c1 , Sl1 , P r1 , F l1 ) and S2 = (n2 , c2 , Sl2 , P r2 , F l2 ),
the semantic operators are defined like this:
            Linguistic Modeling of Linked Open Data for Question Answering              83


                         
                          S(v, c1 , Sl1 − {max(Sl1 )}, P r1 ∪ {(max(Sl1 ), S2 )})
       BIN D(S1 , S2 ) =                 if range(max(Sl1 )) � c2 �= ⊥
                         
                           N U LL                    otherwise
                          
                          
                           S(v, lcs(c1 , c2 ), Sl1 ∪ Sl2 , P r1 ∪ P r2 )
                          
                                          if c1 � c2 �= ⊥
       M ERGE(S1 , S2 ) =
                          
                                 ∧pred(S1 ) ∩ pred(S2 ) = ∅
                          
                            N U LL               otherwise
   The function lcs(c1 , c2 ) gives the least common subsumer of two concepts, i.e. a
concept c such that c1 � c and c2 � c and for all e such that c1 � e and c2 � e then
c � e. The semantic role labeling implemented by BIND depends on a total order of
semantic roles, which has to be configured in the system, e.g.:
    alx:hasAgent > alx:hasAffected > alx:hasRange
    This order determines the ordering of the slots Sl in a semantic description. It
stipulates a hierarchy over the semantic roles of an n-ary node in the SPARQL graph
pattern. It is reflected by an ordering over the syntactic role labels which is roughly
equivalent to the linguistic notion of an obliqueness hierarchy [13], for example:
    SB > OC > OC2
   Formally, the obliqueness hierarchy defines a total order >L over the set of depen-
dency labels L.
   For the composition, each of the labels in the label alphabet is assigned one of the
semantic operators. The algorithm chooses the operator that is defined to build the
composition. The following table shows an excerpt of this mapping:
    SB   OC PNK MO       PD    PUNC
    BIND BIND BIND MERGE MERGE IGNORE
    The action IGNORE simply skips the interpretation of the subtree. The composi-
tion algorithm iterates over the nodes in the dependency graph in a top-down manner,
for each edge applying the action defined for the edge label pairwise to each reading of
the source and target node.
    The algorithm works in a directed manner by sorting the outgoing edges of each
node in the dependency graph according to a partial order ≥D .

                          (v1 , l1 , w1 ) ≥D (v2 , l2 , w2 ) ⇔ l1 > l2
    This ordering stipulates a hierarchy over the syntactic arguments in the dependency
graph that is reflected by a total ordering on the role labels of the SPARQL pattern
graph W : >W . The correspondance between these orders controls the order in which
the graph is traversed and therefore, in particular, the correlation of syntactic and
semantic roles (semantic role labeling).
    The mapping is implemented by the transformation algorithm sketched below. It
takes as input a dependency graph D(VD , ED ) with the root node w0 as the initial node
c in the graph traversation. The nodes are traversed in the order of the hierarchy to
assure the correct binding. Note that the transformation may have multiple semantic
descriptions as its output. An output semantic description S = (n, c, Sl, P r, F l) is only
accepted, if all of its slots Sl have been filled. We then apply the toSPARQL operation
to arrive at the final SPARQL query.
84        M. Wendt, M. Gerlach, H. Düwiger

      transform(D,c) : S
      c ← w0 : the current node
      S ← ∅ : the set of output readings

      foreach (c, l, v) in sort(outgoing(c), ≥D )
      begin
         Rc ← readings(c)
         Rv ← transform(v, D)
         foreach (rc , rv ) in Rc × Rv
         begin
             s ← apply(operator(l), rc , rv )
             if(s �= NULL)
                  S ← S ∪ {s}
         end
      end


Results
The N-ary modeling requires more triples for simple (binary) facts than using RDF/
OWL properties like DBpedia, because there is always an instance of a relation concept
comprising rdf:type and participant role triples.
     At the time of writing, the Alexandria ontology contained approx. 160 million
triples representing more than 7 million entities and more than 13 million relations
between them (including literal value facts like amounts, dates, dimensions, etc.). We
imported the triples into Virtuoso Open Source Edition, which scales as well as expected
with respect to our goals.
     80% of all query types understood by the algorithm (i.e. mappable onto valid
SPARQL queries) take less than 20ms in average for single threaded linguistic pro-
cessing on a 64 bit Linux system running on Intel�    R
                                                         Xeon� R
                                                                 E5420 cores at 2.5GHz,
and pure in-memory SPARQL processing by Virtuoso Open Source Edition on a 64
bit Linux system running on eight Intel�  R
                                             Xeon�R
                                                     L5520 cores at 2.3GHz and 32GB of
RAM.
     The question answering system works fast enough to be used in a multi-user Web
frontend like http://alexandria.neofonie.de.
     The performance of the algorithm is in part due to the high performance of malt
parser with a liblinear model, which runs in less than 5ms per question. By using a
liblinear model, however, we trade off parsing accuracy against performance in terms
of processing time per question. This sometimes becomes noticeable in cases where
subject-object order variation in German leads to an erroneous parse.
     The performance of the question answering system has been measured using the
training set of the QALD-2 challenge17 . As the question answering in Alexandria cur-
rently covers only German, all 100 questions were translated to German first. The
results are shown in Table 4. For 49 of the questions no query could be generated. The
second row shows the results for the questions for which a SPARQL query could be
generated.
     It has to be noted that the results provided in the gold standard rely on the DBpe-
dia SPARQL endpoint. As Alexandria is built upon its own schema and the imported
17
     http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/
     index.php?x=challenge&q=2
            Linguistic Modeling of Linked Open Data for Question Answering          85

 Answer set                                  avg. precision avg. recall avg. f-measure
 All answers                                      0.25         0.27         0.25
 Generated answers                                0.49         0.52         0.48
 Answers without data mismatch                    0.59         0.57         0.57
 Generated answers without data mismatch          0.92         0.89         0.89
Table 4. Quality of results of the Alexandria question answering on the QALD-2
training set translated to German.




data comes from Freebase instead of DBpedia, the comparability of the results is lim-
ited. The comparison of both datasets results in various mismatches. For example,
the comparison of questions having a set of resources as answer type, is done via the
indirection of using the labels. This is possible just because the labels are extracted
from Wikipedia by both Freebase and DBpedia. However, some of the labels have been
changed during the mapping.
    Overall, we have identified the following error types:

 1. different labels for the same entities
 2. different number of results for aggregate questions
 3. query correct but different results
 4. training data specifies “out of scope” where we can provide results
 5. question out of scope for Alexandria

    Type 1 applies particularly often to the labels of movies, most of which are of
the form “Minority Report (film)” in DBpedia, and “Minority Report” in Alexandria.
Another source for errors (2) results when aggregate questions (involving a count)
retrieve a different number of resources. The question “How many films did Hal Roach
produce?” for example yields 509 results in DBpedia and 503 results in Alexandria.
The third type corresponds to a difference in the data set itself, that is when different
information is stored. For example, in Alexandria the highest mountain is the “Mount
Everest” whereas in DBpedia it is the “Dotsero”.
    The last two error types involve questions that are out of scope (4 and 5). The
data model used in Alexandria differs from the model in DBpedia as a result to the
considerations explained above. On the other hand, Alexandria lacks information since
we concentrate on a mapped subset of Freebase. According to the evaluation, the answer
“out of scope” is correct if the question cannot be answered using DBpedia.
    Out of the 82 questions containing (any) erroneous results 63 belong to one of the
error classes mentioned above. The last two rows of Table 4 show the results for all
questions that do not belong to any of these error classes.


Acknowledgements

Research partially funded by the German Federal Ministry for Economics and Tech-
nology as part of the THESEUS research program18 .
18
     http://theseus-programm.de/en/about.php
86      M. Wendt, M. Gerlach, H. Düwiger

References
 1. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hell-
    mann, S.: DBpedia : a crystallization point for the web of data. Web Semantics:
    Science, Services and Agents on the World Wide Web 7(3) (2009)
 2. Brants, S., Dipper, S., Hansen, S., Lezius, W., Smith, G.: The TIGER Treebank
    In: Proc. of the Workshop on Treebanks and Linguistic Theories (2002)
 3. Cimiano, P.: ORAKEL: A Natural Language Interface to an F-Logic Knowledge
    Base. In: Proc. of the 9th International Conference on Applications of Natural
    Language to Information Systems (NLDB) (2004) 401–406
 4. Cimiano, P., Haase, P., Heizmann, J., Mantel, M.: Orakel: A portable natural
    language interface to knowledge bases. Technical Report, University of Karlsruhe
    (2007)
 5. Damljanovic, D., Agatonovic, M., Cunningham, H.: FREyA: an Interactive Way
    of Querying Linked Data using Natural Language. In: Proc. of QALD-1 at
    ESWC 2011
 6. Elhadad, M., Robin, J.: SURGE: a Comprehensive Plug-in Syntactic Realization
    Component for Text Generation. Technical Report (1998)
 7. Hovy, E.: Methodologies for the Reliable Construction of Ontological Knowledge.
    In: Proc. of ICCS 2005 91–106
 8. Kifer, M., Lausen, G., Wu, J.: Logical foundations of object-oriented and frame-
    based languages. Journal of the ACM 42 (1995) 741–843
 9. Lopez, V., Motta, E., Uren, V.: PowerAqua: Fishing the Semantic Web. In: Proc.
    of ESWC 2006 393–410
10. Lopez, V., Nikolov, A., Sabou, M., Uren, V., Motta, E., DAquin, M.: Scaling Up
    Question-Answering to Linked Data. In: Cimiano, P., Pinto, H. (eds.): Knowledge
    Engineering and Management by the Masses. LNCS 6317 (2010) 193–210
11. Nivre, J., Hall, J.: Maltparser: A language-independent system for data-driven
    dependency parsing. In: Proc. of the 4th Workshop on Treebanks and Linguistic
    Theories (2005) 13–95
12. Parsons, T.: Events in the Semantics of English: A study in subatomic semantics.
    MIT Press (1990)
13. Pollard, C., Sag, I.: Information-based Syntax and Semantics, Vol. 1. CSLI Lecture
    Notes 13 (1987)