=Paper=
{{Paper
|id=Vol-2367/paper_10
|storemode=property
|title=Towards an Ontology-Driven Evolutionary Programming System for Answering Natural
Language Queries against RDF Data
|pdfUrl=https://ceur-ws.org/Vol-2367/paper_10.pdf
|volume=Vol-2367
|authors=Sebastian Schrage,Wolfgang May
|dblpUrl=https://dblp.org/rec/conf/gvd/SchrageM19
}}
==Towards an Ontology-Driven Evolutionary Programming System for Answering Natural
Language Queries against RDF Data==
<pdf width="1500px">https://ceur-ws.org/Vol-2367/paper_10.pdf</pdf>
<pre>
           Towards an Ontology-Driven Evolutionary
      Programming-Based Approach for Answering Natural
             Language Queries against RDF Data

                        Sebastian Schrage                                                 Wolfgang May
         Georg-August University of Göttingen, Institute                  Georg-August University of Göttingen, Institute
                   of Computer Science                                              of Computer Science
               schrage@cs.uni-goettingen.de                                      may@cs.uni-goettingen.de

ABSTRACT                                                                  on is not limited to a numeric vector. Due to the nature of
In this paper, we present an ontology-driven evolutionary                 the NLQ it can easily be decomposed into several layers of
learning system for natural language querying of complex                  sub-objectives, therefore a multiobjective evolutionary al-
relational databases or RDF graphs to give users who are                  gorithms (MOEAs) approach like extensively investigated
not familiar with formal database query languages the op-                 from Li et al.[13] was used. This provides the possibility if
portunity to express complex queries against a database.                  the system is not able to provide the correct solution, to
This approach learns how to arrange and when to use given                 train it with further example queries of this kind with corre-
functions to process Natural Language Queries (NLQ).                      sponding SPARQL queries and the framework extends the
                                                                          model via the evolutionary learning algorithm to improve it
                                                                          and to learn this new kind of queries.
1. INTRODUCTION
   Natural language interfaces for databases (NLIDB) are li-              2.   RELATED WORK.
kely the easiest way for a user to access a database. It does                Basically there are two environments in which NLIDB sys-
not require the user to learn the specific query language                 tems are developed, for Knowledge Graphs (KG) like DB-
nor the schema or ontology of the data set. But this lack                 pedia[1] and for smaller data sets like Mondial [7]. For KGs
of knowledge must be compensated by the interface. It not                 with huge amounts of data and entities but no reliable or
only has to understand the user input and to extract the                  well-defined ontology given, approaches based on predefined
information from the natural language query (NLQ), but                    graph pattern matching like the approach from Steinmetz et
also the user could have a different concept in mind than                 al. [10] or pattern learning approaches like STF from Hu et
the one implemented. This can range from a smaller devia-                 al. [9] relying less on ontologies have shown the most success.
tion in the vocabulary, or using abbreviations or incomplete                 On the other hand for smaller data sets with well-defined
names, over using ambiguous formulations to using relation-               ontologies, which is also the scope of this work, approaches
ships which are not in the model or fusing different entities             more focused on ontology usage like Athena from Saha et
to one. Therefore, a NLIDB should be flexible enough to al-               al. [6] or schemata usage like Precise from Popescu et al. [4]
low the user to operate on her concepts, not on those of the              have shown the better results. Both approaches first analyze
implementer.                                                              the NLQ, then assign values to recognized parts representing
This approach consists of two major parts, the evolutiona-                how confident those parts are considered and then try to
ry agent framework loosely based on the work of Turk [11]                 connect them to a minimal graph that spans all parts that
and Hoverd et al. [2] and the NLQ-to-SPARQL application                   are considered evident with weighted edges according to the
of this framework, which uses pre-processed NLP data by                   confidence values or with high penalties if not found at all.
the Stanford’s NLP Core [3] and ontology-based methods                    Structure of the paper: Next, a short overview of the
to translate a NLQ into a SPARQL query against a given                    system architecture is given, followed by description of the
RDF database which is described by an ontology. According                 learning framework. Then, the NLQ-to-SPARQL applicati-
to the definition from Vikhar [12] the framework can be ca-               on is discussed, with a some example queries. If the approach
tegorized as evolutionary programming, since other than in                could not answer a question correctly, a brief explanation is
genetic algorithms, the structure of the subprograms is fix               given why. Last a brief conclusion is given.
and only the execution order of those can differ and other
than in evolutionary strategies the data types of the soluti-
                                                                          3.   SYSTEM OVERVIEW
                                                                             This approach is based on evolutionary programming. The
                                                                          central component is an agent whose input is obtained by
                                                                          preprocessing the NLQ with NLP Core [3] and which out-
                                                                          puts a SPARQL query (cf. Figure 1).
                                                                             The system is initialized with an ontology that covers the
                                                                          application domain. The ontology, given as an OWL ontolo-
                                                                          gy, is analyzed by RDF2SQL [5] and the results are stored in
31st GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 11.06.2019 - 14.06.2019, Saarburg, Germany.                      the Semantical Data Dictionary which is a collection of re-
Copyright is held by the author/owner(s).                                 lational tables stored in an SQL database. When an NLQ is


                                                                      1
                                                                       the nodes. It is a directed graph (which may contain cycles)
                                                                       consisting of a set of nodes and a set of connections. There
                                                                       are input nodes, a single output node, and inner, proces-
                                                                       sing nodes. The graph must be connected, i.e. no isolated
                                                                       fragments are allowed.

                                                                       4.1.2      Nodes
                                                                          The nodes themselves have all the same general struc-
                                                                       ture, each node n of type t has at least one input or one
                                                                       output conduit, usually one or several of both. The conduits
                                                                       are typed according to which kind of data, called products
                                                                       (cf. Section 5.2) they communicate. The product types are
                                                                       organized in a class hierarchy. The input conduits are enu-
                                                                       merated as in1 , in2 , . . . with types type(ini ) ; the output
                                                                       conduits are enumerated as out1 , out2 , . . . with type(outj ).
                                                                       There might be several input conduits with the same pro-
                                                                       duct type. Nodes have one or more output conduits for every
                                                                       product type that it can produce (which in course can be
                                                                       connected to multiple inputs). A node can generates one or
                                                                       more results of one or more product types.
                                                                          Every node type t implements a certain functionality, which
                                                                       satisfies a certain signature wrt. its inputs in1 , . . . , inc(t) , and
                                                                       out1 , . . . outd(t) (i.e. c(t) and d(t) are the indegrees and out-
                                                                       degrees of nodetype t, resp.),
                                                                       ft (type(in1 ), . . ., type(inc(t) )) → (type(out1 )∗ , . . ., type(outd(t) )∗ ),
   Figure 1: Overall architecture of the approach
                                                                       where the ∗ means that there might be zero, one, or more
                                                                       elements (e.g., if the node implements a conditional, one out
                                                                       of two outputs will be set, and if a node cannot do some-
asked, it is first processed with Core NLP [3] using its part of
                                                                       thing useful with the current inputs, no output might be
speech module, the entity recognition module and the gram-
                                                                       generated; or if a list is split, the (only) output is fed with
matical dependencies module. Then the preprocessed NLQ
                                                                       the sequence of all its elements). From a practical point of
is given to an agent which returns a SPARQL query, which
                                                                       view, the output can also bee seen as a set of elements of
can be stated against an RDF data storage or further pro-
                                                                       arbitrary product types.
cessed by ODBA applications.
                                                                          Every output conduit can be connected to multiple input
As depicted in Figure 1, at runtime there is a single agent.
                                                                       conduits, and every input conduit can have multiple inco-
During the learning phase, there are multiple agents, and
                                                                       ming connections from output conduits.
the learning phase results in the “fittest” agent for a given
                                                                          The used product types, the concrete functionality (ft in-
learning set, as described in the following section.
                                                                       cluding the number of input and output conduits, and their
                                                                       product types) of the nodetypes depend on the application.
4. LEARNING FRAMEWORK                                                  The structure of the agent as a graph of nodes of these node
   Therefore the structure for agents that are subject to evo-         types and conduits connecting compatible output and input
lutionary programming has been developed accordingly: The              is subject to learning. Usually, it is started with a concrete
inner structure of an agent consists of application-specific           proposal of a standard agent which is then improved during
nodes. There are different node types, and from each type              the learning process.
there may be multiple instances. The general idea of the
node types is to provide a set of operations which might be            4.2     The Evolutionary Process
useful for solving the task, but it is not known, which of them           The evolutionary process controls the evolution of agents
are needed, in which order and in which cases they must be             in order to improve their competences. It starts with a set of
executed and with which settings, to reach the objectives.             mutated standard agents. Then, in an iterative process, the
Additionally, there are connections between nodes for the              agents have a chance to change their configuration every
data flow inside the agent. The information flow is handled in         time they reproduce. The basic idea from [2] is that each
so called products, which are just an application-specific pre-        solution to each problem is assigned an amount of energy. An
defined encapsulation of arbitrary data types. Which kind              agent gets energy for correct solutions of the problems. With
of products and how many at the same time are accepted by              a growing population of agents, the energy is divided by
the node is type specific. An agent is a network of such nodes         more agents and pressures them to win more energy overall
(for an example see Figure 2) and the computed solution is             and suppresses unlimited growth in numbers.
returned as a set of products.
                                                                       4.2.1      Stepwise Evolution
4.1 General Notions                                                      The framework is organized as a sequence of runs. The-
                                                                       re is a fixed training set T provided by the user consisting
4.1.1    Agent Configuration                                           of test pairs t = (pt , solt ) consisting of a problem pt and a
  The configuration, i.e. the concrete internal structure of           corresponding solution. The solutions, and often also their
an agent, implements its functionality as the cooperation of           components, are assigned an initial energy (=value). Initi-


                                                                   2
                                                                             For the NLQ-to-SPARQL translation, the task of the agent
                                                                          is to translate the outcome of the Core NLP analysis in-
                                                                          to a SPARQL query. So, the solution components mentio-
                                                                          ned above are query fragments. There are different issu-
                                                                          es to be done and combined by the agent: Named-Entity-
                                                                          Recognition, translation of class names and property names
                                                                          into the notions of the database (represented by its ontolo-
                                                                          gy), and the structural generation of a SPARQL query by
                                                                          basic graph patterns (BGPs), logical connectives and condi-
                                                                          tions, and to deal with the variables.
                                                                             The training set consists of a set of pairs of NLQs and
                                                                          the corresponding (usually handmade) annotated SPARQL
                                                                          queries that adhere to conventions to give hints to the trans-
                                                                          lation of the sentence structure.

                                                                          5.1    Ontology Representation and Access
                                                                             The Semantic Data Dictionary (SDD) [5] gives compre-
                                                                          hensive access to the metadata: Basically it contains the sa-
                                                                          me knowledge as an OWL ontology (where it can be extrac-
                                                                          ted from to provide originally an OBDA RDF-to-SQL map-
Figure 2: Visual sketch of a learned agent. Each node                     ping), extended with knowledge which concrete (sub)classes
is represented as a circle and each type has a distinct color and         provide which properties, and their ranges. It is used here
icon. The size of the glow around each node represents its activity       instead of the OWL ontology because it is easier to access
and the smaller dots represent the conduits of the nodes.                 and does not require further reasoning.
                                                                             The SDD has no information about the instances in the
                                                                          data set. Since identifying instances in NLQs is one of the
al and new agents can be created from a problem-specific                  major tasks for answering them, a data structure for efficient
standard agent. Each run is done by an agent set, whose po-               searching is necessary. Therefore the SDD is extended with
pulation changes by evolution. All agents have to solve the               an identifier mapping IM: string → (class, property)∗ , e.g.
problems, and the produced solutions are evaluated. Then,                 “Monaco” 7→ ((Country, name), (City, name)). To identify
for each solution (resp. solution component) it is checked to             which properties are potential identifiers for the mapping,
which extent the solution of an agent matches that solution               the training set is searched for cases where the SPARQL
component. The energy assigned to the solution components                 solution contains a variable whose name is not equal to its
is distributed to the agents that found it.                               class – these denote the named entities (Great (Britain), in
   Given a threshold esus that defines how much energy is                 the example, whose class is Country). For instances of the-
required to sustain an agent, the next step is to check which             se classes all string-valued properties are searched whether
agents collected at least esus energy. Those are then added to            their value equals the name of the training set variable (i.e.,
the agent set of the next round. If an agent earned more than             “Great Britain”). If so, the property is considered as iden-
2esus , it reproduces (i.e., mutates) itself and the offspring is         tifying property and generates an entry for each instance of
added to the agent set as well. After being unchanged a                   this class with this property in the identifier mapping.
certain number of runs, it can mutate itself.

4.2.2    Reproduction                                                     5.2    Products
   If e ≥ 2esus for an agent, it reproduces itself and adds                  The products for the NLQ application are divided into two
itself and its offspring to the agent set as well. During re-             major groups: primitive and compound products. Usually,
production one of two scenarios could happen.                             primitive products can contain complex information, but as
   1.) The offspring is a perfect copy of its parent, without             a product they are seen as-a-whole (see Table 1 for the exact
any changes. This means for the next run there are more                   data composition of each product, where e.g. most products
agents of this configuration and during reproduction the li-              carry the position in the sentence from which they have been
kelihood of successful mutations is higher.                               derived as information). At the primary stage of the proces-
   2.) The offspring is a mutation of the agents, meaning                 sing, there are primitive products of type nlpdata which is a
it makes a random number of changes (based on a nor-                      reduced version of the output of NLP Core [3]. Nlpdata can
mal distribution centered around a value > 0) on its no-                  be turned either into tripleparts or symbols, which are pri-
des and connections. These changes can be adding a new                    mitive products towards the SPARQL side. Triplepart is an
node/connection, removing a node/connection or changing                   abstract superclass of the product types variable, constant,
the configuration of a node.                                              and predicate, while symbol is the abstract superclass of the
                                                                          product types operator (e.g., +, ≤, ≥, =, 6=), aggregation
                                                                          or except. Products of the type variable can be part of the
5. NLQ-TO-SPARQL TRANSLATION                                              solution set (i.e. of the fragments solt,j of the query expres-
   For every application, the specific node types must be de-             sion to be generated) and can generate SPARQL statements
signed and implemented. This requires a profound idea of                  of the type ?x rdf:type class where the class information is
useful small steps of the process. Then, the learning pro-                contained in the information of the variable. Constants are
cess consists of combining such local behavior into a smooth              fixed (literal) values from the NLQ, like names or numbers.
global behavior.                                                          Products of the type predicate are a set of properties (i.e.,


                                                                      3
the properties used in the ontology that may fit the verbal              Class Variable Generator - CVGen: Such nodes gene-
query). Products of type except correspond to negation in                rate variables which range over a class. Therefore they check
the NLQ.                                                                 the string and the lemma of an nlpdata and try to find a mat-
  Compound products are either triple, condition or graph                ching in the SDD. If it finds a matching class, it generates a
products. Triples always consist of a subject which must be              var{namenlp , positionnlp , confidencenode , ClassNameSDD , fal-
an object-valued variable, a predicate, and an object which is           se, POS tag nlp }.
also a variable (object- or literal-valued). Note that IRI con-          Identifier Node - IdGen: While the CVGen nodes are
stants cannot yet exist, since they do not occur in NLQs;                responsible for variables ranging over classes, the IdGen no-
and constant values occur only in comparisons in conditions.             des generate products for identifying a specific instance of
Triples can be translated directly to SPARQL. Conditions                 a class. Incoming nlpdata is checked for containing a string
consist of a left product of the type variable, a right product          or lemma which also occurs in the property value of the
of type variable or constant, and one operator. Products of              identifier mapping for a property nameim . Then it generates
type graph are basically lists of triples and conditions, but            subj:=var{namenlp , positionnlp , confidencenode , domainSDD ,
can also contain primitive products that are not yet integra-            false} describing the class, pred:=pred{nameim , positionnlp ,
ted with the rest of the graph.                                          confidencenode , propertiesim } for the identifying property, the
  For calculating to what extent an agent found a solution               literal-valued obj:=var{nameim , positionnlp , confidencenode ,
component (i.e., a fragment of the query), the partial tasks             string, true} for the value, a triple{subj,pred,obj} containing
are valuated as sketched in Table 1.                                     these three triple parts, a val:= const{nameim , positionnlp }
                                                                         and condition{obj, =, val}.
5.3 Nodes and Operations
  Each node type implements an operation that corresponds                 5.3.3   Relator Node Types
to a single conceptual step. The node types are grouped
                                                                            Relators take two or more products and relate them into
into the following four categories: reader, generator, relator,
                                                                         a compound product, usually triples or conditions. Such pro-
and reducer. Node types also have parameters to configure
                                                                         ducts are possible fragments of the final query. The modifier
their concrete instances. Parameter settings can be changed
                                                                         nodes and reducer nodes described below in Sections 5.3.4
by mutations through evolution. Further, nodes can have a
                                                                         and 5.3.5 will remove non-helpful fragments later. E.g. :
confidence value which can have the values [evident, derived,
necessary]. Each node gets a confidence value assigned when              Triple Generator - TriGen: This node type generates any
created or mutated and gives it to all products. Some nodes              ontologically possible relationship in form of according trip-
are sensitive to those values and base decisions on them.                les. For a subject and either a predicate or an object, a filler
  The nodes can access the SDD via a SQL database and                    for the missing position is generated. Either a var{namepred ,
WordNet via the API [8]. In the following, some of the no-               positionpred , confidencenode , ranges(pred)SDD , isLiteral?SDD }
des types are described. For the generated output products,              is created as object, or a pred{nameobject , positionobject ,
the components are indexed with their provenance; im de-                 confidencenode , propertiesSDD } is generated where the pro-
notes the identifier mappings from the SDD described in                  perties from the SDD are taken that are defined for the
Section 5.1.                                                             subject’s class and where the object’s class is in the range.
                                                                            For literal-valued properties this is often the only way to
5.3.1    Reader Node Types                                               generate the object since they are not of a class of the onto-
Reader nodes receive information from the NLP Core out-                  logy and cannot be found by a CVGen.
put. Some example for reader nodes are:
Part of Speech: This is the most essential node type of all.              5.3.4   Modifier Node Types
Its only parameter is, which Part-of-Speech tag is handled by            Nodes of modifier types perform context-sensitive tasks and
it. A Part-of-Speech node gets the whole set of output POS               have only one input conduit that accepts graph products.
from NLP Core and if the POS tag of the incoming POS                     An Example for this kind is the
matches the parameter of the node, it generates an nlpdata               Reificator: The goal of nodes of this type is to access
product with the content {stringpos , lemmapos , positionpos ,           the literal values of attributed relations which are usual-
POSpos , namedEntitypos }.                                               ly modeled in RDF through reification. While the termi-
Synonym: Such nodes use WordNet to find the terms used                   nology of the reified classes is normally not used in NLQ
in the ontology for a word. Therefore, the nodes maintain a              but the direct relation between the entities is used, like
dictionary using each known term term of the ontology and                “percentage of Russia located in Asia”, where “percenta-
querying WordNet for synonyms syn of term. If an nlpda-                  ge” seems to be a property of countries, and not, as in
ta{syn, . . . } is received, the node replaces it by nlpdata{term,       a reified modeling, of an “EncompassedInfo” resource. The
...}.                                                                    SDD has information about those classes and if encoun-
Proper name: The idea of this node type is to find a se-                 tered, the reificator breaks down the direct relation into
quence of words in the NLQ which together equal a known                  the detour over the reified class and generates additional-
identifier in the database, e.g. “Great Britain”. For each lon-          ly the predicates for the properties of the reification. The
gest exact match in the input, it combines the input nlpdata             output are reifvar := var{name SDD , min(positionA ,positionB ),
products into a single product of type nlpdata.                          confidencenode , reified classSDD , false, -}, triple{VariableA ,
                                                                         reifiedPropertyASDD , reifvar },
5.3.2    Generator Node Types                                            triple{reifvar, reifiedPropertyBinvSDD , VariableB }, and
  Generators turn one product into another type of product               pred{nameSDD , positionview , confidencenode , propertySDD }
using information from the SDD. Some of the more funda-                  that relate VariableA to a new variable reifvar ranging over
mental ones are:                                                         the reified class etc.


                                                                     4
                                          Table 1: Overview of all Product types.
 Product           Content                                                   Superclass           success calculation (sketch)
 nlpdata           string, lemma, position, named entity tag, POS tag        product              none
 triplepart        name, position, confidence                                product              (abstract class)
    variable       domain(s), isLiteralValued (t/f), POS tag                 triplepart           name + position + domain + all
    constant       -                                                         triplepart           name
    predicate      properties                                                triplepart           name + position + properties + all
 symbol            -                                                         product              (abstract class)
    operator       value, position                                           symbol               value + position
    aggregation    type, variable                                            symbol               type + variable + all
    except         position                                                  symbol               position
 triple            subject (variable), predicate, object (variable/constant) product              subject + predicate + object + all
 condition         left (triplepart), operator, right (triplepart)           product              left + operator + right + all
 graph             list of products                                          product              sum of its parts


5.3.5    Reducer Node Types                                            ve from the evolutionary process, where they “learn” to make
   Reducer nodes reduce the number of products circulating             use of the context-sensitive nodes.
in the agent. Such nodes use the SDD and the context of the
products to remove products that are invalid or considered             6.   EVALUATION
not to be helpful. The following nodes are a selection to                 The approach has been tested on the Mondial RDF Data
demonstrate the general functions of this types.                       set with a set of 51 questions. Only a few of them are simple
Fusion Node - Fus: Such nodes reduce the domains or                    selections which can be answered in a single SPARQL triple,
properties if more precise information is available. Espe-             instead the focus is on more complex and ambiguous ques-
cially relator nodes often generate two triples describing             tions. The standard agent can answer 45% correctly while
the same fact, but since either the predicate or the ob-               the best learned agent was able to give the correct answer to
ject is inferred, often the properties respectively the do-            84% of the questions (examples shown in Table 2). Since the
mains are too general in the inferred triple part. A fusi-             approach is still under development and some key features
on node checks a graph product whether it contains triples             are still missing, mainly the aggregation functions and the
A and B such that subjectA =subjectB , pred A ⊆ pred B and             translation from the internal representation into syntactical
objectClass A ⊇ objectClass B , and in this case replaces both         correct SPARQL, therefore there is no extensive comparison
triples by triple{subjectA , predA , objectB }.                        with [6, 4]. The main problems at the moment are the dis-
                                                                       tinction between ”and” which mean both sides should have
Conflicting literal solver - CSolv: Nodes of this type re-             a certain property like query 11 (Table 2 and ”and” which
act to graph products with multiple object-valued variables            mean a union of both sides like query 12. Agents so far only
that refer with the same property to a single literal-valued           were able to answer correctly one or the other. Another big
variable. While this is a valid operation in SPARQL, in NLQ            issue is the lexical gap, as already stated from Steinmetz et
this is expressed in a way that would trigger the operator             al. [10], e.g. the query 13 is answered wrongly because the
generator, e.g. “where the population is equal ” or “with the          approach is unable to map inhabitants to the property popu-
same name”. In this case, it removes all but one of the con-           lation and therefore uses a union over all numeric properties
flicting triples, based on the grammatical distance.                   of cities. Further logical concepts are not covered at all, li-
                                                                       ke the population density in query 15 (only the properties
5.4 Standard Agent                                                     population and area are existent in the ontology), further
                                                                       the approach is not aware, what it is describing as a whole,
   A solid initial basis for the structure of the agents is con-
                                                                       therefore in query 16 it does not just list all seas, but tries to
structed (automatically) from the information contained in
                                                                       find ”world” as an instance, does not succeed and completes
the training set and the application-specific nodes and pro-
                                                                       it to a union over several instances with world in their name
ducts. First, for every primitive product type, the set of
                                                                       like the ”World Health Organization” and the ”World Tra-
POS tags and keywords for node parameters to which they
                                                                       ding Union” which are not directly relatable with seas and
can correspond is computed. From this, typical agent sub-
                                                                       drifts into complete nonsense. Both are problems mentioned
structures for each kind of primitive products, i.e., variables,
                                                                       by Saha et al. [6] as well and to the best of our knowled-
properties, operators, and excepts are constructed algorith-
                                                                       ge, these problems have not yet been solved exhaustively for
mically.
                                                                       generic cases.
   Next, substructures are generated that depend on how
these primitive products are used by relators (for generating
triples and conditions). Their output conduits are directly            7.   CONCLUSION
connected to the output node. So far, this is already a very             In this paper, we developed an approach that enables
basic agent. At this point already more than half of the pos-          agents used in artificial life to work as an functional NLIDB.
sible rewarded energy from the used training set is achieved,          Therefore we developed a framework to enable those agents
but only very simple queries are already sufficiently answe-           to solve complex problems (other than surviving in their en-
red.                                                                   vironment) which can be broken down into sub-objectives.
   For achieving better results, better agents must then evol-         The agents, which are based on evolutionary programming,


                                                                   5
 Nr    NLQ                                                                                                            Correct
 1     Give me all rivers with a length shorter than 100 kilometers.                                                    ✓
 2     List all names except for Deserts.                                                                               ✓
 3     Give me everything located in Asia.                                                                              ✓
 4     Which cities are in Europe?                                                                                      ✓
 5     What is the depth of the Sea of Japan?                                                                           ✓
 6     How many percent of India are Sikh?                                                                              ✓
 7     Give me all cities where the population is greater then the population of the capital of their country.          ✓
 8     Show me all waters with their name                                                                               ✓
 9     Is there a city where the latitude and longitude are equal                                                       ✓
 10    Is the percentage of Turkish people greater than the percentage of Croat people in Austria                       ✓
 11    Which rivers are located in Poland and Germany?                                                                  ✓
 12    Give me the name of all mountains and islands                                                                    ✗
 13    Give me all cities that have more than 1000000 inhabitants, and are not located at any river that is more        ✗
       than 1000 km long
 14    Give me all cities that have a population higher than 1000000, and are not located at any river that is more     ✓
       than 1000 km long
 15    How high is the population density in Japan?                                                                     ✗
 16    How many seas are there in the world?                                                                            ✗

                                      Table 2: Example queries from the test set


had to be extended and transferred from a linear to a multi-             language querying over relational data stores. VLDB,
dimensional evaluation system to cope with the complexity                9:1209–1220, 2016.
of NLQ processing. For this purpose an evaluation techni-            [7] The Mondial database. http:
que, which not only takes into account the agents with the               //dbis.informatik.uni-goettingen.de/Mondial.
highest score, but also those who have specialized in a new          [8] C. Fellbaum (1998, ed.) WordNet: An Electronic
direction and thus extend the functionality of the whole ap-             Lexical Database. MIT Press.
proach. The agents have been equipped with specialized ope-          [9] S. Hu, L. Zou, X. Zhang. A State-transition
rations for their architecture, but also with many common                Framework to Answer Complex Questions over
ontological or graph pattern based operations and are able               Knowledge Base. EMNLP, pp. 2098-2108, 2018.
to link them in a meaningful way to transform them into             [10] N. Steinmetz, A. Arning, K.U. Sattler From Natural
an NLIDB. The intermediate results, although not yet fi-                 Language Questions to SPARQL Queries: A
nal, are comparable with existing approaches. Since some                 Pattern-based Approach. BTW pp. 289-308. LNI,
featrues are still missing and the evaluation is not executed            2019
in SPARQL but in the internal query language for the mo-
                                                                    [11] G. Turk. Sticky feet: Evolution in a multi-creature
ment, this is not precisely comparable. But it gives reasons
                                                                         physical simulation. InALife XII, pages 496-503.
for the assumption that this approach might be comparable
                                                                         MITPress, 2010.
to other state of the art approaches and might also provide
additional flexibility in some cases.                               [12] P. A. Vikhar Evolutionary algorithms: A critical
                                                                         review and its future prospects. ICGTSPICC. 2016,
                                                                         pp. 261-265.
8. REFERENCES                                                       [13] Y.-L. Li, Y.-R. Zhou, Z.-H. Zhan, J. Zhang, ”A
 [1] S. Auer, C. Bizer, G. Kobilarov et al. Dbpedia: A                   primary Theoretical Study on Decomposition Based
     nucleus for a web of open data. In ISWC, Springer                   Multiobjective Evolutionary Algorithm”, IEEE,
     LNCS 4825, pages 722–735. 2007.                                     Volume: 20, Issue: 4, pp. 563-576, 2015
 [2] T. Hoverd and S. Stepney. Energy as a driver of
     diversity in open-ended evolution. In ECAL 2011, pp.
     356-363, ACM. 2011.
 [3] C. D. Manning, M. Surdeanu, J. Bauer et al. The
     Stanford CoreNLP natural language processing
     toolkit. In ACL, pages 55–60, 2014.
 [4] A.-M. Popescu, O. Etzioni, and H. Kautz. Towards a
     theory of natural language interfaces to databases. In
     Intelligent User Interfaces, pp. 149–157. ACM, 2003.
 [5] L. Runge, S. Schrage, and W. May. Systematical
     representation of RDF-to-relational mappings for
     ontology-based data access. Technical report, available
     at https://www.dbis.informatik.uni-goettingen.
     de/Publics/17/odbase17.html, 2017.
 [6] D. Saha, A. Floratou, K. Sankaranarayanan et al.
     Athena: An ontology-driven system for natural


                                                                6

</pre>