<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>QAnswer - Enhanced Entity Matching for Question Answering over Linked Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefan Ruseti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexandru Mirea</string-name>
          <email>alexandru.daniel.mirea@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Traian Rebedea</string-name>
          <email>traian.rebedea@cs.pub.ro</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Trausan-Matu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TeamNet International</institution>
          ,
          <addr-line>Bucharest</addr-line>
          ,
          <country country="RO">Romania</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University Politehnica of Bucharest</institution>
          ,
          <country country="RO">Romania</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>QAnswer is a question answering system that uses DBpedia as a knowledge base and converts natural language questions into a SPARQL query. In order to improve the match between entities and relations and natural language text, we make use of Wikipedia to extract lexicalizations of the DBpedia entities and then match them with the question. These entities are validated on the ontology, while missing ones can be inferred. The proposed system was tested in the QALD-5 challenge and it obtained a F1 score of 0.30, which placed QAnswer in the second position in the challenge, despite the fact that the system used only a small subset of the properties in DBpedia, due to the long extraction process.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Question answering systems have been considered an important goal of Artificial
Intelligence and a subject of research for many years. While the traditional way of
searching for information is most of the times based on keywords, a much more
natural way of searching for information is using natural language questions.
Also, instead of generating a list of documents where the answer can be found,
such a system would return a precise answer.</p>
      <p>Usually, question answering systems either use a knowledge base with
structured information, or try to extract the answer from free text, or employ a
combination of these two approaches. Knowledge bases contain more precise
information, but cannot cover all the possible questions. In order for a system
to answer a question, it must first be able to “understand” the meaning of the
question. This part usually consists of translating the natural language in a form
of structured information, which is used later for querying a knowledge base.</p>
      <p>Our system uses the structured knowledge base DBpedia and a
Wikipediabased approach to match phrases from the question to entities in the ontology.
Different solutions for matching each type of entity were developed and the most
probable interpretation is converted in a SPARQL query. Missing properties or
types can also be inferred if they were not matched in the previous step.</p>
      <p>This paper will describe in the next section other state-of-the-art systems
that participated in the QALD competition. Section 3 presents the general
architecture of our approach and describes each of its steps. The obtained results
in the QALD-5 challenge are included in section 4, along with analysis for some
of the questions the system answered incorrectly. In the end, we present the
conclusions of this paper and some future development directions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>As QAnswer uses a knowledge base for question answering (QA), the focus of
this section will be on QA systems and approaches that employ ontologies or
other sources of structured knowledge to generate the answer. One of the most
important competitions for this type of systems is Question Answering over
Linked Data - QALD3, which is part of the Question Answering lab at CLEF.
This competition, already at the fifth edition, provides valuable datasets with
questions and answers that can be used to evaluate and compare such systems.
For one of the tasks, the questions are chosen so that they could be answered
by using DBpedia resources, so this task was perfectly suited to evaluate our
system.</p>
      <p>
        The most successful system in QALD-4, but also QALD-5, was Xser [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The
system uses a two-layered architecture, where the first step produces a Directed
Acyclic Graph (DAG) from the question with phrases labelled as resource,
relation, type or variable, by using a structured perceptron [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These tags are
independent of any knowledge base, so their approach can be very easily applied
on a different ontology. The second step of their system maps the discovered
entities to the ones in a given knowledge base, such as DBpedia, by using a Naïve
Bayes approach. The downside of their approach is the need of a large annotated
corpus of questions with DAGs.
      </p>
      <p>
        Another interesting solution was used by gAnswer [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], which uses a
graphdriven approach. The disambiguation takes place at the evaluation step, when
the generated graph is matched with subgraphs in the ontology. While usually
expressions are extracted for a given property, they use a database with common
expressions [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and then they map them to paths in the ontology. This way, more
complex ontology constructs can be linked with expressions.
      </p>
      <p>
        Casia [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] uses a similar approach for matching properties based on an
expressions database. However, the system chooses between all possible phrases and
their interpretation in the same step by using a Markov Logical Network [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        There are also other types of systems for question answering, which do not
use a knowledge base, but rather try to select the most appropriate answer for a
given question from texts available online. This kind of systems are usually able
to answer a greater variety of questions, but they are not as precise. The systems
which combine these two strategies are called hybrid systems, and are probably
the most successful. The best example is IBM Watson [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], who was able to beat
the best human players in the Jeopardy game.
      </p>
      <sec id="sec-2-1">
        <title>3 http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Proposed Solution</title>
      <p>The system has a pipeline architecture, described in Fig. 1. The pipeline starts
from the natural language question and ends with the answer.</p>
      <p>The answer is provided based on the generated SPARQL query, which is
executed on a Virtuoso-opensource4 endpoint, loaded with the DBpedia 2014
dump5. In order to be able to generate the query, two conditions are needed:
the DBpedia entities must be discovered in text, and they have to be correctly
linked. The links between the entities can only be based on syntactic dependency
links between words. Because of this, the first step in the pipeline is the question
parsing which is done using the Stanford CoreNLP library6.</p>
      <p>
        After this first step, a directed graph is generated, with vertices corresponding
to tokens in the question, which are annotated with lemma and part-of-speech
tags, and edges corresponding to the collapsed dependencies generated by
Stanford CoreNLP [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Also, numbers and dates are detected in this step by using
Stanford NER7, a named entity recognition module.
      </p>
      <p>Once the graph is generated, DBpedia entities must be detected. There are
three types of entities in the ontology: individuals, types and properties. Methods
for detecting each type of entity were developed, because of the particularities
of each type. During these steps, multiple graphs will be generated because of
the different possible matches, some of them containing multiple words. In the
end, only one graph will be selected based on some scores as will be explained
in the following sections.</p>
      <sec id="sec-3-1">
        <title>4 https://github.com/openlink/virtuoso-opensource 5 http://downloads.dbpedia.org/2014/ 6 http://nlp.stanford.edu/software/corenlp.shtml 7 http://nlp.stanford.edu/software/CRF-NER.shtml</title>
        <p>The most probable graph enters the last stages of the pipeline, where missing
entities are inferred, while existing ones are validated using the ontology. The
SPARQL query is generated in the last step, creating triples and subqueries
based on the structure of the graph and the direction of the properties.
3.1</p>
        <sec id="sec-3-1-1">
          <title>Individual Detection</title>
          <p>The detection of individuals from the ontology is probably the simplest task
out of the three types of detection tasks, because the list of possible ways of
expressing an individual is very limited. In the same time, it is a very important
step, because individuals are much more specific than types and properties, which
could be inferred in some cases based on the individuals discovered in this first
stage.</p>
          <p>In the general form, the named entity recognition (NER) task is a more
difficult problem than the task at hand. This happens because the list of possible
entities is usually open (almost anything can be a name). In our case, the entities
must be part of DBpedia, any other match being useless. For each possible
individual to match, most of the ways in which it can be expressed already exist
in DBpedia, in the form of Wikipedia redirects. Although this approach might
loose some expressions for certain individuals, it works well enough in most cases.</p>
          <p>First, the individuals are filtered by their “importance”, so we can ignore
very unlikely candidates which can have a negative effect on the precision. The
importance is estimated by the number of Wikipedia pages that have links to
the individual’s page. For each such individual, all the redirects that contain
only ASCII characters are kept.</p>
          <p>
            All these expressions are inserted in a trie data structure for words, in order
to perform very fast searches. When searching for matches, the longest sequence
of words that match is kept and for each sequence, the individuals are sorted
based on how close their label is to the sequence of words, measured with the
edit distance [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ], and also by their importance.
          </p>
          <p>Matched sequences of words can also overlap, each sequence being linked to
a different individual. In this case, there is no way we can decide at this point
which is the correct one, so the current graph is replaced by two different ones.
Also, the same sequence of words might produce different graphs if the possible
individuals that match have very different types. The type of each individual is
important in the properties matching, which will be described later.</p>
          <p>Another rule added to improve accuracy is the idea of a “strong” link. These
links correspond to a sequence of more than one word, where each word that
is not a preposition or conjunction is written with capital letter. These cases
strongly indicate a correct match, so any other possible sequence that overlaps
with it, or possible type or property match, will be ignored.
3.2</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Type Detection</title>
          <p>Although the number of types in DBpedia is much smaller than the number of
individuals, their detection is harder because the ways of expressing them is not
known a-priori, and because most of them are common words. Since for types we
don’t have Wikipedia redirects anymore, a new method of determining possible
expressions needed to be found.</p>
          <p>
            The “naïve” solution. Most of the times, a type appears in text by its label,
which makes it very easy to discover. Starting from this, a database can be
generated with entries containing pairs of types and word. Besides the actual
label, synonyms of the label extracted from WordNet [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] are also added. This
approach discovers many of the correct matches, but it is hard to apply on types
whose label contain more than one word.
          </p>
          <p>A Wikipedia-based approach. In order to overcome the shortcomings of
the “naïve” approach, a more general solution was needed. Because of the strong
connection between DBpedia and Wikipedia, each Wikipedia article corresponds
to a DBpedia individual, who also has a type, which means that the page should
contain a formulation of that type within the text. Luckily, the type is most of
the times expressed in the first sentence of each article, which provides a short
description for each individual.</p>
          <p>Since the majority of the first sentences look like “&lt;individual&gt; is a &lt;type&gt;”,
the complement of the copulative verb should be an expression for that type.
Some of the words connected with the complement of the copulative verb are also
important for the expression of the type. Because of this, a subgraph containing
vertices that can be reached starting from the type vertex by going only on edges
corresponding to dependencies of a modifier type is generated. The words from
this subgraph are memorized as a way of expressing the type.</p>
          <p>We have found that, while some of the extra words are important for that
type ( e.g. “formula one racing driver”), others don’t add any extra information
(e.g. “English actor”). However, using all these possibilities might alter the results
or might include in the type words that are important for the question in other
ways. These problems and others are solved in a later step using a method that
removes information that is too specific or too general.</p>
          <p>The remaining entries are added in another trie-like structure, where edges
can be lemmas or whole words. When queried, the word edges have priority, but
lemmas are used when matches are not found. This permits, for example, the
existence of the plural form for some words. Type detection is applied for each
resulted graph from the individual detection phase, so at least the same number
of graphs result from this step.
3.3</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>Property Detection</title>
          <p>Out of the three types of DBpedia resources, properties are arguably the most
difficult to detect. This is because there are more properties than types and most
of them are complex and contain more than one word. Moreover, applying the
same approach as for types is tedious because properties can appear anywhere
in a Wikipedia article, and they can appear in many types of sentences, which
makes developing rules for words extraction more difficult.</p>
          <p>
            Extraction. Because of the strong connection to DBpedia, Wikipedia is a good
source for pattern extraction, and was used for this purpose by systems like
WikiFramework [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] and BOA [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]. We propose a similar approach, that also uses
syntactic dependencies to select the pattern from the sentence.
          </p>
          <p>Because dependency parsing is quite slow, only the sentences from a Wikipedia
page where a property could occur were considered. This was achieved by finding
sentences where both the label of the individual and the value of the property
were found. For data properties, more possible values were looked for, especially
for dates which have many possible formats. Although a lot of relevant sentences
are ignored because the individual appears most of the times as a reference, not
by label, using a coreference system would slow this task even more, and we
concluded not to use coreference resolution at this point in our system.</p>
          <p>From each selected sentence, the path between the discovered subject and
object should be an expression that corresponds to the property between them.
This is not always the case because other related words might be relevant for the
meaning of the property, while sometimes, some words from the path might not
be relevant. Starting from the path, other words, that could be reached from the
ones on the path by going on edges that usually represent modifier dependencies,
are added to the expression.</p>
          <p>Sometimes, the extracted expression contains a lot of unnecessary words. An
example of such a case is “Judd Trump is an English professional snooker player
from Bristol and former world number one”. This is a very common situation,
where the individual is first defined, and then property value is connected to
the definition. In this case, the individual is considered the word “player” so the
extracted expression will be “from”, which is much more relevant than “is an
English professional snooker player from”.</p>
          <p>Matching. A subgraph is considered to match an expression if it has all the
words in the expression, regardless of their order. Also, the words are compared
by their lemma and POS tag, so all variations of a word can match. In this form,
this problem is a subgraph matching one.</p>
          <p>Finding all the possible matches is done by starting from each word in the
sentence. An inverse index contains all the extracted expressions that contain a
certain word, so a possible list of expressions is extracted fast. For each expression
in the list, a search is performed starting from the vertex to check if all words in
the expressions form a subgraph in the query graph. The validated expressions
enter the next step of the algorithm.</p>
          <p>A score is computed for each matched expression, based on the types that
it connects. In the general case, the subgraph connects two words, which might
have corresponding types. The type can be a detected type, the type of a detected
individual or the type of a question word (who - Agent, when - Date, how many
- Integer, etc.). Both types are then compared with the ones from the database
in order to compute the score.</p>
          <p>Considering that the types of the matched resource are (t1; t2) and the types
from the database are (t01; t02), the distance between them is computed as the
sum of the distances between t1and t01 and between t2 and t02 . The distance
between two types is computed as the number of nodes between them in the
DBpedia taxonomy. There are two distinct taxonomies, one for resource classes,
and one for data types. The distance between 2 classes from different taxonomies
is considered maximum. Based on the computed distance between the pairs of
types, the score of the matching is calculated with the formula (1):
S(e; tp; p) =
tp0
X N (e; tp0 ; p)</p>
          <p>N (e; tp0 )
2 dist(tp;tp0 )
(1)
where:
– e = the matched expression
– tp = the type pair (t1; t2)
– p = the property
– tp0 = a type pair for property p and expression e in the database
– N = the number of appearances in the database for the given parameters
Because the same property appears in the database for different type pairs,
the score of the match is summed for all occurrences. Also, any entry with the
distance larger than a threshold is ignored.</p>
          <p>In some cases, the expression in the sentence does not occur between two
types. One such example is the question “How often did Jane Fonda marry?”.
Here, marry has only one connected type, which is Person, because how many
cannot be linked with a type. Another similar case is represented by questions
containing superlative adjectives, like “Which is the highest mountain?”, where
the words that are supposed to be matched to a property have only one neighbour
in the graph. For these cases, results that match only one type are included in
the list of matches, but with a lower score than matches with two types.</p>
          <p>Since the matches contain only sets of vertices from the query graph,
prepositions cannot be captured by this method because they are mapped as
dependencies in the graph. It would be useful to capture such matches too, because they
appear quite often in questions. In order to match edges too, the same scoring
for pairs of types is used, for every edge in the graph that is not included in
another match and that exists in the database.</p>
          <p>Graphs Generation. In the case of individuals and types, the matching process
selects the longest sequence at discovery time and all smaller included sequences
are ignored. However, in the case of properties, the match consists of a subgraph,
so it is more difficult to determine which matches to ignore. All the detected
matches must be distributed to as few graphs as possible, while having as many
non-overlapping matches in each graph.</p>
          <p>
            Determining if two matches can be in the same graph can be easily computed
by checking if their lists of replaced elements are disjoint. By comparing all pairs
of matches, an undirected graph can be constructed with vertices being the
matches, and the existence of an edge between two vertices meaning that the
two matches do not overlap. All maximum cliques in this new graph represent
the sets of matches that should be grouped in the same graph. However, the
problem of finding all maximal cliques is NP-complete. Several algorithms for
this problem exist, but probably the Bron-Kerbosch algorithm [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] is the most
famous one. The Bron-Kerbosch algorithm has a worst case complexity of O(3 n3 ),
which matches the maximum number of maximal cliques in a graph, which is
n
3 3 [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ].
          </p>
          <p>The number of properties in a question depends on the number of vertices in
the input graph, but generally it is very unlikely to be more than a couple (e.g.
maximum 3-4 properties). This means that the size of the graph on which the
search for the maximal cliques is performed is very small in most situations, so
applying the algorithm won’t cause any performance issues, although there isn’t
a theoretic upper limit for the size of the graph. In the vast majority of cases,
only one property is detected per input question, so this algorithm is not even
run.
3.4</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>Graph Selection</title>
          <p>The three separate detection steps produce more than one graph for complex
questions, but testing every one of them on the ontology would take too much
time. The rest of the steps in the pipeline are more complex because detected
entities must be validated, while missing ones have to be inferred. All these
actions require complex queries and the number of investigated graphs can be
quite high. Also, the final results of each query are difficult to compare so it is
hard to choose the best interpretation.</p>
          <p>For all these reasons, QAnswer selects the highest scoring graph right after
the matching steps and all the following steps will use only this graph. For
each generated graph, a score has to be computed that reflects how well each
match fits the text and how well the matches fit together. The first part is
the sum of the scores for each matched entity. One problem is the fact that
each type of property uses a different score formula because of the different
matching methods applied. All these scores coming from different formulas are
not trivially comparable because of the different range of the values, but also
because some types of entities might be more important than others, depending
on the question.</p>
          <p>In the case of individuals, the only measure we can use is the importance of
the entity in the DBpedia graph, which is computed as described in the previous
sections. A logarithm is applied in order to make this score comparable to those
for the other types. Also, the number of matched words increase the score, as it
can be seen in the formula 2:</p>
          <p>Sind(I; match) = ln(1 + importance(I)) nW ords(match)
(2)</p>
          <p>The same formula is applied for types, only by replacing the importance of
an individual with the number of occurrences of a type in our database:
Stype(T; match) = ln(1 + occurences(T; match)) nW ords(match)
(3)</p>
          <p>The property score is calculated as defined in formula 1. While the values
for each pair of types in a property represents a probability, the sum over all
existing pairs can produce results bigger than 1 for good matches, but usually
lower than the scores for individuals or types. This is important because the
certainty of a property match is lower than the one for the other types. Thus,
we prefer to assign a lower importance to properties.</p>
          <p>The graph score also contains bonuses for complete triples of (individual,
property, individual) or (individual, property, type) that appear in the graph,
because this account for a higher success rate. The final formula for the score is
4, where the triples are only the connected ones described above.</p>
          <p>S(g) =</p>
          <p>X
m matches</p>
          <p>S(m) +</p>
          <p>X
(x;y;z) triples</p>
          <p>S(x) S(y) S(z)
(4)
3.5</p>
        </sec>
        <sec id="sec-3-1-5">
          <title>SPARQL generation</title>
          <p>The SPARQL generation is the last step in the pipeline. This component will
receive a query graph with all the possible entities set on the nodes and it has
the job to generate a valid SPARQL query based on this graph.</p>
          <p>The process of generating the SPARQL query is done recursively from the
root of the graph (the node with no incoming edges) downwards.</p>
          <p>When a node is visited, it has access to the graph, the parent, the current
SPARQL query and the triple constructed at the parent level.</p>
          <p>At each step we take certain decisions of how to enrich the query based on the
current triple that is being constructed, the current edge that has been visited
and type of the current node.</p>
          <p>Determining the answer type. In a QA system, the questions asked by the
users can be of multiple types. They can ask for a list of things, what is the
number of certain things, what is the date of an event, location questions etc.</p>
          <p>The type of the question is determined by checking for patterns in the input.
By default, we will list all the results of the SPARQL query. The other types of
questions that we are checking for are the ones that need to count the number
of results (e.g. :“How many . . . ?”) or the ones that ask us to validate the query
(e.g. “Is Berlin in Germany?”).</p>
          <p>An important part of the query generation process is accurately determining
what the searched variable is. We are doing this process by analyzing the edges
of the graph. Usually the searched element will be determined from the “nsubj”
edges in the dependency graph. From the connected resources of this edge we
look for edges that have a “det” (determiner) dependency connected to them.
This highlights that the source connected to them is the one being searched.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>The system was evaluated on the QALD-5 test corpus. Unlike other systems,
ours did not use any training on the data so any differences between the results
on the training and test data are probably due to the difficulty of the questions,
and not the result of overfitting. However, any testing and changes of the
system’s parameters were performed on the training data and this may have caused
overfitting on this training set. On the other hand, the results on the training
set were slightly worse than those on the test set with about 0.27 global F1 score
on test set compared to 0.30 global F1 on test set.</p>
      <p>Table 1 presents the results of all the participating systems in the multilingual
question answering over DBpedia task. Taking into account the global F-1 score,
our system obtained the second best result after Xser, who outperformed all the
other systems by far.</p>
      <p>Out of the 50 test questions, QAnswer only tried to answer those questions
marked as “onlydbo” because possible expressions for types and properties were
extracted only for the “dbpedia.org/ontology” ones. The system returns an
answer for each of those questions, because it is hard to determine when an
interpretation is incorrect.
The system might fail to produce the correct answer due to several reasons.
Below, we present some of the problems identified after interpreting the results
on the questions from the QALD-5 test dataset.</p>
      <p>– “Which programming languages were influenced by Perl?”
Our system detected all the needed entities in the question, but generated a
query with the wrong direction of the property influenced. This is expected to
happen, since the property matching algorithm does not take into account the
direction, which could be very clear in this case due to the word by.</p>
      <p>– “Who killed John Lennon?”
The question was labeled with onlydbo, but the information needed to answer
the question required the dbpedia.org/property/conviction property. Also, this
question is very difficult to answer because the entity that needed to be detected
was Death_of_John_Lennon, and not the much more probable John_Lennon.</p>
      <p>– “Which artists were born on the same date as Rachel Stevens?”
Although all the required entities were matched, our system is not yet able to
answer questions where a comparison on dates is needed.</p>
      <p>– “Who is the manager of Real Madrid?”
Because of the long extraction process for properties, we have only searched
for those properties that appeared in previous QALD datasets. The idea that
those were the most important properties was not really true, as many new
properties appeared in the training and test set. Those new properties cannot
be detected by our system, so questions containing them, like the one above,
were not interpreted correctly. We shall solve this problem in the next version
of the system.</p>
      <p>– “Give me the currency of China.”
Sometimes, the graph scoring formula indicated a wrong interpretation as the
best one. In this case, currency was labeled as a type, instead of the property. In
order to solve problems like this one, other ways of evaluating an interpretation
must be found, or more than one graph should be evaluated.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>We developed QAnswer - a question answering system that uses expressions
extracted from Wikipedia to match entities in DBpedia. The solution can be
generalized to other ontologies, but the properties extraction step must be
executed again for the properties in these new ontologies. More, if an individual
is not part of DBpedia, new text sources that describe it should be identified,
since this means that it does not have a Wikipedia page either.</p>
      <p>At the time the system was tested on the QALD-5 test dataset, only
expressions for a small part of the properties in DBpedia were extracted. The process
should be executed for other common properties too, so that a more relevant
evaluation of our method can be performed. Also, the extraction process can be
improved because a lot of irrelevant information appear in the extracted
expressions, which may affect the results.</p>
      <p>Methods for learning which graph interpretation is more plausible will be
developed and tested in the future. Such methods could help us detect the correct
type of an entity for a phrase, and not depend too much on graph scoring, which
cannot reflect whether an interpretation makes sense or not.</p>
      <p>Acknowledgements. This work has been partly funded by the Sectorial
Operational Programme Human Resources Development 2007-2013 of the Romanian
Ministry of European Funds through the Financial Agreements POSDRU/159/1.5/S/132397
and by POSDRU/155420 - PROSCIENCE. Stefan Ruseti has also been awarded
a student travel grant for CLEF 2015.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bron</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kerbosch</surname>
          </string-name>
          , J.:
          <source>Algorithm</source>
          <volume>457</volume>
          :
          <article-title>finding all cliques of an undirected graph</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>16</volume>
          (
          <issue>9</issue>
          ),
          <fpage>575</fpage>
          -
          <lpage>577</lpage>
          (
          <year>Sep 1973</year>
          ), http://dl.acm.org/citation.cfm?id=
          <volume>362342</volume>
          .
          <fpage>362367</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Collins,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Discriminative training methods for hidden Markov models</article-title>
          .
          <source>In: Proceedings of the ACL-02 conference on Empirical methods in natural language processing - EMNLP '02</source>
          . vol.
          <volume>10</volume>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . Association for Computational Linguistics, Morristown, NJ, USA (Jul
          <year>2002</year>
          ), http://dl.acm.org/citation.cfm?id=
          <volume>1118693</volume>
          .
          <fpage>1118694</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ferrucci</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Chu-Carroll</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fan</surname>
          </string-name>
          , J.:
          <article-title>Building Watson: An overview of the DeepQA project</article-title>
          .
          <source>AI</source>
          magazine pp.
          <fpage>59</fpage>
          -
          <lpage>79</lpage>
          (
          <year>2010</year>
          ), http://www.aaai.org/ojs/index.php/aimagazine/article/viewArticle/2303
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Gerber</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Ngonga</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.C.</surname>
          </string-name>
          :
          <article-title>Bootstrapping the Linked Data Web</article-title>
          .
          <source>1st Workshop on Web Scale Knowledge Extraction @ ISWC</source>
          <year>2011</year>
          1
          <article-title>(</article-title>
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.:</surname>
          </string-name>
          <article-title>CASIA @ V2 : A MLN-based Question Answering System over Linked Data</article-title>
          .
          <source>In: CLEF 2014 Working Notes Papers</source>
          . pp.
          <fpage>1249</fpage>
          -
          <lpage>1259</lpage>
          . No.
          <volume>61272332</volume>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Kun</given-names>
            <surname>Xu</surname>
          </string-name>
          , Sheng Zhang, Yansong Feng,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Answering Natural Language Questions via Phrasal Semantic Parsing</article-title>
          .
          <source>In: CLEF 2014 Working Notes Papers</source>
          . pp.
          <fpage>333</fpage>
          -
          <lpage>344</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Levenshtein</surname>
            ,
            <given-names>V.I.</given-names>
          </string-name>
          :
          <article-title>Binary Codes Capable of Correcting Deletions, Insertions and Reversals</article-title>
          .
          <source>Soviet Physics Doklady</source>
          <volume>10</volume>
          ,
          <issue>707</issue>
          (Feb
          <year>1966</year>
          ), http://adsabs.harvard.edu/abs/1966SPhD...
          <volume>10</volume>
          ..707L
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Mahendra</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wanzare</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernardi</surname>
          </string-name>
          , R.:
          <article-title>Acquiring Relational Patterns from Wikipedia: A Case Study</article-title>
          .
          <source>Proceedings of the 5th Language and Technology</source>
          Conference pp.
          <fpage>111</fpage>
          -
          <lpage>115</lpage>
          (
          <year>2011</year>
          ), http://disi.unitn.it/ bernardi/Papers/ltc.pdf
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Marneffe</surname>
            ,
            <given-names>M.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Stanford typed dependencies manual</article-title>
          (
          <year>September 2008</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>28</lpage>
          (
          <year>2008</year>
          ), http://nlp.stanford.edu/downloads/dependencies_manual.pdf
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>G.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beckwith</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fellbaum</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gross</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>K.J.</given-names>
          </string-name>
          :
          <article-title>Introduction to WordNet: An On-line Lexical Database *</article-title>
          .
          <source>International Journal of Lexicography</source>
          <volume>3</volume>
          (
          <issue>4</issue>
          ),
          <fpage>235</fpage>
          -
          <lpage>244</lpage>
          (
          <year>Jan 1990</year>
          ), http://ijl.oxfordjournals.org/content/3/4/235.short
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Moon</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moser</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>On cliques in graphs</article-title>
          .
          <source>Israel Journal of Mathematics</source>
          <volume>3</volume>
          (
          <issue>1</issue>
          ),
          <fpage>23</fpage>
          -
          <lpage>28</lpage>
          (
          <year>Mar 1965</year>
          ), http://link.springer.com/10.1007/BF02760024
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Nakashole</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F.M.:</given-names>
          </string-name>
          <article-title>PATTY: A Taxonomy of Relational Patterns with Semantic Types</article-title>
          . EMNLP-CoNLL pp.
          <fpage>1135</fpage>
          -
          <lpage>1145</lpage>
          (
          <year>2012</year>
          ), http://dblp.uni-trier.de/db/conf/emnlp/emnlp2012.html#NakasholeWS12
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Richardson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Domingos</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Markov logic networks</article-title>
          .
          <source>Machine Learning</source>
          <volume>62</volume>
          (
          <issue>1-2</issue>
          ),
          <fpage>107</fpage>
          -
          <lpage>136</lpage>
          (
          <year>Jan 2006</year>
          ), http://link.springer.
          <source>com/10.1007/s10994-006-5833-1</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Zou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>J.X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Natural language question answering over RDF: a graph data driven approach</article-title>
          .
          <source>In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data - SIGMOD '14</source>
          . pp.
          <fpage>313</fpage>
          -
          <lpage>324</lpage>
          . ACM Press, New York, New York, USA (Jun
          <year>2014</year>
          ), http://dl.acm.org/citation.cfm?id=
          <volume>2588555</volume>
          .
          <fpage>2610525</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>