=Paper= {{Paper |id=None |storemode=property |title=A System Description of Natural Language Query over DBpedia |pdfUrl=https://ceur-ws.org/Vol-913/08_ILD2012.pdf |volume=Vol-913 |dblpUrl=https://dblp.org/rec/conf/esws/AggarwalB12 }} ==A System Description of Natural Language Query over DBpedia== https://ceur-ws.org/Vol-913/08_ILD2012.pdf
      A System Description of Natural Language
               Query over DBpedia�

                         Nitish Aggarwal and Paul Buitelaar

     Unit for Natural Language Processing, Digital Enterprise Research Institute,
                       National University of Ireland, Galway
                   {nitish.aggarwal,paul.buitelaar}@deri.org




        Abstract. This paper describes our system, which is developed as a first
        step towards implementing a methodology for natural language query-
        ing over semantic structured information (semantic web). This work fo-
        cuses on interpretation of natural language queries (NL-Query) to fa-
        cilitate querying over Linked Data. This interpretation includes query
        annotation with Linked Data concepts (classes and instances), a deep
        linguistic analysis and semantic similarity/relatedness to generate poten-
        tial SPARQL queries for a given NL-Query. We evaluate our approach on
        QALD-2 test dataset and achieve a F1 score of 0.46, an average precision
        of 0.44 and an average recall of 0.48.


Introduction

The rapid growth of Linked Data offers a wealth of semantic data for facilitating
a interactive way to access the Web. However, Linked Data also brings several
challenges in providing a flexible access over the Web for all users. Structured
query languages like SPARQL provide the capability of accessing these this data,
but these languages are restricted to the vocabulary defining the data. This data
should be easily searchable and consumable for casual users to query in their na-
tive language, similar as with traditional web of documents through web search
engines for document search.

   In order to facilitate NL-queries over Linked Data, we implemented a basic
pipeline that includes entity annotation, a deep linguistic analysis and semantic
similarity/relatedness. This pipeline is very similar to the system implemented
�
    This work is supported in part by the European Union under Grant No. 248458
    for the Monnet project and by the Science Foundation Ireland under Grant No.
    SFI/08/CE/I1380 (Lion-2).
    Copyright � c 2012 by the paper’s authors. Copying permitted only for private and
    academic purposes. This volume is published and copyrighted by its editors.
    In: C. Unger, P. Cimiano, V. Lopez, E. Motta, P. Buitelaar, R. Cyganiak (eds.): Pro-
    ceedings of Interacting with Linked Data (ILD 2012), Workshop co-located with the
    9th Extended Semantic Web Conference, Heraklion, Greece, 28-05-2012, published
    at http://ceur-ws.org
            A System Description of Natural Language Query over DBpedia        97

by Freitas A. et.al. [1], which is based on the a combination of entity search, a
Wikipedia-based semantic relatedness (using Explicit Semantic Analysis) mea-
sure and spreading activation. However, our work focuses additionally on a deep
linguistic analysis and categorization of a NL-Query. For example, a given NL-
Query, such as ”who designed the Brooklyn Bridge”, is first categorized as a
person- type query and, then the verb ”designed” is modified to ”designer”. We
also further investigate the approaches used for computing semantic similarity
and relatedness.

Query Interpretation Approach
In our system, the interpretation of NL-Query is driven by semantic match-
ing between Linked Data vocabulary and terms appearing in the NL-Query, to
construct a SPARQL query. A well- interpreted SPARQL query from a given
NL-Query can overcome the semantic gap between user- described queries and
Linked Data vocabularies.

    This includes three components:, namely query annotation, a deep linguis-
tic analysis and semantic similarity/relatedness as shown in Fig.1. We describe
below these components by taking an example NL-Query over the DBbpedia
dataset.




Fig. 1. Query interpretation pipeline for an example NL-Query ”Who is the daughter
of Bill Clinton married to?”.

Query Annotation The interpretation process starts by identifying the po-
tential entities, i.e. DBpedia instances and classes present in the NL-Query. For
identifying these entities, we created two separate lucene indices, one for la-
bels & URIs of all DBpedia instances and other one for all DBpedia classes.
Annotating a NL-Query includes the extraction of keywords by removing stop
words and identification of possible DBpedia classes followed by identification
of DBpedia resources by performing keyword search over both lucene indices.
After identifying potential resource labels, we perform disambiguation to recog-
nize the most appropriate DBpedia resource URI, as there are multiple URIs
for the same DBpedia resource label. The disambiguation is performed by re-
trieving wikiPageRedirects URIs, if the recognized URI redirects to any other
98      N. Aggarwal, P. Buitelaar

DBpedia resource URI, e.g. in our system ”Bill Clinton” is identified as URI
”http://dbpedia.org/resource/BillClinton” which redirects to the right URI of
label ”Bill Clinton” i.e. ”http://dbpedia.org/resource/Bill Clinton”.


Linguistic Analysis A deep linguistic analysis is performed by generating a
parse tree and typed dependencies by using the Stanford parser. Generated parse
trees provides a phrase extraction for identifying them as potential DBpedia re-
sources or DBpedia classes. For instance, in our example query, the phrase ”Bill
Clinton” is identified as a noun phrase. It suggests us to perform a lucene search
over the whole phrase ”Bill Clinton” rather than separate searches for ”Bill” and
”Clinton”.

    We convert the given NL-Query into an ordered list of potential terms by
using typed dependencies generated by the Stanford parser. For creating this
ordered list, first we select a central term among all the identified terms, where
the central term is the most plausible term to start matching of a given NL-Query
to the vocabulary appeared in the DBpedia graph. This selection is performed
by prioritizing the DBpedia resources over DBpedia classes. Then, we retrieve
the direct dependent terms of this central term following the generated typed
dependencies and add them into the ordered list. Similarly, we perform the same
for all the other terms in the list. For instance, in our example NL-Query, firstly,
the system identifies ”Bill Clinton” as a central term and then ”daughter” as
direct dependent of ”Bill Clinton” followed by ”married” as direct dependent of
”daughter” shown in Fig.1.


Semantic Similarity and Relatedness A semantic similarity can be defined
on the basis of taxonomic (is-a) relations of two concepts, while relatedness covers
a broad range of relations, e.g. meronym and antonym. In our problem space, we
want to get the best semantic match of terms appearing in the NL-Query to the
vocabulary of the DBpedia dataset. We can not however rely solely on semantic
similarity measures (as in our example NL-Query), as we can see relatedness can
better map the term ”married” on and the retrieved property ”spouse” as they
are semantically related terms but not semantically similar.

   To find the best semantic match we are investigating two approaches for
semantic relatedness, i.e. Wikipedia based Explicit Semantic Analysis (ESA) [2]
and a semantic relatedness measure based on WordNet structure [3]. Due to
the computational cost involved in getting the relatedness measure using ESA,
currently we are experimenting with measures based on WordNet only.


Evaluation

To evaluate over approach, we calculate average precision, average recall and F1
score of the results obtained by our approach on QALD-2 test dataset, which
includes 100 NL-Queries over DBpedia. The results are shown in Table 1.
             A System Description of Natural Language Query over DBpedia                 99

Total Answered Right Partially right Avg. Precision Avg. Recall                          F1
 100         80          32             7                  0.44              0.48        0.46

   Table 1. Evaluation on QALD-2 test dataset of 100 NL-Queries over DBpedia



Conclusion and Future Work
This paper presented a system for natural language querying over Linked Data,
which includes query annotation, a deep linguistic analysis and semantic simi-
larity/relatedness. Currently, our approach does not fully explore all the types of
queries appeared in dataset as it consists more challenging complex NL-Queries
such as SPARQL aggregation and ask type queries. Future work will concentrate
on improving the annotation step with better handling linguistic variations and
a sophisticated semantic similarity/relatedness measures by that taking contex-
tual information into account.



References
1. Freitas, A., Oliveira, J. G., O’Riain, S., Curry, E., Da Silva, J. C. P.: Querying linked
   data using semantic relatedness: a vocabulary independent approach. In: Proc. of
   NLDB’11 (2011)
2. Gabrilovich, E., Markovitch, S.: Computing Semantic Relatedness using Wikipedia-
   based Explicit Semantic Analysis. In: Proc. of IJCAI 2007
3. Pirró, G.: A semantic similarity metric combining features and intrinsic information
   content. Data Knowl. Eng. 68(11) (2009) 1289–1308