<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Mapping Keywords to Linked Data Resources for Automatic Query Expansion</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Isabelle Augenstein</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Lisa Gentile</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Barry Norton</string-name>
          <email>barry.norton@ontotext.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ziqi Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Ciravegna</string-name>
          <email>f.ciravegnag@dcs.shef.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of She eld</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ontotext</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Linked Data is a gigantic, constantly growing and extremely valuable resource, but its usage is still heavily dependent on (i) the familiarity of end users with RDF's graph data model and its query language, SPARQL, and (ii) knowledge about available datasets and their contents. Intelligent keyword search over Linked Data is currently being investigated as a means to overcome these barriers to entry in a number of di erent approaches, including semantic search engines and the automatic conversion of natural language questions into structured queries. Our work addresses the specific challenge of mapping keywords to Linked Data resources, and proposes a novel method for this task. By exploiting the graph structure within Linked Data we determine which properties between resources are useful to discover, or directly express, semantic similarity. We also propose a novel scoring function to rank results. Experiments on a publicly available dataset show a 17% improvement in Mean Reciprocal Rank over the state of the art.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Linked Open Data grows at an astonishing rate, and is becoming a gigantic data source
that can be harnessed to power various applications. One of the current challenges for
accessing and consuming Linked Data on the Web is dealing with heterogeneity [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
i.e. that data can be distributed and duplicated in di erent data stores, and described by
di erent vocabularies. As a result, accessing and consuming Linked Data on the Web
requires end users to be aware of where to find the data (which datasets) and how it is
represented (which vocabularies). Additionally, users need to have working knowledge
of formal query languages such as SPARQL to access the datasets. This creates high
barriers of entry to creating applications that use Linked Data.
      </p>
      <p>
        A simple and convenient way for users to express their information needs [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is the
usage of natural language queries (NLQs). Therefore, we assume that enabling a user to
use natural language to query Linked Data would o er a conducive alternative to formal
query languages. The first step in translating a NLQ to a SPARQL is to map the keywords
in the query to Linked Data resources. Most current semantic search approaches o er
limited abstraction from the keywords contained in the query, i.e. the mappings are based
merely on keywords and their lexical derivatives ([
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], also [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]).
      </p>
      <p>
        Approaches that just consider variations on a lexicographic level often fail entirely
to map keywords to Linked Data resources [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. More advanced approaches therefore
enrich their query with semantically similar words. For example, by looking up synonyms
in a dictionary, often WordNet ([
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]), but dictionaries in general contain a
very limited vocabulary and even WordNet has very few named entities.
      </p>
      <p>
        Freitas et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] argue that matching natural language terms to resources in Linked
datasets can easily cross taxonomic boundaries and suggest that expanding the query
with a more general semantically similar class of terms produces better results than
either matching keywords on the mere base of lexicographic similarity or by
WordNetbased expansion. Semantic similarity describes the strength of semantic association
between two words. It includes the classical relations hypernymy, hyponymy, meronymy
and synonymy. Semantic relatedness is more general than similarity and also covers
antonymy and other semantic associations [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        This idea is further supported by recent studies on exploratory query expansion,
where users are presented with not only semantically similar, but also generally related
keywords [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. However, the information about related keywords is usually presented to
the user but not directly integrated into the ranking of results [
        <xref ref-type="bibr" rid="ref16 ref4">4,16</xref>
        ]. Furthermore, existing
approaches are typically based on a single knowledge source such as WordNet [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], or
Wikipedia [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and are bound to the specific structure of the knowledge source assumed
to be known a-priori. We argue that such methods are highly restricted by the scope of
the knowledge source and di cult to adapt across domains.
      </p>
      <p>The major contribution of this paper is the basis for a highly generic, adaptable
automatic query expansion approach that (i) automatically finds and ranks URIs of
related resources for keywords, and (ii) that does not impose a-priori knowledge about
data sources or knowledge bases. Compared to the foregoing work of Freitas et al. our
contribution is to (i) o er a means to benefit from the multitude of properties between
resources in Linked Data, not just the use wikiPageWikiLink property, representing the
internal Wikipedia links as used in that fore-going work; (ii) o er means to automatically
rank the e ectiveness of such properties in expressing a useful notion of semantic
similarity; (iii) to move beyond simply considering Wikipedia/DBpedia and demonstrate
e ectiveness across large datasets.</p>
      <p>Our approach hinges on the ability to take a keyword from the query, w and expand
this into a larger set of keywords, Ew, based on the labellings of ‘neighbours’ in whole
graph of Linked Open Data. In the current work this expanded keyword set is then used
to find candidates within a fixed target vocabulary (i.e. we use the whole Linking Open
Data Cloud to help find alternative keywords via similar resources but evaluate these by
using them to find terms in the DBpedia ontology for the purposes of cross-evaluation).
The training phase of our approach is intended to apply supervised learning to identify
which relationships (properties) among those available in Linked Open Data express
useful semantic similarities.</p>
      <p>
        Compared to state of the art, this approach is completely independent of the structure
of underlying datasets and vocabularies. It is therefore highly generic and adaptable
across di erent, and growing, domains covered by Linked Open Data. It is also highly
flexible; as we show that it can be used to expand keywords based on any generalised
or specialised semantic relations given suitable training data. We report experiments
on the gold standard dataset used in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Results show a 17% improvement in Mean
Reciprocal Rank with respect to figures reported in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>The paper is structured as follows: Section 2 discusses relevant related work,
Section 3 describes our query expansion method and Section 4 presents an evaluation of
our method on the DBpedia ontology dataset. We discuss future work directions and
conclude in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Freitas et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] discuss the challenges and limitations of querying Linked Data. They
identify the main streams of works addressing them, including those based on
Information Retrieval (IR) methods and those based on Natural Language Processing (NLP).
      </p>
      <p>
        IR based approaches provide, at di erent levels, an exploration of the underlying
datasets in a web search style, i.e. one which treats documents as bags of words, where
each word is a mere token. The basic idea behind IR methods is to apply keyword search
directly over Linked Data triples. To achieve this goal the triples are tokenised and
indexed according to one of several possible strategies. Watson [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is among the class
of Semantic Search engines that crawl, analyse and (token) index semantic data and
ontologies. Sindice [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] also provides a token index for textual search over Semantic
Web resources, as well as a SPARQL endpoint over the merged graph-indexed data
it has crawled, and allows (cached) retrieval of resources whether found by textual or
graph (SPARQL) matching. The graph processor extracts and indexes all textual property
values — including, but not limited to labels — for each resource in the graph and
tokenises these. In the case that the property attaching a resource to a textual value is
through an inverse functional property, this allows searching for blank nodes (which
do not have an explicit identifier). SearchWebDB [
        <xref ref-type="bibr" rid="ref15 ref16">15,16</xref>
        ] allows users to formulate
queries in terms of keywords, which are translated to structured queries representing
possible interpretations of the user’s information needs. Returned results are the triples
satisfying those queries, ranked with decreasing syntactic similarity from the original
query keywords. The keyword matching process is limited to the literals associated with
each objects, but neighbour resources are also retrieved to help the user select the best
answer. In a subsequent work, semantically similar entries from WordNet are also used
for the keyword matching process [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Falcons Object Search [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] performs the keyword
matching process against labels, literals and textual descriptions of each object and
immediately linked objects. A drawback of the search is the assumption that keywords
in the query directly match labels for resources, although users can iteratively refine
the query with additional keywords. Ranking of results is performed according to text
similarity (tf-idf), weighted by the popularity of each object. The popularity is a score
calculated at index time, as a logarithmic function of the number of documents where
the object is mentioned.
      </p>
      <p>
        NLP based methods attempt to derive a SPARQL query from a natural language
question (NLQ). returning a single answer or the set of results compatible with the
user question. These approaches mainly rely on linguistic resources to expand and
disambiguate keywords, with the drawbacks of coverage and lack of generality in the
relationships considered mentioned earlier. FREyA [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is a natural language interface for
querying ontologies. It performs syntactic parsing of the NLQ and involves the user in
the translation process if necessary. The key feature of the approach is that the user’s
choices are used for training the system in order to improve its performance over time.
Gingseng [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and NLP reduce [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] are two natural language interface question answering
systems. NLP reduce allows the user to enter either full natural language questions,
sentence fragments, or just keywords. The question is reduced to a set of query terms,
which are matched against a precomputed index of the ontology lexicon, where for all
lexicon terms stemming and synonym expansion have been performed. Gingseng does
not provide full natural language interface, but rather guides user input with respect to the
considered ontologies, suggesting term and relations present in the underlying ontologies.
PowerAqua [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] implements a strategy for (i) combining and (ii) ranking answers from
distributed heterogeneous data sources in the context of a multi-ontology Question
Answering task. The result list integrates triples coming from di erent data stores, ranked
using a multi-level ranking algorithm, which combines three di erent criteria (popularity,
confidence and semantic interpretation). Two preprocessing steps are performed to
translate the natural language question. A linguistic component extracts relevant terms
from the question, then a mapping component identifies potentially suitable ontologies
for the terms searching for occurrences of the terms, but also synonyms, hypernyms and
hyponyms.
      </p>
      <p>
        Freitas et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] argue that combining NLP and IR techniques with a richer
exploitation of the graph structure beyond involved entities, would significantly improve
performance of the overall process. They propose a method which allows end users to
query Linked Data using natural language queries. The search is performed in three steps.
First, they recognise key entities in the query (Named Entities and lexicon terms); then
they retrieve potential DBpedia entries matching each entity ; finally they retrieve all
relations for each entry and use a spreading activation algorithm to retrieve triple paths
connecting candidates entities. The evaluation is performed against the DBpedia portion
of the QALD evaluation query set 3. As a general approach, we are inclined to follow
directions in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], as we will discuss in Section 5, but the aim of this paper is a focused
evaluation of the preprocessing step which translates and expands terms in the NLQ in a
set of Linked Data resources. Although both NLP-centric works and IR based methods
point out the importance of such a preprocessing step, they do not provide an in vitro
evaluation of such task. Keyword expansion in queries has been extensively addressed in
the traditional IR settings [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], but it is still mostly untapped in the context of Linked Data,
as mentioned in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. We follow the evaluation proposed by Freitas et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], who created
a test set for this specific task on Linked Data and elaborate on their proposed method,
by providing a richer exploitation of the graph structure around involved concepts.
3 http://www.sc.cit-ec.uni-bielefeld.de/sites/www.sc.cit-ec.
uni-bielefeld.de/files/sharedtask.pdf
rdfs:label
foaf:name
dc:title
skos:prefLabel
skos:altLabel
fb:type.object.name
At the start of our approach we have a keyword w for which the user wants to find
representative resources in Linked Data. The dataset4 we operate on consists of a set of
resources, R, including a set of properties used to denote specific relationships between
resources, P R. Literals are a further subset of resources L R with a lexical
representation, which are disjoint from P (i.e. P \ L = ;). As usual, for RDF, the
data consists of a set of statements, S (R n L) P R, each being a 3-tuple (triple)
consisting of a ‘subject’, ‘predicate’ and an ‘object’. In the current application, and the
method described here, this is applied in a more focussed setting, where the intention
is to find concepts (classes and properties) in a fixed vocabulary; we call the target
classes TC R, where for each c 2 TC we have (c; rdf:type; rdfs:Class) 2 S and target
properties TP P, where for each p 2 TP we have (p; rdf:type; rdf:Property) 2 S .
      </p>
      <p>We have chosen a set of labelling properties, i.e. properties whose values are expected
to be literals which might be worthwhile in identifying distinct concepts, Labelling P,
shown in Table 1. We make claims neither that this is a complete nor, in each part, useful;
indeed Table 2 shows that the method itself reveals two further properties we had not
considered (in bold) and that, according to the precision metric we explain below, two
that we did include were not useful in the evaluation.</p>
      <p>In order to find representative concepts we construct from w an expanded set of
keywords, Ew that improve the chance of finding the most fitting concept in the target
vocabulary according to its labelling (under the properties Labelling). These expanded
sets are more general than ‘synsets’ (sets of synonyms within dictionary-oriented terms)
in both scope, including a huge potential range of named entities, and in the flexibility
of the semantic relationships covered. A distinctive feature of our method is that the data
is not presumed or intended to be governed by the vocabulary targetted but rather the
results improve as the dataset considered increases beyond the target (as we shall show
in the next section, we cross-evaluate against a method choosing terms from the DBpedia
ontology, but use a much larger portion of the Sindice cache to find the expanded set).</p>
      <p>As well as exploiting the labels of resources across the whole dataset, our algorithm
learns a rank over all properties, not just those used for labelling, according to how
4 N.B. in terms of the Linking Open Data Cloud this merged graph can, and does in our evaluation,
span several individual datasets in the sense they are registered at the Data Hub.
useful they are in expressing useful notions of semantic similarity. This is achieved
by a supervised training period, described in Section 3.2, in which the first stages of
the algorithm, forming the expanded set and then using this to find candidates in the
target vocabulary, is followed by a manual choice between candidates. As a result of this
training, we learn a precision for the properties and rank then subset these to create an
ordered list R~el P that the final part of the algorithm can then use to automatically
distinguish between the fitness of candidates, as described in Section 3.3.
3.1</p>
      <sec id="sec-2-1">
        <title>Candidate Identification</title>
        <p>In the first stage of the algorithm, i.e. lines 1-7 in Algorithm 1, triples in the dataset
where the keyword w is used as the label of a resource, r, are identified, via any labelling
property labelling1 2 Labelling. Next the neighbours of r, who are the objects of any
relationship p 2 Q (where Q P is a parameter) from this subject resource, are found.
These neighbours fall into two groups: the first are themseves literals; the second are not
literals, but may have literals attached via a labelling property labelling2 2 Labelling.
The resulting set of labels is called an ‘expanded set’, Ew, relative the original w.
Algorithm 1 identifyCandidates(w, Q)
1: for all r such that (r; labelling1; w) 2 S and labelling1 2 Labelling do
2: for all s such that (r; p; s) 2 S and p 2 Q do
3: if s 2 L then
4: Ew s</p>
      </sec>
      <sec id="sec-2-2">
        <title>5: else</title>
        <p>6: Ew flit j (s; labelling2; lit) 2 S and labelling2 2 Labellingg
7: end if
8: Candw Candw [ f(p; findTargetsFromExpandedLabels(Ew))g
9: end for
10: end for</p>
        <p>This expanded set is used, for each examined property p, to find a list of terms from
the target ontology by Algorithm 2.</p>
        <p>Algorithm 2 relies on a function tokenise which splits multi-word labels into a set of
words, i.e. tokenise(“programming language”) = f“programming”, “language”g.</p>
        <p>Finally, having obtained the targets according to the given property, Algorithm 1 pairs
that property and the target resources found in the set Candw P P(R), or propertly
Q P(R). At this stage the method branches according to whether we are training, or
running automatically with respect to an existing training period, which has defined a
ranked short-list R~el P that express a useful notion of semantic similarity.
In order to train, we select a list of keywords to stand for w, with P (i.e. the whole set of
properties in the dataset) standing for Q, in several iterations of the algorithm presented
in Section 3.1. This training set, Wtr!ain = f 1 2 n</p>
        <p>wtrain; wtrain; :::; wtraing, consists of nouns and
compound nouns in general use; our only constraint in evaluation was that this set is
disjoint from the keywords used in the test set, as described in Section 4.</p>
        <p>The aim of training is to produce a precision measure for every property in Ptrain,
that is every property exercised in finding candidates across Wtr!ain, i.e.:</p>
        <p>Ptrain = fp j (p; T ) 2 Candw where w 2 Wtr!ain)g</p>
        <p>The first stage is to manually choose the best (not necessarily unique) among the
candidates under each p that occurs in Candw, for each w 2 Wma!tch. We therefore
first generalise Candw to Candtr!ain, that is an ordered set matching Wtr!ain such each
Canditrain = Candwtrain .</p>
        <p>i</p>
        <p>Based on the manual selection we then also have an ordered set Be!st such that</p>
        <sec id="sec-2-2-1">
          <title>Bestw;p Tw;p where (p; Tw;p) 2 Candw for each w 2 Wtr!ain.</title>
          <p>In order to compute precision for each p, we now define two functions hits(p; w) and
candidates(p; w), both defined for each p 2 Pmatch and w 2 Wtr!ain, as follows:
hits(p; wtirain) = f t j t 2 T where (p; T ) 2 Canditrain and t 2 Bestwtrain;p g
i
candidates(p; wtirain) = f t j t 2 T where (p; T ) 2 Canditrain g
We can then apply the following simple calculation:
prec(p) = Pw2Wtra!in candidates(p; w)</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Pw2Wtra!in hits(p; w)</title>
          <p>We then define some treshold 0 1, and use prec to define an ordered (ranked)
subset of properties shown to encode semantic relatedness, R~el, as the greatest set with
the following definition:
reli 2 R~el i reli 2 Pmatch ^ prec(reli) &gt;
^ @ j such that j &gt; i ^ prec( j) &lt; prec(i)
now take any set of candidates over a test set of keywords, Wt!est, and apply Algorithm 1
to obtain candidates. In this case however we take only properties from R~el to form the
other parameter, Q. As a result we assume Candt!est matching the definition in Section 3.2,
over properties Ptest, again assuming a matching definition (note, by design: Ptest R~el).</p>
          <p>In order to judge fitness of each candidate resource r in Canditest (i.e. where r 2 R
and (p; R) 2 Canditest for some p) to the corresponding keyword, witest, we combine, as
a numerical product, the precision of the property, p, used to find that candidate and a
tf-idf score for r, computed as follows:
tf(r; witest) =</p>
          <p>Pp2R~el 01 iiff rr&lt;2RR where (p; R) 2 Canditest
1 + Pv2Wtra!in Pp2R~el 01 iiff rr&lt;2RR where (p; R) 2 Candtr!ain
And the ‘inverse’ (i.e., logarithmic) document frequency as:
idf(r; Wtr!ain) = log</p>
          <p>jWtr!ainj
1 + Pv2Wtra!in 01 iofth9epr;wRisseuch that r2R and (p;R)2Candtra!in
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>For evaluation purposes we used the DBpedia ontology as the target vocabulary (i.e.
TC were DBpedia ontology classes and TP were DBpedia ontology properties), and
a dataset of the Sindice cache (therefore S consists of many statements from datasets
registered at the Data Hub5, as shown in the Linking Open Data Cloud, together with
further semantic data found by crawling).
4.1</p>
      <sec id="sec-3-1">
        <title>Gold Standard and Evaluation Metric</title>
        <p>
          We derived a gold standard, containing 178 keywords of which 134 have at least one
representation in the DBpedia ontology, from the experiment described in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. When
examining this paper’s set up and results, we detected some minor errors which we have
corrected in our cross-evaluation. In re-evaluating the approach used by Freitas et al.
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] the actual figures for their approach did not change significantly with respect to our
improvements (0.6 vs. 0.64 in the original evaluation).
        </p>
        <p>The first class of correction actually rates Freitas et al.’s results higher than there own
evaluation as the keyword beatles was marked as not available in the DBpedia ontology
(N/A), although their results list contained the DBpedia class dbpedia-owl:Artist, which
we score as a hit and credit them for this in all such cases. Another group of errors
concern keywords that are marked as N/A, where the correct result is not contained in
the result set but actually available in DBpedia (e.g. dbpedia-owl:manufacturer and
dbpedia-owl:Automobile for the keyword honda).
5 http://datahub.io/
foaf:name
fb-common:topic
rdfs:label
dc:subject
dc:title
sindice:label
rdfs:suBClassOf
skos:prefLabel
fb-type:object
wn20schema:derivationallyRelated
wn20schema:containsWordSense
commontag:label</p>
        <p>Additionally, some results were marked as correct matches although on closer
inspection there are better cadidates; for example, for the keyword feline, the match
dbpediaowl:genus was accepted, although a less general result is dbpedia-owl:Mammal.</p>
        <p>Finally, there were some inconsistencies between Freitas et al.’s self evaluation when
comparing the result table and their downloaded gold standard (i.e., the correct answer
was marked in the gold standard, but marked as N/A in the result table).</p>
        <p>The updated dataset and the results of all evaluations are available for download6.</p>
        <p>
          Freitas et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] evaluate query expansion as an information retrieval task, where
expansion candidates are ranked according to their relevance. The measure used for
evaluation is Mean Reciprocal Rank (MRR) which measures the quality of the ranking
by calculating the inverse rank of the best result. In this study, we follow the same
approach. That is, we apply MRR to the resulting ranked list of candidates given by our
algorithm.
4.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Approach</title>
        <p>
          Since our approach requires a supervised training phase, we manually created a training
set following the same principles outlined in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The training set contains 40 keywords.
After applying the training phase, 194 candidate relations are learnt. We used a precision
threshold of 0.045 to cut o the candidate relation list. This resulted in 23 relations
which can be found in Table 3.
        </p>
        <p>As well as producing, when projected over labelling properties (those with plain
literal values), an interesting comparison (Table 2) with our original set of labelling
properties (Table 1), as observed earlier, this list also brings to light a common but useful
error in the Sindice-crawled data, the misspelling suBClassOf, which turns out to be
higher precision (0.168) than the correct predicate subClassOf (0.0656).
6 http://oak.dcs.shef.ac.uk/LODIE/know-a-lod-2013
Query: [spacecraft]
dbpedia-owl:Spacecraft
dbpedia-owl:spacecraft
dbpedia-owl:satellite
dbpedia-owl:missions
dbpedia-owl:launches
dbpedia-owl:closed
dbpedia-owl:vehicle
dbpedia-owl:Rocket
Query: [bass]
dbpedia-owl:Fish
dbpedia-owl:Instrument
dbpedia-owl:instrument
dbpedia-owl:voice
dbpedia-owl:partner
dbpedia-owl:note
dbpedia-owl:Musical
dbpedia-owl:lowest</p>
        <p>Query: [engine]
dbpedia-owl:engine
dbpedia-owl:gameEngine
dbpedia-owl:Artwork
dbpedia-owl:AutomobileEngine
dbpedia-owl:Locomotive
dbpedia-owl:fuel
dbpedia-owl:added
dbpedia-owl:Locomotive
Query: [wife]
dbpedia-owl:spouse
dbpedia-owl:Criminal
dbpedia-owl:person
dbpedia-owl:status
dbpedia-owl:education
dbpedia-owl:Language
dbpedia-owl:sex
dbpedia-owl:family
dbpedia-owl:manufacturer
dbpedia-owl:plant
dbpedia-owl:Canal
dbpedia-owl:engine
dbpedia-owl:class
dbpedia-owl:Album
dbpedia-owl:product
dbpedia-owl:assembly
Query: [honda]
dbpedia-owl:manufacturer
dbpedia-owl:discovered
dbpedia-owl:Asteroid
dbpedia-owl:engine
dbpedia-owl:season
dbpedia-owl:vehicle
dbpedia-owl:Automobile
dbpedia-owl:Asteroid
4.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Results</title>
        <p>In order to illustrate the output of our method, Table 4 lists keyword queries and
their ranked lists. The correct results the user chose in the evaluation are marked,
as in the gold standard, in bold. These example queries illustrate that our approach
is able to find resources of extended keywords based on di erent kinds of
semantic relationships. These include identity (e.g. spacecraft - dbpedia-owl:spacecraft),
hyponymy (e.g. engine - dbpedia-owl:AutomobileEngine), hypernymy (e.g. wife
dbpedia-owl:spouse, dbpedia-owl:person, dbpedia-owl:family), meronymy (e.g. honda
- dbpedia-owl:engine), and synonymy (e.g. factory - dbpedia-owl:plant). Our approach
is also able to di erent senses for ambiguous keywords (e.g. bass - dbpedia-owl:Fish,
dbpedia-owl:Instrument, dbpedia-owl:instrument. The authors leave it to the reader’s
imagination to explain why the class Criminal is ranked so highly in the matches for the
keyword “wife”.</p>
        <p>
          The results of our evaluation are shown in Tables 5 and 6. Our MRR (LOD Keyword
Expansion) is 0.77, which means most of the best results are located on the first position.
We can further show 17% improvement in MRR in comparison with the Wikipedia
based approach (ESA) Freitas et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] followed. Overall, LOD Keyword Expansion is
able to answer 90% of the keyword queries (only taking into account keywords that are
represented in DBpedia), which is 3% more than ESA. We also compare our approach
to a simple String match baseline and a String match baseline enriched with WordNet
synonyms (String match + WordNet) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Our approach is able to answer twice as many
keyword queries as String match and 38% more than String match + WordNet.
        </p>
        <p>Model
ESA
LOD Keyword Expansion</p>
        <p>MRR
Strich match
String match + WordNet
ESA
LOD Keyword Expansion</p>
        <p>
          Recall
In this paper, we have described a method for automatic query expansion for Linked
Data resources which is based on using properties between resources within the Linked
Open Data cloud. In our evaluation, we examined how useful these di erent properties
are for finding semantic similarities and thereby finding alternative (expanded) keywords,
and applied our method to the DBpedia ontology as a target. Compared to the state of
the art [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], we show an improvement of 17% in Mean Reciprocal Rank.
        </p>
        <p>
          The approach described currently bases query expansion on semantic similarity.
Related work discussed above [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] follows a multi-strategy approach; first using string
similarity then, if the desired result is not achieved, lexical expansion with WordNet,
and finally, where these simpler methods have failed, using ESA (as in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]). They show
that the best result is achieved by combining all these approaches. We would speculate,
therefore, that our results could be achieved with such a hybrid approach, bringing to
bear our general means for semantic similarity only after the lower-hanging fruit has
been picked o .
        </p>
        <p>Future work will also re-apply some feedback based on the approach documented
here. Firstly, as shown in Tables 1 versus 2, our set of labelling properties could be
adjusted, extending the reach of the method. Secondly, once this set is fixed we could
fine-tune the algorithm by learning over the set of properties that express semantic
similarity separately from the labelling properties (which are currently discarded from
Algorithm 2).</p>
        <p>
          While this work focused on the evaluation of the keyword expansion process per se,
we intend to perform an in vivo evaluation of the task, by performing a semantic search
evaluation. One way we foresee for privileging one concept rather than the other in the
simultaneous matching is exploiting Encyclopaedic Knowledge patterns (EKPs) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ],
to exploit not only available relations between candidate concepts, but also statistical
evidence in the data about connections which have been actually used in instance data.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaufmann</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Kaiser</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Querying the semantic web with ginseng: A guided input natural language search engine</article-title>
          .
          <source>In: 15th Workshop on Information Technologies and Systems</source>
          , Las Vegas, NV. pp.
          <fpage>112</fpage>
          -
          <lpage>126</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Budanitsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hirst</surname>
          </string-name>
          , G.:
          <article-title>Evaluating WordNet-based Measures of Lexical Semantic Relatedness</article-title>
          .
          <source>Journal of Computational Linguistics</source>
          <volume>32</volume>
          (
          <issue>1</issue>
          ),
          <fpage>13</fpage>
          -
          <lpage>47</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Carpineto</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Romano</surname>
          </string-name>
          , G.:
          <article-title>A survey of automatic query expansion in information retrieval</article-title>
          .
          <source>ACM Comput. Surv</source>
          .
          <volume>44</volume>
          (
          <issue>1</issue>
          ),
          <volume>1</volume>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Cheng, G.,
          <string-name>
            <surname>Ge</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Falcons: searching and browsing entities on the semantic web</article-title>
          .
          <source>In: Proceedings of the 17th international conference on World Wide Web</source>
          . pp.
          <fpage>1101</fpage>
          -
          <lpage>1102</lpage>
          . ACM (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Damljanovic</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agatonovic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cunningham</surname>
          </string-name>
          , H.:
          <article-title>Freya: An interactive way of querying linked data using natural language</article-title>
          .
          <source>In: The Semantic Web: ESWC 2011 Workshops</source>
          . pp.
          <fpage>125</fpage>
          -
          <lpage>138</lpage>
          . Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>d'Aquin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baldassarre</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gridinoc</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sabou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Angeletou</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Motta., E.: Watson:
          <article-title>Supporting next generation semantic web applications</article-title>
          . In: Proceedings of the WWW/Internet conference, Vila real,
          <source>Spain</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Curry</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliveira</surname>
            ,
            <given-names>J.G.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>O</given-names>
            <surname>'Riain</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.:</surname>
          </string-name>
          <article-title>A distributional structured semantic space for querying rdf graph data</article-title>
          .
          <source>International Journal of Semantic Computing</source>
          <volume>5</volume>
          (
          <issue>04</issue>
          ),
          <fpage>433</fpage>
          -
          <lpage>462</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Curry</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliveira</surname>
            ,
            <given-names>J.G.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>O</given-names>
            <surname>'Riain</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          :
          <article-title>Querying heterogeneous datasets on the linked data web: Challenges, approaches, and trends</article-title>
          .
          <source>Internet Computing, IEEE</source>
          <volume>16</volume>
          (
          <issue>1</issue>
          ),
          <fpage>24</fpage>
          -
          <lpage>33</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Curry</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>O'Riain</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A distributional approach for terminological semantic search on the linked data web</article-title>
          .
          <source>In: Proceedings of the 27th Annual ACM Symposium on Applied Computing</source>
          . pp.
          <fpage>384</fpage>
          -
          <lpage>391</lpage>
          . ACM (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliveira</surname>
            , J.,
            <given-names>O</given-names>
          </string-name>
          <string-name>
            <surname>'Riain</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Curry</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , Pereira da Silva, J.:
          <article-title>Querying linked data using semantic relatedness: A vocabulary independent approach</article-title>
          .
          <source>Natural Language Processing and Information Systems</source>
          pp.
          <fpage>40</fpage>
          -
          <lpage>51</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kaufmann</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fischer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Nlp-reduce: A “naıve” but domain-independent natural language interface for querying ontologies</article-title>
          .
          <source>In: 4th European Semantic Web Conference</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Lopez</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikolov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernandez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sabou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uren</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motta</surname>
          </string-name>
          , E.:
          <article-title>Merging and ranking answers in the semantic web: The wisdom of crowds</article-title>
          . The semantic web pp.
          <fpage>135</fpage>
          -
          <lpage>152</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Nuzzolese</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gangemi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Presutti</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ciancarini</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Encyclopedic knowledge patterns from wikipedia links</article-title>
          .
          <source>The Semantic Web-ISWC</source>
          <year>2011</year>
          pp.
          <fpage>520</fpage>
          -
          <lpage>536</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Oren</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Delbru</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Catasta</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stenzhorn</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tummarello</surname>
          </string-name>
          , G.:
          <article-title>Sindice. com: a document-oriented lookup index for open linked data</article-title>
          .
          <source>International Journal of Metadata, Semantics and Ontologies</source>
          <volume>3</volume>
          (
          <issue>1</issue>
          ),
          <fpage>37</fpage>
          -
          <lpage>52</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haase</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Searchwebdb: Data web search on a pay-as-you-go integration infrastructure</article-title>
          .
          <source>Tech. rep.</source>
          ,
          <source>Technical report</source>
          , University of Karlsruhe (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rudolph</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cimiano</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Top-k exploration of query candidates for e cient keyword search on graph-shaped (rdf) data</article-title>
          .
          <source>In: Data Engineering</source>
          ,
          <year>2009</year>
          . ICDE'09. IEEE 25th International Conference on. pp.
          <fpage>405</fpage>
          -
          <lpage>416</lpage>
          . IEEE (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Walter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Unger</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cimiano</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Ba¨r, D.:
          <article-title>Evaluation of a layered approach to question answering over linked data</article-title>
          .
          <source>In: The Semantic Web-ISWC</source>
          <year>2012</year>
          , pp.
          <fpage>362</fpage>
          -
          <lpage>374</lpage>
          . Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haase</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Penin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Searchwebdb: Searching the billion triples</article-title>
          .
          <source>In: Billion Triple Challenge at the International Semantic Web Conference</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>