<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Patterns for Keyword Search in RDF Graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sylvaine Nugier EDF R</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chatou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France sylvaine.nugier@edf.fr</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hanane Ouksili DAVID lab.</institution>
          ,
          <addr-line>Univ. Versailles</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Stéphane Lopes DAVID lab.</institution>
          ,
          <addr-line>Univ. Versailles</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Zoubida Kedad DAVID lab.</institution>
          ,
          <addr-line>Univ. Versailles</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>An increasing number of RDF datasets are available on the Web. Querying RDF data requires the knowledge of a query language such as SPARQL; it also requires some information describing the content of these datasets. The goal of our work is to facilitate the querying of RDF datasets, and we present an approach for enabling users to search in RDF data using keywords. We introduce the notion of pattern to integrate external knowledge in the search process, which increases the quality of the results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>A huge volume of RDF datasets is available on the Web,
enabling the design of novel intelligent applications. In order
to query these datasets, users must have information about
their schema to target the relevant resources and properties.
They must also be familiar with a formal query language
such as SPARQL. An alternative approach to extract
information from RDF datasets is keyword search, which is a
straightforward and intuitive way of querying these datasets.</p>
      <p>To illustrate our motivation, let us consider the RDF
graph illustrated in Figure 1, containing data from AIFB
institute1, Karlsruhe University.
This work was supported by Electricity of France (EDF),
STEP department and the french ANR project CAIR.
1http://km.aifb.uni-karlsruhe.de/ws/eon2006/ontoeval.zip
2017, Copyright is with the authors. Published in the Workshop
Proceedings of the EDBT/ICDT 2017 Joint Conference (March 21, 2017, Venice,
Italy) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is
permitted under the terms of the Creative Commons license CC-by-nc-nd
4.0</p>
      <p>Let \Studer Cooperator00 be the keyword query which a
user issues to nd the scientists who have collaborated with
a scientist named \Studer". To answer this query, existing
approaches typically use a keyword-to-graph mapping
function to identify a set of elements that contain the query
keywords. Then, a connected \minimal" subgraph, which
contains the identi ed elements is returned as an answer. Using
such mapping function on the example, two elements will be
identi ed: the resource aif b:Rudi Studer and the property
swrc:cooperateW ith. The only result will be the scientist
aif b:Daniel Deutch, which is linked to aif b:Rudi Studer
by the property swrc:cooperateW ith. However, aif b:V ictor
V ianu is also his collaborator. Indeed, they authored the
same paper as shown in the graph. Some relevant answers
will never be returned using the above process because the
existing approaches use the structure of the graph and the
linguistic content only. Moreover, the data is often
incomplete and all property values are not necessarily present for
all resources.</p>
      <p>In this paper, we introduce a novel keyword search
approach which returns graphs as a result to a keyword query.
The key contribution of this paper is the use of external
knowledge, expressed as patterns, during the extraction of
the relevant graph fragments, the construction of the result
and the ranking step. Experiments show that our approach
delivers high-quality search results.</p>
      <p>The rest of this paper is organized as follows. The concept
of pattern is de ned in section 2. We detail our solution for
keyword search integrating patterns in Section 3. Section
4 reports the experimental results. Section 5 reviews the
related works. Finally, we conclude the paper and present
some future works in Section 6.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>EXPRESSING KNOWLEDGE USING PAT</title>
    </sec>
    <sec id="sec-3">
      <title>TERNS</title>
      <p>The keywords used to query a dataset can not always be
mapped into a graph element because the corresponding
information might be missing from the dataset. If we take
into account some external knowledge available for the
considered domain, equivalence relations can be found between
the terms used in the query and the elements of a dataset.</p>
      <p>For example, let us consider the AIFB dataset, the
swrc:cooperateW ith property between two researchers ri and rj
states that they have already collaborated. If this property
is missing for a pair of researchers but if we know that they
wrote the same paper, we can infer that they have
collaborated, even if the swrc:cooperateW ith property is missing.
In order to capture these equivalence relations, we introduce
the notion of pattern, de ned below.</p>
      <p>The de nition of patterns relies on the de nition of
property expression and path expression which are presented
hereafter.</p>
      <p>A property expression denoted exp is a triple (X; p; Y ),
where X and Y are either resources, literals or variables and
p is a property. For example, (X; swrc:isAbout; Database)
is a property expression that represents triples having
Database as object and swrc:isAbout as predicate.</p>
      <p>A path expression describes a relation (possibly indirect)
between resources. The relation is a path (a sequence of
properties) between two resources. Note that a property
expression is a special case of path expression where the
path length is 1. More formally, a path expression denoted
exP is de ned as a triple (X; P; Y ), where X and Y are
either resources, literals or variables and P is a SPARQL 1.1
path expression2. For example,</p>
      <p>(aif b:D:B:; (owl:sameAsj^owl:sameAs)+; Y )
represents all the resources related to aif b:D:B: by a
sequence of properties which are either owl:sameAs or its
inverse.</p>
      <p>A pattern is a pair [exp; exP ] where exp is a property
expression and exP is a path expression. It represents the
equivalence between two expressions, one property
expression and one path expression.</p>
      <p>As an example, consider the RDF graph of Figure 1;
the two resources aif b:D:B: and aif b:Database are related
by the property owl:sameAs which indicates that they are
equivalent. The value of the property swrc:isAbout for the
resource aif b:Art1 is aif b:D:B:, therefore, this property has
also the value aif b:Database. To express this knowledge,
the following pattern is de ned:
[ ( X, swrc : isAbout , Y) , (X, swrc : i s A b o u t /
( owl : sameAs j ^ owl : sameAs ) + , Y ) ]
This pattern represents the fact that the property
swrc:isAbout is equivalent to a path composed by a property
swrc:isAbout followed by a sequence of owl:sameAs properties.</p>
      <p>Patterns can be used to capture external knowledge which
can be user speci c, domain speci c or generic. Generic
patterns are valid for any dataset as they are related to
RDF, RDFS or OWL vocabularies.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Pattern Evaluation</title>
      <p>Patterns are evaluated on the considered RDF graph to
nd a set of triples which satisfy the description given in the
path expression. This set of triples will be stored and used
during keyword search.</p>
      <p>Let consider the subgraph represented in the Figure 1 and
the following pattern which indicates that if two scientists
have authored the same paper, then they have collaborated:
[ ( X, swrc : cooper ateW ith , Y) ,
(X, swrc : p u b l i c a t i o n /^ swrc : p u b l i c a t i o n , Y ) ]
The evaluation of this pattern on the subgraph of Figure 1
results in the following triple:
2http://www.w3.org/TR/sparql11-property-paths/
f&lt; a i f b : Rudi Studer , swrc : cooper ateW ith ,
a i f b : V i c t o r V i a n u &gt;g
We have introduced other properties to express other kinds
of knowledge. For example, to express that a resource ri
contains some information which complements the one provided
by another resource rj, we have introduced a new property
called pattern:sameResult. Consider the case of two
resources related by a sequence of owl:sameAs properties. We
can state that they contain complementary information as
they represent in fact the same object. This is captured by
the following pattern:
The result of its evaluation on the graph of Figure 1 is:
f&lt; a i f b : B .D. , p a t t e r n : sameResult ,
a i f b : Database &gt;g
3.</p>
    </sec>
    <sec id="sec-5">
      <title>KEYWORD SEARCH USING PATTERNS</title>
      <p>In this section, we provide an overview of our approach,
presented in Figure 2.</p>
      <p>The general framework contains four main components:
(a) Indexation, which is performed independently from the
query, pre-computes some information to speedup query
processing; (b) Fragment Extraction uses the index and a
mapping function to identify a set of elements (either nodes or
edges) that contain the query keywords; an expansion is
also applied in order to identify the relevant fragments of
the graph according to the external knowledge formalized as
patterns which are stored in the pattern store; (c) Fragments
Aggregation merges the fragments to form a subgraph;
nally (d) Results Ranking returns a list of ranked subgraphs
representing the results. In the remaining of this section,
each component will be detailed, and the underlying
algorithms will be presented.</p>
      <p>An inverted index is used to improve the e ciency of the
relevant fragment extraction phase. Our index contains a
set of documents, each one representing a graph element
(resource, literal or property). The document structure is
illustrated in Table 1.</p>
      <p>The Element eld contains the element of the graph which
is represented, that is, a URI for resources and properties or
a literal. Type indicates the type of the element: it can be a
class, an instance, a literal, or a property. Content contains
the text that was extracted from the graph element. It
represents di erent parts of the element according to its type.
To obtain the textual content of each document, we consider
a representative keyword from the local name of URIs for
resources and properties as document content; for example,
a document is created for the property swrc:hasAuthor,
containing the word Author (which can be generated using
Information Extraction techniques). For literals, we consider
their content as a document. Finally, Fragment represents
the part of the graph which is relevant to the keyword, it is
a subgraph that will be used to construct the result to be
returned to the user. Instanciating this last eld di ers from
one type of element to another. For all resources, we consider
the corresponding node and we save its URI. In the case of a
literal, the triple &lt; ri; pi; l &gt; is saved, where l is the
considered literal and ri is the resource that it describes. For
properties, one document is created for each triple &lt; ri; pr; rj &gt;,
where pr is the considered property. This triple will be the
fragment corresponding to the document. In Table 1, the
rst document represents the literal \M U LOP roject", the
second represents the class swrc:P roject and others
represent the property swrc:worksAtP roject.
3.2</p>
    </sec>
    <sec id="sec-6">
      <title>Extracting Relevant Fragments</title>
      <p>Extracting relevant fragments consists in retrieving the
parts of the RDF graph that correspond to the keywords of
the query. Fragment extraction is composed of two steps.</p>
      <p>Graph-query matching retrieves the graph elements
containing one of the query keywords in their textual content
by using the index. To this end, a mapping function and
some standard transformation functions such as
abbreviation and synonym are used. More formally, let G be the
RDF graph and Q = fk1; :::; kng the keyword query. The
goal is to return a set of lists El = fEl1; :::; ELng where Eli
is the list of fragments elij of the graph G which corresponds
to the query keyword ki. We refer to these as keyword
elements. In addition, note that if the keyword element is a
property as in our example, our approach will not extract
all occurrences of the property in the graph but only the
ones for which either the object or the subject is a relevant
fragment. Otherwise, the result might include irrelevant
answers. Indeed, if the query is \cooperator Schemek", we will
select the property swrc:cooperateW ith of the resource
representing the scientist Schemek to get only his cooperators
and not cooperators of other scientists.</p>
      <p>The expansion of the results allows to integrate parts of
the graph that are relevant to the query even if they do not
contain the query keywords. The expansion is done using
external knowledge formalized as patterns and the result of
their evaluation on the data graph. We use mainly three
types of patterns for the expansion of the results. In the
following, we detail each of these types and show how they
can be instantiated and how they are used when extracting
the fragments.
3.2.1</p>
      <sec id="sec-6-1">
        <title>Patterns “sameResult”</title>
        <p>In some cases, the relevant fragments selected using the
graph-query matching do not provide the complete
information that responds to the keywords issued by the user. For
example, if the user's keyword is \publication", he expects
to get instances of the class swrc:P ublication. However, the
matching identi es only this class as a relevant fragment.</p>
        <p>We de ne a new property, pattern:sameResult, that we
will use to create patterns describing the semantic closeness
between two nodes of the graph. This pattern has the
following form:
[ ( X, p a t t e r n : sameResult , Y) , (X, P , Y ) ]
The pattern:sameResult property is used between two
nodes x and y (resources or literals) to indicate that the
information provided by x complements the information
provided by y. If there is a triple</p>
        <p>tr =&lt; x; pattern:sameResult; y &gt;
in the pattern store, and y is a relevant fragment for a
keyword k, then the subgraph (X; P; y) is returned as a relevant
fragment for k instead of y alone, given that the evaluation
of the pattern [(X; pattern:sameResult; Y ); (X; P; Y )] has
generated the triple tr.</p>
        <p>The expansion process using triples of the form &lt; x;
pattern: sameResult; y &gt; is described in Algorithm 1. Let
be the keyword fragment, another resource of the graph,
such that tr2 =&lt; ; pattern:sameResult; &gt; is a triple of
the pattern store. The expansion step will construct a
relvant fragment by replacing the resource by the subgraph
( ; P; ), consisting of the two resources and as well as
the path described in the expression P which has led to the
generation of tr2.</p>
        <p>Input: L: triples list of the pattern store,</p>
        <p>El = fEl1; :::; ELng: relevant fragment list
Output: Ele = fEl1e; :::; ELeng: relevant fragment list
after expansion
1: ELe EL
2: for all ELi 2 EL do
3: for all elij 2 ELi do
4: if 9 x j &lt; x; pattern:sameResult; elij &gt; 2 L
then
5: ELie ELienelij
6: ELie ELie [ (x; P; elij ) /*P is a pattern's
path expression which generated the triple
&lt; x; pattern:sameResult; elij &gt; */
7: end if
8: end for
9: end for
Algorithm 1: Result expansion using pattern:sameResult
It is possible to instantiate this pattern with any property,
according to the user's needs. In this paper, we describe the
generic patterns using properties of the RDF, RDFS and
OWL vocabularies, for example: rdf s:subClassOf , rdf :type
and owl:sameAs.</p>
        <p>For the owl:sameAs property, the de ned pattern leads
to identify as a relevant fragment a subgraph composed of x,
the keyword element y and the path relating them, that is
composed of a sequence of owl:sameAs properties because
x and y are equivalent. The patterns is as follows:
[ ( X, p a t t e r n : sameResult , Y) ,</p>
        <p>(X, ( owl : sameAs j ^ owl : sameAs ) + , Y ) ]
For the rdf s:subClassOf and rdf :type properties, we de ne
one pattern which considers x and the keyword element y
and the path relating them as a relevant fragment because
x is an instance or subclass of a keyword element y. The
de ned pattern is as follows:
[ ( X, p a t t e r n : sameResult , Y) ,</p>
        <p>(X, r d f : t y p e ?/ r d f s : s u b C l a s s O f , Y ) ]</p>
        <p>These patterns are generic and can be used for all data
sources. Other domain-speci c patterns can be de ned. For
example, in the case of the SWRC data source, if the user
wants to have the supervisor and his/her student and
considering that the latter is a keyword element, then the following
pattern can be de ned:
[ ( X, p a t t e r n : sameResult , Y) ,</p>
        <p>(X, swrc : s u p e r v i s e s , Y ) ]</p>
        <p>The RDF, RDFS and OWL vocabularies are used in this
category of patterns to identify properties that can replace
other properties in the graph according to the schema of the
domain being studied. Let pr be a resource representing
a property, i.e. which has rdf :P roperty as the value for
the rdf :type property. This resource can be linked to other
resources of type rdf :P roperty with properties indicating a
semantic closeness between them.</p>
        <p>Figure 4 shows an excerpt of RDF graph with its schema.
We can see that property types are interlinked. The
closeness between property types must be re ected on the
properties in the data. For example, we can take into account that
the swrc:member property of the resource aif b:Hartmut
Schmek can be replaced by the swrc:employee property
because this latter is the generalization of the former as
indicated in the schema part. Therefore, if swrc:member
property is relevant, then the swrc:employee is also relevant.</p>
        <p>The new property pattern:sameP roperty is used in the
de nition of patterns having the following form:
[ ( X, p a t t e r n : sameProperty , Y) , (X, P , Y ) ]
that a property of type y can be replaced by a property of
type x. Therefore, if the triple &lt; x; pattern:sameP roperty;
y &gt; is in the pattern store, and if y is a relevant property
then x is also a relevant property.</p>
        <p>Algorithm 2 describes this type of expansion: the set
of relevant fragments is scanned and for every relevant
element elij , the algorithm checks for a triple of the form
&lt; P r; pat:sameP roperty; elij &gt; in the pattern store (line
4). The existence of this triple means that the pr property
is also relevant and is therefore added to the list of relevant
items (line 5). In addition, the algorithm searches in the
RDF graph for triples which contain properties of type pr
to add them to the list of relevant fragments (line 8). As
seen previously, properties are relevant if their subject or
object are relevant fragments. The same constraint holds
for the selection of triples during the expansion (line 7).</p>
        <p>Input: L: triples list of the pattern store, G: RDF graph,</p>
        <p>El = fEl1; :::; ELng: relevant fragment list
Output: Ele = fEl1e; :::; ELeng: relevant fragment list
after expansion
1: ELe EL
2: for all ELi 2 EL do
3: for all elij 2 ELi do
4: if 9 pr j &lt; pr; pat:sameP roperty; elij &gt; 2 L
then
5: ELie ELie [ fprg
6: for all &lt; x; pr; y &gt; 2 G do
7: if 9 ELk jk 6= i ^ (x 2 ELk _ y 2 ELk)
then
8: ELie ELie [ f&lt; x; pr; y &gt;g
9: end if
10: end for
11: end if
12: end for
13: end for
Algorithm 2: Result expansion using
pattern:sameP roperty</p>
        <p>The following is an example of this kind of patterns where
three properties are used: owl:inverseOf ,
owl:equivalentP roperty and rdf s:subP ropertyOf . The pattern allows to
consider a property x as relevant if it is equivalent, inverse
or sub-property of the keyword element y.</p>
        <p>[ ( X, pat : sameProperty , Y) ,
(X, ( owl : i n v e r s e O f j ^ owl : i n v e r s e O f j
owl : e q u i v a l e n t P r o p e r t y j
^ owl : e q u i v a l e n t P r o p e r t y j
r d f s : s u b P r o p e r t y O f ) + , Y ) ]
3.2.3</p>
      </sec>
      <sec id="sec-6-2">
        <title>Patterns using domain properties</title>
        <p>The pattern:sameP roperty property is used between two
nodes x and y representing the property types to indicate
A property can be considered as relevant because its URI
contains a query keyword. However, RDF graphs are often
incomplete and properties are not de ned for all resources.
In this case, the values of these properties can be derived
using external knowledge. We use patterns to retrieve relevant
fragments which do not necessarily contain the keywords of
the query but which meet the user's needs.</p>
        <p>Consider the following pattern: [(X; p; Y ); (X; P; Y )]
and assume that the local name of the p property contains
one of the query keywords. Assume also that there are two
resources ri and rj in the RDF graph such that there is no
triple &lt; ri; p; rj &gt; in the graph. If this triple exists in the
pattern store as a result of the evaluation of the patterns
because the two resources ri and rj are linked by a path
corresponding to the path expression P , then the subgraph
(ri; P; rj ) is a relevant fragment.</p>
        <p>For example, the property swrc:cooperateWith between
two researchers in the ontology SWRC states that they have
already collaborated. However, this property is not always
de ned. If we know that they have authored the same
paper, we can infer that they have collaborated, even if the
property swrc:cooperateWith is missing. We formalise this
using the following pattern:
[ ( X, swrc : cooper ateW ith , Y) ,
(X, swrc : p u b l i c a t i o n /^ swrc : p u b l i c a t i o n , Y ) ]
This pattern is used by the fragment extraction module
which considers the subgraph corresponding to the path
(swrc: publication=^swrc:publication) as a relevant
fragment if the property swrc:cooperateWith is a keyword
element. Consider that \Studer Cooperator" is a query issued
to nd the collaborators of the researcher named Studer in
the graph of Figure 1. Let us consider the following two
extracted graph elements: the node aif b:Rudi Studer and
the property swrc:cooperateW ith. Without using patterns,
the only result will be the subgraph G1 of Figure 5
representing Daniel Deutch, a collaborator of Rudi Studer, both
linked by the swrc:cooperateWith property. However, V ictor
V ianu is also a collaborator of Rudi Studer as they have
authored the same paper (aif b:Art1). Using the pattern
dened above, our approach will also return the subgraph G2
of Figure 5 as a result.</p>
        <p>Algorithm 3 describes this type of expansion: for every
relevant fragment elij , if a triple of the form &lt; elij , rdf :type,
rdf :P roperty &gt; exists in the RDF graph (line 3), this means
that elij is a resource representing a property. Thus, the
algorithm nds triples of the form &lt; x; elij ; y &gt; (line 5) from
the pattern store. Indeed, in some cases, a relevant
property exists in the schema as an instance of the rdf :P roperty
class, but no resource is described using this property in the
data graph. For example, in the graph of Figure 4, if the
resource swrc:employs, which corresponds to a property, is
relevant, then the algorithm will retrieve in the pattern store
triples of the form &lt; x; swrc:employs; y &gt;. The algorithm
will then search for each of these triples the corresponding
subgraph in the graph G, in order to add it as a relevant
fragment (line 7), provided that either x or y is a relevant
fragment (line 6). In other cases, properties might be used
in the data graph without being de ned in the schema. As
explained in Section 3.1, if a property is relevant, the
relevant fragment is composed of all the triples containing this
property. In this case, for each triple of the relevant
fragment, if the corresponding predicate is not marked, then
the algorithm searches for triples having this predicate in
the pattern store (lines 12-14). For each of these triples, the
corresponding subgraph in the graph G is retrieved, in
order to add it as a relevant fragment (line 16), provided that
either x or y is a relevant fragment (line 15).
3.3</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Result Construction</title>
      <p>After obtaining the set of relevant fragments for each
query keyword, the result construction process builds the
subgraphs representing the results. Each result is a
connected minimal subgraph which contains for each keyword
ki one corresponding relevant fragment elij .</p>
      <p>We perform the Cartesian product between the di erent
sets of relevant fragments to construct the combinations.
Then, for each combination, we look for minimal subgraphs
to join di erent relevant fragments and construct the result.
In other words, the result should contain the minimum
intermediate nodes which are not relevant fragments. To do
that, we use shortest path algorithms. Let consider two
relevant fragments eli and elj . The shortest path between
the two fragments is the shortest path between two nodes
ni 2 eli and nj 2 elj such that @ ni2 2 eli, @ nj2 2 elj,
ni 6= ni2, nj 6= nj2 and j (ni2 nj2) j&lt;j (ni nj) j,
where j (x y) j is the size of the shortest path between
nodes x and y. To construct the result, we combine the
shortest paths between all pairs of fragments. We take into
account the existing patterns to calculate the path between
resources. If a pattern indicates that one path is equivalent
to one property, the distance is reduced to 1.
3.4</p>
    </sec>
    <sec id="sec-8">
      <title>Result Ranking</title>
      <p>Since several graph elements may correspond to the same
keyword, several subgraphs are returned to the user. We
de ne a ranking function to evaluate the relevance degree
of each subgraph. Our approach combines two criteria: (i)
Compactness Relevance where compact answers are preferred;
it increases when the size of the intermediate part decreases;
and (ii) Mapping Relevance which is calculated using
standard IR techniques. In our approach, we have used the TF
function. The nal score is the combination of the two scores
and it is calculated as follows:
Score(Ri) =</p>
      <sec id="sec-8-1">
        <title>SizeScore(Ri) + (1 ) M appingScore(Ri)</title>
        <p>Where SizeScore(Ri) is the compactness relevance of the
result Ri and M appingScore(Ri) is the matching relevance.</p>
        <p>is a parameter which expresses the weight of the two
ranking criteria.
3.4.1</p>
        <sec id="sec-8-1-1">
          <title>Compactness Relevance</title>
          <p>The size of the intermediate part of the graph, added
during the construction of the result, is an important element
to take into account: the relevance of the result increases
as the size of the intermediate part decreases. The size of a
subgraph is the sum of the number of its edges and nodes.
Moreover, in an RDF graph, the edges have a speci c
semantic. Consequently, the semantic closeness between resources
does not only depend on the size of the path that connects
them, but also on the semantic carried by the properties
which form this path. For example, two resources linked by
a path composed of a sequence of owl:sameAs properties
are semantically closer than two other resources connected
by a smaller path composed of other properties. Patterns
are taken into account when calculating the size of a
subgraph: if this subgraph is composed of a path de ned in one
of the patterns, then its size is reduced to the value 1. The
real length of a path which represents a strong semantic link
is not considered as it is, but is reduced. The compactness
relevance is calculated as follows:</p>
          <p>SizeScore(Ri) =
size(S RelevantF ragments)
size(Ri)
Where size(S relevantf ragments) is the size of the union of
the relevant fragments and size(Ri) is the size of the result
subgraph Ri after the patterns are taken into account. If
no intermediate part has been used, the size of the union of
relevant fragments will be equal to the size of the result.
3.4.2</p>
        </sec>
        <sec id="sec-8-1-2">
          <title>Matching Relevance</title>
          <p>We use a classical information retrieval function to assign
a weight to each relevant fragment of the result, namely
TF (Term Frequency): the more frequent the keyword of
the query in the document, the more relevant it is. The
TF function also depends on the size of the document, the
smaller the document containing the keyword of the query,
the more it is focused on the term and therefore the more
relevant it is. The equation below is used to obtain the
weight of a relevant fragment:</p>
          <p>T F (eli; kj) =
f requence(kj; eli)
number words(eli)
Where f requency(kj; eli) is the frequency of the keyword
kj in the document representing the element eli, and where
number words(eli) is the number of words in the document
representing the element Eli. The matching score of the
result is obtained by calculating the average of the scores of
the composing relevant fragments; it is computed as follows:
n
M appingScore(Ri) = 1 X T F (elj; kj)
n
j=1
Where kj is a query keyword and n the number of keywords.
4.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>EXPERIMENTAL EVALUATIONS</title>
      <p>
        We have implemented our approach in PatEx, a system
for RDF data exploration [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We have developed our system
as a desktop application implemented in Java. We have used
the Jena API for the manipulation of RDF data and the
Lucene3 API to create the index and to search for keyword
elements. The Jung4 API is used for graph manipulation
and visualization.
      </p>
      <p>In this section, we describe our experiments to
demonstrate the e ectiveness of our approach. All experiments
have been done on an Intel Core i5 with 12GB RAM. In the
rest of this section, we rst describe the used datasets and
the algorithms to which we have compared our approach.
Then, we discuss the results of our evaluations, related to
the performance of our system and the quality of the results.
4.1</p>
    </sec>
    <sec id="sec-10">
      <title>Experimental settings</title>
      <sec id="sec-10-1">
        <title>Datasets.</title>
        <p>We have used two datasets: AIFB and conference5. AIFB
is dataset containing data taken from AIFB institute,
Karlsruhe University. It is about entities of research communities
such as persons, organizations, publications (bibliographic
metadata) and their relationships. The dataset contains
8281 entities and 29 223 triples. The conference dataset
contains information about the main conferences and
workshops in the area of Semantic Web research, papers that
were presented and people who attended these events. This
dataset contains 571 entities and 1429 triples.</p>
      </sec>
      <sec id="sec-10-2">
        <title>Algorithms.</title>
        <p>We have compared our approach, which we refer to as
WP, to two con gurations which do not integrate patterns:
(i) the rst one, which we refer to as WAP-WAPr, considers
node content only, (ii) the second one, which we refer to
as WAP-WPr, considers both node and edge content, but
without any condition during their extraction.</p>
      </sec>
      <sec id="sec-10-3">
        <title>Metrics.</title>
        <p>We have used two metrics to evaluate the quality of the
results: Top-k precision (P @k) calculated with equation 1; it
3http://lucene.apache.org/core/
4http://jung.sourceforge.net/
5http://data.semanticweb.org/dumps/conferences/
provides the percentage of relevant results among the top-k
results.</p>
        <sec id="sec-10-3-1">
          <title>N umber of relevant results k</title>
          <p>
            The second metric is the Normalized Discounted
Cumulative Gain (N DCG@k); it is calculated using equation 2 [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ].
This metric evaluates both the relevance of the results and
their classi cation in the list. It is based on the idea that
a relevant answer which is misclassi ed or at the end of the
list is less useful to the user.
          </p>
          <p>X Zk
jQj q2Q
k=1
Xk 2rel(i) 1
log(i + 1)
Where Q is the set of used queries, rel(i) is the relevance
score of the result at position i and Zk is a normalization
parameter to get a score between 0 and 1.</p>
          <p>We have also evaluated the search performance of our
approach which is represented by the average time to nd the
top-k answers.
4.2</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Efficiency Evaluation</title>
      <p>We vary the query size from 2 to 5 keywords. For each
size, the average execution time for 10 queries which include
a wide variety of keywords for the three algorithm con
gurations are shown in Figures 6a and 6b.</p>
      <p>
        We observe that the execution time increases almost
linearly with the number of keywords for both datasets. The
reason is that the number of combinations increases and
therefore the number of results increases. But as the query
size issued by users is usually small (2.35 in average
according to [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]), the search performance of our approach is
acceptable. However, we can see that the execution time is more
important for the AIFB dataset than for the Conference
dataset, because the size of the former is larger. In
addition, the execution time of WP con guration is higher than
WAP-WAPr, but less important than WAP-WPr. Indeed,
WAP-WAPr does not consider properties and produces less
results than the others. The WAP-WPr con guration
considers properties and does not apply any lter to select only
the relevant triples. The three con gurations are in the same
order of magnitude.
      </p>
      <p>The execution time also depends on the type and number
of keyword elements. Figure 6c shows the execution time
for six queries, with di erent sizes, on the dataset AIFB.
We can notice that the number of keywords in the query is
not the only parameter that in uences the execution time.
Indeed, the query Q2(3 : 2L; 2L; 1L)6, which is longer
than Q1(2 : 399L; 432L), takes less time. This is explained
by the fact that Q1 includes more general concepts and thus
matches with more elements in the RDF graph. We conclude
that the execution time depends on the number of keyword
elements. We also notice that the execution time depends
on the number of relevant fragments returned after the
expansion. For example, Q6(2 : 1C and 4L; 1P ) takes more
time than Q4(2 : 1C; 1P ). The pattern used in this case
is to retrieve the instances of two classes swrc:P erson and
swrc:F ullprof essor and for the swrc:cooperatewith
property, it is the same processing for both queries. As the rst
class has more instances, we will have more results which
6The query contains three keywords, and matches elements:
L for literals, C for classes ans P for properties
(1)
(2)
is why it takes more time. For Q3(2 : 1L; 1P ), we can
observe that the WAP-WPr con guration takes more time
than others because it matches with the swrc:cooperatewith
property and this con guration considers all triples having
this property without any lter.</p>
      <p>For these experiments, we have used 10 queries that are
randomly constructed (with 2 to 5 keywords). We have
compared the three con gurations of the algorithm described in
Section 4.1 and we have asked 4 users to assess the
topk results of each query using the three con gurations on a
3-levels scale: 3 (perfectly relevant), 1 (relevant) and 0
(irrelevant). To calculate P@k, we assume that the scores 3 or
1 correspond to a relevant result and 0 to an irrelevant one.</p>
      <p>We have used NDCG@k and P@k to evaluate the quality
of the results (see Section 4.1). Table 2 reports the average
NDCG@k and P@k for the 10 considered queries with k=5,
10 and 20 on the two datasets. We can see from this table
that the use of patterns in the algorithm allows to signi
cantly outperform the two other con gurations in terms of
NDCG@k and P@k values at all levels for both datasets and
for all values of k. This means that our approach nds more
relevant results. Furthermore, since NDCG@k includes the
ranking position of the results, our ranking algorithm is
better because it includes patterns during the ranking step.</p>
      <p>
        Keyword search in RDF data has emerged as a new
research topic [
        <xref ref-type="bibr" rid="ref1 ref10 ref4 ref5">4, 10, 1, 5</xref>
        ]. Among the most known systems
are SPARK [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], NAGA [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Q2Semantic [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. They
return either SPARQL queries or subgraphs as answers for a
given keyword query.
      </p>
      <p>
        There are mainly two categories of approaches dealing
with keyword search over RDF graphs. The rst one uses
information retrieval techniques. The proposed approaches
rst de ne what a document is, for example, a triple [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
or a resource with its neighborhood [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Documents are
indexed, then query keywords are mapped to the de ned
documents. Finally, a ranked list of documents is returned to
the user. These works do not merge the relevant fragments
corresponding to the query keywords. Therefore, each result
is often the answer to a part of the query.
      </p>
      <p>
        In the second category, approaches use graphs to
construct results [
        <xref ref-type="bibr" rid="ref11 ref13 ref14">14, 11, 13</xref>
        ]. To do so, they typically use a
keyword-to-graph mapping function to identify a set of
elements that contain the query keywords. Then, a connected
\minimal" subgraph, which contains one identi ed element
for each query keyword is returned as an answer. Since
several elements can contain the same keyword, several results
are returned to the user, and each approach de nes a
ranking function mainly based on the size of the subgraph, the
matching relevance and the coverage. In these works, only
the structure of the graph is considered during the
construction of the result, unlike our approach, which integrates the
semantic through patterns. In [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ], the authors have
prede ned templates for the results based on the ontology or
the types of keyword elements. However, de ning these
templates might result in capturing some interpretations of the
keywords which may not satisfy the user's needs.
      </p>
    </sec>
    <sec id="sec-12">
      <title>CONCLUSION AND FUTURE WORKS</title>
      <p>(a) E ect of query size: Conference</p>
      <p>(b) E ect of query size: AIFB</p>
      <p>Figure 6: Execution time evolution
5
0.99
0.92</p>
      <p>WP
10
0.98
0.90</p>
      <p>This paper proposes a novel approach to query an RDF
dataset using keywords; it di ers from existing approaches
in that it includes external knowledge. We have introduced
the notion of pattern to formalise this knowledge. In our
approach, an answer to a query is a set of sugraphs
containing for each keyword, at least one relevant fragment. We
have described how we can expand relevant fragments to
complement them or to nd other fragments using patterns.
We have also developed a new index including all types of
graph elements and we have used the patterns during the
construction of the results as well as in the ranking
function. The experiments show that the patterns improve the
quality of the results. The concept of pattern used in this
paper could be compared to the use of inference. However,
the use of patterns allowed us to introduce knowledge based
on the schema but also on the data without dealing with the
types of entities or their hierarchy in the ontology. Moreover,
we can instanciate one or both ends of the path expression
by replacing the variables with resources or literals, which
allows to focus on speci c cases.</p>
      <p>
        In future works, we will explore the summarisation of the
results produced as an answer to a keyword query. As some
results can have complementary information, we could
aggregate them to obtain one result which is relevant to the
query. Few works have addressed this problem in the case of
keyword research in RDF graphs. The proposal presented
in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is a contribution to solve this problem. We will study
the introduction of patterns in the summarisation process
to improve the results.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Elbassuoni</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Blanco</surname>
          </string-name>
          .
          <article-title>Keyword search over rdf graphs</article-title>
          .
          <source>In CIKM</source>
          , pages
          <volume>237</volume>
          {
          <fpage>242</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Guha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>McCool</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and E.</given-names>
            <surname>Miller</surname>
          </string-name>
          .
          <article-title>Semantic search</article-title>
          .
          <source>In WWW</source>
          , pages
          <volume>700</volume>
          {
          <fpage>709</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ja</surname>
          </string-name>
          <article-title>rvelin and</article-title>
          <string-name>
            <surname>J. Keka</surname>
          </string-name>
          <article-title>lainen. Ir evaluation methods for retrieving highly relevant documents</article-title>
          .
          <source>In SIGIR</source>
          , pages
          <volume>41</volume>
          {
          <fpage>48</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Kasneci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          , G. Ifrim,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ramanath</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum</surname>
          </string-name>
          . Naga:
          <article-title>Searching and ranking knowledge</article-title>
          .
          <source>In ICDE</source>
          , pages
          <volume>953</volume>
          {
          <fpage>962</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kementsietsidis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Duan</surname>
          </string-name>
          .
          <article-title>Scalable keyword search on large rdf data</article-title>
          .
          <source>TKDE</source>
          ,
          <volume>26</volume>
          :
          <fpage>2774</fpage>
          {
          <fpage>2788</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ouksili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kedad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lopes</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Nugier</surname>
          </string-name>
          . Patex:
          <article-title>Pattern oriented RDF graphs exploration</article-title>
          .
          <source>In NLDB</source>
          , pages
          <volume>102</volume>
          {
          <fpage>114</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Oyama</surname>
          </string-name>
          .
          <article-title>Query Re nement for Domain-Speci c Web Search</article-title>
          .
          <source>PhD thesis</source>
          , kyoto University,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Pradel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Haemmerle</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          .
          <article-title>Swip: A natural language to sparql interface implemented with sparql</article-title>
          .
          <source>In ICCS</source>
          , pages
          <volume>260</volume>
          {
          <fpage>274</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rahoman</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Ichise</surname>
          </string-name>
          .
          <article-title>Automatic inclusion of semantics over keyword-based linked data retrieval</article-title>
          .
          <source>IEICE Transactions</source>
          , 97-D:
          <volume>2852</volume>
          {
          <fpage>2862</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rudolph</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          .
          <article-title>Top-k exploration of query candidates for e cient keyword search on graph-shaped (rdf) data</article-title>
          .
          <source>In ICDE</source>
          , pages
          <volume>405</volume>
          {
          <fpage>416</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H. F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. L.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tran</surname>
          </string-name>
          , and
          <string-name>
            <surname>Y. Yu.</surname>
          </string-name>
          <article-title>Q2Semantic: a lightweight keyword interface to semantic search</article-title>
          .
          <source>In ESWC</source>
          , pages
          <volume>584</volume>
          {
          <fpage>598</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Srivatsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Iyengar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Yan</surname>
          </string-name>
          .
          <article-title>Summarizing answer graphs induced by keyword queries</article-title>
          .
          <source>Proc. VLDB Endowment</source>
          .,
          <volume>6</volume>
          (
          <issue>14</issue>
          ):
          <volume>1774</volume>
          {
          <fpage>1785</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Yan</surname>
          </string-name>
          .
          <article-title>Schemaless and structureless graph querying</article-title>
          .
          <source>VLDB Endow</source>
          ., pages
          <volume>565</volume>
          {
          <fpage>576</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          . Spark:
          <article-title>Adapting keyword query to semantic search</article-title>
          .
          <source>In ISWC</source>
          , pages
          <volume>694</volume>
          {
          <fpage>707</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>