=Paper=
{{Paper
|id=Vol-1979/paper-14
|storemode=property
|title=Keyword Search over Federated RDF Datasets
|pdfUrl=https://ceur-ws.org/Vol-1979/paper-14.pdf
|volume=Vol-1979
|authors=Yenier Izquierdo,Marco A. Casanova,Grettel Garcia,Frederic Dartayre,Carlos Henrique Levy
|dblpUrl=https://dblp.org/rec/conf/er/IzquierdoCGDL17
}}
==Keyword Search over Federated RDF Datasets==
<pdf width="1500px">https://ceur-ws.org/Vol-1979/paper-14.pdf</pdf>
<pre>
          Keyword Search over Federated RDF Datasets

    Yenier T. Izquierdo1, Marco A. Casanova1,2, Grettel M. García1, Frederic Dartayre2,
                                    Carlos H. Levy2
     1
      Department of Informatics – Pontifical Catholic University of Rio de Janeiro, RJ, Brazil
             2
               Instituto TecGraf – Pontifícia Universidade Católica do Rio de Janeiro
                 {yizquierdo,casanova,ggarcia}@inf.puc-rio.br,
                        {fdartayre,levy}@tecgraf.puc-rio.br


         Abstract. This paper describes an algorithm to perform keyword search over
         federated RDF datasets. The algorithm compiles keyword-based queries into fed-
         erated SPARQL queries, without user intervention, under the assumption that the
         RDF datasets and the federation have a schema. The compilation process is ex-
         plained in detail, including how to synthesize external joins between local que-
         ries, how to combine local queries with UNION clauses, and how to construct the
         WHERE and TARGET clauses. The paper then presents the architecture of a sys-
         tem that implements the algorithm. Finally, the paper describes experiments with
         three real-world datasets to validate the implementation and to help understand
         the different situations faced by the compilation process.

         Keywords: Keyword search; federated SPARQL query; RDF.


1        Introduction

The Resource Description Framework (RDF) was adopted as a W3C recommendation
in 1999 and today is a standard for exchanging data on the Web. Indeed, a huge amount
of data has been converted to RDF [15] and published openly on the Web, following
the Linked Data principles [2]. These datasets frequently have an RDF schema, often
under the form of a standard ontology, such as the Music Ontology, and are intercon-
nected.
   The SPARQL Protocol and RDF Query Language (SPARQL) was officially intro-
duced in 2008 to specify queries over RDF datasets. The SPARQL 1.1 Federated Query
extension offers services for executing queries distributed over different RDF datasets.
SPARQL is a sophisticated language, which is difficult to master, though.
   Keyword-based queries offer an alternative way to access databases, which is attrac-
tive when users are unaware of the way data is organized, or do not know the syntax of
the query language. Specifically, keyword-based queries for RDF datasets has been
extensively investigated, but most approaches assume that the RDF triples are stored in
a centralized repository [6,8,10,20].
   By contrast, this paper addresses the problem of processing keyword-based queries
over federated RDF datasets. The main contributions of this paper are: (i) An algorithm
to compile keyword-based queries into federated SPARQL queries, without user inter-
vention, under the assumption that the RDF datasets and the federation have a schema;


Copyright © by the paper’s authors. Copying permitted only for private and academic
purposes.
In: C. Cabanillas, S. España, S. Farshidi (eds.):
Proceedings of the ER Forum 2017 and the ER 2017 Demo track,
Valencia, Spain, November 6th-9th, 2017,
published at http://ceur-ws.org
(ii) An implementation of the algorithm; (iii) Experiments with the implementation to
assess its usability.
    The reminder of this paper is organized as follows. Section 2 summarizes related
work. Section 3 contains basic definitions and introduces the example used throughout
the paper. Section 4 details the algorithm to compile keyword-based queries into fed-
erated SPARQL queries. Section 5 covers experiments with an implementation of the
algorithm. Finally, Section 6 presents the conclusions and proposes future work.


2      Related Work

Keyword Search over RDF Graphs in Centralized Environments. Several ap-
proaches have been developed to help solve the keyword search problem over RDF
datasets. The main challenge addressed in many of these approaches has been to syn-
thesize SPARQL queries from a set of keywords, since users are generally unaware of
the query language and the RDF schema to be queried.
   For example, the Konduit tool [4, 10] provides non-expert users with a way to visu-
ally specify SPARQL queries. However, this tool has the disadvantage that the user
must have a basic knowledge of SPARQL.
   The QUICK (QUery Intent Constructor for Keywords) [20] helps users construct
queries in a given domain. The tool combines the convenience of keyword search with
the expressiveness of semantic queries. Users start with a keyword query and are then
guided through an incremental refinement process to specify the query intention. The
intermediate queries are listed and ranked. QUICK has the drawback that the user must
have some knowledge of the RDF schema.
   SPARKLIS [7] helps users explore and query RDF datasets by interactively explor-
ing the RDF schema, in a questions and answers process. SPARKLIS combines the
fine-grained guidance of faceted search, most of the expressiveness of SPARQL, and
the readability of natural languages. It has the advantage that no specific endpoint con-
figuration is required, since the schema is discovered on the fly. However, unlike our
proposal, in SPARKLIS, the query is built by selecting query elements, and only a sin-
gle RDF dataset is accessed.
   The tools described in [8, 20] synthesize SPARQL queries by exploring the under-
lying RDF schema. Unlike [20], the tool described in [8] is fully automatic, and allows
users to specify a keyword-based query with filters, that involve comparison operators.
Frameworks supporting federated SPARQL queries. Research on querying distrib-
uted RDF datasets with SPARQL-like queries typically explores optimizations based
on structural information (i.e., graph partitioning) [9,13,19]. Furthermore, according to
[15], the tools and systems designed to address federated queries focus mostly on da-
taset selection and join optimization during federated SPARQL query execution.
   Well-known federation systems are ANAPSID [1], FedSearch [11], and FedX [17].
But all these systems focus on improving the performance of the SPARQL query exe-
cution, and not on the construction of the queries.
   A comparison of existing SPARQL federation frameworks can be found in [14,16].
Several frameworks, such as ARQ, Sesame and Virtuoso, have been built on top of
SPARQL query engines supporting SPARQL 1.1, but this field is still far from ma-
turity. The existing frameworks for SPARQL 1.1 Federation Extension support the
SERVICE keyword, but not all of them support the BIDING and VALUES operators.
For instance, ARQ is a query engine processor for Jena that supports federated queries,
providing SERVICE and VALUES operators. The framework implements nested loop
joins to retrieve and combine result from multiple RDF datasets. Also, it provides a set
of Java packages to build SPARQL queries programmatically.
   The work reported in this paper extends the centralized algorithm developed in [8]
to compile keyword-based queries into federated SPARQL queries. The implementa-
tion reported in the paper follows the architecture for federations of SPARQL endpoints
described in [15], uses the Java packages that ARQ provides, and stores the RDF
schema of each available RDF dataset in auxiliary tables.


3      Definitions and Examples

3.1    Basic Definitions

A link between RDF datasets Ti and Tj, with i ¹ j, is an RDF triple of the form (si,p,sj),
where si and sj are URIs occurring in Ti and Tj. A linkset between Ti and Tj is simply a
set of links between Ti and Tj.
    Assume that Ti and Tj have RDF schemas. A linkset schema for a linkset Lk between
Ti and Tj is a triple (ci,p,cj), where ci and cj are classes of the RDF schemas of Ti and Tj.
It restricts the links in Lk to be of the form (si,p,sj), where si and sj are instances of the
classes ci and cj in Ti and Tj, respectively.
    An RDF dataset federation is a collection T = {T1, …, Tm} of RDF datasets and a
collection L = {L1, …, Ln} of linksets between the datasets in T, denoted simply as
T È L. The RDF graph of the RDF dataset federation T È L is the RDF graph induced
by the set of triples T1 È …È Tm È L1 È…È Ln.
    A keyword-based query K is a set of literals, or keywords. An answer for K over
T È L is defined as in [8], except that the RDF graph is that induced by the set of triples
T1 È …È Tm È L1 È…È Ln. A detailed definition is omitted for brevity. The problem
of finding answers for keyword-based queries over RDF dataset federations (or, briefly,
the RDF-FKwS problem) is defined as: “Given an RDF dataset federation T È L and
a keyword-based query K, find an answer for K over T È L”.
    In this paper, we assume that each RDF dataset Ti follows an RDF schema Si, for
1 ≤ i ≤ m, and that each linkset Lj follows a linkset schema Mj, for 1 ≤ j ≤ n. We also
assume that the RDF dataset federation T È L has a federated schema, which consists
of the schemas of its datasets and linksets, a set of metadata matches, and a set of in-
stance match definitions. Section 3.2 provides an example of a federated schema.
    More precisely, a federated schema for the RDF dataset federation T È L is a quad-
ruple F = (s,l,µ,d), where:
1) s = (S1,…,Sm) and l = (M1,…,Mn).
2) µ is a set of metadata matches, that is, a set of pairs of the form (xi,xj) such that xi
      is a class (datatype property or object property) defined in Si and xj is a class (re-
      spectively, datatype property or object property) defined in Sj, for 1 £ i ¹ j £ m; in
      this case, we say that xi and xj match.
3) d is a set of instance match definitions, that is, a set of pairs of the form
      (q,(ci,(pi1,…, pik),cj,(pj1,…,pjk))) such that:
      a) q is an object property;
      b) ci and cj are classes defined in Si and Sj, respectively;
      c) (pi1,…,pik) is a list of datatype properties defined in Si whose domain is ci, and
           likewise for (pj1,…,pjk).
     By (1), the federated schema simply contains all dataset and linkset schemas. By
(2), it admits just one-to-one metadata matches (also called schema mappings).
     An instance match definition (q,(ci,(pi1,…,pik),cj, (pj1,…,pjk))) specifies a set of links
L of the form (si,q,sj) such that si is an instance of class ci in Ti, sj is an instance of class
cj in Tj, and there are triples in Ti of the form (si,piu,vu) and triples in Tj of the form
(sj,pju,vu), for 1 ≤ u ≤ k. That is, (si,q,sj) is a link in L iff the value of property piu for si
is equal to the value of property pju for si, for 1 ≤ u ≤ k. A common type of inter-dataset
property definition specifies that si and sj denote the same real-world entity iff the val-
ues of their properties (pi1,…,pik) and (pj1,…,pjk) are equal; these definitions are usually
called sameAs definitions.
     We could account for more complex metadata matches or instance match defini-
tions at the expenses of a more complex keyword federated translation algorithm, which
we leave for future work. In particular, sameAs definitions could incorporate transfor-
mations (e.g. lower case to upper case) and similarity measures (e.g. Levenshtein dis-
tance), as in tools such as Silk [18].


3.2     An Example

Consider the following RDF datasets: the drug and enzyme data in DBpedia, with
11.198 triples; DrugBank, with 765.936 triples; and the Kegg Drug, with 713.737 tri-
ples. We assume that these datasets have the RDF schemas described in Figure 1.
    Consider a federation of these three datasets with the federated schema described in
Figure 2, where dotted arrows represent metadata matches and dashed arrows indicate
instance match definitions. In more detail, we have:
• RDF schemas: (as described in Figure 1).
• linkset schemas: (the federation has no linksets).
• metadata matches: drug in DBpedia with drug in DrugBank; enzyme in DrugBank
    with Metabolism in Kegg Drug; and drugInteraction in DrugBank with Interaction
    in Kegg Drug.
• instance match definitions: a sameAs definition that relates enzyme instances in
    DrugBank and enzyme instances in DBpedia; a sameAs definition that relates Drug
    instances in DBpedia and drug instances in DrugBank; a definition for a new object
    property that relates drug instances in DrugBank with drug instances in Kegg Drug.
                                     a)      drug     enzyme


   b)                                                 c)
   drug interaction       drug            enzyme                   KEGG drug


                       drug target        reference   Metabolism   Interaction      Target

              Figure 1 - RDF Schema of: (a) DBpedia; (b) DrugBank; (c) Kegg Drug.


                 Figure 2 - Federated Schema of DBpedia, DrugBank and Kegg Drug.


   Table 1 illustrates the types of federated queries the federated translation algorithm
generates. The table is organized as follows:
   • Column 1: examples of keyword-based queries with the time to compile and
        run each query (see Section 5).
   • Column 2: SPARQL queries over the federation that translate the keyword-
        based queries.
   • Column 3: a schematic description of each query.
   Section 4 discusses how to compile the keyword-based query in each row into the
corresponding federated SPARQL query. The rows of the table show:
   • Row 1: a query over DrugBank only.
   • Row 2: a federated query with two local queries over DBpedia and Drugbank
        that are combined with the help of a sameAs definition that relates the DBpedia
        instance for “ibuprofen” with the DrugBank instance for the same drug.
   • Row 3: a federated query with two local queries over Kegg Drug and Drugbank
        that are combined with the help of a UNION clause; the synthesis of the UNION
        clause is possible since drugInteraction in DrugBank matches Interaction in
        Kegg Drug.
   • Row 4: a more complex federated query with three local queries over DBpedia,
        Kegg Drug and DrugBank; the local queries over DBpedia and DrugBank are
        combined with the help of a sameAs definition that relates the DBpedia in-
        stances of type enzyme with DrugBank instances also of type enzyme; the local
        query over Kegg Drug and the query formed by the external join between the
        local queries over DBpedia and DrugBank are combined with the help of a
        UNION clause.
                          Table 1 – Sample keyword-based queries and their translations.
Keywords
Runtime (s)
                               Generated Query                                              Federated SPARQL Query Structure
 indication SELECT ?C_0_0 ?P_0_0
                                                                                                        DrugBank
                                                                                                        DrugBank                                 rdf:type
  backache WHERE{
              SERVICE SILENT <drugbank> {
                ?I_C_0_0 rdf:type drugbank:drugs .                                                                                                                                        s
                                                                                                                              drug
                ?I_C_0_0 drugbank:indication ?P_0_0
  0.34s         FILTER oracle:textContains
                                                                                                                              in                                                     on
                 (?P_0_0, "fuzzy({backache}, 70, 1)", 1)                                                                   ma                                                    ati
                                                                                                                   :do                                                        dic
                                                                                                             s                                                         :i n
                ?I_C_0_0 rdfs:label ?C_0_0                                                               rdf
            } }                                                                                                        :indication                                             “backache”

'drug target' SELECT   ?C_0_0 ?C_1_1                                                               DBpedia
                                                                                                   DBpedia                                rdf:type
 ibuprofen WHERE{
                SERVICE SILENT <dbpedia>{                                                                                        Drug                           d                       “Ibuprofen”
                   ?I_C_0_0 rdf:type dbpedia:Drug .                                                                                                                      :name


                                                                                                                                                        el
                   ?I_C_0_0 dbpedia:name ?sA_1_0                                                                             ai n


                                                                                                                                                     ab
                                                                                                                           om


                                                                                                                                                  s:l
                                                                                                                      :d
                                                                                                                                                                                           sameAs


                                                                                                                                                   f
                   FILTER oracle:textContains                                                                     s


                                                                                                                                                rd
                                                                                                              rdf                                                                         definition
                     (?C_0_0, "{\"ibuprofen\"}", 0)                                                                         :name                   “Ibuprofen”
                   ?I_C_0_0 rdfs:label ?C_0_0    }
  1.32s         SERVICE SILENT <drugbank>{                                                          DrugBank
                                                                                                    DrugBank
                                                                                                                                                                      rdfs:label
                     ?I_C_1_0 drugbank:target ?I_C_1_1 .
                                                                                                                   drug                                     s                           “Ibuprofen”
                     ?I_C_1_0 rdfs:label ?sA_1_0                                                                                          rdf:type
                     FILTER oracle:textContains                                                                                                                        get
                                                                                                                                                                :   tar
                        (?sA_1_0, "{\"ibuprofen\"}", 0)
                     ?I_C_1_1 rdfs:label ?C_1_1     }                                                                                    rdf:type
                                                                                                              drug target                                   r
            }
interaction SELECT    ?U_0                                                                                         Kegg
                                                                                                                   Kegg Drug
                                                                                                                        Drug
            WHERE{
            { SELECT (?C_1_0 AS ?U_0)                                                                                        s                    Interaction
               WHERE{                                                                                                              rdf:type
                                                                                                                                                                rdfs:label
                 SERVICE SILENT <drugbank>{
                   ?I_C_1_0 rdf:type
                                                                                                                                                  Interaction
                              drugbank:drug_interactions.                                                                                                                               Metadata
                   ?I_C_1_0 rdfs:label ?C_1_0    } }                                                                                                                                    Definition
  1.54s       UNION{                                                                                               DrugBank
                                                                                                                   DrugBank
                SELECT (?C_0_0 AS ?U_0)                                                                                            rdf:type
                  WHERE{                                                                                               r
                                                                                                                                         drug_interactions
                   SERVICE SILENT <kegg>{
                                                                                                                                                            rdfs:label
                     ?I_C_0_0 rdf:type kegg:Interaction .
                     ?I_C_0_0 rdfs:label ?C_0_0   }                                                                                           drug interactions
                 } }
            }                                                                                                                      u    = s           È r
interaction SELECT ?U_0 ?U_1                                             DBpedia
                                                                         DBpedia                                                               DrugBank
                                                                                                                                               DrugBank                               rdf:type                           rdf:type

  enzyme WHERE {
                                                                                                                                                                                                                g
                                                                                                                                                                                                              Dru
                                                                                             rdf:type


                                                                                                                                                            drug_interactions
                                                                                                                                                                                                              ion


             { SELECT (?C_2_0 AS ?U_0) (?sA_1_0 AS ?U_1)                                                       s                                                                                   r                s
metabolism                                                                  enzyme                                                                                                                                               drug
                                                                                                                                                                                                           act
                                                                                                                                                                                                        ter


               WHERE {                                                                                                                                                rdfs:label
                                                                                                                                                                                                       :in


                                                                                                              me                               drug interactions
                SERVICE SILENT <drugbank>{                                rdfs:domain                   :na                                                           rdfs:domain                                             rdfs:domain
                                                                                                                                                 :interactionDrug1                                           :enzyme
                   ?I_C_2_1 drugbank:enzyme ?I_C_2_2 .                      :name                       ?sA_1_0                                                                                                                :enzyme
                   ?I_C_2_0 drugbank:interactionDrug ?I_C_2_1 .                                                                                                 enzymes                            t
                                                                                                                                 Metadata                                             rdf:type
                   ?I_C_2_2 rdfs:label ?sA_1_0 .                                                                                 Definition
                                                                                                                                                Metadata             rdfs:label
                                                                                                                                                                                                  rdfs:label
                                                                                                                                                                                  enzymes                           ?sA_1_0
                   ?I_C_2_0 rdfs:label ?C_2_0 .    }                                                                                            Definition
                                                                                                               sameAs
  3.55s          SERVICE SILENT <dbpedia>{                                                                    definition
                   ?I_C_0_0 rdf:type dbpedia:Enzyme .
                                                                                                                                               Kegg
                                                                                                                                               Kegg
                   ?I_C_0_0 dbpedia:name ?sA_1_0 . } } }                                                                                                               rdf:type
                                                                                                                                                                                                          c               Metabolism
                                                                                                                                                Interaction                       b
             UNION {                                                                    =
                                                                                                                                                                                                              rdf:type
                                                                                u0           r            È            b
                                                                                                                                                         rdfs:label :interaction                  :metabolism                   rdfs:label
               SELECT (?C_1_0 AS ?U_0) (?C_1_1 AS ?U_1)                                                                                                                                       a
               WHERE{                                                          u1       =   t⋈s           È            c                          Interaction
                                                                                                                                                                                        rdf:type
                                                                                                                                                                                                                         Metabolism

                 SERVICE SILENT <kegg>{
                                                                                                                                                                                         drug
                   ?I_C_1_0 kegg:metabolism ?I_C_1_2 .
                                                                                                                                                                      rdfs:domain                      rdfs:domain
                   ?I_C_1_0 kegg:interaction ?I_C_1_1 .
                                                                                                                                                                       :interaction                    :metabolism
                   ?I_C_1_1 rdfs:label ?C_1_0 .
                   ?I_C_1_2 rdfs:label ?C_1_1    } } } }
4        The Federated Translation Algorithm

The key problems to compile keywords into federated queries are: (1) How to generate
local queries that cover the input keywords as much as possible; (2) How to combine
the local queries, using the metadata matches and instance match definitions of the fed-
erated schema. The first problem was addressed in [8] and the second problem is the
focus of this section.


4.1      Overview of the Federated Translation Algorithm

This section details the federated translation algorithm that compiles keyword based-
queries into federated SPARQL queries (see Figure 3). It has as input a keyword-based
query K and a federated schema F and returns a federated RDF query Q.
    Stage 1 runs the centralized translation algorithm, for each dataset Ti. The result
may be a local query, Qi, if dataset Ti contributes to answering K, or NULL, otherwise.
It also returns, for each variable vk, with 1 £ k £ n:
    1. the set of keywords Kk that vk covers;
    2. the URIs of the elements that vk binds to, as follows:
         • if vk binds to instances of a class ci, then the URI of ci is returned;
         • if vk binds to property values of a property pi, then the URIs of pi and ci, are
             returned, where ci is the domain of pi.
    The set of keywords that Qi covers is then Ki = K1 È … È Kn. By construction of the
centralized translation algorithm, we have Ki Í K.
    The following sections detail how Stage 2 synthesizes a federated SPARQL query
Q from the local queries. Very briefly, Step 2.1 finds the set of external joins that are
candidates to link the set of local queries. Step 2.2 creates the federated query multi-
graph GF=<VF, EF, WF> such that the nodes are the local queries, the edges are the

    FEDERATED TRANSLATION ALGORITHM
    Input:    A keyword-based query K
              A federated RDF schema F
    Output: A federated SPARQL query Q
    STAGE 1 Compute the set of local SPARQL queries:
      For each endpoint ei:
      1.1. Run the Centralized Translation Algorithm.
      1.2. Return the local SPARQL query Qi or NULL.
    STAGE 2 Synthesize the federated SPARQL query Q:
      2.1. Discover the external joins which are candidates to link the local queries.
      2.2. Create a federated graph GF and compute a maximum spanning forest TF of GF,
            to select the external joins that will be present in the federated query.
      2.3. Insert the triple patterns corresponding to the selected external joins into the respective
            queries.
      2.4. If TF is not connected, then check if it is possible to compute the UNION clauses.
      2.5. Synthesize and return Q.

                     Figure 3 - Outline of the Federated Translation Algorithm.
candidate external joins, and the weights are as in Section 4.3; then, Step 2.2 computes
a maximum directed spanning forest TF of GF. Step 2.3 inserts into the local queries the
triple patterns corresponding to the arcs of TF. If TF is not connected, then Step 2.4 tries
to combine the local queries represented by the connected components of TF through
UNION clauses. Step 2.5 synthesizes the federated SPARQL query Q.


4.2    Overview of the Centralized Translation Algorithm

This section briefly overviews the centralized translation algorithm, introduced in [8].
The algorithm accepts a keyword-based query K and an RDF dataset T, and outputs a
SPARQL query Q, which is a correct interpretation for K, in the sense that any result
of Q is an answer for K over T. It assumes that T follows an RDF schema S.
   The algorithm starts by computing a set of metadata matches and a set of property
value matches between the keywords and elements of T. It organizes the matches into
nucleuses, which have a class c, together with a list of properties whose domain is c
and whose values match keywords.
   The centralized translation algorithm implements two heuristics, called the scoring
and the minimization heuristics. Briefly, the scoring heuristic: (1) considers how good
a match is, say “city” matches “Cities” better than “Sin City”; (2) assigns a higher score
to metadata matches, on the grounds that, if the user specifies a keyword, say “city”,
that matches both a class label, say, “Cities”, and the property value of an instance, say
the film title “Sin City”, then the user is probably more interested in the class labelled
“Cities” than the specific film “Sin City”; (3) assigns a higher score to nucleuses that
cover a larger number of keywords. The heuristic is formalized by defining a score
function for the nucleuses.
   The minimization heuristic tries to generate minimal answers, in two stages. Ideally,
we should try to find the smallest set of nucleuses that covers the largest set of keywords
and that has the largest combined score. However, this is an NP-complete problem (by
a reduction to the bin packing problem). The first stage of the minimization heuristic
then implements a greedy algorithm that prioritizes the nucleuses with the largest scores
that cover a large subset of K. The second stage of the minimization heuristic then con-
nects the classes in such nucleuses, using a small number of equijoins. This is equiva-
lent to generating a Steiner tree ST of the graph of the RDF schema of T that covers the
classes in the prioritized nucleuses. Then, the algorithm uses the edges of ST to generate
equijoin clauses of the SPARQL query Q.


4.3    Computing External Joins

Recall that the result of executing the centralized translation algorithm for a dataset Ti
is a local query Qi or NULL. For the sake of simplicity and without loss of generality,
by reordering the datasets, we may assume that Stage 1 returns queries Q1, Q2, …, Qk,
with k £ n. Let Ci denote the set of classes present in Qi, for 1£i£k. Also, for each ciÎCi,
let score(ci) be the score of the nucleus that contains ci, as computed by the centralized
translation algorithm.
    For each pair of local queries Qi and Qj, with 1 £ i ¹ j £ n, Step 2.1 creates a candidate
external join, denoted (Qi,Qj), between Qi and Qj iff the federated schema has a linkset
schema of the form (ci,p,cj) or an instance match definition of the form (q,(ci,(pi1,…,
pin),cj,(pj1,…,pjn))) such that ciÎCi and cjÎCj. We say that the external join is generated
by the linkset schema or by the instance match definition. Step 2.1 computes the score
of (Qi,Qj) as the sum of the scores of ci and cj.
   Step 2.2 first creates the federated query multigraph GF = <VF, EF, WF>, where:
  • VF = {Q1,…,Qk} is the set of non-null local queries that Stage 1 returns;
  • there is an arc (Qi,Qj) in EF, with score WF((Qi,Qj)) = s, iff (Qi,Qj) is a candidate
       external join, returned by Step 2.1, whose score is s.
   Note that GF is indeed a multigraph, since there might be more than one candidate
external join between the same pair of local queries. Also, GF may not be connected.
   Step 2.2 then computes a maximum spanning forest TF of GF. The arcs in TF repre-
sent the selected external joins that will be used to combine the local queries.
   After selecting the external joins, Step 2.2 computes the score of a local query Qi as:
                       score(Qi) = max{ WF((Qi,Qj)) / (Qi,Qj)ÎTF }
   Then, for each selected external join (Qi,Qj), Step 2.3 synthesizes external join triple
patterns as follows. There are two cases to consider. If the external join is generated by
a linkset schema (ci,p,cj), then a triple pattern of the form (si p sj) is included in Qi. If
the external join is generated by an instance match definition of the form (q,(ci,(pi1,…,
pin),cj,(pj1,…,pjn))), then triple patterns of the forms (si pik vk) and (sj pjk vk) are included
in Qi and Qj. Note that the use of the same variable vk in both triple patterns forces the
property values to be equal.
   For example, row 2 of Table 1 shows a federated query with two local queries, Q1
and Q2, respectively over DBpedia and DrugBank. These queries are combined by ex-
ternal join triple patterns, which are in turn generated by a sameAs definition. Let C1
and C2 be the sets of classes present in Q1 and Q2, respectively. So, corresponding to
the sameAs definition, the triple pattern (?I_C_0_0 dbpedia:name ?sA_1_0) is added
to Q1 and (?I_C_1_0 rdfs:label ?sA_1_0) to Q2. The score of Q1 is given by
                   score(Q1) = WF((Q1,Q2)) = score(Drug)+score(drugs)
where we recall DrugÎC1 and drugsÎC2 and the values of score(Drug) and
score(drugs) are computed by the centralized translation algorithm in Stage 1.


4.4    Computing UNIONs
Recall that Step 2.2 creates a federated graph GF and computes a maximum spanning
forest TF of GF. However, TF may not be connected, that is, TF may consist of several
trees, in which case Step 2.4 tries to compute UNION clauses, under certain conditions.
   Let Q1 and Q2 be two SPARQL queries in different trees of TF. Assume that S1 = {v1,
v2, …, vm} is the set of variables in the TARGET clause of Q1, S2 = {w1, w2, …, wn} is
the set of variables in the TARGET clause of Q2, S1 covers the set of keywords K1 and
S2 covers the set of keywords K2. It is possible to combine Q1 and Q2 with the help of a
UNION clause iff
(1) S1 and S2 have the same number of variables, that is, m=n;
(2) There is a permutation p of 1,…,m such that each pair of variables vi and wp(i) are
     bound to instances of classes (or properties) that match; in this case, we also say
     that vi and wp(i) match.
   To generate a federated SPARQL query with a UNION clause, a bind variable us,
for s=0,…,m, is created.
   For example, row 3 of Table 1 shows a federated query with a UNION clause that
combines two local queries, Q1 and Q2, respectively over DrugBank and Kegg Drug.
Note that, the number of variables in the TARGET clause of Q1 is equal to that of Q2.
Variable ?C_1_0 in Q1 binds to instances of the class drug_interactions and variable
?C_0_0 in Q2 binds to instances of the class Interaction. According to the federated
RDF schema depicts in Figure 2, drug_interactions and Interaction match. Therefore,
variables ?C_1_0 and ?C_0_0 also match. Thus, a variable ?U_0 is created, which binds
to the values in ?C_1_0 and ?C_0_0, and Q1 and Q2 are combined via a UNION clause.


4.5    Defining the WHERE clause

This section schematically describes how the WHERE clause of the federated queries
are synthesized.
   Let “Q1 ⋈ Q2” represent an external join between two queries Q1 and Q2. Assume
that Stage 1 returns queries Q1, Q2, …, Qk, with k £ n. Let Q be the federated SPARQL
query to be constructed, and WQ be the WHERE clause of Q. In general, WQ is a union
of queries combined by external join patterns, which is denoted as
                ()
WQ = #   $%&(⋈'%& 𝑄$,' ).
   For a better understanding of the above definition, consider the following case. Sup-
pose that the local SPARQL queries are Q1, Q2, Q3 and Q4. Assume that Q1 and Q2 are
combined by external join triple patterns, and likewise for Q3 and Q4. Let P1 = Q1 ⋈ Q2
and P2 = Q3 ⋈ Q4. Assume that P1 and P2 can be combined by a UNION clause. Then,
the WHERE clause of the federated query would be:
                         WQ = P1 ⋃ P2 = (Q1 ⋈ Q2) ⋃ (Q3 ⋈ Q4)
   Note that, when P1 and P2 cannot be combined by a UNION clause, that is, when
they do not meet the conditions defined in Section 4.4, then the federated translation
algorithm will generate only one of the queries, P1 or P2.
   For example, row 4 of Table 1 shows a federated query whose WHERE clause cor-
responds to a union of queries combined by external join triple patterns. Indeed, let Q1,
Q2, and Q3 denote the local queries over DBpedia, Kegg Drug and DrugBank, respec-
tively. Observe from row 4 of Table 1 that Q1 and Q2 are combined by external join
triple patterns generated by a sameAs definition, denoted P1 = Q1 ⋈ Q2. Note that:
   (1) Corresponding to the sameAs definition, the triple patterns (?I_C_2_2
        rdfs:label ?sA_1_0) and (?I_C_0_0 dbpedia:name ?sA_1_0) are included
        in Q1 and Q2, respectively;
   (2) Variables ?C_2_0 and ?sA_1_0 cover the keyword set K1 = {“interaction”,
        “enzyme” };
   (3) The results in variable ?C_2_0 represent instances of the class drug_interaction
        in DrugBank;
   (4) The results in variable ?sA_1_0 represent instances of the class enzyme that
        occurs in DBpedia and DrugBank.
   As the example shows, the TARGET clause of Q3 contains variables ?C_1_0 and
?C_1_1, which cover the set of keywords K2 = {“interaction”, “metabolism”}. Variable
?C_1_0 represents instances of the class Interaction, and variable ?C_1_1 binds to the
labels of the instances of the class Metabolism. Then, the TARGET clauses of both P1
and Q3 have the same number of variables and, according the federated schema defined
in Figure 2, variables ?C_2_0 and ?C_1_0 are bound to classes that match. Then, the
results of both variables are bound to the new variable ?U_0. Likewise, variables
?sA_1_0 and ?C_1_1 match, and the results are bound to a variable, ?U_1. So, the
WHERE clause of the federated query is given by WQ = P1 ⋃ Q3 = (Q1 ⋈ Q2) ⋃ Q3.


4.6    Defining the TARGET clause

Let Q be a federated query with TARGET clause SQ and WHERE clause WQ. The con-
struction of SQ consists mainly in the computation of a subset Var(SQ) of the set of
variables Var(WQ) present in WQ. The computation of Var(SQ) depends on WQ and the
coverage of the keywords set K. The different cases are explained in what follows.
Federated Query with a WHERE clause without external join triple patterns or
UNION clauses
Suppose that the centralized translation algorithm outputs a single local query Q1. Then,
the federated SPARQL query Q will be Q1, with an additional “SERVICE SILENT”
clause to query the target dataset, and Var(SQ)=Var(SQ1). Row 1 of Table 1 illustrates
this case.
Federated Query with a WHERE clause with only external join triple patterns
Suppose that the WHERE clause of the federated query Q is of the form
WQ = Q1⋈…⋈Qn.
   To compute the set of variables Var(SQ) of the TARGET clause of Q, a greedy strat-
egy is used, based on the score of the local queries, and taking into account the coverage
of the keywords set K by the variables in Var(SQ).
   Let CQ ={Q1, …, Qm} be the set of local queries. The strategy starts with Var(SQ)=Æ
and a set of covered keywords K’ = Æ. Assume that the local query Qi, with 1 ≤ i ≤ m,
has the highest value score. Then, the variables in Var(SQi) are added to Var(SQ), and
the keywords covered by Var(SQi) are added to K’. If K = K’ or all local queries have
been analyzed, the process stops. Otherwise, the next local query Qj in decreasing score
order is analyzed and, if there is a variable vj ÎVar(SQj) such that vj covers a set of
keywords Kj Í K, and there is a keyword kÎKj such that kÏK’, then vj is added to
Var(SQ) and k is added to K’.
   For example, row 2 of Table 1 shows a federated query Q such that the set of varia-
bles Var(SQ) is equal to Var(SQ1), because Q1 is the query with the highest score and
the variables in Var(SQ1) cover the set of keyword K3.
Federated Query with a WHERE clause with only UNION clauses
Suppose that the WHERE clause WQ of the federated query Q is of the form
WQ = Q1⋃…⋃Qn.
   For brevity, consider only the case of two local queries, that is, Q = Q1⋃Q2. Assume
that SQ1 = {v1, …, vm}, SQ2 = {w1, …, wm}, and there is a permutation p of 1,…,m such
that each pair of variables vi and wp(i) match. Then, a new variable ui is created to bind
the results of variables vi and wp(i) and the TARGET clause SQ of Q is composed of the
bind variables u1,…,um.
   For example, row 3 of Table 1 shows a federated query Q such that the set of varia-
bles Var(SQ) consists of a new bind variable ?U_0.
Federated Query with a WHERE clause with external join triple patterns and
UNION clauses
Suppose that the WHERE clause WQ of the federated query Q is of the form
             ()
WQ = # $%&(⋈'%& 𝑄$,' ).
   The strategy for choosing the variables in the TARGET clause of Q is similar to the
previous case and is based on the structure of WQ. As an example, suppose that
Q = (Q1⋈Q2) ⋃ (Q3⋈Q4) and that the sets of variables of Q1⋈Q2 and Q3⋈Q4 are
Var(SQ1⋈Q2)={v1,…,vm} and Var(SQ3⋈Q4)={w1,…,wm}, respectively. Assume that these
sets satisfy Conditions (1) and (2) defined in Section 4.4 and that they cover the key-
word set K. Then, the results of Q1⋈Q2 and Q3⋈Q4 can be combined via a UNION
clause, and a new variable ui is created to bind the results of variables vi and wp(i), for
i=1,…,m. The set of variables Var(Q) is composed of the new bind variables u1,…,um.
   For example, Row 4 of Table 1 shows a federated query Q such that
WQ = (Q1 ⋈ Q2) ⋃ Q3 and Var(Q)={?U_0, ?U_1}.


5      Implementation and Experiments
We implemented a complete federated keyword search system, which incorporates the
federated translation algorithm. The architecture of the system has the following com-
ponents: (1) a Web interface, which allows users to submit keyword-based queries to
the federation; (2) a Mediator, which orchestrates the processing of the keyword-based
queries; (3) a Storage Component, which stores data and metadata about the RDF da-
tasets of the federation; (4) a Federated Schema Component, which saves the federated
schema of the federation; and (5) the RDF datasets that compose the federation.
   Briefly, the processing of a keyword-based query goes as follows. The Mediator
receives the set of keywords specified by the user and executes the federated translation
algorithm, which: (1) uses the Storage Component to find the data and metadata that
matches the keywords; (2) uses the Federated Schema Component to find the external
joins between the computed subqueries; (3) if necessary, it creates UNION clauses to
combine the result of queries that are not linked by inter-dataset property definitions;
(4) synthesizes the federated SPARQL query. Then, the Mediator executes the feder-
ated SPARQL query and returns the response to the user.
   We ran a suite of keyword-based queries to assess the performance of the federated
translation algorithm. All experiments were conducted using the Web interface of the
system, develop in Java. The APP ran on a desktop machine with OS Windows 7 Ulti-
mate, a quad-core processor Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz, 4 GB of
RAM. In the same desktop machine, all system components were allocated to an Oracle
Database version 12c, configured with a PGA size of 324 MB and an SGA size of 612
MB with 148 MB of cache size and 296 MB of buffer cache.
   As introduced in Section 3.2, three RDF datasets were selected for the experiments:
the drug and enzyme data in DBpedia, with 11.198 triples; DrugBank, with 765.936
triples; and the Kegg Drug, with 713.737 triples. These datasets were stored in different
station nodes, located in the same local network, configured using the Oracle Spatial
and Graph for Semantic Technologies, available in Oracle 12c. Data and metadata
stored in the Storage Component were indexed using Oracle Text to facilitate full text
search and each remote RDF graph was indexed using the B-Tree indexing for RDF
models and entailments supported by the Semantic Network feature of Oracle.
   The keyword-based queries were selected to show the different configurations of the
WHERE clause of the federated query, and the coverage of the set of keywords by the
variables in the TARGET clause. Table 1 shows some of the queries run in the test suite.
In particular, Column 1 of Table 1 shows that all queries were successfully executed in
less than 4 seconds (the time reported is the average of 10 executions for each sample
query). This is quite reasonable, considering: the size of the datasets; the fact that the
system returns 750 results as a limit; and the fact that the subquery results come from
different datasets stored in a local network.
   Indeed, the results obtained by running the complete test suit suggest that the algo-
rithm performs well for the keyword-based search over federated RDF graphs.


6      Conclusions and Future Work

We presented an algorithm, called federated translation algorithm, to perform keyword
search over federated RDF datasets by exploring their schemas. The algorithm extends
the centralized translation algorithm described in [8] and generates a federated
SPARQL query such that: (1) the local queries only access the datasets whose indexed
data and metadata match the keywords; and (2) the variables in the TARGET clause of
the federated SPARQL query cover a subset of the set of keywords submitted.
   We detailed the design decisions that support the construction of the federated
SPARQL query, based on the information of the federated schema and on the local
queries generated by the centralized translation algorithm. We defined the conditions
to combine, with the help of external join patterns and UNION clauses, the local que-
ries. We explained how the TARGET clause is constructed, according to the composi-
tion of the WHERE clause.
   Finally, we briefly described the implementation of a federated keyword search sys-
tem, which includes the federated translation algorithm, and conducted experiments to
test the performance of the system, using three freely accessible RDF databases, with
metadata matches and instance match definitions specified between them. The experi-
ments suggest that the proposed algorithm performs well for keyword-based search
over federated RDF datasets.
   As future work, we are developing a benchmark for RDF federated keyword search
systems, which we plan to use to further test the system described in this paper. The
benchmark will include scenarios with a larger number of RDF datasets and more com-
plex federated schemas. Finally, we plan to extend the current implementation to other
federated RDF storage systems and to make the tool publicly available.
References
1.  Acosta, M., Vidal, M. E., Lampo, T., Castillo, J., & Ruckhaus, E. (2011). ANAPSID:
    an adaptive query processing engine for SPARQL endpoints. ISWC 2011, pp. 18-34.
2. Bizer, C., Heath, T., & Berners-Lee, T. (2008). Linked data: Principles and state of the
    art. WWW 2008, pp. 1-40.
3. Buil-Aranda, C., Arenas, M., Corcho, O., & Polleres, A. (2013). Federating queries in
    SPARQL 1.1: Syntax, semantics and evaluation. Web Semantics: Science, Services
    and Agents on the World Wide Web, 18(1), 1-17.
4. Dragan, L., Möller, K., Handschuh, S., Ambrus, O., & Trüg, S. (2009). Converging
    Web and Desktop Data with Konduit. In Proc. of Scripting and Development for the
    Semantic Web Workshop.
5. Cyganiak, R., Wood, D., & Lanthaler, M. (2014). RDF 1.1 concepts and abstract syn-
    tax. W3C Recommendation, 25, 1-8.
6. Elbassuoni, S., & Blanco, R. (2011). Keyword search over RDF graphs. ACM CIKM
    2011, pp. 237-242.
7. Ferré, S. (2017). Sparklis: an expressive query builder for SPARQL endpoints with
    guidance in natural language. Semantic Web, 8(3), 405-418.
8. García, G. M., Izquierdo, Y. T., Menendez, E. S., Dartayre, F., & Casanova, M. A.
    (2017). RDF Keyword-based Query Technology Meets a Real-World Dataset. EDBT
    2017 – Industrial and Applications Paper, pp. 656-667.
9. Huang, J., Abadi, D. J., & Ren, K. (2011). Scalable SPARQL querying of large RDF
    graphs. Proceedings of the VLDB Endowment, 4(11), 1123-1134.
10. Möller, K., Dragan, L., Ambrus, O., & Handschuh, S. (2008). A visual interface for
    building SPARQL queries in Konduit. ISWC-PD 2008, pp. 68-69.
11. Nikolov, A., Schwarte, A., & Hütter, C. (2013). Fedsearch: Efficiently combining
    structured queries and full-text search in a sparql federation. ISWC 2013, pp. 427-443.
12. Prud’hommeaux, E., & Buil-Aranda, C. (2013). SPARQL 1.1 federated query. W3C
    Recommendation, 21.
13. Quilitz, B., & Leser, U. (2008). Querying distributed RDF data sources with SPARQL.
    ESWC 2008, pp. 524-538.
14. Rakhmawati, N. A., Umbrich, J., Karnstedt, M., Hasnain, A., & Hausenblas, M. (2013).
    A comparison of federation over SPARQL endpoints frameworks. In International
    Conference on Knowledge Engineering and the Semantic Web (pp. 132-146).
15. Rakhmawati, N. A., Umbrich, J., Karnstedt, M., Hasnain, A., & Hausenblas, M. (2013).
    Querying over Federated SPARQL Endpoints---A State of the Art Survey. arXiv pre-
    print arXiv:1306.1723.
16. Saleem, M., Khan, Y., Hasnain, A., Ermilov, I., & Ngonga Ngomo, A. C. (2016). A
    fine-grained evaluation of SPARQL endpoint federation systems. Semantic Web, 7(5),
    493-518.
17. Schwarte, A., Haase, P., Hose, K., Schenkel, R., & Schmidt, M. (2011). Fedx: Optimi-
    zation techniques for federated query processing on linked data. ISWC 2011, pp. 601-
    616.
18. Volz, J., Bizer, C., Gaedke, M., & Kobilarov, G. (2009). Discovering and maintaining
    links on the web of data. ISWC 2009, pp. 650-665.
19. Zeng, K., Yang, J., Wang, H., Shao, B., & Wang, Z. (2013). A distributed graph engine
    for web scale RDF data. Proceedings of the VLDB Endowment 6(4), pp. 265-276).
20. Zenz, G., Zhou, X., Minack, E., Siberski, W., & Nejdl, W. (2009). From keywords to
    semantic queries—Incremental query construction on the Semantic Web. Web Seman-
    tics: Science, Services and Agents on the World Wide Web, 7(3), 166-176.

</pre>