=Paper=
{{Paper
|id=Vol-2161/paper37
|storemode=property
|title=Processing SPARQL Queries on Deep Web Sources
|pdfUrl=https://ceur-ws.org/Vol-2161/paper37.pdf
|volume=Vol-2161
|authors=Andrea Calì,Tommaso Di Noia,Thomas Walker Lynch,Azzura Ragone
|dblpUrl=https://dblp.org/rec/conf/sebd/CaliNLR18
}}
==Processing SPARQL Queries on Deep Web Sources==
<pdf width="1500px">https://ceur-ws.org/Vol-2161/paper37.pdf</pdf>
<pre>
                  Processing SPARQL Queries
                     on Deep Web sources

          Andrea Calı́1,4 , Tommaso Di Noia2 , Thomas W. Lynch3,1 , and
                                Azzurra Ragone5
          1
           Dept of Comp. Sci. and Inf. Syst. Birkbeck, Univ. of London, UK
                       2
                         SisInf Lab Politecnico di Bari, Italy
                          3
                            Reasoning Technology Ltd, UK
        4
          Oxford-Man Inst. of Quantitative Finance University of Oxford, UK
                              5
                                Independent Researcher
               andrea@dcs.bbk.ac.uk, tommaso.dinoia@poliba.it,
       thomas.lynch@reasoningtechnology.com, azzurra.ragone@gmail.com


        Abstract. The Deep Web is constituted by data accessible through dy-
        namic Web pages requested through a Web interface. While Deep Web
        data sources have been usually modelled as relational in the literature,
        in many cases it is useful to export Deep Web data as Linked Data sets.
        In this context, processing queries poses some algorithmic challenges due
        to the inherent limitations of Deep Web sources which, requiring some
        inputs to be queried, function as services. In this paper we present a
        framework and a system to export Deep Web data as Linked Data sets.
        Further, we characterise a class of SPARQL queries that are executable
        directly, in spite of the limitations.


1     Introduction

The Deep Web (also known as Hidden Web) [9, 6, 3] is constituted by structured
data that are available as dynamically generated Web pages, generated upon
queries usually posed through HTML forms. The Deep Web content cannot
be indexed by search engines and is therefore not immediately searchable or
accessible. The Deep Web is separated from the so-called Surface Web, the latter
being the set of ordinary, static Web pages. It is also known that the Deep Web
is orders of magnitude larger than the Surface Web [8]. Deep Web data are
normally structured and of great value; however, the limitations in accessing
them make them hard to search and query. Integrating Deep Web sources as a
single database poses several challenges. Normally in this approach, which allows
for processing structured queries [3] as well as keyword queries [5], sources are
federated into a single schema. Normally, in this approach one deals with known
sources, which contain data related to a single domain of interest. Deep Web
sources have been naturally modelled as relational tables that can be queried only
    SEBD 2018, June 24-27, 2018, Castellaneta Marina, Italy. Copyright held by the
    author(s).
according to so-called access patterns (or access limitations); more specifically,
certain attributes are to be selected in the query in order to get an answer —
such a selection corresponds to filling the corresponding attribute in the form
with a value.
    New ways of exposing structured data have recently emerged, which allow the
composition of services for the creation of new integrated applications (mash-
ups [11]) and knowledge spaces. Among the various technical proposals and
approaches for data publication on the Web which survived to the present days,
the two most relevant ones are: RESTful services [7] and Linked Data (LD) [2].
RESTful services provide an agile way of exposing data in a request/response
fashion over HTTP, and has been widely adopted by programmers thanks to
its easiness of implementation [1]. In this context, data are usually returned in
XML or JSON documents after the invocation of a service. Among the issues
related to the pure RESTful approach we mention the following:
 – There is no explicit semantics attached to the returned data.
 – There is no unique query language to invoke services. Each service exposes
   its own API, and APIs considerably differ from each other even when they
   refer to the same knowledge domain.
 – The integration of different data sources is difficult and is often implemented
   ad-hoc.
On the other hand, the Linked Data approach is based on the idea that data
can be delivered on the Web together with their explicit semantics, expressed by
means of common vocabularies. Following the Linked Data principles, datasets
should be made accessible through a SPARQL endpoint. Moreover, by using
federated queries an agent is able to automatically integrate data coming from
different sources thus creating a data space at a Web scale. Unfortunately, also
the Linked Data approach comes with its drawbacks, among which we may
mention:
 – The effort in setting up a SPARQL endpoint is bigger than that of adopting
   a RESTful approach from service providers. Normally it is much easier to
   find a JSON-based service than a LD-based one.
 – Programmers are usually more familiar with JSON services than with
   SPARQL endpoints.
 – Service providers are usually not interested in exposing all the data they
   have; instead, they normally want to expose only a small portion of their
   data.
Based on the above points we can see that, while from the practical point of
view the RESTful approach is the most efficient, if we look at the knowledge
point of view the Linked Data paradigm represents a more suitable solution.
  Actually, with JSON-LD this issue could be solved but this format is not widely
  adopted yet.
  https://www.w3.org/DesignIssues/LinkedData.html
  https://www.w3.org/TR/sparql11-federated-query/
    Following the above observations, we built the PoLDo prototype system, which
is able to export existing RESTful services, even third-party ones, as a SPARQL
endpoint, thus making the underlying Deep Data sources part of the Linked
Data cloud. Thanks to a configurable query planner, PoLDo is able to break
down a SPARQL query into a sequence of RESTful service invocations and to
orchestrate different services to provide a correct answer to the posed query.
Starting from the data it retrieves from services, PoLDo builds a temporary RDF
dataset used to compute the result set for the original SPARQL query.
    In PoLDo, we expose Deep Web sources as services according to the aforemen-
tioned intrinsic restrictions. Rather than limiting ourselves to relational Deep
Web sources, we model sources as Linked Data with access limitations, that is,
SPARQL access points that are only accessible with queries that contain a cer-
tain fixed, ground (variable-free) pattern. This poses the problem of determining
whether a query can be evaluated over a set of sources (in the form of SPARQL
endpoints), while returning the complete answers, in the presence of the access
limitations. For other queries, we can only expect to get partial answers.

2    Modelling and Querying
In this section we present a model for Linked Data sources that expose Deep
Web data, with their inherent limitations. We characterise a class of queries that
can be evaluated non-recursively on the data, while returning the complete set
of answers (as if they were evaluated in the absence of limitations).
    We first show how Deep Web sources are represented as relational data.
    When dealing with databases with access limitations the relations can be
expressed using the proper access modes. In the case of RESTful services we
have the two access mode i and o (for input and output respectively) stating
that for all the tuples r(c1 , . . . , cn ) ∈ DB we have some of the arguments of
r mapped to the inputs of the service and some to its outputs. Following [4],
we denote access modes as superscript of the relation. As an example, a simple
service can be expressed by a relation r1 defined as r1oio (C1 , C2 , C3 ), where C2 is
an input parameter and C1 , C3 are output parameters. Let us assume we have
also another service r2ioo (C3 , C4 , C5 ), where C3 is an input parameter and C4 , C5
are output parameters. Notice that we use the same name for parameters that
are compatible, that is, have as instance values (constants) of the same type;
this is expressed by the notion of abstract domain in Deep Web [3], that is a
domain specifying the class the objects it represents (e.g. telephone number,
person’s name etc.) rather than the concrete domain (string, integer etc.). The
annotations that specify input and output parameters are also called access
patterns. Access limitation affect the computation of the result of a query as
not all data, in the forms of facts of the form r1 (c1 , c2 , c3 ) where c1 , c2 , c3 are
constants, may be accessible. Consider the conjunctive query q1 defined as
                       q1 (X) ← r1oio (Y, a2 , X), r2ioo (X, a4 , Z)
. This query can be processed by executing the services corresponding to the rela-
tions r1 and r2 from left to right. Indeed, by invoking the first service with input
c1 we obtain as output a set of pairs of constants of the type hc1 , c3 i, each instan-
tiating Y and X respectively, as a consequence of the fact that r1 (c1 , a2 , c3 ) ∈ DB
. For each such pair, the second value c3 is then used as input to invoke the ser-
vice corresponding to r2 , thus obtaining as output a new set of pairs of the form
hc4 , c5 i. Among these pairs, due to the selection with a4 on the second atom of
the query, we are interested only in those such that c4 = a4 . So this kind of query
is executable and retrieves a complete answer that contains all solutions to the
query over DB. Notice that if the query is not executable, in some cases it can
be executed from left to right by reordering the subgoals (atoms) [4]. This kind
of queries are called feasible (or orderable). Feasible queries can be evaluated so
as to return the complete answer to the query. If we had in addition the relation
r3 (and associated service) defined as r3oio (C5 , C4 , C6 ) and a query

                 q2 (X) ← r1oio (Y, a2 , X), r2ioo (X, a4 , Z), r3oio (Z, U, W ),

we may not be able to compute all the answers. For example, the instance
{r1 (d1 , d2 , d3 ), r2 (d3 , d4 , d5 ), r3 (d5 , d04 , d6 )} will provide the answer hd3 i for q2 ,
but such answer cannot be retrieved due to the limitations. In general, the an-
swers q(D) to a query q on a database D (as computed if the data sources had no
access limitations) is a superset of the answers ans(q, I, D) that can be actually
retrieved through the access patterns, that is ans(q, I, D) ⊆ q(D), where I is a
set of initial constants available to start the extraction of data from the sources.
    When considering Linked Data sets, the limitations of the HTML forms (or
others beyond the relational formalism) are reflected naturally as we explain
below. We refer to the formalisation of SPARQL in [10].
    Data are in the form of triple patterns of the form hs, p, oi ∈ (I ∪ B) × I ×
(I ∪ B ∪ L), where B is the set of blank nodes. We assume to have a partial
relation ρ on predicates, where ρ(p1 , p2 ) means that p1 and p2 have compatible
domains; this can be specified by the rdfs:domain of an rdf:Property, but we
are not going to expose this in detail in this preliminary paper. The fact that
two predicates have compatible domains reflects the notion of abstract domain
in Deep Web.
    Basic graph patterns (BGPs) are sets of triple graph patterns of the form
t ∈ (I ∪ L ∪ V) × (I ∪ V) × (I ∪ L ∪ V) where I, V and L are the sets of IRIs,
variables and labels, respectively.
    In the context of integrating set of Deep Web data sources and answering
queries posed on a set of such sources as if it was a single database, we assume
to have a schema S = {S1 , . . . , Sn } of Deep Web sources. Each source Si , with
1 6 i 6 n, is an access point that has associated a set Λi of triple patterns that
are called input triple patterns (ITPs). Intuitively, in order to query a source,
we need to use a SPARQL query that provides constants for the object (the
  Analogously to [3], in a “traditional” data integration setting, we assume that we are
  integrating a set of known sources related to a domain of interest. We do not address
  here the problem of automatic discovery of sources, nor the problem of designing a
  global schema that offers a seamless model of the underlying data. Instead, we take
  the union of the sources as our database.
third element in the corresponding triple) of all such patterns. This reflects
the structure of Deep Web sources when represented as relational tables. More
specifically, to query a source Si , we need a query that contains all input triple
patterns Λi for Si , with the third element (the object) instantiated to a value.
    To start our exposition, we assume that queries are of the form q =
P1 and P2 and . . . and Pn , where, for all i s.t. 1 6 i 6 n, each Pi is a graph
pattern query to be evaluated on Si . We do not differentiate here between SE-
LECT and CONSTRUCT query types. Now, we formalise the limitations on
sources and we establish a criterion for determining queries that can be evalu-
ated so that all answers are computed.
    We consider a schema S = {S1 , . . . , Sn } of Deep Web sources with access
limitations Λ = {Λ1 , . . . , Λn }. For each Si , 1 6 i 6 n, Λi is a set of input triple
patterns (ITPs) of the form λ = hVs , p, Vo i ∈ V × I × V. Let D = {D1 , . . . , Dn }
be an instance for the schema S, where Di is the instance of Si , , for all i
s.t. 1 6 i 6 n.
Definition 1.

    (a) A triple pattern t instantiates an ITP λ = hVs , p − Vo i if α = µ(λ) such
     that µ is a mapping and µ(Vo ) ∈ I × L. Notice that in this case µ(p) = p.
    (b) A source S with a set Λ of input triple patterns can be accessed with a
     BGP P if for each λ ∈ Λ, there is α ∈ P such that α instantiates λ.


3     Processing Queries

We now focus on queries that can be executed on an instance D according to
the access patterns

Definition 2. Let t1 , t2 be triple graph patterns defined as t1 = hW1 , p1 , V i, t2 =
hW2 , p1 , V i, with {W1 , W2 } ⊆ I ∪ L ∪ V and V ∈ V.

    (a) We say that t1 feeds t2 if ρ(p1 , p2 ), that is, p1 and p2 are compatible (no-
     tice that both triple graph patterns have the same variable as third element).
    (b) We say that t1 and t2 match if p1 = p2 .

   We now identify a class of queries of the form q = P1 and P2 and . . . and Pn
that are executable from left to right, after ordering the graph patternss Pi , on
an instance D for a schema S; we call such queries orderable in accordance with
the relational terminology for queries under access limitations.

Definition 3. Let q be a query of the form q = P1 and P2 and . . . and Pn , posed
on a schema S as previously defined. q is said to be orderable if P1 , . . . , Pn can
be ordered as Pi1 , . . . , Pin such that:

    (a) For every ITP λ ∈ Λi1 , there is α ∈ Pi1 such that α instantiates λ;
    (b) For all j such that 1 6 j 6 n, for every ITP λ ∈ Λij , either
     (i) there is α ∈ Pij such that α instantiates λ;
     (ii) there are β ∈ Pij and γ ∈ Pi` , witha 1 6 ` 6 ij − 1, such that (1) γ
          feeds β; (2) β and λ match.

    Notice that this definition is analogous to the same one in the context of
relational data under access limitations.

Proposition 1. Every orderable query on a schema S of the form q =
P1 and P2 and . . . and Pn can be executed on an instance D, after suitably or-
dering P1 , . . . , Pn , from left to right so that all answers q(D) are retrieved.

    Proof (sketch). We know q can be ordered as in Definition 3; let us assume
w.l.o.g. then that it is already ordered. It is easily seen that P1 can be immedi-
ately evaluated on D1 . Then, for each j such that 2 6 j 6 n, Pj can be evaluated
on Dj according to Λj because every ITPs λ of Λj is either (1) instantiated by
some BGP in Pj or (2) instantiated by a triple pattern β 0 , obtained from γ ∈ P` ,
1 6 ` 6 j − 1 (as per Definition 3) by replacing its third element with the values
of the third element of µ(γ), where γ ∈ P` , for every mapping µ resulting from
the evaluation of P` on D` .


P1                    P2                       λ1                 λ2
                 ?W                  Y
                                                      ?X1                 ?X1
       p2                  p3        p2

                                                     p1                  p2
            ?X                  ?Z        ?X

         p1                                           ?X2                 ?X2

                 a


                            Fig. 1. Figure for Example 1


Example 1. Consider the query P1 and P2 , where P1 , P2 are as in Figure 1 (we
omit variables that appear only once), on a schema S = {S1 , S2 } with Λ1 = {λ1 },
Λ2 = {λ2 }, where λ1 , λ2 are again as in Figure 1. The query q is processed by
evaluating P1 on S1 directly and then, for all values v obtained by mapping ?X
in the evaluation of P1 , by evaluating P20 (see figure), obtained by replacing ?X
with v in P2 .


4    PoLDo
In this section we briefly present the architecture of PoLDo, a prototype that
processes queries over distributed Deep Web sources that are exposed as Linked
Data sets.
              Fig. 2. A high level view for the architecture of PoLDo


    The high level architecture of PoLDo is shown in Figure 2. The engine is
responsible of getting the SPARQL query and breaking it down to a sequence of
RESTful calls to a remote service. The transformation is made possible thanks
to a mapping file that maps Linked Data URIs to the elements of the signature
of the remote call. While querying the remote service, PoLDo feeds an RDF local
triple store (Jena Fuseki, in its current implementation) which is in charge of
processing the actual SPARQL query. More specifically, in our system we have
the following components and models.

PoLDo engine The engine accepts the SPARQL query and extracts all the con-
   stants from the graph template in the WHERE clause. Then, by using the
   algorithm sketched in Section 3 , it uses the constants to query the external
   service and to get the data that will be used to create a local RDF represen-
   tation of the data space. Thanks to the information encoded in the PoLDo
   mapping file, the engine is able to feed a local repository of RDF triples.
   The engine is also capable to exploit external services to get extracted re-
   sources’ URI from mapped API services, that often return data related to
   the resource but not the URI.
Jena The Jena Model is used to save a Linked Data version of the data which
   are incrementally retrieved from the RESTful service. The availability of a
   third-party RDF model makes PoLDo able to support the full specification
   of SPARQL query language. Furthermore, it is able to return the data in all
   the formats supported by the query engine Jena ARQ.
PoLDo Mapping Generator The Mapping Generator is responsible for the
   generation of the mapping file. Given a RESTful service, it works in four
   steps. It analyzes request (HTTP GET) and response (JSON or XML) given
   to and by a web service, then extracts all the inputs and outputs. The user
   is then allowed to manually assign a class of membership to resources. The
   Mapping Generator queries the ontologies (DBpedia and LOV, in our first
   experimentation) and recommend how to link resources. After user confir-
   mation, the final mapping file is generated and can to be used by the engine.
PoLDo mapping file This file contains information about how to map the URIs
   of the SPARQL query to inputs and outputs of the services. It also describes
   the access patterns as represented in Section 2 as well as their mutual rela-
   tions (that is which triple graph patterns feed each other).


References
 1. Vito Walter Anelli, Vito Bellini, Andrea Calı̀, Giuseppe De Santis, Tommaso Di
    Noia, and Eugenio Di Sciascio. Querying deep web data sources as linked data. In
    Proceedings of the 7th International Conference on Web Intelligence, Mining and
    Semantics, WIMS 2017, Amantea, Italy, June 19-22, 2017, pages 32:1–32:7, 2017.
 2. Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked data – the story so far.
    Int. J. Semantic Web Inf. Syst, 5(3):1–22, 2009.
 3. Andrea Calı̀ and Davide Martinenghi. Querying data under access limitations. In
    Proc. of ICDE, pages 50–59, 2008.
 4. Andrea Calı̀ and Davide Martinenghi. Querying the deep web. In Proc. of EDBT
    2010, pages 724–727, 2010.
 5. Andrea Calı̀, Davide Martinenghi, and Riccardo Torlone. Keyword queries over
    the deep web. In Proc. of ER 2016, pages 260–268, 2016.
 6. Kevin Chen-Chuan Chang, Bin He, and Zhen Zhang. Toward large scale integra-
    tion: Building a metaquerier over databases on the web. In Proc. of CIDR, pages
    44–55, 2005.
 7. Roy T. Fielding and Richard N. Taylor. Principled Design of the Modern Web
    Architecture. ACM Transactions on Internet Technology, 2(2):115–150, 2002.
 8. Govind Kabra, Zhen Zhang, and Kevin Chen-Chuan Chang. Dewex: An explo-
    ration facility for enabling the deep web integration. In Proc. of ICDE, pages
    1511–1512, 2007.
 9. Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, and Alon Y. Halevy.
    Harnessing the deep web: Present and future. In Proc. of CIDR, 2009.
10. Jorge Pérez, Marcelo Arenas, and Claudio Gutiérrez. Semantics and complexity
    of SPARQL. ACM Trans. on Database Systems, 34(3):16:1–16:45, 2009.
11. Ahmet Soylu, Felix Mödritscher, Fridolin Wild, Patrick De Causmaecker, and Piet
    Desmet. Mashups by orchestration and widgetbased personal environments: Key
    challenges, solution strategies, and an application. Program, 46(4):383–428, 2012.

</pre>