=Paper= {{Paper |id=Vol-1912/paper10 |storemode=property |title=Querying the Deep Web: Back to the Foundations |pdfUrl=https://ceur-ws.org/Vol-1912/paper10.pdf |volume=Vol-1912 |authors=Andrea Calí,Davide Martinenghi,Igor Razgon,Martín Ugarte |dblpUrl=https://dblp.org/rec/conf/amw/CaliMRU17 }} ==Querying the Deep Web: Back to the Foundations== https://ceur-ws.org/Vol-1912/paper10.pdf
                         Querying the Deep Web:
                         Back to the Foundations

     Andrea Calı́1,4 , Davide Martinenghi2 , Igor Razgon1 , and Martı́n Ugarte3
1                                                  2
    Dept of Comp. Sci. and Inf. Syst.                  Dip. di Elettr., Informaz. e Bioing.
    Birkbeck, Univ. of London, UK                         Politecnico di Milano, Italy
    3                                       4
        Web and Information Tech. Lab.          Oxford-Man Inst. of Quantitative Finance
         Université Libre de Bruxelles                University of Oxford, UK
                                 {andrea,igor}@dcs.bbk.ac.uk
                               davide.martinenghi@polimi.it
                                    mugartec@ulb.ac.be


           Abstract. The Deep Web is the large corpus of data accessible on the
           Web through forms and presented in dynamically-generated pages, but
           not indexable as static pages, and therefore invisible to search engines.
           Deep Web data are usually modelled as relations with so-called access
           limitations, that is, they can be queried only by selecting certain at-
           tributes. In this paper we give some fundamental complexity results on
           the problem of processing conjunctive (select-project-join) queries on re-
           lational data with access limitations.



1        Introduction

The term Deep Web (also called Hidden Web) [7, 5, 6] refers to the data content
that is created dynamically as the result of a specific search on the web. For exam-
ple, when we query a White Pages website, the generated output consists of one
or more pages containing the result of a query posed on an underlying database;
these pages cannot be indexed by search engines. When we query whitepages.com
through a form, we are forced to fill in some fields of the form, for instance the
Name field; the result is then structured as a table. A Deep Web source can be
naturally modelled as a relational table (or a set of relational tables) that can
be queried only according to so-called access patterns, each of which enforces the
selection on some of the attributes (which corresponds to filling the input fields
in the form with values), which are called input attributes. Relational tables
accessible through access patterns are said to have access limitations.
    Processing structured queries over Deep Web sources is the key problem
in the integration of such sources. Interestingly, when Deep Web sources are
modeled, as mentioned, as relations with access limitations, answering a simple
conjunctive (select-project-join) query on such sources requires, in the worst case,
the evaluation of a recursive Datalog query plan. In such plans, values obtained
as output from a source are used as input for other sources; the compatibility
of values is established by assigning to each attribute of a relation a so-called
         ρ1 :       q() ← r̂(X, Y ), ŝ(Z, Y )        ρ5 : dom(Y ) ← ŝ(X, Y )
         ρ2 : r̂(X, Y ) ← dom(X), r(X, Y )            ρ5 : dom(Y ) ← r̂(X, Y )
         ρ3 : ŝ(X, Y ) ← dom(X), s(X, Y )            ρ6 : dom(a)

                        Fig. 1. Datalog program for Example 1


abstract domain, which expresses the type of value (e.g. name, address etc.) as
opposed to the concrete domain (e.g. string, integer etc.).
   In this paper we consider conjunctive queries (CQs) on relational schemata
with access limitations and the two traditional problems associated with them:
query answering and query containment. Such problems have been extensively
studied in the literature; however, for some cases the problem of determining the
complexity is still open. We tackle some of such cases with the following results:

 – We show that CQ answering under access limitations is np-complete with
   respect to combined complexity; thus, the access limitations do not increase
   the complexity of classic query answering (without access limitations).
 – We consider the problem of CQ containment under access limitations, known
   to be co-nexptime-complete in its general form [3, 4] and thought to be ex-
   ptime-complete in the case of queries without constants [2]. We first address
   the case of input-only predicates; we show that in such a case the problem
   is Πp2 -complete. As for the hardness, we show that Πp2 -hardness holds under
   stricter conditions: for predicates of arity 6 2 and two abstract domains.
   Then we address CQ containment for (input-output) binary predicates; we
   conjecture that this problem is also in Πp2 .

2    Query Answering

We assume the reader is familiar with the notions of relational schema and
instance, conjunctive query and Datalog program — otherwise, see for instance
the book of Abiteboul et al. [1]. We consider relational schemata whose predicates
are annotated so as to express whether each argument/attribute is input (needs
to be selected) or output; for instance, riio , of arity 3, has the first two attributes
as input attributes, and the third as output.
    In the presence of access limitations on the sources, queries cannot be usually
evaluated as in the traditional case, as we show below. Given a conjunctive query
q, a schema with access limitations (implicit), an instance D and a set I of initial
constants, the answers to q, denoted ans(q, I, D), are obtained starting from the
constants in I and extracting all possible tuples (by using the constants as input
in all possible ways); with the newly obtained constants again all possible tuples
are extracted, and so on, until no new tuple is extracted – see e.g. [5].

Example 1. Consider a schema with predicates rio and sio (which contain the
facts of D), a set of initial constants I = {a}, and the Boolean CQ q() ←
r(X, Y ), s(Z, Y ). Assume there is a single abstract domain, represented by the
unary predicate dom, associated with all attributes (arguments). The Datalog
program Πq for q is shown in Figure 1 (facts of D omitted). The query is rewritten
over the cache relations r̂, ŝ (rule ρ1 ) defined in the cache rules ρ2 and ρ3 , which
contain the facts extracted according to rules ρ2 and ρ3 .
   We now come to our result on the decision problem of CQ answering under
access limitations; w.l.o.g., we consider Boolean CQs.
Theorem 1. CQ answering under access limitations is np-complete with respect
to combined complexity.
    Proof (sketch). For membership we exhibit a non-deterministic algorithm that
performs 6 |D| steps; at each step guesses one of the 6 |D|W ·|R| possible accesses
to relations. Hardness follows from CQ answering without access limitations.

3    Query Containment
Definition 1. Consider two CQs q1 , q2 over a schema with access limitations,
as well as a set I of initial constants such that const(q1 ) ∪ const(q2 ) ⊆ I ⊆
∆ (const(q) denotes the constants in a query q, while ∆ denotes the infinite
domain of constants); we say that q1 is contained in q2 under access limitations
with respect to I, denoted q1 ⊆I q2 , if, for every database D for R, we have
ans(q1 , I, D) ⊆ ans(q2 , I, D).
    Checking containment amounts to checking containment between two recur-
sive Datalog programs in the special form presented in Section 2. W.l.o.g., we
consider Boolean CQs as in Section 2. We first consider the case of input-only
predicates. An input-only n-predicate r is accessed, in an instance D, with an
n-tuple hti of constants of the appropriate domain, and tells (with a Boolean
result) whether r(hti) ∈ D. Evidently, this restricts the definition of contain-
ment to instances composed solely of constants of the initial set I. The following
lemma has a rather straightforward proof. A tight hardness result follows.
Lemma 1. CQ containment under access limitations with input-only predicates
is in Πp2 .
Theorem 2. CQ containment under access limitations with input-only predi-
cates of arity 6 2 and two abstract domains is Πp2 -hard.
    Proof (sketch). The proof is by reduction from a tighter version of
Generalised-Graph-Colouring [8], which is Σp2 -complete and is defined as
follows: given a graph F and a positive integer k, is there a two-colouring of the
vertices of F that does not contain a monochromatic (on vertices) clique of k
vertices? We reduce Generalised-Graph-Colouring to non-containment un-
der the above stated restrictions using a predicate e/2 for edges and a predicate
col /2 to indicate by col (v, c) that the vertex v has colour c.
    As a corollary we get tight bounds for the input-only case.
Corollary 1. CQ containment under access limitations with input-only predi-
cates is Πp2 -complete.
    Finally, we studied the binary case with both input and output predicates.
Theorem 3. CQ containment under access limitations with binary predicates
is in Πp2 .
    Proof (sketch). The proof uses the crayfish-chase technique of [4] to check
q1 ⊆I q2 in the binary case. Relying on the fact that q2 is “blind” to pairs of
atoms that are more than |q2 | steps apart in a join graph, to check the existence
of a counterexample for containment we guess, by means of the crayfish-chase, a
polynomially bound set of atoms representing a fragment of instance that makes
q1 true; then we check whether no homomorphism maps q2 onto such fragment.

4    Discussion
We have presented some results on our ongoing study of the fundamentals of the
complexity of CQ answering and containment under access limitations. Interest-
ingly, some of the fundamental problems have been overlooked in the literature,
for instance the complexity of CQ answering under access limitations, for which
we gave a tight bound. We also presented results for the input-only case, em-
ploying techniques that, we believe, pave the way to future investigations. The
binary case is interesting as most knowledge representation formalisms rely on
binary relations; we plan to find a tight bound for its complexity, proving our
conjecture. Finally, we shall study CQ answering and containment under access
limitations as well as integrity constraints expressed as ontological rules; this has
applications in the intersection between the Semantic Web and the Deep Web.

Acknowledgments. Martn Ugarte and Andrea Calı́ acknowledge support by the
EU COST action IC1302 “Keystone”. Andrea Calı́ acknowledges partial support
by the EPSRC project “Logic-based Integration and Querying of Unindexed
Data” (EP/E010865/1)”.

References
1. Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases.
   Addison-Wesley, 1995.
2. Michael Benedikt. Personal communication, 2017.
3. Michael Benedikt, Georg Gottlob, and Pierre Senellart. Determining relevance of
   accesses at runtime. In Proc. of PODS, pages 211–222, 2011.
4. Andrea Calı̀ and Davide Martinenghi. Conjunctive Query Containment under Ac-
   cess Limitations. In Proc. of ER, pages 326–340, 2008.
5. Andrea Calı̀ and Davide Martinenghi. Querying data under access limitations. In
   Proc. of ICDE, pages 50–59, 2008.
6. Kevin Chen-Chuan Chang, Bin He, and Zhen Zhang. Toward large scale integration:
   Building a metaquerier over databases on the web. In Proc. of CIDR, pages 44–55,
   2005.
7. Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, and Alon Y. Halevy. Har-
   nessing the deep web: Present and future. In Proc. of CIDR, 2009.
8. Vladislav Rutenburg. Complexity of generalized graph coloring. In Proc. of MFCS,
   pages 573–581, 1986.