=Paper= {{Paper |id=Vol-1378/paper38 |storemode=property |title=Keyword Search in the Deep Web |pdfUrl=https://ceur-ws.org/Vol-1378/AMW_2015_paper_38.pdf |volume=Vol-1378 |dblpUrl=https://dblp.org/rec/conf/amw/CaliMT15 }} ==Keyword Search in the Deep Web== https://ceur-ws.org/Vol-1378/AMW_2015_paper_38.pdf
               Keyword Search in the Deep Web

             Andrea Calı̀1,4 , Davide Martinenghi2 , and Riccardo Torlone3
     1                                               2
         Birkbeck, University of London, UK           Politecnico di Milano, Italy
              andrea@dcs.bbk.ac.uk                 davide.martinenghi@polimi.it
         3                                4
         Università Roma Tre, Italy          Oxford-Man Inst. of Quantitative Finance
         torlone@dia.uniroma3.it                     University of Oxford, UK


         Abstract. The Deep Web is constituted by data accessible through Web
         pages, but not readily indexable by search engines, as they are returned
         in dynamic pages. In this paper we propose a framework for accessing
         Deep Web sources, represented as relational tables with so-called ac-
         cess limitations, with keyword-based queries. We formalize the notion
         of optimal answer and investigate methods for query processing. To our
         knowledge, this problem has never been studied in a systematic way.


1    Problem definition
Basics. We model data sources as relations of a relational database and we
assume that, albeit autonomous, they have “compatible” attributes. For this,
we assume that the attributes of relations are defined over a set of abstract
domains D = {D1 , . . . , Dm }, which, rather than denoting concrete value types
(such as string or integer), represent data types at a higher level of abstraction
                                                                            Sn
(for instance, car or country). The set of all values is denoted by D = i=1 Di .
    In the following, we shall denote by R(A1 , . . . , Ak ) a (relation) schema, by
dom(A) ∈ D the domain of an attribute A, by r a relation over R, and by
r = {r1 , . . . , rn } a (database) instance of a database schema R = {R1 , . . . , Rn }.

Access limitations. An access pattern Π for a schema R(A1 , . . . , Ak ) is a mapping
sending each attribute Ai into an access mode, which can be either input or
output; Ai is correspondingly called an input (resp., output) attribute for R wrt.
Π. For ease of notation, we shall mark input attributes with an ‘i’ superscript to
distinguish them from the output ones. Let A01 , . . . , A0l be all the input attributes
for R wrt. Π; any tuple hc1 , . . . , cl i such that ci ∈ dom(A0i ) for 1 ≤ i ≤ l is called
a binding for R wrt. Π. An access α consists of an access pattern Π for a schema
R and a binding for R wrt. Π; the output of such an access α on an instance r
is the set T = σA1 =c1 ,...,Al =cl (r). Intuitively, we can only access a relation if we
can provide a value for every input attribute. Given an instance r for a database
schema R, a set of access patterns Π for the relations in R, and a set of values
C ⊆ D, an access path (for R, Π and C) is a sequence of accesses α1 , . . . , αn
on r such that each value in the binding of αi , 1 ≤ i ≤ n, either occurs in the
output of an access αj with j < i or is a value in C. A tuple t in r is said to be
reachable if there exists an access path P such that t is in the output of some
                                                                   t31         A1         t33   A2
       t31         t12         t33      t31            t33

                                              t12
                                                                                          t23
             t21               t23      t21            t23
                                                                         t11        t11
                         t11                   t11


         (a) Access paths              (b) Join graph                      (c) Answers

   Fig. 1. Example 1: Reachable portion, corresponding join graph, and answers.


access in P ; the reachable portion reach(r, Π, C) of r is the set of all reachable
tuples in r given the values in C.

Keyword queries. A keyword query is a set of values in D called keywords.
Example 1 Consider a query q = {k1 , k2 }, a schema (with access patterns Π)
R = {R1 (Ai1 , A2 ), R2 (Ai2 , A1 ), R3 (Ai1 , A2 , A3 )}, and an instance r such that
                                          Ai2 A1                 Ai1 A2 A3
                          Ai1 A2
                                           c c t                  c c k t
                     r1 = k1 c1 t11   r2 = 1 2 21            r3 = 2 1 2 31
                                           c4 c2 t22              c5 c4 k2 t32
                          c2 c3 t12
                                           c1 c6 t23              c6 c7 k2 t33
Figure 1(a) shows the reachable portion of r given the values in q along with the
access paths used to extract it, with dotted lines enclosing outputs of accesses.

Given a set T of tuples, the join graph of T is a node-labelled undirected graph
T = hN, Ei constructed as follows: (i) the nodes N are labelled with tuples of
T , and (ii) there is an arc between two nodes n1 and n2 if the tuples labelling
n1 and n2 have at least one value in common.
Example 1 (cont.) The join graph of reach(r, Π, q) is shown in Figure 1(b).

Definition 1 (Answer). An answer to a keyword query q against a database
instance r over a schema R with access patterns Π is a set of tuples A in
reach(r, Π, q) such that: (1) each c ∈ q occurs in at least one tuple t in A;
(2) the join graph of A is connected; (3) for every subset A0 ⊆ A such that A0
enjoys Condition 1 above, the join graph of A0 is not connected.
It is straightforward to see that there could be several answers to a keyword
query; below we give a widely accepted criterium for ranking such answers [4].
Definition 2. Let A1 , A2 be two answers of a keyword query q on an instance
r of size |A1 | and |A2 | respectively; we say that A1 is better than A2 , denoted
A1  A2 , if |A1 | ≤ |A2 |. The optimal answers are those of minimum size.
Example 1 (cont.) The sets A1 = {t11 , t31 } and A2 = {t11 , t23 , t33 } are an-
swers to q; A1 is better than A2 and is the optimal answer to q.


                                              2
2     Keyword-based answering in the Deep Web
We now present a vanilla algorithm to discuss the computational complexity of
answering a keyword query q in the deep Web modeled as an instance r of a
schema R with access patterns Π. Example 1 shows that, in the worst case,
we need to extract the whole reachable portion to obtain the tuples involved
in an optimal answer. In fact, s = reach(r, Π, q) is actually a connected join
graph, since every tuple in it is in some output of some access path starting
from the values in the query (see for example Figure 1.a), but further paths may
exist between tuples in s (see Figure 1.b). Therefore, query answering requires
in general two main steps, described in Algorithm 1: (i) extract the reachable
portion s of r; (ii) if possible, remove tuples from s so that the obtained set
satisfies the conditions of Definition 1, while minimizing its size.
Algorithm 1: Computing an optimal answer (Answer(q, Π, r))
      Input: Keyword query q, access patterns Π, instance r over R
      Output: Answer A
       1. A := reachableP ortion(r, Π, q); // see Algorithm 2
       2. if A does not contain all values in q then return nil;
       3. else prune(A, q); // see Algorithm 3
       4. return A;

    A simple way of extracting the reachable portion, inspired by the procedure
described in [1], is shown in Algorithm 2. This algorithm may be allowed to
terminate early if the answer is not required to be optimal (flag ω set to f alse),
and thus can stop as soon as the reachable portion contains all the keywords
in the query. This is coherent with the distinct root-based semantics of keyword
search in relational databases, which provides a tradeoff between quality of the
result and efficiency of the method to evaluate it [4].

    Algorithm 2: Reachable portion (reachableP ortion(r, Π, q))
      Input: Instance r over R, access patterns Π, initial values q
      Flag: boolean ω // if ω = true the answer is guaranteed to be optimal
      Output: Reachable portion RP
       1. RP := ∅; C := ∅;
       2. while an access can be made with a new binding b for some R ∈ R wrt. Π using values in C ∪ q
       3.   O := output of access to r over R with binding b;
       4.   RP := RP ∪ O; // cumulating all the obtained tuples into RP
                     S
       5.   C := C ∪ A∈R,t∈O {t(A)}; // cumulating all the obtained values into C
       6.   if C ⊇ q ∧ ¬ω then break;
       7. return RP ;

    Basically, determining an optimal answer from the reachable portion cor-
responds to finding a Steiner tree of its join graph [4], i.e., a minimal-weight
subtree of this graph involving a subset of its nodes. An efficient method for
solving this problem in the context of keyword search over structured data is
presented in [2], where a q-fragment can model our notion of answer. Yet, when
optimality is not required, a simple technique (quadratic in the size of r) to
obtain an answer (steps 2–6 of Algorithm 3) consists in trying to remove any
tuple from the set as long as it contains all the keywords and remains connected.


                                                 3
    Algorithm 3: Pruning (prune(T , q))
      Input: Set of tuples T , keyword query q
      Flag: boolean ω // if ω = true the answer is guaranteed to be optimal
      Output: Minimal set of tuples T
       1. if ω then return a minimal subtree of the join graph of T that contains q;
       2. T 0 := T ; T 00 := ∅;
       3. while T 00 6= T 0
       4.    T 00 := T 0 ;
       5.    for each t ∈ T 00 if T 0 \ {t} is connected and T 0 \ {t} ⊇ q then T 0 := T 0 \ {t};
       6. return T 0 ;




    The extraction of the reachable portion of an instance r with access lim-
itations can be implemented by a Datalog program over r [1], which can be
evaluated in polynomial time in the size of the input [3]. In addition, in [2] it
is shown that the optimal q-fragments of r can be enumerated in ranked-order
with polynomial delay, i.e., the time for printing the next optimal answer is again
polynomial in the size of r. Hence, we can state the following preliminary result.
Theorem 1. An optimal answer to a keyword query against a database instance
with access limitations can be efficiently computed under data complexity.

3     Discussion and future work
In this paper, we have defined the problem of keyword search in the Deep Web
and provided some preliminary insights on query answering in this context. As
future work on the problem in question, we plan to:
 – devise optimization strategies for query answering; in particular, identify
   conditions under which an optimal answer can be derived without extracting
   the whole reachable instance;
 – leverage known values (besides the keywords), modeled as relations with only
   one (output) attribute, to speed up the search for an optimal answer;
 – study the problem in a scenario in which the domains of the keywords are
   known in advance: in this case schema-based techniques can be used;
 – consider the case in which nodes and arcs of the join graph are weighted to
   model source availability and proximity, respectively.

Acknowledgments. Andrea Calı̀ acknowledges support by the EPSRC grant
“Logic-based Integration and Querying of Unindexed Data” (EP/E010865/1).
References
 1. A. Calı̀, D. Martinenghi. Querying Data under Access Limitations. In ICDE,
    pag. 50–59, 2008.
 2. B. Kimelfeld and Y. Sagiv. Finding and approximating top-k answers in keyword
    proximity search. In PODS, pag. 173–182, 2006.
 3. M. Vardi.The complexity of relational query languages. In STOC, pag. 137–146, 1982.
 4. J. Xu Yu, L. Qin, and L. Chang. Search in Relational Databases: A Survey. IEEE
    Data Eng. Bull., 33(1): 67–78, 2010.



                                                       4