=Paper=
{{Paper
|id=Vol-1378/paper38
|storemode=property
|title=Keyword Search in the Deep Web
|pdfUrl=https://ceur-ws.org/Vol-1378/AMW_2015_paper_38.pdf
|volume=Vol-1378
|dblpUrl=https://dblp.org/rec/conf/amw/CaliMT15
}}
==Keyword Search in the Deep Web==
Keyword Search in the Deep Web Andrea Calı̀1,4 , Davide Martinenghi2 , and Riccardo Torlone3 1 2 Birkbeck, University of London, UK Politecnico di Milano, Italy andrea@dcs.bbk.ac.uk davide.martinenghi@polimi.it 3 4 Università Roma Tre, Italy Oxford-Man Inst. of Quantitative Finance torlone@dia.uniroma3.it University of Oxford, UK Abstract. The Deep Web is constituted by data accessible through Web pages, but not readily indexable by search engines, as they are returned in dynamic pages. In this paper we propose a framework for accessing Deep Web sources, represented as relational tables with so-called ac- cess limitations, with keyword-based queries. We formalize the notion of optimal answer and investigate methods for query processing. To our knowledge, this problem has never been studied in a systematic way. 1 Problem definition Basics. We model data sources as relations of a relational database and we assume that, albeit autonomous, they have “compatible” attributes. For this, we assume that the attributes of relations are defined over a set of abstract domains D = {D1 , . . . , Dm }, which, rather than denoting concrete value types (such as string or integer), represent data types at a higher level of abstraction Sn (for instance, car or country). The set of all values is denoted by D = i=1 Di . In the following, we shall denote by R(A1 , . . . , Ak ) a (relation) schema, by dom(A) ∈ D the domain of an attribute A, by r a relation over R, and by r = {r1 , . . . , rn } a (database) instance of a database schema R = {R1 , . . . , Rn }. Access limitations. An access pattern Π for a schema R(A1 , . . . , Ak ) is a mapping sending each attribute Ai into an access mode, which can be either input or output; Ai is correspondingly called an input (resp., output) attribute for R wrt. Π. For ease of notation, we shall mark input attributes with an ‘i’ superscript to distinguish them from the output ones. Let A01 , . . . , A0l be all the input attributes for R wrt. Π; any tuple hc1 , . . . , cl i such that ci ∈ dom(A0i ) for 1 ≤ i ≤ l is called a binding for R wrt. Π. An access α consists of an access pattern Π for a schema R and a binding for R wrt. Π; the output of such an access α on an instance r is the set T = σA1 =c1 ,...,Al =cl (r). Intuitively, we can only access a relation if we can provide a value for every input attribute. Given an instance r for a database schema R, a set of access patterns Π for the relations in R, and a set of values C ⊆ D, an access path (for R, Π and C) is a sequence of accesses α1 , . . . , αn on r such that each value in the binding of αi , 1 ≤ i ≤ n, either occurs in the output of an access αj with j < i or is a value in C. A tuple t in r is said to be reachable if there exists an access path P such that t is in the output of some t31 A1 t33 A2 t31 t12 t33 t31 t33 t12 t23 t21 t23 t21 t23 t11 t11 t11 t11 (a) Access paths (b) Join graph (c) Answers Fig. 1. Example 1: Reachable portion, corresponding join graph, and answers. access in P ; the reachable portion reach(r, Π, C) of r is the set of all reachable tuples in r given the values in C. Keyword queries. A keyword query is a set of values in D called keywords. Example 1 Consider a query q = {k1 , k2 }, a schema (with access patterns Π) R = {R1 (Ai1 , A2 ), R2 (Ai2 , A1 ), R3 (Ai1 , A2 , A3 )}, and an instance r such that Ai2 A1 Ai1 A2 A3 Ai1 A2 c c t c c k t r1 = k1 c1 t11 r2 = 1 2 21 r3 = 2 1 2 31 c4 c2 t22 c5 c4 k2 t32 c2 c3 t12 c1 c6 t23 c6 c7 k2 t33 Figure 1(a) shows the reachable portion of r given the values in q along with the access paths used to extract it, with dotted lines enclosing outputs of accesses. Given a set T of tuples, the join graph of T is a node-labelled undirected graph T = hN, Ei constructed as follows: (i) the nodes N are labelled with tuples of T , and (ii) there is an arc between two nodes n1 and n2 if the tuples labelling n1 and n2 have at least one value in common. Example 1 (cont.) The join graph of reach(r, Π, q) is shown in Figure 1(b). Definition 1 (Answer). An answer to a keyword query q against a database instance r over a schema R with access patterns Π is a set of tuples A in reach(r, Π, q) such that: (1) each c ∈ q occurs in at least one tuple t in A; (2) the join graph of A is connected; (3) for every subset A0 ⊆ A such that A0 enjoys Condition 1 above, the join graph of A0 is not connected. It is straightforward to see that there could be several answers to a keyword query; below we give a widely accepted criterium for ranking such answers [4]. Definition 2. Let A1 , A2 be two answers of a keyword query q on an instance r of size |A1 | and |A2 | respectively; we say that A1 is better than A2 , denoted A1 A2 , if |A1 | ≤ |A2 |. The optimal answers are those of minimum size. Example 1 (cont.) The sets A1 = {t11 , t31 } and A2 = {t11 , t23 , t33 } are an- swers to q; A1 is better than A2 and is the optimal answer to q. 2 2 Keyword-based answering in the Deep Web We now present a vanilla algorithm to discuss the computational complexity of answering a keyword query q in the deep Web modeled as an instance r of a schema R with access patterns Π. Example 1 shows that, in the worst case, we need to extract the whole reachable portion to obtain the tuples involved in an optimal answer. In fact, s = reach(r, Π, q) is actually a connected join graph, since every tuple in it is in some output of some access path starting from the values in the query (see for example Figure 1.a), but further paths may exist between tuples in s (see Figure 1.b). Therefore, query answering requires in general two main steps, described in Algorithm 1: (i) extract the reachable portion s of r; (ii) if possible, remove tuples from s so that the obtained set satisfies the conditions of Definition 1, while minimizing its size. Algorithm 1: Computing an optimal answer (Answer(q, Π, r)) Input: Keyword query q, access patterns Π, instance r over R Output: Answer A 1. A := reachableP ortion(r, Π, q); // see Algorithm 2 2. if A does not contain all values in q then return nil; 3. else prune(A, q); // see Algorithm 3 4. return A; A simple way of extracting the reachable portion, inspired by the procedure described in [1], is shown in Algorithm 2. This algorithm may be allowed to terminate early if the answer is not required to be optimal (flag ω set to f alse), and thus can stop as soon as the reachable portion contains all the keywords in the query. This is coherent with the distinct root-based semantics of keyword search in relational databases, which provides a tradeoff between quality of the result and efficiency of the method to evaluate it [4]. Algorithm 2: Reachable portion (reachableP ortion(r, Π, q)) Input: Instance r over R, access patterns Π, initial values q Flag: boolean ω // if ω = true the answer is guaranteed to be optimal Output: Reachable portion RP 1. RP := ∅; C := ∅; 2. while an access can be made with a new binding b for some R ∈ R wrt. Π using values in C ∪ q 3. O := output of access to r over R with binding b; 4. RP := RP ∪ O; // cumulating all the obtained tuples into RP S 5. C := C ∪ A∈R,t∈O {t(A)}; // cumulating all the obtained values into C 6. if C ⊇ q ∧ ¬ω then break; 7. return RP ; Basically, determining an optimal answer from the reachable portion cor- responds to finding a Steiner tree of its join graph [4], i.e., a minimal-weight subtree of this graph involving a subset of its nodes. An efficient method for solving this problem in the context of keyword search over structured data is presented in [2], where a q-fragment can model our notion of answer. Yet, when optimality is not required, a simple technique (quadratic in the size of r) to obtain an answer (steps 2–6 of Algorithm 3) consists in trying to remove any tuple from the set as long as it contains all the keywords and remains connected. 3 Algorithm 3: Pruning (prune(T , q)) Input: Set of tuples T , keyword query q Flag: boolean ω // if ω = true the answer is guaranteed to be optimal Output: Minimal set of tuples T 1. if ω then return a minimal subtree of the join graph of T that contains q; 2. T 0 := T ; T 00 := ∅; 3. while T 00 6= T 0 4. T 00 := T 0 ; 5. for each t ∈ T 00 if T 0 \ {t} is connected and T 0 \ {t} ⊇ q then T 0 := T 0 \ {t}; 6. return T 0 ; The extraction of the reachable portion of an instance r with access lim- itations can be implemented by a Datalog program over r [1], which can be evaluated in polynomial time in the size of the input [3]. In addition, in [2] it is shown that the optimal q-fragments of r can be enumerated in ranked-order with polynomial delay, i.e., the time for printing the next optimal answer is again polynomial in the size of r. Hence, we can state the following preliminary result. Theorem 1. An optimal answer to a keyword query against a database instance with access limitations can be efficiently computed under data complexity. 3 Discussion and future work In this paper, we have defined the problem of keyword search in the Deep Web and provided some preliminary insights on query answering in this context. As future work on the problem in question, we plan to: – devise optimization strategies for query answering; in particular, identify conditions under which an optimal answer can be derived without extracting the whole reachable instance; – leverage known values (besides the keywords), modeled as relations with only one (output) attribute, to speed up the search for an optimal answer; – study the problem in a scenario in which the domains of the keywords are known in advance: in this case schema-based techniques can be used; – consider the case in which nodes and arcs of the join graph are weighted to model source availability and proximity, respectively. Acknowledgments. Andrea Calı̀ acknowledges support by the EPSRC grant “Logic-based Integration and Querying of Unindexed Data” (EP/E010865/1). References 1. A. Calı̀, D. Martinenghi. Querying Data under Access Limitations. In ICDE, pag. 50–59, 2008. 2. B. Kimelfeld and Y. Sagiv. Finding and approximating top-k answers in keyword proximity search. In PODS, pag. 173–182, 2006. 3. M. Vardi.The complexity of relational query languages. In STOC, pag. 137–146, 1982. 4. J. Xu Yu, L. Qin, and L. Chang. Search in Relational Databases: A Survey. IEEE Data Eng. Bull., 33(1): 67–78, 2010. 4