=Paper=
{{Paper
|id=Vol-1189/paper_18
|storemode=property
|title=Towards a new Foundation for Keyword Search in Relational Databases
|pdfUrl=https://ceur-ws.org/Vol-1189/paper_18.pdf
|volume=Vol-1189
|dblpUrl=https://dblp.org/rec/conf/amw/Torlone14
}}
==Towards a new Foundation for Keyword Search in Relational Databases==
<pdf width="1500px">https://ceur-ws.org/Vol-1189/paper_18.pdf</pdf>
<pre>
    Towards a new Foundation for Keyword Search in
                 Relational Databases
                                    Riccardo Torlone
                                 Università Roma Tre, Italy

1 Introduction
The idea of querying relational databases using keywords emerged a decade ago [4]
as a way to provide an high-level access to data and free the user from the knowledge
of query languages and data organization. The common approach to this problem is as
follows: the database is viewed as a graph G in which the nodes represent tuples and the
edges represent foreign key references between them, a query is a set of strings Q (the
keywords), and the result is a subgraph G′ of G whose nodes contain the keywords in
Q. Usually it is assumed that: (i) all the keywords should appear in the result and (ii) the
result should have minimal size. In this framework, query answering usually relies on
rather complex, graph-based techniques.
    We believe that this approach has a main drawback: it relies on the specific distri-
bution of data in relational tables, which may depend on aspects that are not related to
the actual content of the database. Indeed, today the degree of normalization is usually
based not only on data redundancy, but also on the database workload, which has to
do with the type and frequency of queries and updates. It follows that the result of a
keyword query can change by just modifying the organization of the database (e.g., for
optimization purposes) even if its actual content does not change.
    We then propose a new approach to keyword search in relational databases that re-
lies on the weak instance model [7, 6], an old yet still fascinating tool from relational
database theory in which a database is considered as a whole, regardless of the way in
which data are decomposed in the various relation schemes. In this model, we present a
simple semantics for the result of a keyword query that is independent of the database
schema and show that, by suitably extending a basic definition, we can introduce a rel-
evance criterium among different results of a query. We also discuss the computational
complexity of keyword search by means of a basic method for query answering.

2 Keyword search over weak instances
Let U be a finite set of attributes and R = {R1 (X1 ), . . . , Rn (Xn )} the schema of a
relational database such that the union of the Xi ’s is U . We say that an instance r of
R (globally) satisfies a set of functional dependencies (FDs) F if there is a relation w
on U , called a weak instance for r, that satisfies F and contains the relations of r in its
projections over the respective relation schemes, that is: πXi (w) ⊇ ri , for 1 ≤ i ≤ n.
    Let Tr for r be tableau formed by taking the union of all the relations in r extended
to U by means of unique variables. The representative instance for r, indicated with
RI r , is the tableau obtained by chasing [5] Tr with respect to F .
    Consider for instance a database scheme with relations R1 (ED), R2 (DF ),
R3 (EP ) and the functional dependencies E → D, D → F as constraints. Figures 1
shows a database state on this scheme and the corresponding representative instance. It
                                              r3            RI r Emp Dept Floor Proj
                r1
                              r2          Emp Proj           t1 John CS     1 Nana
            Emp Dept
                          Dept Floor      John Nana          t2 John CS     1 Trudy
            John CS
                           CS    1        John Trudy         t3 Bob EE      5    v1
       r    Bob EE
                           EE    5        Ann Nana           t4 Ann CS      1 Nana
            Ann CS
                          MS 3            Ann Dante          t5 Ann CS      1 Dante
             Jim EE
                                           Jim Dante         t6 Jim EE      5 Dante
                                                             t7 v2 MS 3          v3

                    Fig. 1. A database state and its representative instance.

has been shown that a database state is consistent if and only if the corresponding repre-
sentative instance can be built without encountering contradictions [3]. Also, for every
consistent state r and for every X, the set of total tuples (i.e., without variables) in RI r
on X (called the X-total projection of RI r and denoted by πX↓ (RI r )) is equal to the set
of tuples that appear in the projection on X of every weak instance of r [6]. According
to this definition, πX↓ (RI r ) is the relation over X implied by the current state.
    We assume that a keyword query Q is simply a finite and non-empty set of constants.
Given a tuple t over X ⊆ U , we say that: (i) t covers a set of constants C if, for each
c ∈ C, c = t[A] for some A ∈ X, and (ii) t X-belongs to database r if it belongs to the
X-total projection of the representative instance of r (that is: t ∈ πX↓ (RI r )).
Definition 1 (Base result). A base result (also called 1-result) of a keyword query Q
on a database r is a set of complete, total tuples R such that, for every tuple t ∈ R: (i) t
covers Q and (ii) t X-belongs to r for some X ⊆ U .
For instance, a base result of the keyword query {CS , Nana} over the database r in
Figure 1 is composed by the tuples t1 and t4 of RI r .
    Let us now now refine this notion by assuming that the keywords in the query can
appear in different tuples of the representative instance that are connected through com-
mon values on common attributes. We say that a tableau T is connected if for each
t ∈ T there is another tuple t′ ∈ T that is joinable with t (that is, they share values on
the same attributes) and that a set of total tuples T covers a set of constants C if each
c ∈ C appears in some tuple t ∈ T.
Definition 2 (K-result). A k-result of a keyword query Q on a database r is a minimal
set of total tuples Rk such that: (i) Rk has size k, is connected, and covers Q, and
(ii) every tuple t ∈ Rk x-belongs to r for some X ∈ U .
For instance, the query {Nana, EE } has one 3-result (R3 = {t4 , t5 , t6 }) and no 2 or
1-results. This shows that the parameter k captures the relevance of the result and then
provides an effective tool to order (and possibly limit) the tuples to return to the users.


3 Computing the results of a keyword query
In the framework we have defined, the first question focuses on finding, possibly in an
efficient way, the top-k results of a keyword query and the computational complexity
of this problem. In this section we provide a preliminary result by discussing the Algo-
rithm that follows, which implements a basic, “brute force” technique for solving this
problem.
 Algorithm 1: Computation of the top-k results of a keyword query
   Input : A consistent database state r, a keyword query Q, a limit k > 0
   Output: The Ri -results of Q on r (for 1 ≤ i ≤ k)
 1 Build the representative instance T of r;
 2 foreach tuple t in T do if t covers Q then output t;
 3 for (i = 2; i ≤ k; i ++) do
 4      foreach tuple t in T that covers some c ∈ Q do
 5          search and return the Ri -results including t with a depth-first visit of T from t;
 6          remove t from T


The algorithm consists of three main steps: the construction of the representative in-
stance (line 1), the search for the R1 results (line 2), and the search for the subsequent
Rk results, for k > 1 (lines 3-6). It is known that the first step requires polynomial
time in the size of the database. In step 2 all the tuples of the representative instance are
checked to verify if they (completely) cover the query and so it requires linear time in
the size of RI r , which is proportional to |r|. Finally, step 3 involves, for each tuple of
RI r that covers some keyword in the query, a depth-limited search in a graph G where
the nodes represent the tuples and the edge represent the joinability relationship. In the
worst case, the cost of this task is proportional to the maximum number of k-long paths
in G, which is bounded by |RI r |k . It is then possibile to show the following result.
Theorem 1. Algorithm 1 computes, for some finite k > 0, all the first k-results of a
keyword query Q of size q over a database state r of size n in time O(nq ).
Algorithm 1 can be optimized in several ways. In particular, the representative instance
does not need to be built since, for significant classes of schemas, its total projection on
a set of attributes can be computed efficiently by means of simple SPJ expressions [2].
Using these results, together with a suitable use of an inverted index, we could restrict
our attention only to the relevant portion of the database. These issues and other exten-
sions of the framework presented here will be subject of future studies.
References
 1. A.V. Aho, C. Beeri, and J.D. Ullman. The theory of joins in relational databases. ACM Trans.
    on Database Syst., 4(3):297–314, 1979.
 2. P. Atzeni and E.P.F. Chan. Efficient and optimal query answering on independent schemes.
    Theoretical Computer Science, 77(3):291–308, 1990.
 3. P. Honeyman. Testing satisfaction of functional dependencies. Journal of the ACM,
    29(3):668–677, 1982.
 4. V. Hristidis and Y. Papakonstantinou. Discover: Keyword search in relational databases. In
    VLDB, pages 670–681, 2002.
 5. D. Maier, A.O. Mendelzon, and Y. Sagiv. Testing implications of data dependencies. ACM
    Trans. on Database Syst., 4(4):455–468, 1979.
 6. D. Maier, J.D. Ullman, and M. Vardi. On the foundations of the universal relation model.
    ACM Trans. on Database Syst., 9(2):283–308, 1984.
 7. Y. Sagiv. A characterization of globally consistent databases and their correct access paths.
    ACM Trans. on Database Syst., 8(2):266–286, 1983.

</pre>