Computing the Skyline of a Relational Table Based on a Query Lattice Nicolas Spyratos, Tsuyoshi Sugibuchi, Ekaterina Simonenko, and Carlo Meghini Laboratoire de Recherche en Informatique, Université Paris-Sud 11, France {Nicolas.Spyratos, Tsuyoshi.Sugibuchi, Ekaterina.Simonenko}@lri.fr Istituto di Scienza e Tecnologie della Informazione del CNR, Pisa, Italy Carlo.Meghini@isti.cnr.it Abstract. We propose a novel approach to computing the skyline set of a relational table R, with respect to preferences expressed over one or more numerical attributes. Our approach is based on what we call the query lattice of R, and our basic algorithm constructs the skyline set as the union of the answers to a subset of queries from that lattice - hence without directly accessing the table R. Therefore, in contrast to all existing techniques, our approach is independent of how the table R is implemented or how its tuples are indexed. We demonstrate the general- ity of our approach by computing the skyline set of the join of two tables based on the product of their individual query lattices - therefore without performing the join. The paper presents basic concepts and algorithms leaving experimentation and performance evaluation to a forthcoming paper. Keywords: skyline, relational table, query lattice 1 Introduction In many multicriteria decision-making applications, dominance analysis is an im- portant aspect. As an example, consider a person looking for a vacation package using two criteria, or “attributes”: hotel rating and price. Intuitively, a package P = hr, pi is better than a package P 0 = hr0 , p0 i if P is better than P 0 in one attribute and not worse than P 0 in the other attribute. If this is the case then we say that P dominates P 0 . For example, consider the following three packages: – P1 = h2, 100i, P2 = h3, 130i, P3 = h2, 120i Since a higher rating and a lower price are more preferable, P1 dominates P3 . On the other hand, P1 and P2 don’t dominate each other because P1 has a lower rating than P2 and P2 has a higher price. Similarly, P2 and P3 don’t dominate each other because P2 has a higher rating than P3 and P3 has a lower price. A package, or “tuple” that is not dominated by any other tuple is said to be a skyline tuple or to be in the skyline. The tuples in the skyline are the best 146 N. Spyratos et al. possible trade-offs among the attribute values appearing in the tuples. Thus in our example, packages P1 and P2 are in the skyline, while P3 is not. In order to conduct a skyline analysis, two items must be specified: 1. A set of attributes over which preferences are expressed (such as Hotel Rating and Price, in our example). 2. An ordering of the attribute values (either total or partial) according to which preference is expressed (such as the ordering of the integers for Hotel Rating to express that rating r is preferred to rating r0 if r > r0 ; and similarly for Price to express that price p is preferred to price p0 if p < p0 ). In recent years, skyline analysis has gained considerable interest in the area of information systems in general, and in the area of databases in particular. How- ever, skyline analysis (i.e. computing non dominated points) existed well before the concept appeared in database research; it is known as the maximum vector problem or the Pareto optimum [11][18]. The popularity of skyline analysis in the area of information systems is mainly due to its applicability for decision making applications. Indeed, as information systems store larger and larger volumes of data today, data management and in particular query processing present diffi- cult challenges. From the user viewpoint, large volumes of data imply answers of large size. By returning the best tuples (in terms of user preferences), the skyline query relieves the user from having to deal with answers of large size in order to find the best tuples. The skyline operator was first introduced in [3], where the authors also present two basic algorithms: the Block Nested Loops (BNL) and the Divide and Conquer (D&C). In order to improve the performance of BNL algorithm, the SFS (sort-filter-skyline) algorithm was proposed in [5]. SFS runs on data sorted according to a monotonic function (namely, entropy descending). Such sorting guarantees the non-dominance of each object by those that follow in the order. Therefore, once an object is put into the buffer window, it can be reported as part of the skyline. Not only this makes the SFS algorithm progressive, but also allows to reduce the number of comparisons needed, since SFS compares only against the non-dominated tuples, whereas BNL often compares against dominated tuples [7]. Despite this improvement, all objects have to be scanned by the algorithm at least once. Succeeding approaches tend to avoid scanning the complete data set. Namely, SaLSa [1] uses the minimal coordinate of each object as a sorting function, and during the filter-scan step checks if all remain- ing objects are dominated by a so-called stop object, which can be determined in O(1) from the data accessed so far. The shortcoming is that the performance of SaLSa algorithm is affected by the data distribution and increasing dimen- sionality, since in higher dimensions instances of the problem the pruning power of the stop object is limited [29]. More generally, all sort-based techniques share the same drawback, namely the number of computations during the filter-scan step, as every input object should be compared with the skyline points in the buffer (which can potentially become large). Computing the Skyline of a Relational Table Based on a Query Lattice 147 An alternative to the sort-based techniques is the use of indexes, which allows to avoid scanning all the input objects. The basic idea is to rely on an index in order to determine dominance between tuples, and to exclude tuples from fur- ther processing as early as possible. Two index-based algorithms, Bitmap and Index were first introduced in [21]. Bitmap uses the bitmap encoding of the data so that the dominating points are determined by a bit-wise “and” operation. The Index approach partitions the objects into a set of lists. Each list is sorted by minimum coordinate and indexed by a B-tree. The objects are accessed in batches defined by the values of the minimal coordinate, while the algorithm computes local skylines in each “batch” of the lists and then merges them into a global skyline. However, besides the computation cost of Bitmap, and the neces- sity to construct a B-tree for every combination of dimensions having potential interest for the user, the order in which skyline points are returned by these al- gorithms is fixed and depends on the data distribution, so it cannot be adapted to the users preferences. Two other index-based skyline algorithms, NN (nearest-neighbor) [13] and BBS (branch-and-bound skyline) [19], are based on the observation that the object closest to the origin has to be part of the skyline. Nearest neighbor search is used to retrieve such point by using the R-tree. The pitfall of the NN algorithm is that in order to iteratively find the next nearest neighbors it divides the data set into overlapping partitions, and therefore duplicates have to be removed by traversing the R-tree multiple times. To avoid that, BBS rather accesses par- tially dominated nodes of the R-tree. The main drawback of all index-based approaches is that not all data can be indexed (namely when data is dynamically produced). Also, R-trees and other multidimensional indexes have their own limitations, namely the curse of dimen- sionality. Concerning related work specifically targeting the multi-relational skyline (or skyline join), two progressive algorithms are proposed in [9]. The idea is to com- bine the join with nested-loop and sort-merge algorithms. However, each relation has to accessed multiple times in order to compute the skyline for each join value, and then the global skyline. In addition, each input object has to be scanned at least once. More recent work also considers how to compute the skyline over the join of two or more relational tables without actually computing the join [23]. Apart from its applicability to computing skylines in a centralized database, the interest of such work lies in the fact that it is also applicable to distributed environments. Earlier work related to this topic can be found in [1][2][8][10][20][25][26]. A lattice-based approach to single-relation skyline computation is introduced in [15], with the aim of proposing a data-distribution independent algorithm. A lattice structure is used to answer skyline queries over dimensions with low- cardinality domains, or those that can be mapped to low-cardinality domains (such as Price, that can be mapped to price ranges). The principle is to organize all the values combinations into a lattice based on the dominance relationship, and then to retrieve those that (a) are present in the input data set, and (b) 148 N. Spyratos et al. are not reachable by the dominance relationship from another element of the lattice, also belonging to the data set. However, no early pruning is done, so the entire data set has to be read twice in order to determine the skyline tuples, and the skyline join problem is not investigated. Also, the mapping of the values of a domain to a set of ranges has to be carefully tuned in order to deliver a meaningful skyline result, which is not discussed in the paper. Several variants of skyline were introduced in [19], such as constrained, sub- space and dynamic skyline queries (see also [6][16][22][27][28]). Skyline queries have also been studied in various other domains, outside traditional databases. These include probabilistic skyline computations over uncertain data [17](e.g. data in sensor networks); skyline computations over incomplete data [12](e.g. data with missing values); over data whose attributes have partially-ordered do- mains [4](e.g. preferences expressed by users online); over stream data[14]; or even bandwidth-constrained skyline computations over mobile devices [24]. In this paper, we present a novel approach to computing skylines which rep- resents a major deviation from existing approaches. Indeed, instead of accessing individual tuples in a database table, our approach relies on the definition of skyline as the union of the answers to a set of queries. In doing so, our ba- sic algorithm avoids accessing the table directly: access to the table is through queries, hence independent of how the table is implemented or how its tuples are indexed. Given a relational table R, our approach is based on what we call the query lattice of R; and our basic algorithm constructs the skyline set as the union of the answers to a subset of queries from that lattice - hence without directly accessing the table R. We demonstrate the generality of our approach by computing the skyline of the join of two tables based on the product of their individual query lattices - therefore without performing the join. The paper presents basic con- cepts and algorithms leaving experimentation and performance evaluation to a forthcoming paper. The paper is organized as follows. In section 2 we give some preliminary def- initions and introduce our notation. In section 3 we present our basic algorithm for computing skylines through queries. In section 4 we apply our approach to computing skylines over joins, thus demonstrating the generality of the approach. Finally, in section 5, we offer some concluding remarks and discuss further re- search. 2 Basic definitions Let R be a relational table, with A1 , . . . , An as attributes. Let B = {B1 , . . . , Bk }, k ≤ n, be the set of preference attributes, that is a set of attributes of the table whose domains are numeric and over which preferences are declared. A preference over Bi is an expression of one of two forms: Bi → min or Bi → max . If the preference Bi → min is expressed by a user of the table, then this is interpreted as follows: given two values x and y in the domain of Bi , x is preferred to y or x preceeds y iff x < y; and similarly, if the preference Bi → max Computing the Skyline of a Relational Table Based on a Query Lattice 149 is expressed by a user, then this is interpreted as follows: given two values x and y in the domain of Bi , x is preferred to y or x preceeds y iff x < y; In order to simplify the presentation, and without loss of generality, we shall consider only one form of preference, namely Bi → min. However, all methods discussed in this paper can be applied with any combination of the preferences Bi → min and Bi → max .Therefore, from now on, given two values x and y in the domain of Bi , we shall say that x is preferred to y or x preceeds y iff x < y. Definition 1 (Pareto domination) Let B = {B1 , . . . , Bk } be a set of preference attributes of a relational table R and let s and t be tuples of R. We say that s is equivalent to t, denoted as s ≡ t iff s.Bi = t.Bi for all Bi ∈ B. Moreover, we say that s Pareto dominates t, denoted as s

and ⊥, respectively. These extreme elements are given by: > = hm1 , . . . , mk i ⊥ = hM1 , . . . , Mk i where mi and Mi are the minimum value and the maximum value appearing in the active domain of Bi , respectively. We shall call this (finite) lattice the query lattice, of R and we shall denote it as (Q, ≤), where Q is the (finite) set of preference queries over R. In this paper, we shall use the query lattice for two purposes: (a) as a tool for computing skylines of relational tables, and (b) as a means for comparing our approach to existing approaches. First, however, let’s define skylines formally. Definition 2 The skyline of a table R over preference attributes B, denoted by SKY (R, B), is the set of tuples from R defined as follows: SKY (R, B) = {t ∈ R | 6 ∃s ∈ R : s ≤ t} In other words, the skyline of R is the set of non-dominated tuples of R. As customary, given a query q and a relation R, we will let ans(q, R) stand for the answer of q over R, that is the set of tuples obtained by asking query q against relation R. Moreover, we define a skyline query of R over preference attributes B, to be a query q(R, B) over R whose answer is a skyline of R over B, that is: ans(q(R, B), R) = SKY (R, B) Skyline queries will play a central role in our approach to skyline computation. 3 Computing the skyline of a relational table In this Section, we present an algorithm for computing the skyline of a given table R. Our algorithm obtains the skyline by constructing a skyline query. To find the skyline query, our algorithm traverses part of the query lattice and collects a set of non-dominated queries whose disjunction is the skyline query. In contrast to existing methods that actually construct the skyline, and as such are sensitive to the ways the table R is implemented or how its tuples are accessed, our algorithm obtains the skyline while making no assumptions on how the relation R is accessed by the system while processing the skyline query. Our algorithm uses the notion of successors of a query, defined as follows. First, for each preference attribute Bi and value bi in the domain of Bi , let’s denote as succ(bi ) the successor of bi in the linear order of the domain of Bi . For instance, let Bi be the Hotel Rating attribute of our earlier example, having as domain the interval [1,5]. As this interval is linearly ordered by the < relation, we have succ(1) = 2, succ(2) = 3, and so on. This makes succ a partial function over the domain of Bi , undefined for the maximum Mi . Now, we extend the succ function to preference queries as follows. Computing the Skyline of a Relational Table Based on a Query Lattice 151 Let q be a preference query q = hb1 , . . . , bk i such that q 6= ⊥. This means that bj 6= Mj for at least one j ∈ [1, k], where Mj is the maximum value in the active domain of Bj . With no loss of generality, we shall assume that the m values of q that are not maximal, where 1 ≤ m ≤ k, occur in the first m positions of q, (i.e., q = hb1 , . . . , bm , Mm+1 , . . . , Mk i). Then, the successors of q, succ(q), is defined to be the set of queries succ(q) = {q1 , . . . , qm }, such that, for all 1 ≤ i ≤ m, qi = hb1 , . . . , bi−1 , succ(bi ), bi+1 , . . . , bm , Mm+1 , . . . , Mk i Clearly, the succ function is undefined on the bottom of the lattice ⊥. The following Lemma gives two important properties of the successors of q for the establishment of the correctness of the following algorithm for computing a sky- line query of the table R. Lemma 1. Let q be a preference query. Then, for each query q 0 ∈ succ(q) : 1. q ≤ q 0 2. there is no query q 00 such that q < q 00 < q 0 . We are now ready to give the algorithm for computing a skyline query of a table R. Algorithm Skyline Query over a Single Table (SQST ) Input A non-empty table R, a non-empty set B = {B1 , . . . , Bk } of preference attributes in R, and the projections of R over B1 , . . . , Bk . Data We use the following variables for accumulating data during the execution of the algorithm: – The variable F is a set variable called frontier; it is initialized to empty, and it is used to accumulate the queries whose disjunction will be the result of the algorithm (i.e., whose disjunction will be the skyline query). – The variable C is a set variable containing the set of all current candidate queries; it is initialized to the top > of the query lattice. – The variable S is a set variable used to accumulate successors of current candidate queries. – The variable C 0 is a (auxiliary) set variable for accumulating candidate queries for the next while-loop iteration in the algorithm; it is initialized to empty at the beginning of each iteration. Output A set of preference queries over R, whose disjunction is a skyline query. Method C ← {>}; F ← ∅ while C 6= ∅ do for all c ∈ C such that ans(c, R) 6= ∅ do C ← C\{c}; F ← F ∪ {c} 152 N. Spyratos et al. end for C0 ← ∅ for all c ∈ C do for all s ∈ succ(c) such that no query in F Pareto dominates s do C 0 ← C 0 ∪ {s} end for end for C ← C0 end while return F Informally, the algorithm works as follows: – If the top query > of the lattice is non-empty (i.e.if its answer over R is non- empty) then the algorithm terminates; > is the output of the algorithm, and the answer of > over R is the skyline. – Otherwise, each successor query c of > might be a candidate for contributing to the skyline, and this is checked as follows: • if the query c is non-empty then it is added to F (i.e.to the variable that accumulates the queries contributing to the skyline); • otherwise, each successor of c that is not dominated by a query in F becomes a candidate, by being added to the variable C 0 (i.e.to the vari- able that accumulates all candidate queries for the next iteration). The so selected successors are finally transferred to the variable C. This process is repeated until there is no more candidate left (i.e.until C is empty). Upon termination of the algorithm, F contains all queries q such that: (a) the answer to q is non-empty, and (b) q is not dominated by another query. The disjunction of all queries in F is then the skyline query, and the answer of the skyline query over R is the skyline of R. R HID Price Rating h1 200 2 h2 150 5 h3 100 3 h4 300 2 h5 350 1 h6 100 8 Fig. 1. Example relation and query lattice. Let us illustrate the algorithm by using the example table R in Fig 1. We assume the preference attributes to be P rice and Rating and the preference Computing the Skyline of a Relational Table Based on a Query Lattice 153 to be P rice → min and Rating → min. The projections of R over P rice and Rating, sorted in ascending order, are as follows: – P rice : 100, 150, 200, 300, 350 – Rating : 1, 2, 3, 5, 8 The diagram in Fig 1 shows a part of the query lattice derived from R. In this diagram, queries having non-empty answers are emphasized by bold letters. Queries contributing to the skyline are enclosed by rounded rectangles. Gray triangles in the diagram represent areas dominated by queries in the skyline. As we can see in the diagram, the top of the query lattice is > = h100, 1i. Therefore, we start the first iteration of the algorithm with C = {h100, 1i} and F = ∅. At the end of each subsequent iteration of the while-loop, the contents of C and F change as follows: – end of 1st iteration: C = {h150, 1i, h100, 2i}, F = ∅ – end of 2nd iteration: C = {h200, 1i, h150, 2i, h100, 3i}, F = ∅ In the third iteration, the query h100, 3i ∈ C has a non-empty result therefore h100, 3i leaves C and enters F . We then consider the successors of the queries left in C. h150, 3i ∈ succ(h150, 2i) is dominated by h100, 3i ∈ F therefore h150, 3i is omitted from C 0 , which accumulates candidate queries for the next iteration. – end of 3rd iteration: C = {h300, 1i, h200, 2i}, F = {h100, 3i} – end of 4th iteration: C = {h350, 1i}, F = {h100, 3i, h200, 2i} – end of 5th iteration: C = ∅, F = {h100, 3i, h200, 2i, h350, 1i} (the algorithm stops here) The correctness of the algorithm is easily established by observing that the algorithm explores the lattice completely (this is guaranteed by the second prop- erty of the succ function in the above Lemma), retaining only maximal queries (this is guaranteed by the test on dominance performed by the algorithm and also by the fact that the successors of a query are all dominated by it, as stated by the first property of the succ function in the above Lemma). Formally, we will denote as SQST (R, B) the result of the SQST algorithm having R and B as the input non-empty relation and preference attributes, re- spectively. On the basis of the above observations, we state the following: Proposition 1 For every relation R and preference attributes B over R, _ ans( SQST (R, B), R) = SKY (R, B) We note that, as all queries in F are conjunctive and any two queries in F differ in at least one conjunct, the answers making up the skyline actually form a partition of the skyline. This is an interesting observation when combined with the notion of rank of a query in the query lattice. 154 N. Spyratos et al. Definition 3 (Rank of a query) The rank of a query q in the query lattice is defined as follows: if q is the root query then rank(q) = 0 else rank(q) is the maximum length of path from the root query to q Clearly, the higher the rank of a query the less the tuples in its answer are preferred. Now, as the skyline of R is partitioned by the answers to the queries in F , one can ask new kinds of queries. For example, one can ask the query: “give me the best tuples from the skyline”. The answer to this query will be the answer to the query of lowest rank in F . Similarly, one can ask the query: “give me the ranks of all tuples in the skyline”. This query will return a set of ranks, thereby giving a useful information as to how far are the tuples of the skyline from the most preferred tuples. A detailed discussion of the relationship between “most preferred” and “non-dominated” tuples is given in a forthcoming paper. 4 Skylines of joins We now consider the computation of the skyline over the join of two tables R1 and R2 . To this end, we introduce the necessary concepts. Let R1 and R2 be relations with A11 , . . . , An1 1 and A12 , . . . , An2 2 as attributes, respectively, and let Bi = {Bi1 , . . . , Biki }, ki ≤ ni , be a set of attributes of Ri , called the preference attributes of Ri , for i = 1, 2. We shall denote as (Q1 , ≤1 ) and (Q2 , ≤2 ) the query lattices over R1 and R2 , respectively. Moreover, ⊗i and ⊕i will stand for the least upper bound and the greatest lower bound of any two preference queries over Ri , respectively. Let Ji = {Ji1 , . . . , Jil }, be a set of attributes of Ri disjoint form the preference attributes Bi , for i = 1, 2. A join over J1 and J2 is a relational equijoin R1 ./e R2 , whose join expression e is given by J11 = J21 , . . . , J1l = J2l . In order to simplify the model and with no loss of generality, we will consider the join attributes to be the same for the two relations, that is J1 = J2 , and moreover to consist of a single attribute J, that is e is given by J = J. Intuitively, a preference query over a join R1 ./e R2 is a (k1 +k2 )-tuple whose first k1 elements make up a query in Q1 and whose last k2 elements make up a query in Q2 . In order to simplify notation, we will commit a slight abuse and write hq1 , q2 i to represent a preference query over R1 ./e R2 , where q1 ∈ Q1 and q2 ∈ Q2 . As a consequence, the set of preference queries over R1 ./e R2 , that we shall denote as Q./ , is given by the Cartesian product of the set of preference queries over R1 and R2 , that is: Q./ = Q1 × Q2 Now, let ≤./ be the Pareto preference relation over Q./ . As we have already seen in the previous Section, (Q./ , ≤./ ) is a complete lattice. Moreover, it is not difficult to see that: Computing the Skyline of a Relational Table Based on a Query Lattice 155 Proposition 2 (Q./ , ≤./ ) is the product of (Q1 , ≤1 ) and (Q2 , ≤2 ). That is, let- ting q and q 0 be preference queries in Q./ such that q = hq1 , q2 i and q 0 = hq10 , q20 i, where q1 , q10 ∈ Q1 and q2 , q20 ∈ Q2 , we have: 1. q≤./ q 0 iff q1 ≤1 q10 and q2 ≤2 q20 2. q ⊗./ q 0 = hq1 ⊗1 q10 , q2 ⊗2 q20 i 3. q ⊕./ q 0 = hq1 ⊕1 q10 , q2 ⊕2 q20 i From now on we shall simplify notation by omitting subscripts, unless this creates ambiguity. We shall call join values the set of tuples V = X1 ∩ X2 , where: Xi = πJ (Ri ) for i = 1, 2 By definition of join, a tuple t in Ri contributes to the join if and only if its projection over the join attribute J is in V, that is t.J ∈ V, for i = 1, 2. Likewise, a query q in Qi may contribute to the skyline of the join only if it occurs in the join. For each join value v ∈ V, we define the v−partition of Ri , denoted Si (v), as follows : Si (v) = πBi (σJ=v (Ri )) for i = 1, 2 In practice, each v−partition includes the queries that contribute to the join R1 ./ R2 . In particular: [ πB1 ∪B2 (R1 ./ R2 ) = S1 (v) × S2 (v) v∈V v−partitions play an important role in determining the skyline of the join R1 ./ R2 without computing the join. Proposition 3 A query hq1 , q2 i ∈ πB1 ∪B2 (SKY (R1 ./ R2 , B1 ∪ B2 )) iff for some join value v ∈ V, qi ∈ πBi (SKY (Si (v), Bi )) for i = 1, 2 and for no other v 0 ∈ V there exists skylines qi0 ∈ πBi (SKY (Si (v 0 ), Bi )) such that qi0 ≤ qi for i = 1, 2. Proof: (→) If for some join value v ∈ V, qi ∈ πBi (SKY (Si (v), Bi )) for i = 1, 2 then hq1 , q2 i ∈ πB1 ∪B2 (R1 ./ R2 ), and moreover for 1 in Proposition 2 there exists no other query hq10 , q20 i ∈ πB1 ∪B2 (R1 ./ R2 ) where qi0 ∈ Si (v) for i = 1, 2 such that hq10 , q20 i ≤ hq1 , q2 i. If for no other v 0 ∈ V there exists skylines qi0 ∈ πBi (SKY (Si (v 0 ), Bi )) such that qi0 ≤ qi for i = 1, 2 then again from 1 in Proposition 2 there exists no hq10 , q20 i such that hq10 , q20 i ≤ hq1 , q2 i. Hence hq1 , q2 i ∈ πB1 ∪B2 (SKY (R1 ./ R2 , B1 ∪ B2 )). (←) Suppose not. Then, either (a) for no join value v ∈ V, qi ∈ πBi (SKY (Si (v), Bi )) for i = 1, 2 or (b) there exists a join value v 0 ∈ V and skylines qi0 ∈ πBi (SKY (Si (v 0 ), Bi )) such that qi0 ≤ qi for i = 1, 2. In case (a), there are two sub-cases: (a1) for no join value v ∈ V, qi ∈ Si (v) for i = 1, 2. In this case, hq1 , q2 i 6∈ πB1 ∪B2 (R1 ./ R2 ), against the hypothesis. (a2) for some join value v ∈ V, qi ∈ Si (v), but either q1 6∈ πB1 (SKY (S1 (v), B1 )) or q2 6∈ πB2 (SKY (S2 (v), B2 )). Then let qi0 be such that qi0 ∈ πBi (SKY (Si (v), Bi )). Such qi0 are guaranteed to exist because Si (v) is finite and partially ordered by Pareto preference. By hypothesis, either q10 6= q1 or 156 N. Spyratos et al. q20 6= q2 . Then hq10 , q20 i ∈ πB1 ∪B2 (R1 ./ R2 ) and by 1 in Proposition 2 hq10 , q20 i ≤ hq1 , q2 i, therefore hq1 , q2 i 6∈ πB1 ∪B2 (SKY (R1 ./ R2 , B1 ∪B2 )), against the hypoth- esis. (b) If there exists a join value v 0 ∈ V and skylines qi0 ∈ πBi (SKY (Si (v 0 ), Bi )) such that qi0 ≤ qi for i = 1, 2 then hq10 , q20 i ∈ πB1 ∪B2 (R1 ./ R2 ) and by 1 in Propo- sition 2 hq10 , q20 i ≤ hq1 , q2 i, therefore hq1 , q2 i 6∈ πB1 ∪B2 (SKY (R1 ./ R2 , B1 ∪ B2 )), against the hypothesis. We now provide an algorithm for computing the skyline queries over the join of two tables R1 and R2 (without computing the join.). The algorithm exploits Proposition 3, and uses the SQST algorithm for computing skylines over v- partitions of the given tables. As a result, the algorithm returns two sets of queries, one over R1 , the other over R2 , for selecting from each table the tuples that will generate the skyline of the join R1 ./ R2 , and only those. Algorithm The Skyline Queries over a Join (SQJ ) Input Non-empty relations R1 and R2 , non-empty sets B1 and B2 of preference attributes in R1 and R2 , respectively, and the join values V. Data G is a set variable initialized to empty and used to accumulate all candidate skyline queries, resulting from the Cartesian products of the skyline queries over the same join value. R is a set variable where the skyline of G is computed. Pi and Fi (for i = 1, 2) are auxiliary set variables, used to store v-partitions and final results, respectively. Output Two sets of queries, one over R1 and the other over R2 . Method G←∅ for all v ∈ V do P1 ← SQST (S1 (v), B1 ) P2 ← SQST (S2 (v), B2 ) G ← G ∪ (P1 × P2 × {v}) end for R ← SQST (G, B1 ∪ B2 ) F1 ← πB1 ∪{J} (R) F2 ← πB2 ∪{J} (R) return F 1, F2 We note that the algorithm operates in three passes: 1. In the first pass, it gathers (in G) all candidate results by looping over all join values. This is in fact required by the first condition of Proposition 3, which states that hq1 , q2 i is in a skyline query of the join if both q1 and q2 are skyline queries over the v-partitions for the same join value v. 2. In the second pass, it removes from G the compound queries that are dom- inated by some other compound query, as required by the second condition of Proposition 3. Computing the Skyline of a Relational Table Based on a Query Lattice 157 3. Finally, it slices the compound queries vertically, in order to obtain queries over R1 (these are stored in F1 ) and queries over R2 (in F2 ). Notice that the join attribute J must be transferred all along in order to generate the correct queries. Clearly, there is no other way of proceeding since it is necessary to obtain all candidate queries in order to apply the second condition of Proposition 3. Formally, we will denote as SQJ (R1 , R2 , B1 , B2 )i the i−th result of the SQJ algorithm having R1 and R2 as the input non-empty relations and B1 and B2 as preference attributes, respectively. For readability, we will abbreviate SQJ (R1 , R2 , B1 , B2 )i as SQJ i . On the basis of the above observations and of Proposition 3, we therefore state the following: Proposition 4 For every pair of relations R1 and R2 and preference attributes B1 and B2 over them, _ _ SQST (ans( SQJ 1 , R1 ) ./ ans( SQJ 2 , R2 ), B1 ∪B2 ) = SKY (R1 ./ R2 , B1 ∪B2 ) Let us demonstrate the SQJ algorithm by using table R1 (hotels) and R2 (restaurants) in Fig. 2. Suppose we want to find best combinations of hotels and restaurants in the same “Location”, by minimizing “Price”, “Rating”, “Dis- tance” and “Location” (we took this example from [23]). In this case, the pref- erence attributes are B1 = {P rice, Rating}, B2 = {Distance, Ranking} and the join attribute is J = {Location}. From the intersection of values appearing in the Location attribute in R1 and R2 , we can obtain join values V = {A, B, C}. In the algorithm, firstly we gather candidate skyline queries for each join value. In the example, for a join value A ∈ V , we obtain v-partition S1 (A) and S2 (A). Then we compute skylines P1 , P2 (emphasized in tables S1 , S2 by bold letters) from S1 (A), S2 (A) by the applying the SQST algorithm. Finally we make a Cartesian product P1 × P2 × {A} and append it to G that accumulates candidate skyline queries. After iterating this candidate gathering process for every join value, we apply SQST to G to obtain the skyline (emphasized in tables G by bold letters) over the join R1 ./ R2 . It is important to note the difference of size between R1 ./ R2 and G. In this example, R1 ./ R2 produces 12 tuples but G contains 8 tuples. Therefore, we can compute the skyline from G with less cost than by computing directly from R1 ./ R2 . In the example, hh5 , r5 i is not a skyline in the join because for the join value B, we have h2 dominating h5 and r1 dominating r5 . On the other hand, neither h6 nor r4 are skyline in their table, but they form a skyline in the join because they are skylines in their join group and there is no other group in which both are dominated (h6 is dominated by a query h3 in the A group, whereas r4 is dominated by r2 in the C group, and there is no single group in which both are dominated.) 158 N. Spyratos et al. R1 (Hotels) S1 HID Price Rating Location Price Rating h1 100 8 A 100 8 h2 150 5 B S1 (A) 200 1 P1A = {h100, 8i, h200, 1i} h3 200 1 A 400 2 h4 400 2 A 150 5 S1 (B) P1B = {h150, 5i, h350, 3i} h5 300 7 C 350 3 h6 350 3 B S1 (C) 300 7 P1C = {h300, 7i} R2 (Restaurants) S2 RID Distance Ranking Location Distance Ranking r1 150 4 B 500 1 S2 (A) P2A = {h500, 1i} r2 250 2 C 500 6 r3 500 1 A 150 4 S2 (B) P2B = {h150, 4i, h400, 3i} r4 400 3 B 400 3 r5 200 5 C 250 2 S2 (C) P2C = {h250, 2i, h200, 5i} r6 500 6 A 200 5 G Price Rating Distance Ranking Location 100 8 500 1 A P1A × P2A × {A} 200 1 500 1 A 150 5 150 4 B 150 5 400 3 B P1B × P2B × {B} 350 3 150 4 B 350 3 400 3 B 300 7 250 2 C P1C × P2C × {C} 300 7 200 5 C F1 F2 Price Rating Location Distance Ranking Location 100 8 A 500 1 A 200 1 A 150 4 B 150 5 B 400 3 B 350 3 B 250 2 C 300 7 C Fig. 2. An example of skyline over the join of two tables 5 Concluding Remarks We have seen a novel approach to computing the skyline of a relational table with respect to preferences expressed over one or more attributes with ordered domains. Our approach is based on what we called the query lattice of the table, and our basic algorithm constructs the skyline as the union of the answers to a subset of queries from that lattice — hence without directly accessing the table R. Therefore, in contrast to all existing techniques, our approach is independent of how the table R is implemented or how its tuples are indexed. We have demonstrated the generality of our approach by computing the skyline over the join of two tables based on the product of their individual query lattices — therefore without performing the join. We note that our method is applicable to a computational geometry setting as well. Indeed, a discrete (finite) n-dimensional Euclidean space S can be thought of as a relational table T (S) in which: (a) the attributes are the n dimensions Computing the Skyline of a Relational Table Based on a Query Lattice 159 of S and (b) each tuple of T (S) represents a point in S. Moreover, the answer to a query q = hb1 , . . . , bk i from the query lattice of T (S) is the set of all points of S such that (B1 = b1 ) ∧ . . . ∧ (Bk = bk ), where B1 , . . . , Bk are the corresponding dimensions; in other words, the answer to q is the set of points having the same coordinate values over the dimensions B1 , . . . , Bk . Additionally, our method can be applied to the Cartesian product of two or more spaces through the product lattice of the individual spaces. We are currently pursuing work in two different directions, namely refining skyline analysis and applying our approach to a distributed setting: 1. Refining skyline analysis As we mentioned in section 3, the skyline query returned by our basic al- gorithm is the disjunction of a set of queries from the query lattice, say q1 ∧ . . . ∧ qm ; and the answers to these queries actually partition the skyline into disjoint subsets. Moreover, these queries have different ranks, in general. Therefore it now becomes possible to ask finer queries regarding the skyline such as “give me the best tuples of the skyline” (meaning the answer to the query qi of highest rank), or “return the skyline by presenting the answers to qi ’s in increasing order of rank”, and so on. In this respect, we would also like to investigate in more detail the relationship between “most preferred tuple” and “non-dominated tuple”. 2. Applying our method to a distributed setting As information systems store bigger and bigger volumes of data today, data management and in particular query processing presents difficult challenges. From the user viewpoint, big volumes of data imply answers of large size. By returning the best tuples (in terms of user preferences), the skyline query relieves the user from having to deal with answers of large size; and having the possibility to further refine the skyline (as mentioned above) further contributes in that direction. However, in recent years, data management and data storage have become increasingly distributed, and distribution presents additional challenges for query processing. Adapting the skyline operator to a distributed setting is one of the research lines that we are currently pursuing. We believe that our approach to skyline computation through query lattices is particularly well suited in a distributed environment, where computation can be distributed and recomposed in the form of the product lattice. References 1. I. Bartolini, P. Ciaccia, and M. Patella. SaLSa: computing the skyline without scanning the whole sky. In Proceedings of CIKM, 2006. 2. I. Bartolini, P. Ciaccia, and M. Patella. Efficient sort-based skyline evaluation. ACM Trans. Database Syst., 33(4), 2008. 3. S. Börzsönyi, D. Kossmann, and K. Stocker. The skyline operator. In Proceed- ings of ICDE, pages 421-430, 2001. 4. C.-Y. Chan, P.-K. Eng, K.-L. Tan. Stratified computation of skylines with partially-ordered domains. In Proc. of SIGMOD 2005, pp. 03–214, 2008. 160 N. Spyratos et al. 5. J. Chomicki, P. Godfrey, J. Gryz, and D. Liang. Skyline with presorting. In Proceedings of ICDE, pages 717-816, 2003. 6. B. Cui, H. Lu, Q. Xu, L. Chen, Y. Dai, Y. Zhou. Parallel Distributed Processing of Constrained Skyline Queries by Filtering. In Proc. of ICDE 2008, pp. 546- 555, 2008. 7. P. Godfrey, R. Shipley, and Jarek Gryz. Algorithms and analyses for maximal vector computation. The VLDB Journal, vol 16(1), pp. 5–28, 2007. 8. W. Jin, M. Ester, Z. Hu, and J. Han. The multi-relational skyline operator. In Proceedings of ICDE, 2007. 9. W. Jin, M. Morse, J. Patel, M. Ester, and Z. Hu. Evaluating skylines in the presence of equi-joins. In Proc. of ICDE, 2010. 10. N. Koudas, C. Li, A. K. H. Tung, and R. Vernica. Relaxing join and selection queries. In Proceedings of VLDB, 2006. 11. H.T. Kung, F. Luccio, F.P. Preparata. On finding the maxima of a set of vectors. Journal of the ACM, vol. 22(4), pp. 469–476, 1975. 12. M.E. Khalefa, M.F. Mokbel, J.J. Levandoski. Skyline Query Processing for Incomplete Data. In Proc. of ICDE 2008, pp. 556-565, 2008. 13. D. Kossmann , F. Ramsak , S. Rost. Shooting Stars in the Sky: An Online Algorithm for Skyline Queries. In Proc. of VLDB 2002, pp. 275–286, 2002. 14. X. Lin , Y. Yuan , W. Wangnicta , S. Wales. Stabbing the sky: Efficient skyline computation over sliding windows. In Proc. of ICDE 2005, pp. 502–513, 2005. 15. M. Morse , J. Patel , H. V. Jagadish. Efficient Skyline Computation over Low- Cardinality Domains. In Proc. of VLDB 2007, pp. 267–278, 2007. 16. J. Pei , W. Jin , M. Ester , Y. Tao. Catching the best views of skyline: A semantic approach based on decisive subspaces. In Proc. of VLDB 2005, pp. 253–264, 2005. 17. J. Pei , B. Jiang , X. Lin , Y. Yuan. Probabilistic skylines on uncertain data. In Proc. of VLDB 2007, pp. 15–26, 2007. 18. F.P. Preparata, M.I. Shamos. Computational Geometry Computational Geom- etry, Springer-Verlag, 1985. 19. D. Papadias, Yufei Tao, Greg Fu, Bernhard Seeger. An Optimal and Progressive Algorithm for Skyline Queries. In Proc. of SIGMOD 2003, pp. 467–478, 2003. 20. V. Raghavan, E. Rundensteiner. Progressive result generation for multi-criteria decision support queries. In Proceedings of ICDE, 2010. 21. K.-L. Tan, P.-K. Eng, B.C. Ooi. Efficient Progressive Skyline Computation. In Proc. of VLDB 2001, pp. 301-310, 2001. 22. Y. Tao, X. Xiao, and J. Pei. Subsky: Efficient computation of skylines in sub- spaces. In Proceedings of ICDE, 2006. 23. A. Vlachou, C. Doulkeridis, N. Polyzotis. Skyline query processing over joins. In Proceedings of SIGMOD 2011, pp. 73–84, 2011. 24. A. Vlachou, K. Nørvåg. Bandwidth-constrained distributed skyline computa- tion. In Proc. of MobiDE 2009, pp. 17–24, 2009. 25. D. Sun, S. Wu, J. Li, and A. K. H. Tung. Skyline-join in distributed databases. In ICDE Workshops, 2008. 26. Q. Wan, R. C.-W. Wong, I. F. Ilyas, M. T. Özsu, and Y. Peng. Creating competitive products. PVLDB, vol. 2(1), 898-909, 2009. 27. P. Wu , C. Zhang , Y. Feng , B.Y. Zhao , D. Agrawal , A.E. Abbadi. Parallelizing skyline queries for scalable distribution. In Proc. of EDBT 2006, pp. 112-130, 2006. 28. Y. Yuan , X. Lin , Q. Liu , W. Wang , J.X. Yu , Q. Zhang. Efficient Computation of the Skyline Cube. In Proc. of VLDB 2005, pp. 241–252, 2005. Computing the Skyline of a Relational Table Based on a Query Lattice 161 29. S. Zhang , N. Mamoulis , S.W. Cheung. Scalable skyline computation using object-based space partitioning. In Proc. of SIGMOD 2009, pp. 483-494, 2009.