Computing the Skyline of a Relational Table
            Based on a Query Lattice

    Nicolas Spyratos, Tsuyoshi Sugibuchi, Ekaterina Simonenko, and Carlo
                                  Meghini

     Laboratoire de Recherche en Informatique, Université Paris-Sud 11, France
    {Nicolas.Spyratos, Tsuyoshi.Sugibuchi, Ekaterina.Simonenko}@lri.fr
      Istituto di Scienza e Tecnologie della Informazione del CNR, Pisa, Italy
                            Carlo.Meghini@isti.cnr.it


      Abstract. We propose a novel approach to computing the skyline set
      of a relational table R, with respect to preferences expressed over one
      or more numerical attributes. Our approach is based on what we call
      the query lattice of R, and our basic algorithm constructs the skyline
      set as the union of the answers to a subset of queries from that lattice -
      hence without directly accessing the table R. Therefore, in contrast to all
      existing techniques, our approach is independent of how the table R is
      implemented or how its tuples are indexed. We demonstrate the general-
      ity of our approach by computing the skyline set of the join of two tables
      based on the product of their individual query lattices - therefore without
      performing the join. The paper presents basic concepts and algorithms
      leaving experimentation and performance evaluation to a forthcoming
      paper.

      Keywords: skyline, relational table, query lattice


1   Introduction
In many multicriteria decision-making applications, dominance analysis is an im-
portant aspect. As an example, consider a person looking for a vacation package
using two criteria, or “attributes”: hotel rating and price. Intuitively, a package
P = hr, pi is better than a package P 0 = hr0 , p0 i if P is better than P 0 in one
attribute and not worse than P 0 in the other attribute. If this is the case then
we say that P dominates P 0 .
    For example, consider the following three packages:

 – P1 = h2, 100i, P2 = h3, 130i, P3 = h2, 120i

    Since a higher rating and a lower price are more preferable, P1 dominates P3 .
On the other hand, P1 and P2 don’t dominate each other because P1 has a lower
rating than P2 and P2 has a higher price. Similarly, P2 and P3 don’t dominate
each other because P2 has a higher rating than P3 and P3 has a lower price.
    A package, or “tuple” that is not dominated by any other tuple is said to
be a skyline tuple or to be in the skyline. The tuples in the skyline are the best
146     N. Spyratos et al.

possible trade-offs among the attribute values appearing in the tuples. Thus in
our example, packages P1 and P2 are in the skyline, while P3 is not.
   In order to conduct a skyline analysis, two items must be specified:

 1. A set of attributes over which preferences are expressed (such as Hotel Rating
    and Price, in our example).
 2. An ordering of the attribute values (either total or partial) according to
    which preference is expressed (such as the ordering of the integers for Hotel
    Rating to express that rating r is preferred to rating r0 if r > r0 ; and similarly
    for Price to express that price p is preferred to price p0 if p < p0 ).

    In recent years, skyline analysis has gained considerable interest in the area of
information systems in general, and in the area of databases in particular. How-
ever, skyline analysis (i.e. computing non dominated points) existed well before
the concept appeared in database research; it is known as the maximum vector
problem or the Pareto optimum [11][18]. The popularity of skyline analysis in the
area of information systems is mainly due to its applicability for decision making
applications. Indeed, as information systems store larger and larger volumes of
data today, data management and in particular query processing present diffi-
cult challenges. From the user viewpoint, large volumes of data imply answers of
large size. By returning the best tuples (in terms of user preferences), the skyline
query relieves the user from having to deal with answers of large size in order to
find the best tuples.
    The skyline operator was first introduced in [3], where the authors also
present two basic algorithms: the Block Nested Loops (BNL) and the Divide
and Conquer (D&C). In order to improve the performance of BNL algorithm,
the SFS (sort-filter-skyline) algorithm was proposed in [5]. SFS runs on data
sorted according to a monotonic function (namely, entropy descending). Such
sorting guarantees the non-dominance of each object by those that follow in the
order. Therefore, once an object is put into the buffer window, it can be reported
as part of the skyline. Not only this makes the SFS algorithm progressive, but
also allows to reduce the number of comparisons needed, since SFS compares
only against the non-dominated tuples, whereas BNL often compares against
dominated tuples [7]. Despite this improvement, all objects have to be scanned
by the algorithm at least once. Succeeding approaches tend to avoid scanning
the complete data set. Namely, SaLSa [1] uses the minimal coordinate of each
object as a sorting function, and during the filter-scan step checks if all remain-
ing objects are dominated by a so-called stop object, which can be determined
in O(1) from the data accessed so far. The shortcoming is that the performance
of SaLSa algorithm is affected by the data distribution and increasing dimen-
sionality, since in higher dimensions instances of the problem the pruning power
of the stop object is limited [29].
More generally, all sort-based techniques share the same drawback, namely the
number of computations during the filter-scan step, as every input object should
be compared with the skyline points in the buffer (which can potentially become
large).
      Computing the Skyline of a Relational Table Based on a Query Lattice     147

An alternative to the sort-based techniques is the use of indexes, which allows
to avoid scanning all the input objects. The basic idea is to rely on an index in
order to determine dominance between tuples, and to exclude tuples from fur-
ther processing as early as possible. Two index-based algorithms, Bitmap and
Index were first introduced in [21]. Bitmap uses the bitmap encoding of the data
so that the dominating points are determined by a bit-wise “and” operation.
The Index approach partitions the objects into a set of lists. Each list is sorted
by minimum coordinate and indexed by a B-tree. The objects are accessed in
batches defined by the values of the minimal coordinate, while the algorithm
computes local skylines in each “batch” of the lists and then merges them into a
global skyline. However, besides the computation cost of Bitmap, and the neces-
sity to construct a B-tree for every combination of dimensions having potential
interest for the user, the order in which skyline points are returned by these al-
gorithms is fixed and depends on the data distribution, so it cannot be adapted
to the users preferences.
Two other index-based skyline algorithms, NN (nearest-neighbor) [13] and BBS
(branch-and-bound skyline) [19], are based on the observation that the object
closest to the origin has to be part of the skyline. Nearest neighbor search is
used to retrieve such point by using the R-tree. The pitfall of the NN algorithm
is that in order to iteratively find the next nearest neighbors it divides the data
set into overlapping partitions, and therefore duplicates have to be removed by
traversing the R-tree multiple times. To avoid that, BBS rather accesses par-
tially dominated nodes of the R-tree.
The main drawback of all index-based approaches is that not all data can be
indexed (namely when data is dynamically produced). Also, R-trees and other
multidimensional indexes have their own limitations, namely the curse of dimen-
sionality.
Concerning related work specifically targeting the multi-relational skyline (or
skyline join), two progressive algorithms are proposed in [9]. The idea is to com-
bine the join with nested-loop and sort-merge algorithms. However, each relation
has to accessed multiple times in order to compute the skyline for each join value,
and then the global skyline. In addition, each input object has to be scanned at
least once.
More recent work also considers how to compute the skyline over the join of two
or more relational tables without actually computing the join [23]. Apart from
its applicability to computing skylines in a centralized database, the interest of
such work lies in the fact that it is also applicable to distributed environments.
Earlier work related to this topic can be found in [1][2][8][10][20][25][26].
A lattice-based approach to single-relation skyline computation is introduced
in [15], with the aim of proposing a data-distribution independent algorithm.
A lattice structure is used to answer skyline queries over dimensions with low-
cardinality domains, or those that can be mapped to low-cardinality domains
(such as Price, that can be mapped to price ranges). The principle is to organize
all the values combinations into a lattice based on the dominance relationship,
and then to retrieve those that (a) are present in the input data set, and (b)
148     N. Spyratos et al.

are not reachable by the dominance relationship from another element of the
lattice, also belonging to the data set. However, no early pruning is done, so the
entire data set has to be read twice in order to determine the skyline tuples,
and the skyline join problem is not investigated. Also, the mapping of the values
of a domain to a set of ranges has to be carefully tuned in order to deliver a
meaningful skyline result, which is not discussed in the paper.
Several variants of skyline were introduced in [19], such as constrained, sub-
space and dynamic skyline queries (see also [6][16][22][27][28]). Skyline queries
have also been studied in various other domains, outside traditional databases.
These include probabilistic skyline computations over uncertain data [17](e.g.
data in sensor networks); skyline computations over incomplete data [12](e.g.
data with missing values); over data whose attributes have partially-ordered do-
mains [4](e.g. preferences expressed by users online); over stream data[14]; or
even bandwidth-constrained skyline computations over mobile devices [24].
    In this paper, we present a novel approach to computing skylines which rep-
resents a major deviation from existing approaches. Indeed, instead of accessing
individual tuples in a database table, our approach relies on the definition of
skyline as the union of the answers to a set of queries. In doing so, our ba-
sic algorithm avoids accessing the table directly: access to the table is through
queries, hence independent of how the table is implemented or how its tuples
are indexed.
    Given a relational table R, our approach is based on what we call the query
lattice of R; and our basic algorithm constructs the skyline set as the union of the
answers to a subset of queries from that lattice - hence without directly accessing
the table R. We demonstrate the generality of our approach by computing the
skyline of the join of two tables based on the product of their individual query
lattices - therefore without performing the join. The paper presents basic con-
cepts and algorithms leaving experimentation and performance evaluation to a
forthcoming paper.
    The paper is organized as follows. In section 2 we give some preliminary def-
initions and introduce our notation. In section 3 we present our basic algorithm
for computing skylines through queries. In section 4 we apply our approach to
computing skylines over joins, thus demonstrating the generality of the approach.
Finally, in section 5, we offer some concluding remarks and discuss further re-
search.


2     Basic definitions

Let R be a relational table, with A1 , . . . , An as attributes. Let B = {B1 , . . . , Bk },
k ≤ n, be the set of preference attributes, that is a set of attributes of the table
whose domains are numeric and over which preferences are declared.
    A preference over Bi is an expression of one of two forms: Bi → min or
Bi → max . If the preference Bi → min is expressed by a user of the table, then
this is interpreted as follows: given two values x and y in the domain of Bi , x is
preferred to y or x preceeds y iff x < y; and similarly, if the preference Bi → max
      Computing the Skyline of a Relational Table Based on a Query Lattice         149

is expressed by a user, then this is interpreted as follows: given two values x and
y in the domain of Bi , x is preferred to y or x preceeds y iff x < y;
    In order to simplify the presentation, and without loss of generality, we shall
consider only one form of preference, namely Bi → min. However, all methods
discussed in this paper can be applied with any combination of the preferences
Bi → min and Bi → max .Therefore, from now on, given two values x and y in
the domain of Bi , we shall say that x is preferred to y or x preceeds y iff x < y.

Definition 1 (Pareto domination) Let B = {B1 , . . . , Bk } be a set of preference
attributes of a relational table R and let s and t be tuples of R. We say that s
is equivalent to t, denoted as s ≡ t iff s.Bi = t.Bi for all Bi ∈ B. Moreover, we
say that s Pareto dominates t, denoted as s <P a t, iff s 6≡ t and for all Bi ∈ B,
s.Bi ≤ t.Bi .

    In order to simplify the presentation we will simply say “dominates” instead
of “Pareto dominates”, and we shall drop the subscript in the notation, writing
s < t instead of s <P at.
    We shall call Pareto preference query, or simply preference query over R, any
expression of the form (B1 = b1 ) ∧ . . . ∧ (Bk = bk ), where each bi is a value in the
domain of attribute Bi . For simplicity of notation we shall denote a preference
query simply by hb1 , . . . , bk i.
    Note that hb1 , . . . , bk i denotes also a tuple in the projection of R over the
preference attributes; however, context will always disambiguate. Also note that
a preference query hb1 , . . . , bk i returns the set of tuples in R whose projection
over the preference attributes is the tuple hb1 , . . . , bk i; therefore the answer to
each query of the form hb1 , . . . , bk i is a Pareto equivalence class.
    It is easy to verify that Pareto domination is irreflexive (i.e., s < s is false
for each tuple s) and transitive (i.e., s < t and t < u imply s < u for all tuples
s, t and u), hence a strict order over R. A partial order ≤ over R can be defined
from Pareto domination as follows:

                               s ≤ t iff (s < t or s = t)

for all tuples s and t in R. We shall say that s Pareto preceeds t, or simply that
s preceeds t, whenever s ≤ t.
    Clearly, Pareto precedence defines a partial order also over preference queries.
Moreover, given preference queries s and t we define the following operations:

 – s ⊗ t = hmin{s.B1 , t.B1 }, . . . , min{s.Bk , t.Bk }i
 – s ⊕ t = hmax {s.B1 , t.B1 }, . . . , max {s.Bk , t.Bk }i

It is easy to check that these operations make the set of preference queries over
R into a complete lattice, with ⊗ defining the least upper bound and ⊕ defining
the greatest lower bound of any two preference queries. We notice that this
lattice may be infinite, as some of the domains of the preference attributes may
be infinite. However, if we require that each bi in a preference query be in the
active domain of Bi (i.e.if we require that each bi appear in R), then the lattice
150     N. Spyratos et al.

becomes finite and therefore it has a top and a bottom query, denoted as > and
⊥, respectively. These extreme elements are given by:

                                > = hm1 , . . . , mk i
                                ⊥ = hM1 , . . . , Mk i

where mi and Mi are the minimum value and the maximum value appearing
in the active domain of Bi , respectively. We shall call this (finite) lattice the
query lattice, of R and we shall denote it as (Q, ≤), where Q is the (finite) set
of preference queries over R.
    In this paper, we shall use the query lattice for two purposes: (a) as a tool for
computing skylines of relational tables, and (b) as a means for comparing our
approach to existing approaches. First, however, let’s define skylines formally.

Definition 2 The skyline of a table R over preference attributes B, denoted by
SKY (R, B), is the set of tuples from R defined as follows:

                      SKY (R, B) = {t ∈ R | 6 ∃s ∈ R : s ≤ t}

In other words, the skyline of R is the set of non-dominated tuples of R.
    As customary, given a query q and a relation R, we will let ans(q, R) stand
for the answer of q over R, that is the set of tuples obtained by asking query
q against relation R. Moreover, we define a skyline query of R over preference
attributes B, to be a query q(R, B) over R whose answer is a skyline of R over
B, that is:
                         ans(q(R, B), R) = SKY (R, B)
Skyline queries will play a central role in our approach to skyline computation.


3     Computing the skyline of a relational table
In this Section, we present an algorithm for computing the skyline of a given
table R. Our algorithm obtains the skyline by constructing a skyline query. To
find the skyline query, our algorithm traverses part of the query lattice and
collects a set of non-dominated queries whose disjunction is the skyline query.
    In contrast to existing methods that actually construct the skyline, and as
such are sensitive to the ways the table R is implemented or how its tuples are
accessed, our algorithm obtains the skyline while making no assumptions on how
the relation R is accessed by the system while processing the skyline query.
    Our algorithm uses the notion of successors of a query, defined as follows.
First, for each preference attribute Bi and value bi in the domain of Bi , let’s
denote as succ(bi ) the successor of bi in the linear order of the domain of Bi . For
instance, let Bi be the Hotel Rating attribute of our earlier example, having as
domain the interval [1,5]. As this interval is linearly ordered by the < relation,
we have succ(1) = 2, succ(2) = 3, and so on. This makes succ a partial function
over the domain of Bi , undefined for the maximum Mi . Now, we extend the succ
function to preference queries as follows.
      Computing the Skyline of a Relational Table Based on a Query Lattice                  151

    Let q be a preference query q = hb1 , . . . , bk i such that q 6= ⊥. This means that
bj 6= Mj for at least one j ∈ [1, k], where Mj is the maximum value in the active
domain of Bj . With no loss of generality, we shall assume that the m values of q
that are not maximal, where 1 ≤ m ≤ k, occur in the first m positions of q, (i.e.,
q = hb1 , . . . , bm , Mm+1 , . . . , Mk i). Then, the successors of q, succ(q), is defined
to be the set of queries succ(q) = {q1 , . . . , qm }, such that, for all 1 ≤ i ≤ m,

             qi = hb1 , . . . , bi−1 , succ(bi ), bi+1 , . . . , bm , Mm+1 , . . . , Mk i

Clearly, the succ function is undefined on the bottom of the lattice ⊥. The
following Lemma gives two important properties of the successors of q for the
establishment of the correctness of the following algorithm for computing a sky-
line query of the table R.

Lemma 1. Let q be a preference query. Then, for each query q 0 ∈ succ(q) :
 1. q ≤ q 0
 2. there is no query q 00 such that q < q 00 < q 0 .

   We are now ready to give the algorithm for computing a skyline query of a
table R.
Algorithm Skyline Query over a Single Table (SQST )

Input A non-empty table R, a non-empty set B = {B1 , . . . , Bk } of preference
attributes in R, and the projections of R over B1 , . . . , Bk .

Data We use the following variables for accumulating data during the execution
of the algorithm:
 – The variable F is a set variable called frontier; it is initialized to empty, and
   it is used to accumulate the queries whose disjunction will be the result of
   the algorithm (i.e., whose disjunction will be the skyline query).
 – The variable C is a set variable containing the set of all current candidate
   queries; it is initialized to the top > of the query lattice.
 – The variable S is a set variable used to accumulate successors of current
   candidate queries.
 – The variable C 0 is a (auxiliary) set variable for accumulating candidate
   queries for the next while-loop iteration in the algorithm; it is initialized
   to empty at the beginning of each iteration.

Output A set of preference queries over R, whose disjunction is a skyline query.

Method
 C ← {>}; F ← ∅
 while C 6= ∅ do
    for all c ∈ C such that ans(c, R) 6= ∅ do
      C ← C\{c}; F ← F ∪ {c}
152       N. Spyratos et al.

    end for
    C0 ← ∅
    for all c ∈ C do
      for all s ∈ succ(c) such that no query in F Pareto dominates s do
        C 0 ← C 0 ∪ {s}
      end for
    end for
    C ← C0
  end while
  return F

      Informally, the algorithm works as follows:

 – If the top query > of the lattice is non-empty (i.e.if its answer over R is non-
   empty) then the algorithm terminates; > is the output of the algorithm, and
   the answer of > over R is the skyline.
 – Otherwise, each successor query c of > might be a candidate for contributing
   to the skyline, and this is checked as follows:
     • if the query c is non-empty then it is added to F (i.e.to the variable that
        accumulates the queries contributing to the skyline);
     • otherwise, each successor of c that is not dominated by a query in F
        becomes a candidate, by being added to the variable C 0 (i.e.to the vari-
        able that accumulates all candidate queries for the next iteration). The
        so selected successors are finally transferred to the variable C.

    This process is repeated until there is no more candidate left (i.e.until C is
empty).
    Upon termination of the algorithm, F contains all queries q such that: (a)
the answer to q is non-empty, and (b) q is not dominated by another query. The
disjunction of all queries in F is then the skyline query, and the answer of the
skyline query over R is the skyline of R.


             R
             HID Price Rating
               h1 200       2
               h2 150       5
               h3 100       3
               h4 300       2
               h5 350       1
               h6 100       8

                       Fig. 1. Example relation and query lattice.


   Let us illustrate the algorithm by using the example table R in Fig 1. We
assume the preference attributes to be P rice and Rating and the preference
      Computing the Skyline of a Relational Table Based on a Query Lattice      153

to be P rice → min and Rating → min. The projections of R over P rice and
Rating, sorted in ascending order, are as follows:

 – P rice : 100, 150, 200, 300, 350
 – Rating : 1, 2, 3, 5, 8

    The diagram in Fig 1 shows a part of the query lattice derived from R. In
this diagram, queries having non-empty answers are emphasized by bold letters.
Queries contributing to the skyline are enclosed by rounded rectangles. Gray
triangles in the diagram represent areas dominated by queries in the skyline. As
we can see in the diagram, the top of the query lattice is > = h100, 1i. Therefore,
we start the first iteration of the algorithm with C = {h100, 1i} and F = ∅. At
the end of each subsequent iteration of the while-loop, the contents of C and F
change as follows:

 – end of 1st iteration: C = {h150, 1i, h100, 2i}, F = ∅
 – end of 2nd iteration: C = {h200, 1i, h150, 2i, h100, 3i}, F = ∅

    In the third iteration, the query h100, 3i ∈ C has a non-empty result therefore
h100, 3i leaves C and enters F . We then consider the successors of the queries
left in C. h150, 3i ∈ succ(h150, 2i) is dominated by h100, 3i ∈ F therefore h150, 3i
is omitted from C 0 , which accumulates candidate queries for the next iteration.

 – end of 3rd iteration: C = {h300, 1i, h200, 2i}, F = {h100, 3i}
 – end of 4th iteration: C = {h350, 1i}, F = {h100, 3i, h200, 2i}
 – end of 5th iteration: C = ∅, F = {h100, 3i, h200, 2i, h350, 1i}
   (the algorithm stops here)

    The correctness of the algorithm is easily established by observing that the
algorithm explores the lattice completely (this is guaranteed by the second prop-
erty of the succ function in the above Lemma), retaining only maximal queries
(this is guaranteed by the test on dominance performed by the algorithm and
also by the fact that the successors of a query are all dominated by it, as stated
by the first property of the succ function in the above Lemma).
    Formally, we will denote as SQST (R, B) the result of the SQST algorithm
having R and B as the input non-empty relation and preference attributes, re-
spectively. On the basis of the above observations, we state the following:

Proposition 1 For every relation R and preference attributes B over R,
                      _
                  ans( SQST (R, B), R) = SKY (R, B)

    We note that, as all queries in F are conjunctive and any two queries in F
differ in at least one conjunct, the answers making up the skyline actually form
a partition of the skyline. This is an interesting observation when combined with
the notion of rank of a query in the query lattice.
154     N. Spyratos et al.

Definition 3 (Rank of a query) The rank of a query q in the query lattice is
defined as follows: if q is the root query then rank(q) = 0 else rank(q) is the
maximum length of path from the root query to q

    Clearly, the higher the rank of a query the less the tuples in its answer are
preferred.
    Now, as the skyline of R is partitioned by the answers to the queries in F ,
one can ask new kinds of queries. For example, one can ask the query: “give me
the best tuples from the skyline”. The answer to this query will be the answer
to the query of lowest rank in F . Similarly, one can ask the query: “give me the
ranks of all tuples in the skyline”. This query will return a set of ranks, thereby
giving a useful information as to how far are the tuples of the skyline from the
most preferred tuples. A detailed discussion of the relationship between “most
preferred” and “non-dominated” tuples is given in a forthcoming paper.


4     Skylines of joins

We now consider the computation of the skyline over the join of two tables R1
and R2 . To this end, we introduce the necessary concepts.
    Let R1 and R2 be relations with A11 , . . . , An1 1 and A12 , . . . , An2 2 as attributes,
respectively, and let Bi = {Bi1 , . . . , Biki }, ki ≤ ni , be a set of attributes of Ri ,
called the preference attributes of Ri , for i = 1, 2.
    We shall denote as (Q1 , ≤1 ) and (Q2 , ≤2 ) the query lattices over R1 and R2 ,
respectively. Moreover, ⊗i and ⊕i will stand for the least upper bound and the
greatest lower bound of any two preference queries over Ri , respectively.
    Let Ji = {Ji1 , . . . , Jil }, be a set of attributes of Ri disjoint form the preference
attributes Bi , for i = 1, 2. A join over J1 and J2 is a relational equijoin R1 ./e R2 ,
whose join expression e is given by J11 = J21 , . . . , J1l = J2l . In order to simplify
the model and with no loss of generality, we will consider the join attributes to
be the same for the two relations, that is J1 = J2 , and moreover to consist of a
single attribute J, that is e is given by J = J.
    Intuitively, a preference query over a join R1 ./e R2 is a (k1 +k2 )-tuple whose
first k1 elements make up a query in Q1 and whose last k2 elements make up a
query in Q2 . In order to simplify notation, we will commit a slight abuse and
write hq1 , q2 i to represent a preference query over R1 ./e R2 , where q1 ∈ Q1 and
q2 ∈ Q2 . As a consequence, the set of preference queries over R1 ./e R2 , that we
shall denote as Q./ , is given by the Cartesian product of the set of preference
queries over R1 and R2 , that is:

                                      Q./ = Q1 × Q2

Now, let ≤./ be the Pareto preference relation over Q./ . As we have already
seen in the previous Section, (Q./ , ≤./ ) is a complete lattice. Moreover, it is not
difficult to see that:
      Computing the Skyline of a Relational Table Based on a Query Lattice              155

Proposition 2 (Q./ , ≤./ ) is the product of (Q1 , ≤1 ) and (Q2 , ≤2 ). That is, let-
ting q and q 0 be preference queries in Q./ such that q = hq1 , q2 i and q 0 = hq10 , q20 i,
where q1 , q10 ∈ Q1 and q2 , q20 ∈ Q2 , we have:
 1. q≤./ q 0 iff q1 ≤1 q10 and q2 ≤2 q20
 2. q ⊗./ q 0 = hq1 ⊗1 q10 , q2 ⊗2 q20 i
 3. q ⊕./ q 0 = hq1 ⊕1 q10 , q2 ⊕2 q20 i

   From now on we shall simplify notation by omitting subscripts, unless this
creates ambiguity.
   We shall call join values the set of tuples V = X1 ∩ X2 , where:

                               Xi = πJ (Ri ) for i = 1, 2

By definition of join, a tuple t in Ri contributes to the join if and only if its
projection over the join attribute J is in V, that is t.J ∈ V, for i = 1, 2. Likewise,
a query q in Qi may contribute to the skyline of the join only if it occurs in the
join. For each join value v ∈ V, we define the v−partition of Ri , denoted Si (v),
as follows :
                       Si (v) = πBi (σJ=v (Ri )) for i = 1, 2
In practice, each v−partition includes the queries that contribute to the join
R1 ./ R2 . In particular:
                                            [
                      πB1 ∪B2 (R1 ./ R2 ) =   S1 (v) × S2 (v)
                                               v∈V

v−partitions play an important role in determining the skyline of the join R1 ./
R2 without computing the join.

Proposition 3 A query hq1 , q2 i ∈ πB1 ∪B2 (SKY (R1 ./ R2 , B1 ∪ B2 )) iff for some
join value v ∈ V, qi ∈ πBi (SKY (Si (v), Bi )) for i = 1, 2 and for no other v 0 ∈ V
there exists skylines qi0 ∈ πBi (SKY (Si (v 0 ), Bi )) such that qi0 ≤ qi for i = 1, 2.
Proof: (→) If for some join value v ∈ V, qi ∈ πBi (SKY (Si (v), Bi )) for i = 1, 2
then hq1 , q2 i ∈ πB1 ∪B2 (R1 ./ R2 ), and moreover for 1 in Proposition 2 there
exists no other query hq10 , q20 i ∈ πB1 ∪B2 (R1 ./ R2 ) where qi0 ∈ Si (v) for i =
1, 2 such that hq10 , q20 i ≤ hq1 , q2 i. If for no other v 0 ∈ V there exists skylines
qi0 ∈ πBi (SKY (Si (v 0 ), Bi )) such that qi0 ≤ qi for i = 1, 2 then again from 1
in Proposition 2 there exists no hq10 , q20 i such that hq10 , q20 i ≤ hq1 , q2 i. Hence
hq1 , q2 i ∈ πB1 ∪B2 (SKY (R1 ./ R2 , B1 ∪ B2 )).
(←) Suppose not. Then, either (a) for no join value v ∈ V, qi ∈ πBi (SKY (Si (v), Bi ))
for i = 1, 2 or (b) there exists a join value v 0 ∈ V and skylines qi0 ∈ πBi (SKY (Si (v 0 ), Bi ))
such that qi0 ≤ qi for i = 1, 2. In case (a), there are two sub-cases: (a1) for no join
value v ∈ V, qi ∈ Si (v) for i = 1, 2. In this case, hq1 , q2 i 6∈ πB1 ∪B2 (R1 ./ R2 ),
against the hypothesis. (a2) for some join value v ∈ V, qi ∈ Si (v), but either
q1 6∈ πB1 (SKY (S1 (v), B1 )) or q2 6∈ πB2 (SKY (S2 (v), B2 )). Then let qi0 be such
that qi0 ∈ πBi (SKY (Si (v), Bi )). Such qi0 are guaranteed to exist because Si (v) is
finite and partially ordered by Pareto preference. By hypothesis, either q10 6= q1 or
156       N. Spyratos et al.

q20 6= q2 . Then hq10 , q20 i ∈ πB1 ∪B2 (R1 ./ R2 ) and by 1 in Proposition 2 hq10 , q20 i ≤
hq1 , q2 i, therefore hq1 , q2 i 6∈ πB1 ∪B2 (SKY (R1 ./ R2 , B1 ∪B2 )), against the hypoth-
esis. (b) If there exists a join value v 0 ∈ V and skylines qi0 ∈ πBi (SKY (Si (v 0 ), Bi ))
such that qi0 ≤ qi for i = 1, 2 then hq10 , q20 i ∈ πB1 ∪B2 (R1 ./ R2 ) and by 1 in Propo-
sition 2 hq10 , q20 i ≤ hq1 , q2 i, therefore hq1 , q2 i 6∈ πB1 ∪B2 (SKY (R1 ./ R2 , B1 ∪ B2 )),
against the hypothesis.
    We now provide an algorithm for computing the skyline queries over the join
of two tables R1 and R2 (without computing the join.). The algorithm exploits
Proposition 3, and uses the SQST algorithm for computing skylines over v-
partitions of the given tables. As a result, the algorithm returns two sets of
queries, one over R1 , the other over R2 , for selecting from each table the tuples
that will generate the skyline of the join R1 ./ R2 , and only those.
Algorithm The Skyline Queries over a Join (SQJ )

Input Non-empty relations R1 and R2 , non-empty sets B1 and B2 of preference
attributes in R1 and R2 , respectively, and the join values V.

Data G is a set variable initialized to empty and used to accumulate all candidate
skyline queries, resulting from the Cartesian products of the skyline queries over
the same join value. R is a set variable where the skyline of G is computed. Pi
and Fi (for i = 1, 2) are auxiliary set variables, used to store v-partitions and
final results, respectively.

Output Two sets of queries, one over R1 and the other over R2 .

Method
 G←∅
 for all v ∈ V do
    P1 ← SQST (S1 (v), B1 )
    P2 ← SQST (S2 (v), B2 )
    G ← G ∪ (P1 × P2 × {v})
 end for
 R ← SQST (G, B1 ∪ B2 )
 F1 ← πB1 ∪{J} (R)
 F2 ← πB2 ∪{J} (R)
 return F 1, F2

      We note that the algorithm operates in three passes:
 1. In the first pass, it gathers (in G) all candidate results by looping over all
    join values. This is in fact required by the first condition of Proposition 3,
    which states that hq1 , q2 i is in a skyline query of the join if both q1 and q2
    are skyline queries over the v-partitions for the same join value v.
 2. In the second pass, it removes from G the compound queries that are dom-
    inated by some other compound query, as required by the second condition
    of Proposition 3.
      Computing the Skyline of a Relational Table Based on a Query Lattice        157

 3. Finally, it slices the compound queries vertically, in order to obtain queries
    over R1 (these are stored in F1 ) and queries over R2 (in F2 ). Notice that
    the join attribute J must be transferred all along in order to generate the
    correct queries.

Clearly, there is no other way of proceeding since it is necessary to obtain all
candidate queries in order to apply the second condition of Proposition 3.
    Formally, we will denote as SQJ (R1 , R2 , B1 , B2 )i the i−th result of the SQJ
algorithm having R1 and R2 as the input non-empty relations and B1 and
B2 as preference attributes, respectively. For readability, we will abbreviate
SQJ (R1 , R2 , B1 , B2 )i as SQJ i .
    On the basis of the above observations and of Proposition 3, we therefore
state the following:

Proposition 4 For every pair of relations R1 and R2 and preference attributes
B1 and B2 over them,
             _                       _
SQST (ans(       SQJ 1 , R1 ) ./ ans( SQJ 2 , R2 ), B1 ∪B2 ) = SKY (R1 ./ R2 , B1 ∪B2 )

    Let us demonstrate the SQJ algorithm by using table R1 (hotels) and R2
(restaurants) in Fig. 2. Suppose we want to find best combinations of hotels
and restaurants in the same “Location”, by minimizing “Price”, “Rating”, “Dis-
tance” and “Location” (we took this example from [23]). In this case, the pref-
erence attributes are B1 = {P rice, Rating}, B2 = {Distance, Ranking} and the
join attribute is J = {Location}. From the intersection of values appearing in
the Location attribute in R1 and R2 , we can obtain join values V = {A, B, C}.
    In the algorithm, firstly we gather candidate skyline queries for each join
value. In the example, for a join value A ∈ V , we obtain v-partition S1 (A) and
S2 (A). Then we compute skylines P1 , P2 (emphasized in tables S1 , S2 by bold
letters) from S1 (A), S2 (A) by the applying the SQST algorithm. Finally we
make a Cartesian product P1 × P2 × {A} and append it to G that accumulates
candidate skyline queries.
    After iterating this candidate gathering process for every join value, we apply
SQST to G to obtain the skyline (emphasized in tables G by bold letters) over
the join R1 ./ R2 . It is important to note the difference of size between R1 ./ R2
and G. In this example, R1 ./ R2 produces 12 tuples but G contains 8 tuples.
Therefore, we can compute the skyline from G with less cost than by computing
directly from R1 ./ R2 .
    In the example, hh5 , r5 i is not a skyline in the join because for the join value
B, we have h2 dominating h5 and r1 dominating r5 . On the other hand, neither
h6 nor r4 are skyline in their table, but they form a skyline in the join because
they are skylines in their join group and there is no other group in which both
are dominated (h6 is dominated by a query h3 in the A group, whereas r4 is
dominated by r2 in the C group, and there is no single group in which both are
dominated.)
158    N. Spyratos et al.

R1 (Hotels)                                  S1
HID Price Rating Location                    Price Rating
 h1   100      8        A                     100       8
 h2   150      5        B           S1 (A)    200       1 P1A = {h100, 8i, h200, 1i}
 h3   200      1        A                     400       2
 h4   400      2        A                     150       5
                                    S1 (B)                P1B = {h150, 5i, h350, 3i}
 h5   300      7        C                     350       3
 h6   350      3        B           S1 (C)    300       7 P1C = {h300, 7i}

R2 (Restaurants)                             S2
RID Distance Ranking Location                Distance Ranking
 r1      150       4        B                    500        1
                                    S2 (A)                    P2A = {h500, 1i}
 r2      250       2        C                     500       6
 r3      500       1        A                    150        4
                                    S2 (B)                    P2B = {h150, 4i, h400, 3i}
 r4      400       3        B                    400        3
 r5      200       5        C                    250        2
                                    S2 (C)                    P2C = {h250, 2i, h200, 5i}
 r6      500       6        A                    200        5

G
Price Rating Distance Ranking Location
 100       8     500        1        A
                                       P1A × P2A × {A}
 200       1     500        1        A
 150       5     150        4        B
 150       5     400        3        B
                                       P1B × P2B × {B}
 350       3     150        4        B
 350       3     400        3        B
 300       7     250        2        C
                                       P1C × P2C × {C}
 300       7      200       5        C

F1                          F2
Price Rating Location       Distance Ranking Location
 100       8        A            500       1        A
 200       1        A            150       4        B
 150       5        B            400       3        B
 350       3        B            250       2        C
 300       7        C


              Fig. 2. An example of skyline over the join of two tables


5     Concluding Remarks

We have seen a novel approach to computing the skyline of a relational table
with respect to preferences expressed over one or more attributes with ordered
domains. Our approach is based on what we called the query lattice of the table,
and our basic algorithm constructs the skyline as the union of the answers to a
subset of queries from that lattice — hence without directly accessing the table
R. Therefore, in contrast to all existing techniques, our approach is independent
of how the table R is implemented or how its tuples are indexed. We have
demonstrated the generality of our approach by computing the skyline over the
join of two tables based on the product of their individual query lattices —
therefore without performing the join.
    We note that our method is applicable to a computational geometry setting as
well. Indeed, a discrete (finite) n-dimensional Euclidean space S can be thought
of as a relational table T (S) in which: (a) the attributes are the n dimensions
      Computing the Skyline of a Relational Table Based on a Query Lattice           159

of S and (b) each tuple of T (S) represents a point in S. Moreover, the answer
to a query q = hb1 , . . . , bk i from the query lattice of T (S) is the set of all
points of S such that (B1 = b1 ) ∧ . . . ∧ (Bk = bk ), where B1 , . . . , Bk are the
corresponding dimensions; in other words, the answer to q is the set of points
having the same coordinate values over the dimensions B1 , . . . , Bk . Additionally,
our method can be applied to the Cartesian product of two or more spaces
through the product lattice of the individual spaces. We are currently pursuing
work in two different directions, namely refining skyline analysis and applying
our approach to a distributed setting:

 1. Refining skyline analysis
    As we mentioned in section 3, the skyline query returned by our basic al-
    gorithm is the disjunction of a set of queries from the query lattice, say
    q1 ∧ . . . ∧ qm ; and the answers to these queries actually partition the skyline
    into disjoint subsets. Moreover, these queries have different ranks, in general.
    Therefore it now becomes possible to ask finer queries regarding the skyline
    such as “give me the best tuples of the skyline” (meaning the answer to the
    query qi of highest rank), or “return the skyline by presenting the answers
    to qi ’s in increasing order of rank”, and so on. In this respect, we would also
    like to investigate in more detail the relationship between “most preferred
    tuple” and “non-dominated tuple”.
 2. Applying our method to a distributed setting
    As information systems store bigger and bigger volumes of data today, data
    management and in particular query processing presents difficult challenges.
    From the user viewpoint, big volumes of data imply answers of large size.
    By returning the best tuples (in terms of user preferences), the skyline query
    relieves the user from having to deal with answers of large size; and having
    the possibility to further refine the skyline (as mentioned above) further
    contributes in that direction. However, in recent years, data management and
    data storage have become increasingly distributed, and distribution presents
    additional challenges for query processing. Adapting the skyline operator to a
    distributed setting is one of the research lines that we are currently pursuing.
    We believe that our approach to skyline computation through query lattices
    is particularly well suited in a distributed environment, where computation
    can be distributed and recomposed in the form of the product lattice.


References
    1. I. Bartolini, P. Ciaccia, and M. Patella. SaLSa: computing the skyline without
       scanning the whole sky. In Proceedings of CIKM, 2006.
    2. I. Bartolini, P. Ciaccia, and M. Patella. Efficient sort-based skyline evaluation.
       ACM Trans. Database Syst., 33(4), 2008.
    3. S. Börzsönyi, D. Kossmann, and K. Stocker. The skyline operator. In Proceed-
       ings of ICDE, pages 421-430, 2001.
    4. C.-Y. Chan, P.-K. Eng, K.-L. Tan. Stratified computation of skylines with
       partially-ordered domains. In Proc. of SIGMOD 2005, pp. 03–214, 2008.
160       N. Spyratos et al.

       5. J. Chomicki, P. Godfrey, J. Gryz, and D. Liang. Skyline with presorting. In
          Proceedings of ICDE, pages 717-816, 2003.
       6. B. Cui, H. Lu, Q. Xu, L. Chen, Y. Dai, Y. Zhou. Parallel Distributed Processing
          of Constrained Skyline Queries by Filtering. In Proc. of ICDE 2008, pp. 546-
          555, 2008.
       7. P. Godfrey, R. Shipley, and Jarek Gryz. Algorithms and analyses for maximal
          vector computation. The VLDB Journal, vol 16(1), pp. 5–28, 2007.
       8. W. Jin, M. Ester, Z. Hu, and J. Han. The multi-relational skyline operator. In
          Proceedings of ICDE, 2007.
       9. W. Jin, M. Morse, J. Patel, M. Ester, and Z. Hu. Evaluating skylines in the
          presence of equi-joins. In Proc. of ICDE, 2010.
      10. N. Koudas, C. Li, A. K. H. Tung, and R. Vernica. Relaxing join and selection
          queries. In Proceedings of VLDB, 2006.
      11. H.T. Kung, F. Luccio, F.P. Preparata. On finding the maxima of a set of
          vectors. Journal of the ACM, vol. 22(4), pp. 469–476, 1975.
      12. M.E. Khalefa, M.F. Mokbel, J.J. Levandoski. Skyline Query Processing for
          Incomplete Data. In Proc. of ICDE 2008, pp. 556-565, 2008.
      13. D. Kossmann , F. Ramsak , S. Rost. Shooting Stars in the Sky: An Online
          Algorithm for Skyline Queries. In Proc. of VLDB 2002, pp. 275–286, 2002.
      14. X. Lin , Y. Yuan , W. Wangnicta , S. Wales. Stabbing the sky: Efficient skyline
          computation over sliding windows. In Proc. of ICDE 2005, pp. 502–513, 2005.
      15. M. Morse , J. Patel , H. V. Jagadish. Efficient Skyline Computation over Low-
          Cardinality Domains. In Proc. of VLDB 2007, pp. 267–278, 2007.
      16. J. Pei , W. Jin , M. Ester , Y. Tao. Catching the best views of skyline: A
          semantic approach based on decisive subspaces. In Proc. of VLDB 2005, pp.
          253–264, 2005.
      17. J. Pei , B. Jiang , X. Lin , Y. Yuan. Probabilistic skylines on uncertain data.
          In Proc. of VLDB 2007, pp. 15–26, 2007.
      18. F.P. Preparata, M.I. Shamos. Computational Geometry Computational Geom-
          etry, Springer-Verlag, 1985.
      19. D. Papadias, Yufei Tao, Greg Fu, Bernhard Seeger. An Optimal and Progressive
          Algorithm for Skyline Queries. In Proc. of SIGMOD 2003, pp. 467–478, 2003.
      20. V. Raghavan, E. Rundensteiner. Progressive result generation for multi-criteria
          decision support queries. In Proceedings of ICDE, 2010.
      21. K.-L. Tan, P.-K. Eng, B.C. Ooi. Efficient Progressive Skyline Computation. In
          Proc. of VLDB 2001, pp. 301-310, 2001.
      22. Y. Tao, X. Xiao, and J. Pei. Subsky: Efficient computation of skylines in sub-
          spaces. In Proceedings of ICDE, 2006.
      23. A. Vlachou, C. Doulkeridis, N. Polyzotis. Skyline query processing over joins.
          In Proceedings of SIGMOD 2011, pp. 73–84, 2011.
      24. A. Vlachou, K. Nørvåg. Bandwidth-constrained distributed skyline computa-
          tion. In Proc. of MobiDE 2009, pp. 17–24, 2009.
      25. D. Sun, S. Wu, J. Li, and A. K. H. Tung. Skyline-join in distributed databases.
          In ICDE Workshops, 2008.
      26. Q. Wan, R. C.-W. Wong, I. F. Ilyas, M. T. Özsu, and Y. Peng. Creating
          competitive products. PVLDB, vol. 2(1), 898-909, 2009.
      27. P. Wu , C. Zhang , Y. Feng , B.Y. Zhao , D. Agrawal , A.E. Abbadi. Parallelizing
          skyline queries for scalable distribution. In Proc. of EDBT 2006, pp. 112-130,
          2006.
      28. Y. Yuan , X. Lin , Q. Liu , W. Wang , J.X. Yu , Q. Zhang. Efficient Computation
          of the Skyline Cube. In Proc. of VLDB 2005, pp. 241–252, 2005.
  Computing the Skyline of a Relational Table Based on a Query Lattice      161

29. S. Zhang , N. Mamoulis , S.W. Cheung. Scalable skyline computation using
    object-based space partitioning. In Proc. of SIGMOD 2009, pp. 483-494, 2009.