=Paper= {{Paper |id=Vol-2161/paper20 |storemode=property |title=Beyond Skyline and Ranking Queries: Restricted Skylines (Extended Abstract) |pdfUrl=https://ceur-ws.org/Vol-2161/paper20.pdf |volume=Vol-2161 |authors=Paolo Ciaccia,Davide Martinenghi |dblpUrl=https://dblp.org/rec/conf/sebd/CiacciaM18 }} ==Beyond Skyline and Ranking Queries: Restricted Skylines (Extended Abstract)== https://ceur-ws.org/Vol-2161/paper20.pdf
         Beyond Skyline and Ranking Queries:
        Restricted Skylines (Extended Abstract)

                         Paolo Ciaccia1 and Davide Martinenghi2
              1
               Università di Bologna, Italy, email: paolo.ciaccia@unibo.it
        2
            Politecnico di Milano, Italy, email: davide.martinenghi@polimi.it



        Abstract. Traditionally, skyline and ranking queries have been treated
        separately as alternative ways of discovering interesting data in poten-
        tially large datasets. While ranking queries adopt a specific scoring func-
        tion to rank tuples, skyline queries return the set of non-dominated tuples
        and are independent of attribute scales and scoring functions. Ranking
        queries are thus less general, but cheaper to compute and widely used.
        In this paper, we integrate these two approaches under the unifying
        framework of restricted skylines by applying the notion of dominance
        to a set of scoring functions of interest.


1     Introduction

When simultaneously optimizing different criteria (e.g., the different attributes
in a dataset), a problem known as multi-objective optimization, three approaches
are prevalent [4]: (1) ranking queries, which focus on a scoring function express-
ing the relative importance of the different attributes; (2) lexicographical queries,
which establish a strict priority among the attributes; (3) skyline queries, which
return all the non-dominated objects (t dominates s iff t is no worse than s on all
the attributes, and strictly better on at least one). As argued in the literature [4],
each of these methods has pros and cons (also refer to Table 1).
                  Table 1: Pros and cons of multi-objective optimization approaches.
                    Evaluation criteria ↓ Queries → Ranking Lexicographic Skyline
                    Simplicity of formulation           No       Yes        Yes
                    Overall view of interesting results No       No         Yes
                    Control of result cardinality       Yes      Yes        No
                    Trade-off among attributes          Yes      No         No
                    Relative importance of attributes   Yes      Yes        No
    Ranking queries heavily depend on the particular choice of weights in the
scoring function, and thus fail to offer an overall view of the dataset. Lexi-
cographic queries enforce a linear priority between attributes, thus even the
smallest difference in the most important attribute cannot be compensated
by the other attributes (a problem inherited by newer approaches, such as p-
skylines [5]). Skyline queries provide a good overview of potentially interesting
tuples, but may contain too many objects.
    SEBD 2018, June 24-27, 2018, Castellaneta Marina, Italy. Copyright held by the
    authors.
    Restricted skyline (R-skyline) queries, introduced in [2], consider arbitrary
families of scoring functions (induced, e.g., by constraints on the weights), thus
allowing greater flexibility than skylines and ranking queries. The linchpin of R-
skylines is the novel concept of F-dominance: t F-dominates s when t is always
better than or equal to s according to all the scoring functions in the family F
(and strictly better for at least one scoring function in F).
    In this paper, we present a synthesis of the results in [2], where we intro-
duced two R-skyline operators: nd, characterizing the set of non-F-dominated
tuples; po, referring to the tuples that are potentially optimal, i.e., best according
to some function in F. R-skylines capture in a single framework all the prac-
tically relevant approaches to multi-objective optimization, traditionally dealt
with separately, and enable the study of other scenarios of practical interest. For
example, decision makers may encounter objectives in which the model param-
eters are characterized by complex preferences, perhaps coming from preference
elicitation from a crowd (see, e.g., [3] and references therein for strategies for
collecting preferences between tuples).

2     Restricted Skylines
We consider a relational schema R(A1 , . . . , Ad ), with d ≥ 1, where, w.l.o.g., each
attribute value ranges in [0, 1]; we refer to an instance r over R.
     The skyline of r, denoted Sky(r), is equivalently defined (i) as the set of all
non-dominated tuples (Eq. (1)), or (ii) as the set of potentially optimal tuples,
i.e., those that are better than all the others according to at least one monotone
scoring function [1] (Eq. (2)).
                           Sky(r) = {t ∈ r | @s ∈ r. s ≺ t}                          (1)
                 Sky(r) = {t ∈ r | ∃f ∈ M. ∀s ∈ r. s 6= t → f (t) < f (s)}           (2)
where s ≺ t means “s dominates t”, and M is the set of monotone scoring
functions. While the former view is typically adopted for skylines, the latter is
commonly applied to “top-k” queries (here with k = 1), i.e., those queries whose
goal is to return the k best tuples according to a given scoring function.
    We now introduce R-skylines, whose behavior is similar to Sky, but applied
to a limited set of monotone scoring functions F ⊆ M. To this end, we say that
t F-dominates s, s 6= t, denoted by t ≺F s, iff ∀f ∈ F. f (t) ≤ f (s).
Example 1. Consider the tuples t = h0.5, 0.5i, s = h0, 1i, the monotone scoring
functions f1 (x, y) = x + y and f2 (x, y) = x + 2y, and the set F = {f1 , f2 }. We
have t ≺F s, since f1 (t) = f1 (s) = 1 and f2 (t) = 1.5 < f2 (s) = 2. However,
t 6≺M s, since M includes f3 (x, y) = 2x + y, for which f3 (t) = 1.5 > f3 (s) = 1.
    The non-dominated restricted skyline of r wrt. F, denoted by nd(r; F), is
defined as the set of non-F-dominated tuples:
                         nd(r; F) = {t ∈ r | @s ∈ r. s ≺F t}.                        (3)
    Here, as a convention, we consider lower values to be better than higher ones.


                                            2
Note that the right-hand side of Eq. (3) is similar to that of Eq. (1), where ≺
has been replaced by ≺F . Observe that, clearly, ≺M coincides with ≺.
   The potentially optimal restricted skyline of r wrt. F, denoted by po(r; F),
returns the tuples that are best (top 1) according to some scoring function in F.

                po(r; F) = {t ∈ r | ∃f ∈ F. ∀s ∈ r. s 6= t → f (t) < f (s)}.                        (4)

The right-hand side of Eq. (4) is as in Eq. (2), with F instead of M.
   The following containment relationships between R-skylines and Sky hold:

                                         po(r; F) ⊆ nd(r; F) ⊆ Sky(r).                              (5)

                                   po(r; M) = nd(r; M) = Sky(r).                                    (6)
    The F-dominance region DR(t; F) of a tuple t under F is the set of all points
in [0, 1]d that are F-dominated by t:

                                   DR(t; F) = {s ∈ [0, 1]d | t ≺F s}.                               (7)

Such a region grows larger for smaller sets: DR(t; F1 ) ⊇ DR(t; F2 ), if F1 ⊆ F2 .



                       ��
                 ���                                              ���




                                                                  ���

                                                                                    �(�) =(���)
                 ���


                             ��                                   ��
                 ���
                                  ��                              ���




                 ���
                                       ��                         ���




                                            ��           ��                           �(�) =(���)
                            ���    ���       ���   ���                  ���    ��
                                                                              ���    ���   ���


                 (a) Tuples from Example 2 in                     (b) Region of normalized
                 [0, 1]d , d = 2. F -dominance                    weights such that w1 ≥ w2 (in
                 regions in gray.                                 gray).

Fig. 1: Example 2 – tuples and weights in [0, 1]d , d = 2, C = {w1 ≥ w2 }, F = LC
                                                                                1 , where L1 is the
set of monotone scoring functions that are weighted sums of attribute values.




Example 2. Let F be the set of linear scoring functions f (x, y) = w1 x+w2 y with
w1 ≥ w2 and let r consist of t1 = h0.3, 0.6i, t2 = h0.4, 0.45i, t3 = h0.5, 0.2i, t4 =
h0.6, 0.15i (Figure 1a). We have po(r; F) = {t1 , t3 } ⊆ nd(r; F) = {t1 , t2 , t3 } ⊆
Sky(r) = r. Indeed, no tuple in r dominates any other tuple in r, and so
Sky(r) = r. However, t3 ≺F t4 , since f (t3 ) ≤ f (t4 ) holds for any f ∈ F,
as w1 (0.5 − 0.6) ≤ w2 (0.15 − 0.2) always holds when w1 ≥ w2 . Therefore
   / nd(r; F). Figure 1a shows in gray the region of [0, 1]d whose points (includ-
t4 ∈
ing tuple t4 ) are F-dominated by some tuple in r, i.e., ∪t∈r DR(t; F); Figure 1b
shows in gray the region of normalized weights such that w1 ≥ w2 .


                                                              3
     Finally, with linear scoring functions, as is well known [7], top-1 tuples can
only lie in the boundary of the convex hull of the F-dominated region, thus
t2 ∈/ po(r; F), since there is no function f ∈ F for which both f (t2 ) < f (t1 ) and
f (t2 ) < f (t3 ), as there are no w1 , w2 such that w1 (0.4 − 0.3) < w2 (0.6 − 0.45),
w1 (0.4 − 0.5) < w2 (0.2 − 0.45), and w1 ≥ w2 all hold.


3    Restricted Skylines and Lp Norms
A practically relevant case to consider is that of the weighted Lp norms, defined
as follows, where W = (w1 , . . . , wd ) ∈ W is a normalized weight vector, i.e.,
                                                             Pd
W ⊆ [0, 1]d and, for each W = (w1 , . . . , wd ) ∈ W, we have i=1 wi = 1:


                                  d
                                                        !1/p
                                  X
                      LW
                       p (t) =          wi t[Ai ]   p
                                                               ,   p ∈ N.                         (8)
                                  i=1

We therefore turn our attention to the case in which the set of monotone scoring
functions coincides with the family Lp of weighted Lp norms:

                          Lp = {LW
                                 p | W ∈ W},                   p ∈ N.                             (9)

    The behaviors of nd and po are very different under Lp .
Theorem 1. For every value of p and every r, nd(r; Lp ) = Sky(r).
Thus, any Lp family is “powerful enough” to reveal all skyline points with nd.
However, this does not hold for po, as indicated in the following theorems.

Theorem 2. Let p < p0 , with p, p0 ∈ N. Then, for every r, po(r; Lp ) ⊆ po(r; Lp0 ).

Theorem 3. For each p ∈ N, there exists r such that po(r; Lp ) ⊂ Sky(r).

   The results of Theorems 1, 2 and 3 suggest that, by imposing some constraints
on the weights, one can use any Lp family to smoothly move from the skyline
(no constraints) to top-1 queries (a single weight vector satisfies the constraints).
   Checking F-dominance for Lp norms with linear constraints on the weights
can be done in PTIME. Let LCp indicate the set Lp with linear constraints C =
                                         Pd
{C1 , . . . , Cc } on weights, where Cj = i=1 aji wi ≤ kj (for j ∈ {1, . . . , c}).

Theorem 4 (F-dominance test). t ≺LCp s iff the following linear program-
ming problem in the unknowns W = (w1 , . . . , wd ) has a non-negative solution:
                            Pd                 p
        minimize              i=1 wi (s[Ai ]       − t[Ai ]p )                                   (10)
        subject to         wi ∈ [0, 1]                                      i ∈ {1, . . . , d}
                           Pd
                              i=1 wi = 1
                           Pd
                              i=1 aji wi ≤ kj                           j ∈ {1, . . . , c}.


                                            4
    Computing nd(r; F) using Theorem 4 is likely to be time-consuming, since
a different linear programming (LP) problem needs to be solved for each F-
dominance test. An alternative approach is to explicitly compute the F-dominance
regions of tuples, and then discard those tuples that belong to at least one of
such regions. The advantage of this approach is that the computation of the
F-dominance region of a tuple t can be performed just once, thus independently
of how many F-dominance tests involve t.
    In order to compute DR(t; LCp ), we observe that the subset of W that satisfies
C, denoted W(C), is a convex polytope contained in the unit (d − 1)-simplex.
Theorem 5 (F-dominance region). Let W (1) , . . . , W (q) be W(C)’s vertices,
                (`)         (`)
with W (`) = (w1 , . . . , wd ) for ` ∈ {1, . . . , q}. The dominance region DR(t; LCp )
of a tuple t under LCp is the locus of points s defined by the q inequalities:
                 d                     d
                         (`)                  (`)
                 X                     X
                        wi s[Ai ]p ≥         wi t[Ai ]p ,      ` ∈ {1, . . . , q}.   (11)
                  i=1                  i=1

Example 3. Let d = 2, p = 1, and consider tuples t1 = h0.3, 0.6i, t3 = h0.5, 0.2i,
t4 = h0.6, 0.15i from Example 2. For C = {w1 ≥ w2 } and considering that
w1 + w2 = 1 and 0 ≤ w1 , w2 ≤ 1, the vertices of W(C) are W (1) = (1, 0) and
W (2) = (0.5, 0.5). Figure 1a shows the tuples along with their LC1 -dominance
regions, while Figure 1b shows W(C). By Theorem 5, DR(t3 ; LC1 ) is characterized
by the system of inequalities:
                         {s[A1 ] ≥ 0.5,      s[A1 ] + s[A2 ] ≥ 0.7}.                 (12)
Tuple t4 satisfies (12) and thus t3 ≺LC1 t4 . For tuple t1 , the system becomes:

                         {s[A1 ] ≥ 0.3,      s[A1 ] + s[A2 ] ≥ 0.9}.                 (13)
Here, t4 does not satisfy (13) and therefore t1 6≺LC1 t4 .
As Example 3, Figure 1a and Inequalities (11) suggest, the “shape” of DR(t; LCp )
(modulo cropping in the [0, 1]d hypercube) is independent of t, since the left-hand
sides are the same and the right-hand sides are, for any given t, a constant. The
only significant overhead introduced by this approach is the enumeration of the
vertices of W(C). However, this has to be done just once for all tuples.
    For any set F, po(r; F) can be computed from nd(r; F) by retaining only the
tuples that are not F-dominated by any “virtual” tuple obtained by combining
other tuples in nd(r; F). When F = LCp , this is done by solving an LP problem.
Theorem 6 (po test). Let nd(r; LCp ) = {t1 , t2 , . . . , tσ , t}. Then, t ∈ po(r; LCp )
iff there is no convex combination s of t1 , . . . , tσ such that s ≺LCp t, i.e., iff the
following linear system in the unknowns α = (α1 , . . . , ασ ) is unsatisfiable:
           Pd     (`) Pσ               p
                                            Pd       (`)   p
             i=1 wi (   j=1 αj tj [Ai ] ) ≤  i=1 wi t[Ai ]      ` ∈ {1, . . . , q}   (14)
                                  αj ∈ [0, 1]       j ∈ {1, . . . , σ}
                             Pσ
                               j=1 αj = 1.


                                                5
    Algorithm 1: SVE1F for nd.
 Input: relation r, constraints C, family F = LCp . Output: nd(r; F).
  1. let ND := ∅; let W (1) , . . . , W (q) be the vertices of W(C)
  2. sort r using the coordinates of the centroid of W(C) as weights
  3. for each s in r // candidate F-dominated tuple
  4.   compute left-hand sides of Inequalities (11)
  5.   for each t in ND // candidate F-dominant tuple
  6.      if t ≺ s ∨ t ≺F s then continue to line 3
  7.   let ND := ND ∪ {s}
  8. return ND




   Besides the LCp family, Theorems 4, 5, and 6 also hold for any set F whose
functions are weighted sums of monotone functions of single attributes and mono-
tonic transforms thereof.


4     Algorithms and Experiments

We considered several algorithmic opportunities for the computation of nd.
  – Appropriately sorting the dataset beforehand produces a topological sort
     with respect to the F-dominance relation. With that, if tuple t precedes
     tuple s in the sorted relation, then s 6≺F t.
  – The F-dominance test can be executed by either i) solving an LP problem
     as in Theorem 4 or ii) checking whether t ∈ DR(s; F) as in Theorem 5
     through vertex enumeration of the polytope W(C).
  – We can either i) first compute Sky and then remove F-dominated tuples
     (two phases), or ii) integrate dominance and F-dominance tests (one phase).
     The best choice is to use sorting, vertex enumeration and a single phase.
The corresponding pseudocode is shown in Algorithm 1, called SVE1F. The main
idea is to scan the tuples sortedly and to populate a current window ND of non-
F-dominated tuples. No tuple will ever be removed from ND, as no tuple can
be F-dominated by a tuple found later in the sorted relation. The vertices of
the polytope W(C) are computed just once (line 1). Algorithm 1 interleaves
dominance and F-dominance tests, thus performing, for each new tuple s, a
single pass over ND. Every candidate F-dominated tuple s (line 3) is compared
against every candidate F-dominant tuple t (line 6) to decide whether s should
be added to ND. The F-dominance test of line 6 is done via Theorem 5 and, thus,
it is useful to precompute the left-hand sides of Inequality (11) already at line 4.
     For the computation of po(r; F), we start from the tuples in nd(r; F) and, by
Theorem 6, we discard any tuple t that is F-dominated by a convex combination
of tuples in nd(r; F)\{t}. However, directly checking F-dominance via (14) may
be prohibitively time consuming when σ = |nd(r; F)| − 1 is large. We therefore
try, in Algorithm 2 (POND, i.e. po via nd), to reduce as early as possible the set
of candidate potentially optimal tuples (PO) by adopting the following heuristics:
i) we start with a convex combination of only σ̃ = 2 tuples (line 2), which will
give rise to smaller, faster-to-solve systems (14) for testing F-dominance; as long


                                                  6
  Algorithm 2: POND for po.
 Input: relation r, constraints C, family F = LCp . Output: po(r; F).
 1. let PO := nd(r; F) // via SVE1F
 2. let σ̃ := 2; let lastRound := false
 3. while(¬lastRound)
 4.   if σ̃ ≥ |PO| − 1 then lastRound := true
 5.   for each t in PO in reverse order // candidate F-dominated tuple
 6.     if ∃s. s ≺F t, s is convex comb. of the first min(σ̃, |PO|−1) tuples in PO\{t} then let PO := PO\{t}
 7.   let σ̃ := σ̃ · 2
 8. return PO




as σ̃ < |PO| − 1, this condition is only sufficient for pruning, but not necessary;
after each round, we double σ̃ (line 7); ii) we sortedly enumerate candidate F-
dominated tuples from PO in reverse order (line 5), as the worst tuples wrt. the
ordering are the most likely to be F-dominated; iii) using linear system (14), we
check the existence of a convex combination of the first σ̃ tuples in PO (line 6),
as they are the best wrt. the ordering and thus more likely to F-dominate other
tuples. In the last round (enabled by line 4) all the remaining tuples are checked
against a convex combination of all the other tuples still in po, which is now a
necessary and sufficient condition for pruning, as in Theorem 6.
    In [2], efficiency and effectiveness of the proposed algorithms have been mea-
sured on both real and synthetic scenarios with varying i) data distribution,
ii) dataset size, iii) number of dimensions, and iv) number of constraints. With
default settings (anticorrelated data, N = 100K tuples, d = 6 dimensions,
F = LC1 family with |C| = 3 constraints), our algorithms incur sub-second exe-
cution times. Such times increase as N increases, d increases and |C| decreases.
    Both nd and po reduce the size of the result wrt. Sky (Fig. 2a). For default
values, their effectiveness is remarkable (9.8% of the Sky points are in nd and
only 1% in po). Somewhat surprisingly, computing nd via SVE1F may require
much less time than computing Sky, because F-dominance tests, albeit more
costly than dominance tests, are more effective in pruning redundant tuples.

                                                                                       ���
                                                                ��������� ��� ������




                      ���
  (�����)/��� �����




                      ���                                                              ���                                             ���(���)
                      ���                                                              ���                                             ���(��)

                      ���
                                                       ⨯   ��
                                                                                       ���
                                                                                                                                       ���(��)
                                                                                                                                       ���(���)
                                                       ○ ��
                      ���
                            ⨯○    ⨯○ ⨯○      ⨯○ ⨯○
                                                                                       ���

                                                                                       ���
                                                                                                                                       ���(��)
                                                                                                                                       ���(��)
                      ���
                            ���   ��� ����   ���� ��                                         �   ��   ��� |��| �� |��| ��� |���|����
                                      ����                                                                �
  (a) |nd|/|Sky| and |po|/|Sky| cardinality                     (b) Precision and recall of top-k queries wrt. Sky,
  ratios as dataset size N varies.                              nd, and po. Centroid of the W(C) as weight vector;
                                                                N = 100K tuples in the dataset.
Fig. 2: R-skylines vs. skyline and ranking queries (anticorrelated data with d = 6, F = LC
                                                                                         1 , |C| = 3).




   Let Tk (r; f ) indicate the set of top-k tuples in relation r wrt. a scoring
function f . The precision of Tk (r; f ) wrt. a set S is defined as pre(S) =


                                                                                7
|S ∩ Tk (r; f )|/k, whereas the recall is rec(S) = |S ∩ Tk (r; f )|/|S|. Figure 2b
shows precision and recall when f is the most representative weighted sum in
F (i.e., f ’s weight vector is the centroid of W(C)). Recall grows with k, while
precision decreases. With standard parameter values, when k = |S| (and thus
precision and recall are equal), rec(nd) is 50%, while rec(po) is 38%. Retriev-
ing the entire nd with a top-k query may however require to scan almost the
entire dataset. As expected, top-k queries incur a small fraction of the execution
time of SVE1F (below 7%).


5    Discussion
In this paper, we have presented a framework aiming to unify skyline and ranking
queries. We have done so by introducing two R-skyline operators implementing
the notions of non-dominated (nd) and potentially optimal (po) tuples with
respect to a set of scoring functions F. The greater flexibility of these operators
captures not only standard skyline and top-1 queries, but also constraints on the
weights in the scoring functions, which are highly relevant in practice.
    It is well known that choosing the “right” weights for a scoring function is a
difficult task for users, since it is usually hard to predict the effects on ranking
of changing one or more parameters. Replacing precise values with constraints
on weights, as R-skylines do, is therefore a viable way to alleviate the problem.
    We have shown that both nd and po are very effective in focusing on tuples
of interest, even in very large datasets. In practical cases, computing nd requires
no more than a few seconds. For the more complex problem of computing po,
we have developed heuristics to reduce the number of LP problems to be solved.
    Natural extensions of this framework include the notions of top-k query and
k-skyband, with k > 1 [6].


References
1. J. Chomicki, P. Ciaccia, and N. Meneghetti. Skyline queries, front and back. SIG-
   MOD Record, 42(3):6–18, 2013.
2. P. Ciaccia and D. Martinenghi. Reconciling skyline and ranking queries. PVLDB,
   10(11):1454–1465, 2017.
3. E. Ciceri et al. Crowdsourcing for top-k query processing over uncertain data.
   TKDE, 28(1):41–53, 2016.
4. A. A. Freitas. A critical review of multi-objective optimization in data mining: a
   position paper. SIGKDD Explorations, 6(2):77–86, 2004.
5. D. Mindolin and J. Chomicki. Preference elicitation in prioritized skyline queries.
   VLDB J., 20(2):157–182, 2011.
6. D. Papadias et al. Progressive skyline computation in database systems. TODS,
   30(1):41–82, 2005.
7. M. A. Soliman et al. Ranking with uncertain scoring functions: semantics and
   sensitivity measures. In SIGMOD, pages 805–816, 2011.




                                          8