=Paper=
{{Paper
|id=Vol-2161/paper20
|storemode=property
|title=Beyond Skyline and Ranking Queries: Restricted Skylines (Extended Abstract)
|pdfUrl=https://ceur-ws.org/Vol-2161/paper20.pdf
|volume=Vol-2161
|authors=Paolo Ciaccia,Davide Martinenghi
|dblpUrl=https://dblp.org/rec/conf/sebd/CiacciaM18
}}
==Beyond Skyline and Ranking Queries: Restricted Skylines (Extended Abstract)==
Beyond Skyline and Ranking Queries: Restricted Skylines (Extended Abstract) Paolo Ciaccia1 and Davide Martinenghi2 1 Università di Bologna, Italy, email: paolo.ciaccia@unibo.it 2 Politecnico di Milano, Italy, email: davide.martinenghi@polimi.it Abstract. Traditionally, skyline and ranking queries have been treated separately as alternative ways of discovering interesting data in poten- tially large datasets. While ranking queries adopt a specific scoring func- tion to rank tuples, skyline queries return the set of non-dominated tuples and are independent of attribute scales and scoring functions. Ranking queries are thus less general, but cheaper to compute and widely used. In this paper, we integrate these two approaches under the unifying framework of restricted skylines by applying the notion of dominance to a set of scoring functions of interest. 1 Introduction When simultaneously optimizing different criteria (e.g., the different attributes in a dataset), a problem known as multi-objective optimization, three approaches are prevalent [4]: (1) ranking queries, which focus on a scoring function express- ing the relative importance of the different attributes; (2) lexicographical queries, which establish a strict priority among the attributes; (3) skyline queries, which return all the non-dominated objects (t dominates s iff t is no worse than s on all the attributes, and strictly better on at least one). As argued in the literature [4], each of these methods has pros and cons (also refer to Table 1). Table 1: Pros and cons of multi-objective optimization approaches. Evaluation criteria ↓ Queries → Ranking Lexicographic Skyline Simplicity of formulation No Yes Yes Overall view of interesting results No No Yes Control of result cardinality Yes Yes No Trade-off among attributes Yes No No Relative importance of attributes Yes Yes No Ranking queries heavily depend on the particular choice of weights in the scoring function, and thus fail to offer an overall view of the dataset. Lexi- cographic queries enforce a linear priority between attributes, thus even the smallest difference in the most important attribute cannot be compensated by the other attributes (a problem inherited by newer approaches, such as p- skylines [5]). Skyline queries provide a good overview of potentially interesting tuples, but may contain too many objects. SEBD 2018, June 24-27, 2018, Castellaneta Marina, Italy. Copyright held by the authors. Restricted skyline (R-skyline) queries, introduced in [2], consider arbitrary families of scoring functions (induced, e.g., by constraints on the weights), thus allowing greater flexibility than skylines and ranking queries. The linchpin of R- skylines is the novel concept of F-dominance: t F-dominates s when t is always better than or equal to s according to all the scoring functions in the family F (and strictly better for at least one scoring function in F). In this paper, we present a synthesis of the results in [2], where we intro- duced two R-skyline operators: nd, characterizing the set of non-F-dominated tuples; po, referring to the tuples that are potentially optimal, i.e., best according to some function in F. R-skylines capture in a single framework all the prac- tically relevant approaches to multi-objective optimization, traditionally dealt with separately, and enable the study of other scenarios of practical interest. For example, decision makers may encounter objectives in which the model param- eters are characterized by complex preferences, perhaps coming from preference elicitation from a crowd (see, e.g., [3] and references therein for strategies for collecting preferences between tuples). 2 Restricted Skylines We consider a relational schema R(A1 , . . . , Ad ), with d ≥ 1, where, w.l.o.g., each attribute value ranges in [0, 1]; we refer to an instance r over R. The skyline of r, denoted Sky(r), is equivalently defined (i) as the set of all non-dominated tuples (Eq. (1)), or (ii) as the set of potentially optimal tuples, i.e., those that are better than all the others according to at least one monotone scoring function [1] (Eq. (2)). Sky(r) = {t ∈ r | @s ∈ r. s ≺ t} (1) Sky(r) = {t ∈ r | ∃f ∈ M. ∀s ∈ r. s 6= t → f (t) < f (s)} (2) where s ≺ t means “s dominates t”, and M is the set of monotone scoring functions. While the former view is typically adopted for skylines, the latter is commonly applied to “top-k” queries (here with k = 1), i.e., those queries whose goal is to return the k best tuples according to a given scoring function. We now introduce R-skylines, whose behavior is similar to Sky, but applied to a limited set of monotone scoring functions F ⊆ M. To this end, we say that t F-dominates s, s 6= t, denoted by t ≺F s, iff ∀f ∈ F. f (t) ≤ f (s). Example 1. Consider the tuples t = h0.5, 0.5i, s = h0, 1i, the monotone scoring functions f1 (x, y) = x + y and f2 (x, y) = x + 2y, and the set F = {f1 , f2 }. We have t ≺F s, since f1 (t) = f1 (s) = 1 and f2 (t) = 1.5 < f2 (s) = 2. However, t 6≺M s, since M includes f3 (x, y) = 2x + y, for which f3 (t) = 1.5 > f3 (s) = 1. The non-dominated restricted skyline of r wrt. F, denoted by nd(r; F), is defined as the set of non-F-dominated tuples: nd(r; F) = {t ∈ r | @s ∈ r. s ≺F t}. (3) Here, as a convention, we consider lower values to be better than higher ones. 2 Note that the right-hand side of Eq. (3) is similar to that of Eq. (1), where ≺ has been replaced by ≺F . Observe that, clearly, ≺M coincides with ≺. The potentially optimal restricted skyline of r wrt. F, denoted by po(r; F), returns the tuples that are best (top 1) according to some scoring function in F. po(r; F) = {t ∈ r | ∃f ∈ F. ∀s ∈ r. s 6= t → f (t) < f (s)}. (4) The right-hand side of Eq. (4) is as in Eq. (2), with F instead of M. The following containment relationships between R-skylines and Sky hold: po(r; F) ⊆ nd(r; F) ⊆ Sky(r). (5) po(r; M) = nd(r; M) = Sky(r). (6) The F-dominance region DR(t; F) of a tuple t under F is the set of all points in [0, 1]d that are F-dominated by t: DR(t; F) = {s ∈ [0, 1]d | t ≺F s}. (7) Such a region grows larger for smaller sets: DR(t; F1 ) ⊇ DR(t; F2 ), if F1 ⊆ F2 . �� ��� ��� ��� �(�) =(���) ��� �� �� ��� �� ��� ��� �� ��� �� �� �(�) =(���) ��� ��� ��� ��� ��� �� ��� ��� ��� (a) Tuples from Example 2 in (b) Region of normalized [0, 1]d , d = 2. F -dominance weights such that w1 ≥ w2 (in regions in gray. gray). Fig. 1: Example 2 – tuples and weights in [0, 1]d , d = 2, C = {w1 ≥ w2 }, F = LC 1 , where L1 is the set of monotone scoring functions that are weighted sums of attribute values. Example 2. Let F be the set of linear scoring functions f (x, y) = w1 x+w2 y with w1 ≥ w2 and let r consist of t1 = h0.3, 0.6i, t2 = h0.4, 0.45i, t3 = h0.5, 0.2i, t4 = h0.6, 0.15i (Figure 1a). We have po(r; F) = {t1 , t3 } ⊆ nd(r; F) = {t1 , t2 , t3 } ⊆ Sky(r) = r. Indeed, no tuple in r dominates any other tuple in r, and so Sky(r) = r. However, t3 ≺F t4 , since f (t3 ) ≤ f (t4 ) holds for any f ∈ F, as w1 (0.5 − 0.6) ≤ w2 (0.15 − 0.2) always holds when w1 ≥ w2 . Therefore / nd(r; F). Figure 1a shows in gray the region of [0, 1]d whose points (includ- t4 ∈ ing tuple t4 ) are F-dominated by some tuple in r, i.e., ∪t∈r DR(t; F); Figure 1b shows in gray the region of normalized weights such that w1 ≥ w2 . 3 Finally, with linear scoring functions, as is well known [7], top-1 tuples can only lie in the boundary of the convex hull of the F-dominated region, thus t2 ∈/ po(r; F), since there is no function f ∈ F for which both f (t2 ) < f (t1 ) and f (t2 ) < f (t3 ), as there are no w1 , w2 such that w1 (0.4 − 0.3) < w2 (0.6 − 0.45), w1 (0.4 − 0.5) < w2 (0.2 − 0.45), and w1 ≥ w2 all hold. 3 Restricted Skylines and Lp Norms A practically relevant case to consider is that of the weighted Lp norms, defined as follows, where W = (w1 , . . . , wd ) ∈ W is a normalized weight vector, i.e., Pd W ⊆ [0, 1]d and, for each W = (w1 , . . . , wd ) ∈ W, we have i=1 wi = 1: d !1/p X LW p (t) = wi t[Ai ] p , p ∈ N. (8) i=1 We therefore turn our attention to the case in which the set of monotone scoring functions coincides with the family Lp of weighted Lp norms: Lp = {LW p | W ∈ W}, p ∈ N. (9) The behaviors of nd and po are very different under Lp . Theorem 1. For every value of p and every r, nd(r; Lp ) = Sky(r). Thus, any Lp family is “powerful enough” to reveal all skyline points with nd. However, this does not hold for po, as indicated in the following theorems. Theorem 2. Let p < p0 , with p, p0 ∈ N. Then, for every r, po(r; Lp ) ⊆ po(r; Lp0 ). Theorem 3. For each p ∈ N, there exists r such that po(r; Lp ) ⊂ Sky(r). The results of Theorems 1, 2 and 3 suggest that, by imposing some constraints on the weights, one can use any Lp family to smoothly move from the skyline (no constraints) to top-1 queries (a single weight vector satisfies the constraints). Checking F-dominance for Lp norms with linear constraints on the weights can be done in PTIME. Let LCp indicate the set Lp with linear constraints C = Pd {C1 , . . . , Cc } on weights, where Cj = i=1 aji wi ≤ kj (for j ∈ {1, . . . , c}). Theorem 4 (F-dominance test). t ≺LCp s iff the following linear program- ming problem in the unknowns W = (w1 , . . . , wd ) has a non-negative solution: Pd p minimize i=1 wi (s[Ai ] − t[Ai ]p ) (10) subject to wi ∈ [0, 1] i ∈ {1, . . . , d} Pd i=1 wi = 1 Pd i=1 aji wi ≤ kj j ∈ {1, . . . , c}. 4 Computing nd(r; F) using Theorem 4 is likely to be time-consuming, since a different linear programming (LP) problem needs to be solved for each F- dominance test. An alternative approach is to explicitly compute the F-dominance regions of tuples, and then discard those tuples that belong to at least one of such regions. The advantage of this approach is that the computation of the F-dominance region of a tuple t can be performed just once, thus independently of how many F-dominance tests involve t. In order to compute DR(t; LCp ), we observe that the subset of W that satisfies C, denoted W(C), is a convex polytope contained in the unit (d − 1)-simplex. Theorem 5 (F-dominance region). Let W (1) , . . . , W (q) be W(C)’s vertices, (`) (`) with W (`) = (w1 , . . . , wd ) for ` ∈ {1, . . . , q}. The dominance region DR(t; LCp ) of a tuple t under LCp is the locus of points s defined by the q inequalities: d d (`) (`) X X wi s[Ai ]p ≥ wi t[Ai ]p , ` ∈ {1, . . . , q}. (11) i=1 i=1 Example 3. Let d = 2, p = 1, and consider tuples t1 = h0.3, 0.6i, t3 = h0.5, 0.2i, t4 = h0.6, 0.15i from Example 2. For C = {w1 ≥ w2 } and considering that w1 + w2 = 1 and 0 ≤ w1 , w2 ≤ 1, the vertices of W(C) are W (1) = (1, 0) and W (2) = (0.5, 0.5). Figure 1a shows the tuples along with their LC1 -dominance regions, while Figure 1b shows W(C). By Theorem 5, DR(t3 ; LC1 ) is characterized by the system of inequalities: {s[A1 ] ≥ 0.5, s[A1 ] + s[A2 ] ≥ 0.7}. (12) Tuple t4 satisfies (12) and thus t3 ≺LC1 t4 . For tuple t1 , the system becomes: {s[A1 ] ≥ 0.3, s[A1 ] + s[A2 ] ≥ 0.9}. (13) Here, t4 does not satisfy (13) and therefore t1 6≺LC1 t4 . As Example 3, Figure 1a and Inequalities (11) suggest, the “shape” of DR(t; LCp ) (modulo cropping in the [0, 1]d hypercube) is independent of t, since the left-hand sides are the same and the right-hand sides are, for any given t, a constant. The only significant overhead introduced by this approach is the enumeration of the vertices of W(C). However, this has to be done just once for all tuples. For any set F, po(r; F) can be computed from nd(r; F) by retaining only the tuples that are not F-dominated by any “virtual” tuple obtained by combining other tuples in nd(r; F). When F = LCp , this is done by solving an LP problem. Theorem 6 (po test). Let nd(r; LCp ) = {t1 , t2 , . . . , tσ , t}. Then, t ∈ po(r; LCp ) iff there is no convex combination s of t1 , . . . , tσ such that s ≺LCp t, i.e., iff the following linear system in the unknowns α = (α1 , . . . , ασ ) is unsatisfiable: Pd (`) Pσ p Pd (`) p i=1 wi ( j=1 αj tj [Ai ] ) ≤ i=1 wi t[Ai ] ` ∈ {1, . . . , q} (14) αj ∈ [0, 1] j ∈ {1, . . . , σ} Pσ j=1 αj = 1. 5 Algorithm 1: SVE1F for nd. Input: relation r, constraints C, family F = LCp . Output: nd(r; F). 1. let ND := ∅; let W (1) , . . . , W (q) be the vertices of W(C) 2. sort r using the coordinates of the centroid of W(C) as weights 3. for each s in r // candidate F-dominated tuple 4. compute left-hand sides of Inequalities (11) 5. for each t in ND // candidate F-dominant tuple 6. if t ≺ s ∨ t ≺F s then continue to line 3 7. let ND := ND ∪ {s} 8. return ND Besides the LCp family, Theorems 4, 5, and 6 also hold for any set F whose functions are weighted sums of monotone functions of single attributes and mono- tonic transforms thereof. 4 Algorithms and Experiments We considered several algorithmic opportunities for the computation of nd. – Appropriately sorting the dataset beforehand produces a topological sort with respect to the F-dominance relation. With that, if tuple t precedes tuple s in the sorted relation, then s 6≺F t. – The F-dominance test can be executed by either i) solving an LP problem as in Theorem 4 or ii) checking whether t ∈ DR(s; F) as in Theorem 5 through vertex enumeration of the polytope W(C). – We can either i) first compute Sky and then remove F-dominated tuples (two phases), or ii) integrate dominance and F-dominance tests (one phase). The best choice is to use sorting, vertex enumeration and a single phase. The corresponding pseudocode is shown in Algorithm 1, called SVE1F. The main idea is to scan the tuples sortedly and to populate a current window ND of non- F-dominated tuples. No tuple will ever be removed from ND, as no tuple can be F-dominated by a tuple found later in the sorted relation. The vertices of the polytope W(C) are computed just once (line 1). Algorithm 1 interleaves dominance and F-dominance tests, thus performing, for each new tuple s, a single pass over ND. Every candidate F-dominated tuple s (line 3) is compared against every candidate F-dominant tuple t (line 6) to decide whether s should be added to ND. The F-dominance test of line 6 is done via Theorem 5 and, thus, it is useful to precompute the left-hand sides of Inequality (11) already at line 4. For the computation of po(r; F), we start from the tuples in nd(r; F) and, by Theorem 6, we discard any tuple t that is F-dominated by a convex combination of tuples in nd(r; F)\{t}. However, directly checking F-dominance via (14) may be prohibitively time consuming when σ = |nd(r; F)| − 1 is large. We therefore try, in Algorithm 2 (POND, i.e. po via nd), to reduce as early as possible the set of candidate potentially optimal tuples (PO) by adopting the following heuristics: i) we start with a convex combination of only σ̃ = 2 tuples (line 2), which will give rise to smaller, faster-to-solve systems (14) for testing F-dominance; as long 6 Algorithm 2: POND for po. Input: relation r, constraints C, family F = LCp . Output: po(r; F). 1. let PO := nd(r; F) // via SVE1F 2. let σ̃ := 2; let lastRound := false 3. while(¬lastRound) 4. if σ̃ ≥ |PO| − 1 then lastRound := true 5. for each t in PO in reverse order // candidate F-dominated tuple 6. if ∃s. s ≺F t, s is convex comb. of the first min(σ̃, |PO|−1) tuples in PO\{t} then let PO := PO\{t} 7. let σ̃ := σ̃ · 2 8. return PO as σ̃ < |PO| − 1, this condition is only sufficient for pruning, but not necessary; after each round, we double σ̃ (line 7); ii) we sortedly enumerate candidate F- dominated tuples from PO in reverse order (line 5), as the worst tuples wrt. the ordering are the most likely to be F-dominated; iii) using linear system (14), we check the existence of a convex combination of the first σ̃ tuples in PO (line 6), as they are the best wrt. the ordering and thus more likely to F-dominate other tuples. In the last round (enabled by line 4) all the remaining tuples are checked against a convex combination of all the other tuples still in po, which is now a necessary and sufficient condition for pruning, as in Theorem 6. In [2], efficiency and effectiveness of the proposed algorithms have been mea- sured on both real and synthetic scenarios with varying i) data distribution, ii) dataset size, iii) number of dimensions, and iv) number of constraints. With default settings (anticorrelated data, N = 100K tuples, d = 6 dimensions, F = LC1 family with |C| = 3 constraints), our algorithms incur sub-second exe- cution times. Such times increase as N increases, d increases and |C| decreases. Both nd and po reduce the size of the result wrt. Sky (Fig. 2a). For default values, their effectiveness is remarkable (9.8% of the Sky points are in nd and only 1% in po). Somewhat surprisingly, computing nd via SVE1F may require much less time than computing Sky, because F-dominance tests, albeit more costly than dominance tests, are more effective in pruning redundant tuples. ��� ��������� ��� ������ ��� (�����)/��� ����� ��� ��� ���(���) ��� ��� ���(��) ��� ⨯ �� ��� ���(��) ���(���) ○ �� ��� ⨯○ ⨯○ ⨯○ ⨯○ ⨯○ ��� ��� ���(��) ���(��) ��� ��� ��� ���� ���� �� � �� ��� |��| �� |��| ��� |���|���� ���� � (a) |nd|/|Sky| and |po|/|Sky| cardinality (b) Precision and recall of top-k queries wrt. Sky, ratios as dataset size N varies. nd, and po. Centroid of the W(C) as weight vector; N = 100K tuples in the dataset. Fig. 2: R-skylines vs. skyline and ranking queries (anticorrelated data with d = 6, F = LC 1 , |C| = 3). Let Tk (r; f ) indicate the set of top-k tuples in relation r wrt. a scoring function f . The precision of Tk (r; f ) wrt. a set S is defined as pre(S) = 7 |S ∩ Tk (r; f )|/k, whereas the recall is rec(S) = |S ∩ Tk (r; f )|/|S|. Figure 2b shows precision and recall when f is the most representative weighted sum in F (i.e., f ’s weight vector is the centroid of W(C)). Recall grows with k, while precision decreases. With standard parameter values, when k = |S| (and thus precision and recall are equal), rec(nd) is 50%, while rec(po) is 38%. Retriev- ing the entire nd with a top-k query may however require to scan almost the entire dataset. As expected, top-k queries incur a small fraction of the execution time of SVE1F (below 7%). 5 Discussion In this paper, we have presented a framework aiming to unify skyline and ranking queries. We have done so by introducing two R-skyline operators implementing the notions of non-dominated (nd) and potentially optimal (po) tuples with respect to a set of scoring functions F. The greater flexibility of these operators captures not only standard skyline and top-1 queries, but also constraints on the weights in the scoring functions, which are highly relevant in practice. It is well known that choosing the “right” weights for a scoring function is a difficult task for users, since it is usually hard to predict the effects on ranking of changing one or more parameters. Replacing precise values with constraints on weights, as R-skylines do, is therefore a viable way to alleviate the problem. We have shown that both nd and po are very effective in focusing on tuples of interest, even in very large datasets. In practical cases, computing nd requires no more than a few seconds. For the more complex problem of computing po, we have developed heuristics to reduce the number of LP problems to be solved. Natural extensions of this framework include the notions of top-k query and k-skyband, with k > 1 [6]. References 1. J. Chomicki, P. Ciaccia, and N. Meneghetti. Skyline queries, front and back. SIG- MOD Record, 42(3):6–18, 2013. 2. P. Ciaccia and D. Martinenghi. Reconciling skyline and ranking queries. PVLDB, 10(11):1454–1465, 2017. 3. E. Ciceri et al. Crowdsourcing for top-k query processing over uncertain data. TKDE, 28(1):41–53, 2016. 4. A. A. Freitas. A critical review of multi-objective optimization in data mining: a position paper. SIGKDD Explorations, 6(2):77–86, 2004. 5. D. Mindolin and J. Chomicki. Preference elicitation in prioritized skyline queries. VLDB J., 20(2):157–182, 2011. 6. D. Papadias et al. Progressive skyline computation in database systems. TODS, 30(1):41–82, 2005. 7. M. A. Soliman et al. Ranking with uncertain scoring functions: semantics and sensitivity measures. In SIGMOD, pages 805–816, 2011. 8