Flexible Score Aggregation (Discussion Paper) Paolo Ciaccia1 and Davide Martinenghi2 1 Università di Bologna, Italy, email: paolo.ciaccia@unibo.it 2 Politecnico di Milano, Italy, email: davide.martinenghi@polimi.it Abstract. Ranking objects according to different criteria is a central issue in many data-intensive applications. Yet, no existing solution deals with the case of partially specified score aggregation functions (e.g., a weighted sum with no precisely known weight values). We address multi-source top-k queries with constraints (rather than pre- cise values) on the weights. Our solution is instance optimal and provides increased flexibility with negligible overhead wrt classical top-k queries. 1 Introduction Looking for information relevant to decide about the goodness of items of inter- est (hotels, restaurants, products, services, etc.) has become a common activity, supported by the so-called multi-source top-k queries [6]. For instance, the prob- lem of choosing where to eat (possibly with constraints on price, food type, and location) may be addressed by aggregating different pieces of information from several sources. It is custom to assume that each such source provides a ranked list of results (from best to worst), each coming with a (local) score. Accordingly, objects can be either retrieved sequentially via a sorted access or by providing the object’s identifier and obtaining the corresponding local score (random ac- cess). Then, for each object o in the dataset, a monotone scoring function f is used to aggregate the local scores yielding the (overall) score f (o). ahjdfbeicg i agjchfdeb cfeidabghj R1 9 8 8 7 7 6 6 5 4 4 R2 10 9 8 8 7 6 5 5 5 4 R3 9 8 7 7 6 5 5 5 4 3 Fig. 1: A set of ten restaurants ranked by three sources. Example 1. Consider the scenario depicted in Figure 1, where ten restaurants, a, . . . , j, are ordered by d = 3 rankings. Assume that the overall score is a Pd weighted sum of the local scores, i.e., f (o) = i=1 wi · o[i], where o[i] is the score assigned by Ri to object o (e.g., c[2] = 7), and that the k = 2 best restaurants are sought. In particular, using the weight vector W1 = (1/3, 1/3, 1/3), the top-2 restaurants are a and i, with f (a) ≈ 7.67 and f (i) ≈ 7.33. A different choice of weights, say W2 = (0.2, 0.3, 0.5), results in i and c being the top-2 restaurants. Choosing precise weights in the scoring function is admittedly difficult, and might be the result of learning from users, e.g., via crowdsourcing [4]. A more Copyright c 2019 for the individual papers by the papers authors. Copying permit- ted for private and academic purposes. This volume is published and copyrighted by its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy. viable alternative is to specify weights only partially through a set of constraints. With reference to Example 1, instead of having fixed weights, it might be in- teresting to “explore” all the alternatives arising when weights vary around W2 , by, say, ±0.1. Then, only a, c, f, and i can be the possible (top-2) answers. We study the problem of answering multi-source top-k queries with partially specified weights. This leads to a family F of scoring functions for ranking ob- jects. Existing solutions to the problem with a single scoring function are Fagin’s algorithm (FA) [6] and the threshold algorithm (TA) [7]. The latter was shown to perform better and to achieve optimal I/O performance. Our proposal, FSA, solves the general case and reduces to TA for classical top-k queries (i.e, when F = {f }) and is instance optimal for any F. When no constraints are present, FSA behaves like FA and returns the so-called k-skyband [9], the case k = 1 being the skyline of the dataset [1]. This paper summarizes the main contributions presented in [3], namely 1) a provably correct and instance optimal algorithm, FSA, for answering multi-source top-k queries, 2) several optimization opportunities for improving the execution time of FSA, 3) experimental evidence that FSA enjoys the flexibility of partially specified weights while incurring little overhead wrt top-k queries. 2 Problem statement Consider a dataset D = {o1 , . . . , oN } where each object o ∈ D has d ≥ 2 partial scores, (o[1], . . . , o[d]). For any scoring function f and dataset D, a top-k query on D using f returns a set of k objects with the highest scores according to f (ties arbitrarily broken). Given objects o and o0 , if o[i] ≥ o0 [i] holds for all i = 1, . . . , d, at least once strictly, then o dominates o0 , written o  o0 . Notice that o  o0 implies f (o) ≥ f (o0 ) ∀f ∈ M (M being the set of all monotone scoring functions). The skyline of D is the set of undominated objects in D: Sky(D) = {o ∈ D | @o0 ∈ D. o0  o}, (1) but can also be viewed as the set of objects that can be top-1 [8], i.e.: o ∈ Sky(D) ⇐⇒ ∃f ∈ M. ∀o0 ∈ D. o0 6= o → f (o) > f (o0 ). The k-skyband of D, written Skyk (D), is the set of all objects in D that are dominated by less than k objects [9], and thus includes all possible top-k sets. As a generalization of dominance for a family F of monotone scoring functions ([2]), we say that object o F-dominates object o0 , o 6= o0 , denoted by o F o0 , iff ∀f ∈ F. f (o) ≥ f (o0 ) and ∃f ∈ F. f (o) > f (o0 ). We define the non-dominated restricted skyband of D wrt F, denoted ndk (D; F), as the set of objects in D that are F-dominated by less than k objects: ndk (D; F) = {o ∈ D | @o1 , . . . , ok ∈ D. o1 F o ∧ . . . ∧ ok F o} (2) Clearly, ndk (D; F) ⊆ Skyk (D). When F reduces to a single function f , then ndk (D, {f }) includes all possible top-k results for f (whereas, in case of ties, top-k queries may nondeterministically discard tuples). From (2) it is evident that if o 6∈ ndk (D; F), then no scoring function f ∈ F can make o top-k. We now address the problem of efficiently computing ndk (D; F) in a multi-source scenario, for any given dataset D and set of monotone scoring functions F. 3 Flexible score aggregation with FSA Fagin’s algorithm FA was the first to address multi-source top-k queries for ar- bitrary scoring functions. It consists of three phases: 1) sorted accesses are exe- cuted; 2) random accesses are executed on all objects met in phase 1; 3) scores of all fetched objects are computed using a scoring function f and the top-k result determined. The stopping condition for phase 1 is to have seen at least k objects on all the d rankings. With the scenario depicted in Figure 1, FA would stop at depth 7 (after 7 sorted accesses per ranking), after seeing a and f on all rankings. The FA algorithm can be adapted to compute Skyk (D), since it stops phase 1 independently of f . Therefore, the set S of objects seen when the algorithm stops includes the top-k objects according to any possible monotonic scoring function (and, thus, Skyk (D), which, in turn, includes ndk (D; F)). The TA algorithm uses a threshold value f (τ ) computed by applying the specific scoring function f to the tuple τ of last scores seen by sorted access on each ranking, which we here call the threshold point. A further difference with respect to FA is that random accesses are not delayed to a separate phase: after performing d sorted accesses (one per ranking), all the necessary random accesses are executed. The algorithm stops when at least k seen objects have a global score no worse than f (τ ). Thus, for each object o in the result, f (o) ≥ f (τ ) holds. In Figure 1, with k = 2 and weights W2 = (0.2, 0.3, 0.5) TA would stop at depth 4, where f (τ ) = 7.3, since f (i) = 7.5 and f (c) = 7.4. Being tailored to a specific f , the stopping criterion of TA is (much) more efficient than the one used by FA, but now, when TA stops, S is only guaranteed to contain the top-k objects according to f , but not necessarily according to other scoring functions. Let us say that o weakly F-dominates o0 , o 6= o0 iff ∀f ∈ F. f (o) ≥ f (o0 ). Then, we can unify FA and TA according to the following observation. Observation 1 FA stops executing sorted accesses when k seen objects o1 , . . . , ok weakly M-dominate τ . TA stops when k seen objects weakly {f }-dominate τ . Our proposal, FSA, applies the notion of F-dominance wrt τ , which includes (M-)dominance and {f }-dominance as special cases. The extraction performed by FSA stops as soon as k objects are seen that F-dominate τ , as all unseen objects are also necessarily F-dominated by these k objects. Notice that FSA uses F-dominance rather than weak F-dominance for not discarding ties from ndk (D; F). The pseudocode of FSA is presented in Figure 1. Note that the descent performed by FSA alternates a single sorted access on the i-th ranking with d − 1 random accesses on the other rankings. This differs from the strategy adopted by TA, which, at each depth, first performs d sorted accesses (one per ranking), thereby retrieving up to d unseen objects, and then completes the scores for these objects with up to d · (d − 1) random accesses. However, both algorithms stop at the same depth. Also note that FA performs all random accesses at the end; however, such an approach would not be possible for TA or FSA, since {f }-dominance and F-dominance generally require knowing all the scores of the objects being compared. Algorithm 1: FSA algorithmic pattern for computing ndk . Input: Rankings R1 , . . . , Rd for D, family of scoring functions F, integer k > 0. Output: ndk (D; F). 1. let S := ∅ // Seen objects 2. do 3. for i in 1 . . . d 4. let ho, σi := sortedAccess(Ri ) // Do a sorted access in each ranking Ri 5. τ [i] := σ // Update score for threshold τ 6. if o ∈ / S then // Object o is new 7. o[i] := σ // Save score for o 8. for j in 1 . . . d // Extract all other scores for o via random access 9. if j 6= i then o[j] := randomAccess(Rj , o) 10. if @o1 , . . . , ok ∈ S. o1 F o ∧ . . . ∧ ok F o then handle(o) 11. while @o1 , . . . , ok ∈ S. o1 F τ ∧ . . . ∧ ok F τ // Stop if k objects F τ 12. clean() // Remove from S all objects F-dominated by ≥ k others 13. return S Let I = hD, F, ki indicate an instance of the problem of computing ndk (D; F) wrt a set of monotone scoring functions F and let I indicate the class of all such problems. An algorithm A is correct (for the computation of ndk ) if, for each instance I = hD, F, ki ∈ I, A returns ndk (D; F). Theorem 1. FSA is correct for the computation of ndk . Performance in top-k contexts is commonly characterized by the number of objects accessed by sorted access by an algorithm A on an instance I, which we indicate as sumDepths(A, I). Algorithm A is instance optimal if there exist constants c1 and c2 such that sumDepths(A, I) ≤ c1 · sumDepths(A0 , I) + c2 for every correct algorithm A0 and instance I ∈ I. Theorem 2. FSA is instance optimal. 4 Algorithmic variants We now discuss efficient implementations of P FSA and focus on the case in which d each scoring function is of the form f (o) = i=1 wi gi (o[i]), where the wi ’s are weights subject to a set C of linear constraints and the gi ’s are monotone. This form includes, e.g., weighted linear sums and weighted Lp norms. The constraints may be of several kinds (see [5] for an overview), such as 1. weak rankings: w1 ≥ w2 ≥ . . . ≥ wd (relative importance of attributes); 2. ratio bounds: w̄i (1−ε) ≤ wi ≤ w̄i (1+ε), 1 ≤ i ≤ d (spread around a weight). Without loss of generality, we consider the space of normalized weight Pd vectors W ⊆ [0, 1]d , such that, for each W = (w1 , . . . , wd ) ∈ W, we have i=1 wi = 1; W(C) denotes the subset of W of weight vectors satisfying constraints C. F-dominance test. There are two main approaches to efficiently check F- dominance [2]. One approach requires solving a linear programming problem encoding the notion of F-dominance. This may be expensive, since every F- dominance test requires solving a different LP problem. A better approach, which we choose for FSA, is based on the F-dominance region of an object o, i.e., the set of all points that o F-dominates. Such a region needs to be computed only once for all the tests in which o is the candidate F-dominant object. Testing membership to the F-dominance region is done by i) computing the vertices of the convex polytope W(C), and ii) checking the following inequalities: d d (`) (`) X X wi gi (o[i]) ≥ wi gi (o0 [i]), ` ∈ {1, . . . , q}, (3) i=1 i=1 (`) (`) where each W (`) = (w1 , . . . , wd ), for 1 ≤ ` ≤ q, is a vertex of W(C), and at least one inequality is strict. Computing vertices may be expensive, but needs to be done only once per dataset. Memoization of the vertex scores. We call the lhs of (3) the vertex score of o for W (`) , denoted by v` (o) (similarly for o0 in the rhs). Each vertex score, once computed, is remembered for subsequent tests (3) involving the same object. We adopt this technique for FSA and call it memoization. Sorting. In FSA, we keep the seen objects topologically sorted wrt the F- dominance relation, so that, in a sequence o1 , . . . , on , no object oi can be F- dominated by an object oj if j > i. We do so by sorting according to weights in the centroid of W(C). Then, when checking whether an object o should be added to the set S of seen objects (line 10 of Algorithm 1), we only look for k objects F-dominating o among those preceding o in the topological sort. Pruning. In FSA, object removal can occur within either handle or clean. With lazy pruning, we do not remove objects in handle and wait to do it when clean is called. With eager pruning, when an object o is inserted in S, handle also removes all objects in S that are F-dominated by o and were already F- dominated by k − 1 other objects; clean does nothing. depth = 3 depth = 4 depth = 5 ` v` (a) v` (i) v` (c) v` (h) v` (f) v` (j) v` (g) v` (e) v` (d) v` (τ3 ) V` v` (τ4 ) V` v` (τ5 ) V` 1 6.6 7.7 7.9 5.0 7.0 5.0 5.8 6.3 5.8 7.4 [f, a] 7.3 [f, a] 6.4 [] 2 7.0 8.0 7.7 5.2 6.7 5.5 6.1 6.1 5.7 7.5 [a, f] 7.4 [a, f] 6.5 [] 3 7.4 7.8 7.2 5.6 6.6 6.0 6.0 6.0 5.8 7.6 [a, c, f] 7.4 [c, f] 6.6 [] 4 7.4 7.3 6.9 5.8 6.8 6.0 5.6 6.1 6.0 7.6 [a, i, c, f] 7.3 [c, f] 6.6 [] 5 6.6 7.2 7.6 5.2 7.2 5.0 5.4 6.4 6.0 7.4 [i, f, a] 7.2 [a] 6.4 [] 6 7.0 7.0 7.1 5.6 7.1 5.5 5.3 6.3 6.1 7.5 [c, f, a, i] 7.2 [c, f, a, i] 6.5 [] Table 1: Vertex scores for Examples 1 and 2, along with threshold point’s vertex scores and vertex lists at depth 3, 4, and 5. Vertex lists. We observe that τ is F-dominated by an object o iff all vertex scores of o are no less than the corresponding vertex scores of τ , and at least one is greater. We then maintain, for each vertex W (`) , a list V` of seen ob- jects whose vertex score is still lower than τ ’s vertex score v` (τ ). An object o is added to V` within the call to handle(o) and, in V` , the objects are kept sorted by vertex score. Additionally, for each object o, we maintain a mask of q bits B1 (o), . . . , Bq (o) such that B` (o) = 1 iff v` (o) ≥ v` (τ ), plus one extra bit B>(o) that is set to 1 iff at least one inequality holds strictly. Analogously, we also main- tain a list V> of those objects o such that B>(o) = 0. In this way, when checking which objects F-dominate τ , it suffices to consider, for each ` ∈ {1, . . . , q}, only the objects in V` ∪ V> and only if v` (τ ) has changed since the last check. Example 2. Consider again the scenario of Example 1, with k = 2 and weight vector (0.2, 0.3, 0.5) ±0.1 on each weight. The vertices of W(C) are W (1) = (0.1, 0.3, 0.6), W (2) = (0.1, 0.4, 0.5), W (3) = (0.2, 0.4, 0.4), W (4) = (0.3, 0.3, 0.4), W (5) = (0.2, 0.2, 0.6), W (6) = (0.3, 0.2, 0.5), with centroid (0.2, 0.3, 0.5). At depth 3, the seen objects are a, i, c, f (which will all appear in the result) and h, j, g, e, which are all F-dominated by both i and c and thus not re- tained. The objects are inserted sortedly in S according to the centroid, with the order: i, c, a, f. Table 1 shows the vertex scores, orderly computed as Pd (`) i=1 wi o[i] (b is not even accessed). To see that, e.g., i F h we observe that v` (i) ≥ v` (h), for 1 ≤ ` ≤ q (and v1 (i) > v1 (h)). At this point, the seen objects S already coincide with nd2 , but FSA cannot stop yet. The threshold point is τ3 = (8, 8, 7); its vertex scores are shown in Table 1, along with the correspond- ing vertex lists (of objects that still have a vertex score lower than τ3 ’s). The bit masks B>(o)B1 (o) . . . B6 (o) are computed accordingly; e.g., we have 1110010 for c since c has a worse vertex score than τ3 on vertices W (3) , W (4) , and W (6) , and thus is present in vertex lists V3 , V4 , and V6 . At depth 4, τ4 = (7, 8, 7), but still no object F-dominates it. Finally, at depth 5, τ5 = (7, 7, 6), which is now F-dominated by all of i, c, a, f, and then FSA can stop. The subsequent clean procedure does not eliminate anything, since none of the remaining objects is F-dominated by k = 2 other objects, thus nd2 (D; F) = {i, c, a, f}. 5 Experiments In this section we evaluate the performance of our implementations of FSA. In particular, we quantify the overhead incurred by the increased flexibility of FSA when compared to the computation of plain k-skybands and top-k queries. For our experiments, we consider three kinds of data distribution (anticorrelated, uniform, and real)1 and scoring functions taken from weighted L1 norms with ratio bounds constraints applied on the weights (1/d(1 − ε) ≤ wi ≤ 1/d(1 + ε)). We report on both effectiveness and efficiency in different scenarios varying by number of rankings d, spread ε, and k, with defaults 4, 20%, and 10, resp. In order to evaluate the effectiveness of our approach, we compare it to k- skybands, which also retrieve non(-F)-dominated objects but in a way that is agnostic with respect to the set of scoring functions F. The introduction of F drastically helps to reduce the result set. Figure 2 shows this effect for NBA. In Figure 2a, we vary the spread parameter ε used in ratio bounds constraints and keep default values for all other parameters. As can be seen, the result sets returned by ndk and Skyk coincide (985 objects) only when the spread is so high 1 NBA: 190862 tuples reporting measures of basketball players’ performance, available at www.quantifan.com; ANT and UNI have 100K tuples. that each weight can vary in the entire [0, 1] interval (“full”). For all other values of ε, the result set of ndk is much smaller (from 71 objects when ε = 50%, to only 12 for ε = 1%). Note that, when ε = 0% (“none”), all weights equal 1/d exactly and, thus, there is only one scoring function in F; in that case, the cardinality of the result set is exactly k = 10. Figure 2b also reports the cardinality of the result sets, but for different values of k (defaults otherwise). The result set returned by Skyk is always almost two orders of magnitude larger than that of ndk , the ratio becoming larger as k increases (from 157/2 when k = 1 to 6076/191 when k = 100). Finally, Figure 2c shows this when the number of rankings d varies (defaults otherwise). As d increases, the ratio between the cardinalities of the results increases (from 89/15 when d = 2 to 27403/63 when d = 8). Similar trends can be observed for the UNI and ANT datasets (omitted in the interest of space). Overall, this shows that, thanks to F, ndk is able to keep the result set within a reasonable cardinality, even for larger or more challenging problem instances, while the cardinality of Skyk soon diverges. ⨯⨯⨯⨯⨯⨯⨯⨯◇ ⨯ ⨯⨯ ��� ⨯ ⨯ ���� ��� ⨯ ⨯⨯ ��� ����������� ��� ����������� ����������� ���� ��� ◇ ◇ �� � ⨯ ��� ◇ ◇◇ ◇ ◇ �� � ���� ⨯⨯ ◇ �� � ⨯ ⨯ ⨯◇ ◇ ◇ ◇ ⨯��� �� ◇ ◇ ���� �� ◇ ◇ ���� ��� ◇ � �� ◇◇◇◇ ���� �% �% �% ��%��%��% ���� � ◇� � � �� �� �� ��� �� � � � � � ε (������) � ���������� (a) Spread ε varies. (b) k varies. (c) Number of rankings d varies. Fig. 2: Cardinality of results for the NBA dataset. Varying parameter: ε in (a), k in (b), d in (c). We consider two variants of the FSA pattern: VEL+ and VEE+, using lazy and, resp., eager pruning, and test them in terms of execution time. Effect of number of rankings d. Both variants exhibit sub-second execution times in default conditions. The “vertex lists” optimization grants savings of up to one order of magnitude, especially for more complex problem instances. Indeed, for d > 4 the number q of vertices increases dramatically, from q = 6 with d = 4 to q = 70 with d = 8. Yet, our optimization allows skipping many (already verified) vertex score comparisons between objects and the threshold point, thus leading to a linear behavior as d varies. Effect of k and spread ε. Figure 3 shows execution times as k and ε vary in all datasets. Clearly, in all scenarios, the problem becomes more challeng- ing as k increases and as ε increases, since, in both cases, more objects need to be extracted and compared. VEL+ and VEE+ show similar performance, with VEE+ slightly prevailing on more challenging instances (e.g., when k = 100 or ε =“full”). In such cases, many objects are kept by a lazy strategy and thus the call to the clean procedure weighs more heavily than the eager pruning of ob- jects. In the other cases, the extra cost of pruning objects early is balanced with the reduced cost of calling clean. We also observe that, in most scenarios, VEE+ and VEL+ incur sub-second execution times, and only exceed 2s when ε =“full” (a case in which the result coincides with the k-skyband) in the ANT dataset. ��� △ ��� △ ◇ ◇ ��� ◇ ���� ���� △ ���� (�) ���� (�) ���� (�) ��� ��� △ ◇ △ ���+ ��� △ ◇ △ ���+ ���� ◇ △ △ ���+ ��� ◇ ���+ ◇ ���+ ◇ ◇ ���+ △ ◇ △ △ ◇ △ ���� ��� △ ◇ ◇ △ ◇ △ ◇ ��� △ ◇ ◇ △ ◇ △ ◇ ���� △◇ ◇ △◇ △ ◇ △ △ � � � �� �� �� ��� � � � �� �� �� ��� � � � �� �� �� ��� � � � (a) UNI: k varies. (b) ANT: k varies. (c) NBA: k varies. △ �� △ △ ���� ���� ◇ �� ◇ ��� ◇ ���� (�) ���� (�) ���� (�) �� ��� ���� ���� △ ���+ �� △ ���+ ��� △ ���+ ���� ◇ ���+ � ◇ ���+ ◇ ���+ ���� ◇ △◇ △◇ △◇ △◇ △◇ △◇ △ � △◇ ◇ △◇ △◇ △◇ △◇ △ △◇ ��� △◇ ◇ △◇ △◇ △◇ △◇ △ △◇ �����% �% �% ��%��%��% ���� ���� �% �% �% ��%��%��% ���� ���� �% �% �% ��%��%��% ���� ε (������) ε (������) ε (������) (d) UNI: spread ε varies. (e) ANT: spread ε varies. (f) NBA: spread ε varies. Fig. 3: CPU times for computing ndk . Datasets: UNI in (a),(d); ANT in (b),(e); NBA in (c),(f). 6 Conclusions We have presented a framework for multi-source top-k queries defined by scoring functions with partially specified weights. This provides the flexibility required in many application scenarios (e.g., online services, crowdsourcing, etc.). Our so- lution, FSA, is a provably instance optimal algorithmic pattern that encompasses both top-k queries (attaining comparable performance) and k-skyband queries. References 1. S. Börzsönyi, D. Kossmann, and K. Stocker. The skyline operator. In ICDE, pages 421–430, 2001. 2. P. Ciaccia and D. Martinenghi. Reconciling skyline and ranking queries. PVLDB, 10(11):1454–1465, 2017. 3. P. Ciaccia and D. Martinenghi. FA + TA < FSA: Flexible score aggregation. In CIKM 2018, pages 57–66, 2018. 4. E. Ciceri et al. Crowdsourcing for top-k query processing over uncertain data. TKDE, 28(1):41–53, 2016. 5. Y. S. Eum et al. Establishing dominance and potential optimality in multi-criteria analysis with imprecise weight and value. Computers & Op. Res., 28:397–409, 2001. 6. R. Fagin. Combining fuzzy information from multiple systems. In PODS, pages 216–226, 1996. 7. R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In PODS, 2001. 8. N. Meneghetti et al. Output-sensitive evaluation of prioritized skyline queries. In SIGMOD, pages 1955–1967, 2015. 9. D. Papadias et al. Progressive skyline computation in database systems. TODS, 30(1):41–82, 2005.