ceur-ws.org/Vol-2399/paper15.pdf Computational aspects around preference queries Karim Alami supervised by Sofian Maabout Univ. Bordeaux, CNRS, LaBRI, UMR 5800, F-33400 Talence, France karim.alami@u-bordeaux.fr ABSTRACT user ,and consequently, avoids him an endless comparison of Preference queries present two main challenges: difficulty data. for users to express their preferences and the computational The information retrieval regarding user preferences have complexity. For skyline queries, the preferences can be on historically been a ranking problem, i.e., given a set of key attributes, e.g., some user may look for the best flights re- words, retrieve the elements that ”best” match these key- garding price and number of stops, and others may look words, we call it often ”top-k” query . Its adaptation to for the best flights regarding number of stops and dura- relational databases has been addressed in [7]. Users are tion. In addition, preference can be expressed as a (par- requested to put weights on the attributes in order to select tial) order on attributes domains, e.g., some user may pre- points that have the higher values on attributes with the fer flight company A over B while another one may have higher weights. Top-k queries gives the advantage of con- the opposite preference. For top-k queries, users define a trolling the size of the output, however the selection of the score function to rank objects, e.g., users who give more weights remain very hard. There has been several works to importance to price could define the following score func- simplify this process, e.g., selection of a range of weights, tion: price ⇤ 2 + duration of the flight. In general, several elicitation and regret minimizing, among others. Authors in rounds are required before converging towards a satisfying [6] proposed the skyline operator as an alternative to rank- answer where at each round, more precise preferences are ing queries. A skyline query returns a set of points which are given by the user. This is due to the difficulty to figure out not dominated by any other point of the dataset. A point x the precise formulation of the user’s preferences. Therefore, dominates a point y i↵ x is better or equal on all attributes a more or less high number of queries need to be evaluated. and strictly better on at least one attribute. The skyline Our research work aims to make these queries answering query provides the advantage to not rely on a score function faster through dedicated index structures and precomputed however, the size of the output is not controllable and it views. The main challenges when adopting this strategy are requires a quadratic computational time regarding the size (i) lightweight memory consumption and (ii) fast mainte- of the dataset. Many studies have been done to optimize nance process. Our first step was NSC, an index structure the skyline query execution time either by optimizing the that optimizes skyline queries. However, the structure was number of comparisons or by indexation. designed for a static context making it unsuitable when data In previous work [9], the structure NSC has been presented can be inserted/deleted. We redesigned NSC to cope with as an index to optimize skyline queries. Its main idea con- dynamic data and in some cases, we proposed further ap- sists in comparing every record r of a dataset to all remain- proaches when the structure is not suitable. In this paper, ing records r0 and store the subspaces where r0 dominates r we summarize our previous contributions and present some in a form of pair of subspaces compare(r, r0 ) = hX|Y i where perspective research regarding the link between regret min- X represents the attributes where r0 is strictly better than imization queries and what we did so far. r and Y represents the attributes where r and r0 are equal. Now given a subspace Z, a record r belongs to Sky(Z), i.e. the skyline over the subspace Z, if and only if there does 1. INTRODUCTION not exist a pair hX|Y i associated to r that covers Z, i.e. Preference queries aim to retrieve points among a set of Z ✓ XY and Z 6= Y . We denote by cover(hX|Y i) the points that can be considered interesting regarding the pref- set of subspaces covered by hX|Y i. For example, consider erences of the user over a set of parameters. They are effi- a dataset with attributes A, B and C. The pair hAB|Ci cient tools which reduce the amount of data returned to the covers the subspaces {A, B, AB, AC, BC, ABC}. The time complexity of NSC is quadratic wrt the size of the dataset as well as the space complexity. However, not every pair should be kept. Let P airs(r) be the set of pairs associated to r, this set can be minimized by computing a subset Q ✓ P airs(r) such that cover(Q) = cover(P airs(r)), i.e. the set of sub- spaces covered by P airs(r) are covered by Q as well. We say that Q is an equivalent subset of P airs(r), Q ⌘ P airs(r). Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. Los Angeles, California The minimization problem is NP-Hard and a polynomial Copyright (C) 2019 for this paper by its authors. Copying permitted for greedy approximate algorithm has been proposed. private and academic purposes. In the literature, works that optimize skyline queries can in P airs(r) and its counter set to 1. In theory, this ap- be divided in three groups, works that (i) design fast al- proach could be seen as building the structure from scratch gorithms without precomputation, (ii) design index struc- at each deletion, however, in practice, only a small fraction tures (iii) materialize the results. We note BSkyTree [11] the of the dataset requires their set of pairs to be recomputed. state of the art algorithm to process skyline queries without In terms of time complexity, let the size of the dataset be n. precomputing. We note HashCube [4], a bitmap like index Let I be the set of records for which their respective set of structure which associates a 2d Boolean vector v to every pairs requires to be recomputed. Identifying I takes O(n) record r where d is the number of attributes. v[i] is set i↵ r time and recomputing P airs(r) 8r 2 I takes O(|I| · n). Re- belongs to Sky(Z) such that Z is encoded by i. HashCube garding memory consumption, NSC size is augmented by a is highly efficient for skyline query answering, however, au- constant due to the additional counter. thors only proposed an algorithm to construct the structure Now for insertions, let r+ be a record to be inserted then from scratch in [5]. No obvious maintenance procedure is we compute P airs(r+ ). Regarding the existing records in noticed. Materialized skyline views are very time and mem- the dataset, let r be one, we compute p+ = compare(r, r+ ). ory consuming as the number of skylines wrt to a dataset However, P airs(r) is already a minimal set of pairs, hence is exponential to the number of dimension. However, they we address the problem: should P airs(r) [ compare(r, r+ ) provide the best query answering performance. be minimized now? or should the minimization process be The experiments performed in [9] assess the construction triggered after a number of pairs appended? In absence of time and memory performance of NSC against its competi- an incremental greedy algorithm, we propose a linear algo- tors as well as query execution time. However, the structure rithm wrt the size of P airs(r), called min inclusion, which was designed for a static context. An insertion/deletion of a takes every pair p 2 P airs(r) and compare it to p+ . The record or a change in an attribute domain order may require pair covered by the other one is discarded. This algorithm rebuilding the structure from scratch. does not provide any approximation guarantee. However, For this thesis, we studied the ability of NSC to deal with in practice, min inclusion allows a good compression ratio updates. Precisely, we addressed its maintenance in case wrt the greedy algorithm. of (i) dynamic data, i.e., points are removed and inserted at any time and (ii) streaming data, i.e., append-only database 2.2 Streaming Data and sliding window model. We redesigned the structure to We consider the following stream model. Records are cope with each context. Presently, we are working on skyline streamed at regular interval time ✓. They have a validity queries optimization over a dataset with nominal attributes period of size ! which can also be seen as a sliding window and dynamic order. We found that NSC is not suitable for over the dataset. Hence, records are considered outdated ! this context, hence we established a novel approach based on timestamps after their arrival time. The semantic of con- views. Finally, we aim to investigate relationship between tinuous queries [15] states that the query result should be multidimensional skylines and regret minimization queries. available and accurate at each time. However, this condition is hardly attainable, especially for skyline queries. We point out two main difficulties for continuous skyline queries. Let 2. SKYLINE QUERIES OPTIMIZATION IN Qsky be a skyline query, then (i) a record r belonging to PRESENCE OF UPDATES Ans(Qsky ) at a timestamp t may leave Ans(Qsky ) at times- tamp t0 > t if a streamed record r0 dominates r. And (ii) a In the following, we present the adaptation of NSC for, record r can join Ans(Qsky ) at timestamp t0 later than its first, dynamic data, i.e., insertion/deletion of one or multiple arrival time if skyline records that dominate r get outdated. records at any time, then, streaming data, i.e., insertion of We note the contribution of [14]. Authors propose to main- records at regular intervals. We studied separately both tain a skyline set DBSky and a set of potentially skyline contexts because they present di↵erent challenges. records DBRest, i.e., records which may join DBSky in the future. They propose an algorithm to maintain these sets 2.1 Dynamic Data together with an event list recording timestamps where a In the case of dynamic data, i.e., insertion/deletion of one record in DBRest may join DBSky. Still, this approach or multiple records at any time, NSC should be updated in makes a big number of records comparison which makes it accordance to the new dataset. The baseline approach is to non scalable wrt dimensionality and data cardinality. More- build NSC structure from scratch, however this approach is over it maintains a single skyline query, hence maintaining costly. several skylines may require an exponential space wrt to the Related work pointed out that dealing with deletion is number of attribute. We note as well the approach presented harder than insertion. We note the work [17] which ad- in [12] which is based on R-trees. Regarding NSC structure, dressed the maintenance of a materialized skyline view in the approaches presented for dynamic data are not suitable case of deletion. For NSC, deletion is a hard task as well. for two main reasons: (i) the timestamp where a record is Let r be the deleted record then for every record r, we no more valid is known and (ii) the frequency of streaming remove from P airs(r) the pair p computed wrt r . How- is often high. Generally, real time data processing is hard ever, p may have exclusively covered some pairs that have in streaming context. Batch processing mode provides a been deleted from P airs(r) during the minimization pro- larger time to process data, however, the query result is not cess. Recovering these pairs requires recomputing P airs(r) accurate with the current dataset. We proposed a frame- from scratch. We propose a solution that detects when work called M SSD which handles three data structure, (i) P airs(r) should be recomputed. For every distinct pair p a bu↵er B, (ii) a main dataset R and (iii) NSCt, a redesigned in P airs(r), we set a counter of how many records are the version for NSC. Incoming records are first bu↵ered during source of p. Hence, we recompute P airs(r) only if p is an interval of time of size k, then inserted into the main dataset R. NSCt only indexes records in R, hence, queries product decomposition of Q. We adopted a novel approach evaluated through NSCt, consider only records in R. We which theoretically and experimentally provided better re- sacrifice the accuracy of the queries, nonetheless, we ensure sults than the above approaches. We define the single partial a fast maintenance that allows faster query answering than order spo over an attribute domain Dom(A) as a partial or- state of the art works. Next we explain the main novelties der where only two values in Dom(A) are comparable and all for NSCt. We organize the set of pairs of a record r as a se- other values are incomparable. Given a query Q with user quence of buckets of pairs P airs(r) = [Buck1 , . . . , Buckm ], preference Q.R, Ans(Q) is the intersection of the skylines where each bucket corresponds to the pairs computed wrt a wrt spos composing Q.R. Every spo skyline query is consid- batch of data. We adopt this approach for the following pur- ered a view whether it is materialized or not. Our experi- pose. Let b1 and b2 be two batches such that their respective ments showed that answering a query through non material- timestamp are T S(b1 ) and T S(b2 ), and T S(b1 ) < T S(b2 ). ized spo views is even faster than online algorithms, mainly As we consider a sliding window data model with an inter- because skyline algorithms are highly weakened by dimen- val of size !, records in b1 get outdated before records in sionality added by mapping a nominal attribute to several b2 . Therefore, let r be a record, the bucket computed wrt b1 numerical attributes. Moreover, this approach, namely the is located before the bucket computed wrt b2 in P airs(r). single partial order decomposition, presents the advantage of During maintenance, Buck1 , i.e. the oldest bucket, is sim- easily selecting the right views in order to evaluate a given ply discarded as it contains pairs computed wrt an outdated query compared to the chain product decomposition which batch. Moreover, during the minimization process, a bucket is an NP Hard Problem. Regarding the memory consump- Bucki is minimized only wrt successor buckets. We for- tion, let |dom(A)| = m, there exists m! total orders wrt malized the problem of buckets minimization for NSCt and dom(A), hence m! views to store for chain product decom- proved its NP-Hardness by a reduction to MSC (Minimum position approach. For refinement approach, the higher the Set Cover) problem. We provided a greedy approximate al- number of views stored, the faster will be query answering. gorithm as well. We present in [2] early experiments we per- Note that there exists an exponential number of partial or- 2 formed. It shows that despite the maintenance time, NSCt ders wrt m. Our approach requires 2 · Cm views as for every answers much more queries during the batch interval than two values, two views are stored. This can be further opti- BSkyTree [11]. mized by selecting only a subset of spo views to materialize. We addressed the following problem: given a workload Q and an integer k, select a set of views of size at most k to 3. SKYLINE QUERIES OPTIMIZATION IN materialize such that the cost of answering the queries in PRESENCE OF DYNAMIC ORDER Q through the views is minimum. We are working on the In many real world applications, users are allowed to ex- proof of the Hardness of this problem. Finally we extended press their preferences over the values of nominal attributes. the work to the case where datasets have several nominal In such case, we say that the attribute domain has a dynamic attributes. order. For example, on a movie platform website, users want the best rated movies but have preferences over the genre as 4. REGRET MINIMIZING QUERIES AND well. Also, on a flight booking website, users are interested in cheap and fast flights but may have preferences over the MULTIDIMENSIONAL SKYLINES airline companies. To ease the comprehension, we consider In this section, we mainly present state of the art of regret datasets with only one nominal attribute A and some num- optimization queries and we place some questions that we ber of numerical attributes. A user preference over A is a plan to investigate during this thesis. partial order that can be written as a set of (a1 , a2 ) such Skyline and Top-k queries share the same objective which that a1 , a2 2 A and a1 is preferred over a2 . In such sce- is selecting the best elements. However, on one hand, skyline nario, NSC is not suitable to answer skyline queries as it queries return a non constrained size of result by relying on is constructed depending on a given partial order over A. just the dominance relation between elements, and on the Hence it should be updated every time a query with a dif- other hand, Top-k queries require a score function from the ferent partial order is issued, or it may require to construct user to restrain the result size to k. NSC for each possible partial order over A. In the litera- Recently [13] presented “regret minimizing set” to avoid ture, there is two major approaches to handle the problem the limitations of skyline and Top-k queries by not requiring of answering skyline queries over datasets with attributes a score function while bounding the result size. The main having dynamic order, (i) algorithms which, given a query idea is to select a representative subset S of a dataset T . Let Q, maps a nominal attribute into a set of virtual numerical f be a score function, k be an integer, then let fk (T ) be the attributes in accordance to the user preference Q.R. Then, score of the kth ranked point using f . The regret of a subset a skyline query is processed over the transformed dataset S wrt f is f1 (T ) f1 (S) and the regret ratio is f1 (Tf1) (Tf)1 (S) . [18]. And (ii) answers the given query through a set of Given a family of functions F , the problem here is to find cached views, each computed wrt a partial order [16, 10]. S of size r which minimizes the maximal regret ratio among These works adopt refinement and chain product decomposi- all f 2 F. A greedy approximate algorithm to solve this tion techniques in order to evaluate an issued query through problem has been proposed in [13]. Consequently, [8] pro- materialized views. Let Q be an issued query and let Q0 be posed a relaxation, namely the k-regret minimizing set: the a view then Ans(Q) ✓ Ans(Q0 ) if Q0 .R ✓ Q.R. We say regret of S considered here is max(fk (T ) f1 (S), 0). More- that Q is a refinement of Q0 . Now let {Q1 , . . . , Qn } a set of over, they proved the NP-hardness of the regret minimiza- views such that Qi .R 8i 2 [1 . . . n] is a chain,Si.e., the val- tion problem, with or without the relaxation. [3] addressed ues of A are totally ordered, then Ans(Q) = i2[1...n] Qi if the query evaluation optimization. They proposed a linear N Q.R = i2[1...m] Qi .R. We say that {Q1 , . . . , Qn } is a chain time dynamic programming algorithm based on the skyline set for a two-dimensional dataset. In addition, for datasets compact maxima representative. In Proceedings of the with more than 3 dimensions, they proposed an approach 2017 ACM SIGMOD Conference, pages 821–834. where they discretise the family of linear function F , into ACM, 2017. a set of function F wrt a parameter given by the user. [4] K. S. Bøgh, S. Chester, D. Sidlauskas, and I. Assent. This discretisation allows a controlled approximation of the Hashcube: A data structure for space and query optimal regret ratio. A notable contribution to the regret efficient skycube compression. In Proceedings of CIKM minimization line of research is given by [1]. They provide conference, pages 1767–1770, 2014. a reformulation of a regret representative set. Let S ⇢ T . [5] K. S. Bøgh, S. Chester, D. Sidlauskas, and I. Assent. Then S is a (k, ✏)-set i↵ 8f 2 F , fk (Tfk) (Tf)1 (S)  ✏. They Template skycube algorithms for heterogeneous formulate two regret minimizing set problems, (i) by size parallelism on multicore and GPU architectures. In minimization, i.e., find the smallest (k, ✏)-set, and (ii) by re- Proc. of SIGMOD Conference, pages 447–462, 2017. gret minimization, compute the (k, ✏)-set of size r and ✏ is [6] S. Börzsönyi, D. Kossmann, and K. Stocker. The minimum. skyline operator. In Proc. of ICDE conf., pages Our previous work addressed two aspects of skyline queries: 421–430, 2001. index structure to optimize multidimensional queries and its [7] S. Chaudhuri and L. Gravano. Evaluating top-k efficient maintenance upon updates. We aim to extend this selection queries. In VLDB, volume 99, pages 397–410, material to optimize regret minimization queries. We plan 1999. to investigate more deeply the relationship between skyline [8] S. Chester, A. Thomo, S. Venkatesh, and sets and regret minimizing sets (RMS). Let T be a dataset S. Whitesides. Computing k-regret minimizing sets. over a set of dimensions D = {D1 , . . . , Dd }. One question Proceedings of the VLDB Endowment, 7(5):389–400, that could be interesting to investigate is to find a relation- 2014. ship between the regret minimization sets when considering [9] N. Hanusse, P. Kamnang-Wanko, and S. Maabout. subspaces of D. More precisely, let X ✓ Y ✓ D and S(X), Computing and summarizing the negative skycube. In resp. S(Y ), be the RMS wrt X, resp. Y . How these two Proc. of CIKM Conference, pages 1733–1742, 2016. sets do compare? Which situations make them comparable? [10] Y.-L. Hsueh, C.-C. Lin, and C.-C. Chang. An efficient Which functions families make them comparable? How can indexing method for skyline computations with the RMS’s wrt all subspaces can be computed efficiently? partially ordered domains. IEEE Transactions on Consider the RMS frequency of a tuple to be the number Knowledge & Data Engineering, pages 963–976, 2017. of RMS’s it belongs to, then retrieving the k most frequent [11] J. Lee and S.-W. Hwang. Scalable skyline tuples could be seen as a way to define the ‘best” k tuples. computation using a balanced pivot selection Let t 2 T . Is the knowledge we have about the skylines to technique. Information Systems, 39:1–21, 2014. which t belongs, can give us hints about its belonging to an [12] M. Morse, J. M. Patel, and W. I. Grosky. Efficient RMS? Let the skyline frequency of a tuple be the number continuous skyline computation. Information Sciences, of skylines it belongs to. Let L be the set of tuples that 177(17):3411–3437, 2007. are at least l skyline frequent, what approximation, if any, [13] D. Nanongkai, A. D. Sarma, A. Lall, R. J. Lipton, and gives the restriction of regret minimizing set computation J. Xu. Regret-minimizing representative databases. wrt L compared to the whole dataset? In case of a positive Proceedings of the VLDB Endowment, answer, it would be interesting to see how to adapt NSC to 3(1-2):1114–1124, 2010. optimize the calculation of the set L and consequently, an approximate regret minimization set. [14] Y. Tao and D. Papadias. Maintaining sliding window skylines on data streams. IEEE Trans. Knowl. Data Eng., 18(2):377–391, 2006. Acknowledgements [15] D. B. Terry, D. Goldberg, D. A. Nichols, and B. M. Experiments presented in this paper were carried out us- Oki. Continuous queries over append-only databases. ing the PlaFRIM experimental testbed, supported by Inria, In Proceedings of the 1992 ACM SIGMOD, pages CNRS (LABRI and IMB), Université de Bordeaux, Bor- 321–330, 1992. deaux INP and Conseil Régional d’Aquitaine [16] R. C.-W. Wong, J. Pei, A. W.-C. Fu, and K. Wang. (see https://www.plafrim.fr/) Online skyline analysis with dynamic preferences on nominal attributes. IEEE Transactions on Knowledge 5. REFERENCES and Data Engineering, 21(1):35–49, 2009. [1] P. K. Agarwal, N. Kumar, S. Sintos, and S. Suri. Efficient algorithms for k-regret minimizing sets. In [17] P. Wu, D. Agrawal, Ö. Egecioglu, and A. El Abbadi. SEA 2017, June 21-23, 2017, London, UK, pages Deltasky: Optimal maintenance of skyline deletions 7:1–7:23, 2017. without exclusive dominance region generation. In Proceedings of ICDE Conference, pages 486–495, 2007. [2] K. Alami and S. Maabout. Multidimensional skylines over streaming data. In International Conference on [18] S. Zhang, N. Mamoulis, D. W. Cheung, and B. Kao. Database Systems for Advanced Applications, pages Efficient skyline evaluation over partially ordered 338–342. Springer, 2019. domains. Proceedings of the VLDB Endowment, 3(1-2):1255–1266, 2010. [3] A. Asudeh, A. Nazi, N. Zhang, and G. Das. Efficient computation of regret-ratio minimizing set: A