ceur-ws.org/Vol-2399/paper15.pdf


           Computational aspects around preference queries

                                                          Karim Alami
                                                  supervised by Sofian Maabout
                                        Univ. Bordeaux, CNRS, LaBRI, UMR 5800, F-33400
                                                         Talence, France
                                                     karim.alami@u-bordeaux.fr


ABSTRACT                                                                  user ,and consequently, avoids him an endless comparison of
Preference queries present two main challenges: difficulty                data.
for users to express their preferences and the computational                 The information retrieval regarding user preferences have
complexity. For skyline queries, the preferences can be on                historically been a ranking problem, i.e., given a set of key
attributes, e.g., some user may look for the best flights re-             words, retrieve the elements that ”best” match these key-
garding price and number of stops, and others may look                    words, we call it often ”top-k” query . Its adaptation to
for the best flights regarding number of stops and dura-                  relational databases has been addressed in [7]. Users are
tion. In addition, preference can be expressed as a (par-                 requested to put weights on the attributes in order to select
tial) order on attributes domains, e.g., some user may pre-               points that have the higher values on attributes with the
fer flight company A over B while another one may have                    higher weights. Top-k queries gives the advantage of con-
the opposite preference. For top-k queries, users define a                trolling the size of the output, however the selection of the
score function to rank objects, e.g., users who give more                 weights remain very hard. There has been several works to
importance to price could define the following score func-                simplify this process, e.g., selection of a range of weights,
tion: price ⇤ 2 + duration of the flight. In general, several             elicitation and regret minimizing, among others. Authors in
rounds are required before converging towards a satisfying                [6] proposed the skyline operator as an alternative to rank-
answer where at each round, more precise preferences are                  ing queries. A skyline query returns a set of points which are
given by the user. This is due to the difficulty to figure out            not dominated by any other point of the dataset. A point x
the precise formulation of the user’s preferences. Therefore,             dominates a point y i↵ x is better or equal on all attributes
a more or less high number of queries need to be evaluated.               and strictly better on at least one attribute. The skyline
Our research work aims to make these queries answering                    query provides the advantage to not rely on a score function
faster through dedicated index structures and precomputed                 however, the size of the output is not controllable and it
views. The main challenges when adopting this strategy are                requires a quadratic computational time regarding the size
(i) lightweight memory consumption and (ii) fast mainte-                  of the dataset. Many studies have been done to optimize
nance process. Our first step was NSC, an index structure                 the skyline query execution time either by optimizing the
that optimizes skyline queries. However, the structure was                number of comparisons or by indexation.
designed for a static context making it unsuitable when data                 In previous work [9], the structure NSC has been presented
can be inserted/deleted. We redesigned NSC to cope with                   as an index to optimize skyline queries. Its main idea con-
dynamic data and in some cases, we proposed further ap-                   sists in comparing every record r of a dataset to all remain-
proaches when the structure is not suitable. In this paper,               ing records r0 and store the subspaces where r0 dominates r
we summarize our previous contributions and present some                  in a form of pair of subspaces compare(r, r0 ) = hX|Y i where
perspective research regarding the link between regret min-               X represents the attributes where r0 is strictly better than
imization queries and what we did so far.                                 r and Y represents the attributes where r and r0 are equal.
                                                                          Now given a subspace Z, a record r belongs to Sky(Z), i.e.
                                                                          the skyline over the subspace Z, if and only if there does
1.    INTRODUCTION                                                        not exist a pair hX|Y i associated to r that covers Z, i.e.
   Preference queries aim to retrieve points among a set of               Z ✓ XY and Z 6= Y . We denote by cover(hX|Y i) the
points that can be considered interesting regarding the pref-             set of subspaces covered by hX|Y i. For example, consider
erences of the user over a set of parameters. They are effi-              a dataset with attributes A, B and C. The pair hAB|Ci
cient tools which reduce the amount of data returned to the               covers the subspaces {A, B, AB, AC, BC, ABC}. The time
                                                                          complexity of NSC is quadratic wrt the size of the dataset as
                                                                          well as the space complexity. However, not every pair should
                                                                          be kept. Let P airs(r) be the set of pairs associated to r, this
                                                                          set can be minimized by computing a subset Q ✓ P airs(r)
                                                                          such that cover(Q) = cover(P airs(r)), i.e. the set of sub-
                                                                          spaces covered by P airs(r) are covered by Q as well. We say
                                                                          that Q is an equivalent subset of P airs(r), Q ⌘ P airs(r).
Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. Los
Angeles, California
                                                                          The minimization problem is NP-Hard and a polynomial
Copyright (C) 2019 for this paper by its authors. Copying permitted for   greedy approximate algorithm has been proposed.
private and academic purposes.
   In the literature, works that optimize skyline queries can      in P airs(r) and its counter set to 1. In theory, this ap-
be divided in three groups, works that (i) design fast al-         proach could be seen as building the structure from scratch
gorithms without precomputation, (ii) design index struc-          at each deletion, however, in practice, only a small fraction
tures (iii) materialize the results. We note BSkyTree [11] the     of the dataset requires their set of pairs to be recomputed.
state of the art algorithm to process skyline queries without      In terms of time complexity, let the size of the dataset be n.
precomputing. We note HashCube [4], a bitmap like index            Let I be the set of records for which their respective set of
structure which associates a 2d Boolean vector v to every          pairs requires to be recomputed. Identifying I takes O(n)
record r where d is the number of attributes. v[i] is set i↵ r     time and recomputing P airs(r) 8r 2 I takes O(|I| · n). Re-
belongs to Sky(Z) such that Z is encoded by i. HashCube            garding memory consumption, NSC size is augmented by a
is highly efficient for skyline query answering, however, au-      constant due to the additional counter.
thors only proposed an algorithm to construct the structure           Now for insertions, let r+ be a record to be inserted then
from scratch in [5]. No obvious maintenance procedure is           we compute P airs(r+ ). Regarding the existing records in
noticed. Materialized skyline views are very time and mem-         the dataset, let r be one, we compute p+ = compare(r, r+ ).
ory consuming as the number of skylines wrt to a dataset           However, P airs(r) is already a minimal set of pairs, hence
is exponential to the number of dimension. However, they           we address the problem: should P airs(r) [ compare(r, r+ )
provide the best query answering performance.                      be minimized now? or should the minimization process be
   The experiments performed in [9] assess the construction        triggered after a number of pairs appended? In absence of
time and memory performance of NSC against its competi-            an incremental greedy algorithm, we propose a linear algo-
tors as well as query execution time. However, the structure       rithm wrt the size of P airs(r), called min inclusion, which
was designed for a static context. An insertion/deletion of a      takes every pair p 2 P airs(r) and compare it to p+ . The
record or a change in an attribute domain order may require        pair covered by the other one is discarded. This algorithm
rebuilding the structure from scratch.                             does not provide any approximation guarantee. However,
   For this thesis, we studied the ability of NSC to deal with     in practice, min inclusion allows a good compression ratio
updates. Precisely, we addressed its maintenance in case           wrt the greedy algorithm.
of (i) dynamic data, i.e., points are removed and inserted at
any time and (ii) streaming data, i.e., append-only database       2.2 Streaming Data
and sliding window model. We redesigned the structure to
                                                                      We consider the following stream model. Records are
cope with each context. Presently, we are working on skyline
                                                                   streamed at regular interval time ✓. They have a validity
queries optimization over a dataset with nominal attributes
                                                                   period of size ! which can also be seen as a sliding window
and dynamic order. We found that NSC is not suitable for
                                                                   over the dataset. Hence, records are considered outdated !
this context, hence we established a novel approach based on
                                                                   timestamps after their arrival time. The semantic of con-
views. Finally, we aim to investigate relationship between
                                                                   tinuous queries [15] states that the query result should be
multidimensional skylines and regret minimization queries.
                                                                   available and accurate at each time. However, this condition
                                                                   is hardly attainable, especially for skyline queries. We point
                                                                   out two main difficulties for continuous skyline queries. Let
2.    SKYLINE QUERIES OPTIMIZATION IN                              Qsky be a skyline query, then (i) a record r belonging to
      PRESENCE OF UPDATES                                          Ans(Qsky ) at a timestamp t may leave Ans(Qsky ) at times-
                                                                   tamp t0 > t if a streamed record r0 dominates r. And (ii) a
   In the following, we present the adaptation of NSC for,
                                                                   record r can join Ans(Qsky ) at timestamp t0 later than its
first, dynamic data, i.e., insertion/deletion of one or multiple
                                                                   arrival time if skyline records that dominate r get outdated.
records at any time, then, streaming data, i.e., insertion of
                                                                   We note the contribution of [14]. Authors propose to main-
records at regular intervals. We studied separately both
                                                                   tain a skyline set DBSky and a set of potentially skyline
contexts because they present di↵erent challenges.
                                                                   records DBRest, i.e., records which may join DBSky in the
                                                                   future. They propose an algorithm to maintain these sets
2.1    Dynamic Data                                                together with an event list recording timestamps where a
   In the case of dynamic data, i.e., insertion/deletion of one    record in DBRest may join DBSky. Still, this approach
or multiple records at any time, NSC should be updated in          makes a big number of records comparison which makes it
accordance to the new dataset. The baseline approach is to         non scalable wrt dimensionality and data cardinality. More-
build NSC structure from scratch, however this approach is         over it maintains a single skyline query, hence maintaining
costly.                                                            several skylines may require an exponential space wrt to the
   Related work pointed out that dealing with deletion is          number of attribute. We note as well the approach presented
harder than insertion. We note the work [17] which ad-             in [12] which is based on R-trees. Regarding NSC structure,
dressed the maintenance of a materialized skyline view in          the approaches presented for dynamic data are not suitable
case of deletion. For NSC, deletion is a hard task as well.        for two main reasons: (i) the timestamp where a record is
Let r be the deleted record then for every record r, we            no more valid is known and (ii) the frequency of streaming
remove from P airs(r) the pair p computed wrt r . How-             is often high. Generally, real time data processing is hard
ever, p may have exclusively covered some pairs that have          in streaming context. Batch processing mode provides a
been deleted from P airs(r) during the minimization pro-           larger time to process data, however, the query result is not
cess. Recovering these pairs requires recomputing P airs(r)        accurate with the current dataset. We proposed a frame-
from scratch. We propose a solution that detects when              work called M SSD which handles three data structure, (i)
P airs(r) should be recomputed. For every distinct pair p          a bu↵er B, (ii) a main dataset R and (iii) NSCt, a redesigned
in P airs(r), we set a counter of how many records are the         version for NSC. Incoming records are first bu↵ered during
source of p. Hence, we recompute P airs(r) only if p is            an interval of time of size k, then inserted into the main
dataset R. NSCt only indexes records in R, hence, queries            product decomposition of Q. We adopted a novel approach
evaluated through NSCt, consider only records in R. We               which theoretically and experimentally provided better re-
sacrifice the accuracy of the queries, nonetheless, we ensure        sults than the above approaches. We define the single partial
a fast maintenance that allows faster query answering than           order spo over an attribute domain Dom(A) as a partial or-
state of the art works. Next we explain the main novelties           der where only two values in Dom(A) are comparable and all
for NSCt. We organize the set of pairs of a record r as a se-        other values are incomparable. Given a query Q with user
quence of buckets of pairs P airs(r) = [Buck1 , . . . , Buckm ],     preference Q.R, Ans(Q) is the intersection of the skylines
where each bucket corresponds to the pairs computed wrt a            wrt spos composing Q.R. Every spo skyline query is consid-
batch of data. We adopt this approach for the following pur-         ered a view whether it is materialized or not. Our experi-
pose. Let b1 and b2 be two batches such that their respective        ments showed that answering a query through non material-
timestamp are T S(b1 ) and T S(b2 ), and T S(b1 ) < T S(b2 ).        ized spo views is even faster than online algorithms, mainly
As we consider a sliding window data model with an inter-            because skyline algorithms are highly weakened by dimen-
val of size !, records in b1 get outdated before records in          sionality added by mapping a nominal attribute to several
b2 . Therefore, let r be a record, the bucket computed wrt b1        numerical attributes. Moreover, this approach, namely the
is located before the bucket computed wrt b2 in P airs(r).           single partial order decomposition, presents the advantage of
During maintenance, Buck1 , i.e. the oldest bucket, is sim-          easily selecting the right views in order to evaluate a given
ply discarded as it contains pairs computed wrt an outdated          query compared to the chain product decomposition which
batch. Moreover, during the minimization process, a bucket           is an NP Hard Problem. Regarding the memory consump-
Bucki is minimized only wrt successor buckets. We for-               tion, let |dom(A)| = m, there exists m! total orders wrt
malized the problem of buckets minimization for NSCt and             dom(A), hence m! views to store for chain product decom-
proved its NP-Hardness by a reduction to MSC (Minimum                position approach. For refinement approach, the higher the
Set Cover) problem. We provided a greedy approximate al-             number of views stored, the faster will be query answering.
gorithm as well. We present in [2] early experiments we per-         Note that there exists an exponential number of partial or-
                                                                                                              2
formed. It shows that despite the maintenance time, NSCt             ders wrt m. Our approach requires 2 · Cm    views as for every
answers much more queries during the batch interval than             two values, two views are stored. This can be further opti-
BSkyTree [11].                                                       mized by selecting only a subset of spo views to materialize.
                                                                     We addressed the following problem: given a workload Q
                                                                     and an integer k, select a set of views of size at most k to
3.   SKYLINE QUERIES OPTIMIZATION IN                                 materialize such that the cost of answering the queries in
     PRESENCE OF DYNAMIC ORDER                                       Q through the views is minimum. We are working on the
   In many real world applications, users are allowed to ex-         proof of the Hardness of this problem. Finally we extended
press their preferences over the values of nominal attributes.       the work to the case where datasets have several nominal
In such case, we say that the attribute domain has a dynamic         attributes.
order. For example, on a movie platform website, users want
the best rated movies but have preferences over the genre as         4.    REGRET MINIMIZING QUERIES AND
well. Also, on a flight booking website, users are interested
in cheap and fast flights but may have preferences over the                MULTIDIMENSIONAL SKYLINES
airline companies. To ease the comprehension, we consider               In this section, we mainly present state of the art of regret
datasets with only one nominal attribute A and some num-             optimization queries and we place some questions that we
ber of numerical attributes. A user preference over A is a           plan to investigate during this thesis.
partial order that can be written as a set of (a1 , a2 ) such           Skyline and Top-k queries share the same objective which
that a1 , a2 2 A and a1 is preferred over a2 . In such sce-          is selecting the best elements. However, on one hand, skyline
nario, NSC is not suitable to answer skyline queries as it           queries return a non constrained size of result by relying on
is constructed depending on a given partial order over A.            just the dominance relation between elements, and on the
Hence it should be updated every time a query with a dif-            other hand, Top-k queries require a score function from the
ferent partial order is issued, or it may require to construct       user to restrain the result size to k.
NSC for each possible partial order over A. In the litera-              Recently [13] presented “regret minimizing set” to avoid
ture, there is two major approaches to handle the problem            the limitations of skyline and Top-k queries by not requiring
of answering skyline queries over datasets with attributes           a score function while bounding the result size. The main
having dynamic order, (i) algorithms which, given a query            idea is to select a representative subset S of a dataset T . Let
Q, maps a nominal attribute into a set of virtual numerical          f be a score function, k be an integer, then let fk (T ) be the
attributes in accordance to the user preference Q.R. Then,           score of the kth ranked point using f . The regret of a subset
a skyline query is processed over the transformed dataset            S wrt f is f1 (T ) f1 (S) and the regret ratio is f1 (Tf1) (Tf)1 (S) .
[18]. And (ii) answers the given query through a set of              Given a family of functions F , the problem here is to find
cached views, each computed wrt a partial order [16, 10].            S of size r which minimizes the maximal regret ratio among
These works adopt refinement and chain product decomposi-            all f 2 F. A greedy approximate algorithm to solve this
tion techniques in order to evaluate an issued query through         problem has been proposed in [13]. Consequently, [8] pro-
materialized views. Let Q be an issued query and let Q0 be           posed a relaxation, namely the k-regret minimizing set: the
a view then Ans(Q) ✓ Ans(Q0 ) if Q0 .R ✓ Q.R. We say                 regret of S considered here is max(fk (T ) f1 (S), 0). More-
that Q is a refinement of Q0 . Now let {Q1 , . . . , Qn } a set of   over, they proved the NP-hardness of the regret minimiza-
views such that Qi .R 8i 2 [1 . . . n] is a chain,Si.e., the val-    tion problem, with or without the relaxation. [3] addressed
ues of A are totally ordered, then Ans(Q) = i2[1...n] Qi if          the query evaluation optimization. They proposed a linear
        N
Q.R = i2[1...m] Qi .R. We say that {Q1 , . . . , Qn } is a chain     time dynamic programming algorithm based on the skyline
set for a two-dimensional dataset. In addition, for datasets             compact maxima representative. In Proceedings of the
with more than 3 dimensions, they proposed an approach                   2017 ACM SIGMOD Conference, pages 821–834.
where they discretise the family of linear function F , into             ACM, 2017.
a set of function F wrt a parameter given by the user.               [4] K. S. Bøgh, S. Chester, D. Sidlauskas, and I. Assent.
This discretisation allows a controlled approximation of the             Hashcube: A data structure for space and query
optimal regret ratio. A notable contribution to the regret               efficient skycube compression. In Proceedings of CIKM
minimization line of research is given by [1]. They provide              conference, pages 1767–1770, 2014.
a reformulation of a regret representative set. Let S ⇢ T .          [5] K. S. Bøgh, S. Chester, D. Sidlauskas, and I. Assent.
Then S is a (k, ✏)-set i↵ 8f 2 F , fk (Tfk) (Tf)1 (S)  ✏. They          Template skycube algorithms for heterogeneous
formulate two regret minimizing set problems, (i) by size                parallelism on multicore and GPU architectures. In
minimization, i.e., find the smallest (k, ✏)-set, and (ii) by re-        Proc. of SIGMOD Conference, pages 447–462, 2017.
gret minimization, compute the (k, ✏)-set of size r and ✏ is         [6] S. Börzsönyi, D. Kossmann, and K. Stocker. The
minimum.                                                                 skyline operator. In Proc. of ICDE conf., pages
   Our previous work addressed two aspects of skyline queries:           421–430, 2001.
index structure to optimize multidimensional queries and its         [7] S. Chaudhuri and L. Gravano. Evaluating top-k
efficient maintenance upon updates. We aim to extend this                selection queries. In VLDB, volume 99, pages 397–410,
material to optimize regret minimization queries. We plan                1999.
to investigate more deeply the relationship between skyline          [8] S. Chester, A. Thomo, S. Venkatesh, and
sets and regret minimizing sets (RMS). Let T be a dataset                S. Whitesides. Computing k-regret minimizing sets.
over a set of dimensions D = {D1 , . . . , Dd }. One question            Proceedings of the VLDB Endowment, 7(5):389–400,
that could be interesting to investigate is to find a relation-          2014.
ship between the regret minimization sets when considering           [9] N. Hanusse, P. Kamnang-Wanko, and S. Maabout.
subspaces of D. More precisely, let X ✓ Y ✓ D and S(X),                  Computing and summarizing the negative skycube. In
resp. S(Y ), be the RMS wrt X, resp. Y . How these two                   Proc. of CIKM Conference, pages 1733–1742, 2016.
sets do compare? Which situations make them comparable?             [10] Y.-L. Hsueh, C.-C. Lin, and C.-C. Chang. An efficient
Which functions families make them comparable? How can                   indexing method for skyline computations with
the RMS’s wrt all subspaces can be computed efficiently?                 partially ordered domains. IEEE Transactions on
Consider the RMS frequency of a tuple to be the number                   Knowledge & Data Engineering, pages 963–976, 2017.
of RMS’s it belongs to, then retrieving the k most frequent         [11] J. Lee and S.-W. Hwang. Scalable skyline
tuples could be seen as a way to define the ‘best” k tuples.             computation using a balanced pivot selection
Let t 2 T . Is the knowledge we have about the skylines to               technique. Information Systems, 39:1–21, 2014.
which t belongs, can give us hints about its belonging to an        [12] M. Morse, J. M. Patel, and W. I. Grosky. Efficient
RMS? Let the skyline frequency of a tuple be the number                  continuous skyline computation. Information Sciences,
of skylines it belongs to. Let L be the set of tuples that               177(17):3411–3437, 2007.
are at least l skyline frequent, what approximation, if any,
                                                                    [13] D. Nanongkai, A. D. Sarma, A. Lall, R. J. Lipton, and
gives the restriction of regret minimizing set computation
                                                                         J. Xu. Regret-minimizing representative databases.
wrt L compared to the whole dataset? In case of a positive
                                                                         Proceedings of the VLDB Endowment,
answer, it would be interesting to see how to adapt NSC to
                                                                         3(1-2):1114–1124, 2010.
optimize the calculation of the set L and consequently, an
approximate regret minimization set.                                [14] Y. Tao and D. Papadias. Maintaining sliding window
                                                                         skylines on data streams. IEEE Trans. Knowl. Data
                                                                         Eng., 18(2):377–391, 2006.
Acknowledgements
                                                                    [15] D. B. Terry, D. Goldberg, D. A. Nichols, and B. M.
Experiments presented in this paper were carried out us-                 Oki. Continuous queries over append-only databases.
ing the PlaFRIM experimental testbed, supported by Inria,                In Proceedings of the 1992 ACM SIGMOD, pages
CNRS (LABRI and IMB), Université de Bordeaux, Bor-                      321–330, 1992.
deaux INP and Conseil Régional d’Aquitaine                         [16] R. C.-W. Wong, J. Pei, A. W.-C. Fu, and K. Wang.
(see https://www.plafrim.fr/)                                            Online skyline analysis with dynamic preferences on
                                                                         nominal attributes. IEEE Transactions on Knowledge
5.   REFERENCES                                                          and Data Engineering, 21(1):35–49, 2009.
 [1] P. K. Agarwal, N. Kumar, S. Sintos, and S. Suri.
     Efficient algorithms for k-regret minimizing sets. In          [17] P. Wu, D. Agrawal, Ö. Egecioglu, and A. El Abbadi.
     SEA 2017, June 21-23, 2017, London, UK, pages                       Deltasky: Optimal maintenance of skyline deletions
     7:1–7:23, 2017.                                                     without exclusive dominance region generation. In
                                                                         Proceedings of ICDE Conference, pages 486–495, 2007.
 [2] K. Alami and S. Maabout. Multidimensional skylines
     over streaming data. In International Conference on            [18] S. Zhang, N. Mamoulis, D. W. Cheung, and B. Kao.
     Database Systems for Advanced Applications, pages                   Efficient skyline evaluation over partially ordered
     338–342. Springer, 2019.                                            domains. Proceedings of the VLDB Endowment,
                                                                         3(1-2):1255–1266, 2010.
 [3] A. Asudeh, A. Nazi, N. Zhang, and G. Das. Efficient
     computation of regret-ratio minimizing set: A