=Paper= {{Paper |id=Vol-2399/paper15 |storemode=property |title=Computational Aspects Around Preference Queries |pdfUrl=https://ceur-ws.org/Vol-2399/paper15.pdf |volume=Vol-2399 |authors=Karim Alami |dblpUrl=https://dblp.org/rec/conf/vldb/Alami19 }} ==Computational Aspects Around Preference Queries== https://ceur-ws.org/Vol-2399/paper15.pdf

ceur-ws.org/Vol-2399/paper15.pdf

Computational aspects around preference queries

Karim Alami
supervised by Sofian Maabout
Univ. Bordeaux, CNRS, LaBRI, UMR 5800, F-33400
Talence, France
karim.alami@u-bordeaux.fr

ABSTRACT user ,and consequently, avoids him an endless comparison of
Preference queries present two main challenges: difficulty data.
for users to express their preferences and the computational The information retrieval regarding user preferences have
complexity. For skyline queries, the preferences can be on historically been a ranking problem, i.e., given a set of key
attributes, e.g., some user may look for the best flights re- words, retrieve the elements that ”best” match these key-
garding price and number of stops, and others may look words, we call it often ”top-k” query . Its adaptation to
for the best flights regarding number of stops and dura- relational databases has been addressed in [7]. Users are
tion. In addition, preference can be expressed as a (par- requested to put weights on the attributes in order to select
tial) order on attributes domains, e.g., some user may pre- points that have the higher values on attributes with the
fer flight company A over B while another one may have higher weights. Top-k queries gives the advantage of con-
the opposite preference. For top-k queries, users define a trolling the size of the output, however the selection of the
score function to rank objects, e.g., users who give more weights remain very hard. There has been several works to
importance to price could define the following score func- simplify this process, e.g., selection of a range of weights,
tion: price ⇤ 2 + duration of the flight. In general, several elicitation and regret minimizing, among others. Authors in
rounds are required before converging towards a satisfying [6] proposed the skyline operator as an alternative to rank-
answer where at each round, more precise preferences are ing queries. A skyline query returns a set of points which are
given by the user. This is due to the difficulty to figure out not dominated by any other point of the dataset. A point x
the precise formulation of the user’s preferences. Therefore, dominates a point y i↵ x is better or equal on all attributes
a more or less high number of queries need to be evaluated. and strictly better on at least one attribute. The skyline
Our research work aims to make these queries answering query provides the advantage to not rely on a score function
faster through dedicated index structures and precomputed however, the size of the output is not controllable and it
views. The main challenges when adopting this strategy are requires a quadratic computational time regarding the size
(i) lightweight memory consumption and (ii) fast mainte- of the dataset. Many studies have been done to optimize
nance process. Our first step was NSC, an index structure the skyline query execution time either by optimizing the
that optimizes skyline queries. However, the structure was number of comparisons or by indexation.
designed for a static context making it unsuitable when data In previous work [9], the structure NSC has been presented
can be inserted/deleted. We redesigned NSC to cope with as an index to optimize skyline queries. Its main idea con-
dynamic data and in some cases, we proposed further ap- sists in comparing every record r of a dataset to all remain-
proaches when the structure is not suitable. In this paper, ing records r0 and store the subspaces where r0 dominates r
we summarize our previous contributions and present some in a form of pair of subspaces compare(r, r0 ) = hX|Y i where
perspective research regarding the link between regret min- X represents the attributes where r0 is strictly better than
imization queries and what we did so far. r and Y represents the attributes where r and r0 are equal.
Now given a subspace Z, a record r belongs to Sky(Z), i.e.
the skyline over the subspace Z, if and only if there does
1. INTRODUCTION not exist a pair hX|Y i associated to r that covers Z, i.e.
Preference queries aim to retrieve points among a set of Z ✓ XY and Z 6= Y . We denote by cover(hX|Y i) the
points that can be considered interesting regarding the pref- set of subspaces covered by hX|Y i. For example, consider
erences of the user over a set of parameters. They are effi- a dataset with attributes A, B and C. The pair hAB|Ci
cient tools which reduce the amount of data returned to the covers the subspaces {A, B, AB, AC, BC, ABC}. The time
complexity of NSC is quadratic wrt the size of the dataset as
well as the space complexity. However, not every pair should
be kept. Let P airs(r) be the set of pairs associated to r, this
set can be minimized by computing a subset Q ✓ P airs(r)
such that cover(Q) = cover(P airs(r)), i.e. the set of sub-
spaces covered by P airs(r) are covered by Q as well. We say
that Q is an equivalent subset of P airs(r), Q ⌘ P airs(r).
Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. Los
Angeles, California
The minimization problem is NP-Hard and a polynomial
Copyright (C) 2019 for this paper by its authors. Copying permitted for greedy approximate algorithm has been proposed.
private and academic purposes.
In the literature, works that optimize skyline queries can in P airs(r) and its counter set to 1. In theory, this ap-
be divided in three groups, works that (i) design fast al- proach could be seen as building the structure from scratch
gorithms without precomputation, (ii) design index struc- at each deletion, however, in practice, only a small fraction
tures (iii) materialize the results. We note BSkyTree [11] the of the dataset requires their set of pairs to be recomputed.
state of the art algorithm to process skyline queries without In terms of time complexity, let the size of the dataset be n.
precomputing. We note HashCube [4], a bitmap like index Let I be the set of records for which their respective set of
structure which associates a 2d Boolean vector v to every pairs requires to be recomputed. Identifying I takes O(n)
record r where d is the number of attributes. v[i] is set i↵ r time and recomputing P airs(r) 8r 2 I takes O(|I| · n). Re-
belongs to Sky(Z) such that Z is encoded by i. HashCube garding memory consumption, NSC size is augmented by a
is highly efficient for skyline query answering, however, au- constant due to the additional counter.
thors only proposed an algorithm to construct the structure Now for insertions, let r+ be a record to be inserted then
from scratch in [5]. No obvious maintenance procedure is we compute P airs(r+ ). Regarding the existing records in
noticed. Materialized skyline views are very time and mem- the dataset, let r be one, we compute p+ = compare(r, r+ ).
ory consuming as the number of skylines wrt to a dataset However, P airs(r) is already a minimal set of pairs, hence
is exponential to the number of dimension. However, they we address the problem: should P airs(r) [ compare(r, r+ )
provide the best query answering performance. be minimized now? or should the minimization process be
The experiments performed in [9] assess the construction triggered after a number of pairs appended? In absence of
time and memory performance of NSC against its competi- an incremental greedy algorithm, we propose a linear algo-
tors as well as query execution time. However, the structure rithm wrt the size of P airs(r), called min inclusion, which
was designed for a static context. An insertion/deletion of a takes every pair p 2 P airs(r) and compare it to p+ . The
record or a change in an attribute domain order may require pair covered by the other one is discarded. This algorithm
rebuilding the structure from scratch. does not provide any approximation guarantee. However,
For this thesis, we studied the ability of NSC to deal with in practice, min inclusion allows a good compression ratio
updates. Precisely, we addressed its maintenance in case wrt the greedy algorithm.
of (i) dynamic data, i.e., points are removed and inserted at
any time and (ii) streaming data, i.e., append-only database 2.2 Streaming Data
and sliding window model. We redesigned the structure to
We consider the following stream model. Records are
cope with each context. Presently, we are working on skyline
streamed at regular interval time ✓. They have a validity
queries optimization over a dataset with nominal attributes
period of size ! which can also be seen as a sliding window
and dynamic order. We found that NSC is not suitable for
over the dataset. Hence, records are considered outdated !
this context, hence we established a novel approach based on
timestamps after their arrival time. The semantic of con-
views. Finally, we aim to investigate relationship between
tinuous queries [15] states that the query result should be
multidimensional skylines and regret minimization queries.
available and accurate at each time. However, this condition
is hardly attainable, especially for skyline queries. We point
out two main difficulties for continuous skyline queries. Let
2. SKYLINE QUERIES OPTIMIZATION IN Qsky be a skyline query, then (i) a record r belonging to
PRESENCE OF UPDATES Ans(Qsky ) at a timestamp t may leave Ans(Qsky ) at times-
tamp t0 > t if a streamed record r0 dominates r. And (ii) a
In the following, we present the adaptation of NSC for,
record r can join Ans(Qsky ) at timestamp t0 later than its
first, dynamic data, i.e., insertion/deletion of one or multiple
arrival time if skyline records that dominate r get outdated.
records at any time, then, streaming data, i.e., insertion of
We note the contribution of [14]. Authors propose to main-
records at regular intervals. We studied separately both
tain a skyline set DBSky and a set of potentially skyline
contexts because they present di↵erent challenges.
records DBRest, i.e., records which may join DBSky in the
future. They propose an algorithm to maintain these sets
2.1 Dynamic Data together with an event list recording timestamps where a
In the case of dynamic data, i.e., insertion/deletion of one record in DBRest may join DBSky. Still, this approach
or multiple records at any time, NSC should be updated in makes a big number of records comparison which makes it
accordance to the new dataset. The baseline approach is to non scalable wrt dimensionality and data cardinality. More-
build NSC structure from scratch, however this approach is over it maintains a single skyline query, hence maintaining
costly. several skylines may require an exponential space wrt to the
Related work pointed out that dealing with deletion is number of attribute. We note as well the approach presented
harder than insertion. We note the work [17] which ad- in [12] which is based on R-trees. Regarding NSC structure,
dressed the maintenance of a materialized skyline view in the approaches presented for dynamic data are not suitable
case of deletion. For NSC, deletion is a hard task as well. for two main reasons: (i) the timestamp where a record is
Let r be the deleted record then for every record r, we no more valid is known and (ii) the frequency of streaming
remove from P airs(r) the pair p computed wrt r . How- is often high. Generally, real time data processing is hard
ever, p may have exclusively covered some pairs that have in streaming context. Batch processing mode provides a
been deleted from P airs(r) during the minimization pro- larger time to process data, however, the query result is not
cess. Recovering these pairs requires recomputing P airs(r) accurate with the current dataset. We proposed a frame-
from scratch. We propose a solution that detects when work called M SSD which handles three data structure, (i)
P airs(r) should be recomputed. For every distinct pair p a bu↵er B, (ii) a main dataset R and (iii) NSCt, a redesigned
in P airs(r), we set a counter of how many records are the version for NSC. Incoming records are first bu↵ered during
source of p. Hence, we recompute P airs(r) only if p is an interval of time of size k, then inserted into the main
dataset R. NSCt only indexes records in R, hence, queries product decomposition of Q. We adopted a novel approach
evaluated through NSCt, consider only records in R. We which theoretically and experimentally provided better re-
sacrifice the accuracy of the queries, nonetheless, we ensure sults than the above approaches. We define the single partial
a fast maintenance that allows faster query answering than order spo over an attribute domain Dom(A) as a partial or-
state of the art works. Next we explain the main novelties der where only two values in Dom(A) are comparable and all
for NSCt. We organize the set of pairs of a record r as a se- other values are incomparable. Given a query Q with user
quence of buckets of pairs P airs(r) = [Buck1 , . . . , Buckm ], preference Q.R, Ans(Q) is the intersection of the skylines
where each bucket corresponds to the pairs computed wrt a wrt spos composing Q.R. Every spo skyline query is consid-
batch of data. We adopt this approach for the following pur- ered a view whether it is materialized or not. Our experi-
pose. Let b1 and b2 be two batches such that their respective ments showed that answering a query through non material-
timestamp are T S(b1 ) and T S(b2 ), and T S(b1 ) < T S(b2 ). ized spo views is even faster than online algorithms, mainly
As we consider a sliding window data model with an inter- because skyline algorithms are highly weakened by dimen-
val of size !, records in b1 get outdated before records in sionality added by mapping a nominal attribute to several
b2 . Therefore, let r be a record, the bucket computed wrt b1 numerical attributes. Moreover, this approach, namely the
is located before the bucket computed wrt b2 in P airs(r). single partial order decomposition, presents the advantage of
During maintenance, Buck1 , i.e. the oldest bucket, is sim- easily selecting the right views in order to evaluate a given
ply discarded as it contains pairs computed wrt an outdated query compared to the chain product decomposition which
batch. Moreover, during the minimization process, a bucket is an NP Hard Problem. Regarding the memory consump-
Bucki is minimized only wrt successor buckets. We for- tion, let |dom(A)| = m, there exists m! total orders wrt
malized the problem of buckets minimization for NSCt and dom(A), hence m! views to store for chain product decom-
proved its NP-Hardness by a reduction to MSC (Minimum position approach. For refinement approach, the higher the
Set Cover) problem. We provided a greedy approximate al- number of views stored, the faster will be query answering.
gorithm as well. We present in [2] early experiments we per- Note that there exists an exponential number of partial or-
2
formed. It shows that despite the maintenance time, NSCt ders wrt m. Our approach requires 2 · Cm views as for every
answers much more queries during the batch interval than two values, two views are stored. This can be further opti-
BSkyTree [11]. mized by selecting only a subset of spo views to materialize.
We addressed the following problem: given a workload Q
and an integer k, select a set of views of size at most k to
3. SKYLINE QUERIES OPTIMIZATION IN materialize such that the cost of answering the queries in
PRESENCE OF DYNAMIC ORDER Q through the views is minimum. We are working on the
In many real world applications, users are allowed to ex- proof of the Hardness of this problem. Finally we extended
press their preferences over the values of nominal attributes. the work to the case where datasets have several nominal
In such case, we say that the attribute domain has a dynamic attributes.
order. For example, on a movie platform website, users want
the best rated movies but have preferences over the genre as 4. REGRET MINIMIZING QUERIES AND
well. Also, on a flight booking website, users are interested
in cheap and fast flights but may have preferences over the MULTIDIMENSIONAL SKYLINES
airline companies. To ease the comprehension, we consider In this section, we mainly present state of the art of regret
datasets with only one nominal attribute A and some num- optimization queries and we place some questions that we
ber of numerical attributes. A user preference over A is a plan to investigate during this thesis.
partial order that can be written as a set of (a1 , a2 ) such Skyline and Top-k queries share the same objective which
that a1 , a2 2 A and a1 is preferred over a2 . In such sce- is selecting the best elements. However, on one hand, skyline
nario, NSC is not suitable to answer skyline queries as it queries return a non constrained size of result by relying on
is constructed depending on a given partial order over A. just the dominance relation between elements, and on the
Hence it should be updated every time a query with a dif- other hand, Top-k queries require a score function from the
ferent partial order is issued, or it may require to construct user to restrain the result size to k.
NSC for each possible partial order over A. In the litera- Recently [13] presented “regret minimizing set” to avoid
ture, there is two major approaches to handle the problem the limitations of skyline and Top-k queries by not requiring
of answering skyline queries over datasets with attributes a score function while bounding the result size. The main
having dynamic order, (i) algorithms which, given a query idea is to select a representative subset S of a dataset T . Let
Q, maps a nominal attribute into a set of virtual numerical f be a score function, k be an integer, then let fk (T ) be the
attributes in accordance to the user preference Q.R. Then, score of the kth ranked point using f . The regret of a subset
a skyline query is processed over the transformed dataset S wrt f is f1 (T ) f1 (S) and the regret ratio is f1 (Tf1) (Tf)1 (S) .
[18]. And (ii) answers the given query through a set of Given a family of functions F , the problem here is to find
cached views, each computed wrt a partial order [16, 10]. S of size r which minimizes the maximal regret ratio among
These works adopt refinement and chain product decomposi- all f 2 F. A greedy approximate algorithm to solve this
tion techniques in order to evaluate an issued query through problem has been proposed in [13]. Consequently, [8] pro-
materialized views. Let Q be an issued query and let Q0 be posed a relaxation, namely the k-regret minimizing set: the
a view then Ans(Q) ✓ Ans(Q0 ) if Q0 .R ✓ Q.R. We say regret of S considered here is max(fk (T ) f1 (S), 0). More-
that Q is a refinement of Q0 . Now let {Q1 , . . . , Qn } a set of over, they proved the NP-hardness of the regret minimiza-
views such that Qi .R 8i 2 [1 . . . n] is a chain,Si.e., the val- tion problem, with or without the relaxation. [3] addressed
ues of A are totally ordered, then Ans(Q) = i2[1...n] Qi if the query evaluation optimization. They proposed a linear
N
Q.R = i2[1...m] Qi .R. We say that {Q1 , . . . , Qn } is a chain time dynamic programming algorithm based on the skyline
set for a two-dimensional dataset. In addition, for datasets compact maxima representative. In Proceedings of the
with more than 3 dimensions, they proposed an approach 2017 ACM SIGMOD Conference, pages 821–834.
where they discretise the family of linear function F , into ACM, 2017.
a set of function F wrt a parameter given by the user. [4] K. S. Bøgh, S. Chester, D. Sidlauskas, and I. Assent.
This discretisation allows a controlled approximation of the Hashcube: A data structure for space and query
optimal regret ratio. A notable contribution to the regret efficient skycube compression. In Proceedings of CIKM
minimization line of research is given by [1]. They provide conference, pages 1767–1770, 2014.
a reformulation of a regret representative set. Let S ⇢ T . [5] K. S. Bøgh, S. Chester, D. Sidlauskas, and I. Assent.
Then S is a (k, ✏)-set i↵ 8f 2 F , fk (Tfk) (Tf)1 (S)  ✏. They Template skycube algorithms for heterogeneous
formulate two regret minimizing set problems, (i) by size parallelism on multicore and GPU architectures. In
minimization, i.e., find the smallest (k, ✏)-set, and (ii) by re- Proc. of SIGMOD Conference, pages 447–462, 2017.
gret minimization, compute the (k, ✏)-set of size r and ✏ is [6] S. Börzsönyi, D. Kossmann, and K. Stocker. The
minimum. skyline operator. In Proc. of ICDE conf., pages
Our previous work addressed two aspects of skyline queries: 421–430, 2001.
index structure to optimize multidimensional queries and its [7] S. Chaudhuri and L. Gravano. Evaluating top-k
efficient maintenance upon updates. We aim to extend this selection queries. In VLDB, volume 99, pages 397–410,
material to optimize regret minimization queries. We plan 1999.
to investigate more deeply the relationship between skyline [8] S. Chester, A. Thomo, S. Venkatesh, and
sets and regret minimizing sets (RMS). Let T be a dataset S. Whitesides. Computing k-regret minimizing sets.
over a set of dimensions D = {D1 , . . . , Dd }. One question Proceedings of the VLDB Endowment, 7(5):389–400,
that could be interesting to investigate is to find a relation- 2014.
ship between the regret minimization sets when considering [9] N. Hanusse, P. Kamnang-Wanko, and S. Maabout.
subspaces of D. More precisely, let X ✓ Y ✓ D and S(X), Computing and summarizing the negative skycube. In
resp. S(Y ), be the RMS wrt X, resp. Y . How these two Proc. of CIKM Conference, pages 1733–1742, 2016.
sets do compare? Which situations make them comparable? [10] Y.-L. Hsueh, C.-C. Lin, and C.-C. Chang. An efficient
Which functions families make them comparable? How can indexing method for skyline computations with
the RMS’s wrt all subspaces can be computed efficiently? partially ordered domains. IEEE Transactions on
Consider the RMS frequency of a tuple to be the number Knowledge & Data Engineering, pages 963–976, 2017.
of RMS’s it belongs to, then retrieving the k most frequent [11] J. Lee and S.-W. Hwang. Scalable skyline
tuples could be seen as a way to define the ‘best” k tuples. computation using a balanced pivot selection
Let t 2 T . Is the knowledge we have about the skylines to technique. Information Systems, 39:1–21, 2014.
which t belongs, can give us hints about its belonging to an [12] M. Morse, J. M. Patel, and W. I. Grosky. Efficient
RMS? Let the skyline frequency of a tuple be the number continuous skyline computation. Information Sciences,
of skylines it belongs to. Let L be the set of tuples that 177(17):3411–3437, 2007.
are at least l skyline frequent, what approximation, if any,
[13] D. Nanongkai, A. D. Sarma, A. Lall, R. J. Lipton, and
gives the restriction of regret minimizing set computation
J. Xu. Regret-minimizing representative databases.
wrt L compared to the whole dataset? In case of a positive
Proceedings of the VLDB Endowment,
answer, it would be interesting to see how to adapt NSC to
3(1-2):1114–1124, 2010.
optimize the calculation of the set L and consequently, an
approximate regret minimization set. [14] Y. Tao and D. Papadias. Maintaining sliding window
skylines on data streams. IEEE Trans. Knowl. Data
Eng., 18(2):377–391, 2006.
Acknowledgements
[15] D. B. Terry, D. Goldberg, D. A. Nichols, and B. M.
Experiments presented in this paper were carried out us- Oki. Continuous queries over append-only databases.
ing the PlaFRIM experimental testbed, supported by Inria, In Proceedings of the 1992 ACM SIGMOD, pages
CNRS (LABRI and IMB), Université de Bordeaux, Bor- 321–330, 1992.
deaux INP and Conseil Régional d’Aquitaine [16] R. C.-W. Wong, J. Pei, A. W.-C. Fu, and K. Wang.
(see https://www.plafrim.fr/) Online skyline analysis with dynamic preferences on
nominal attributes. IEEE Transactions on Knowledge
5. REFERENCES and Data Engineering, 21(1):35–49, 2009.
[1] P. K. Agarwal, N. Kumar, S. Sintos, and S. Suri.
Efficient algorithms for k-regret minimizing sets. In [17] P. Wu, D. Agrawal, Ö. Egecioglu, and A. El Abbadi.
SEA 2017, June 21-23, 2017, London, UK, pages Deltasky: Optimal maintenance of skyline deletions
7:1–7:23, 2017. without exclusive dominance region generation. In
Proceedings of ICDE Conference, pages 486–495, 2007.
[2] K. Alami and S. Maabout. Multidimensional skylines
over streaming data. In International Conference on [18] S. Zhang, N. Mamoulis, D. W. Cheung, and B. Kao.
Database Systems for Advanced Applications, pages Efficient skyline evaluation over partially ordered
338–342. Springer, 2019. domains. Proceedings of the VLDB Endowment,
3(1-2):1255–1266, 2010.
[3] A. Asudeh, A. Nazi, N. Zhang, and G. Das. Efficient
computation of regret-ratio minimizing set: A