Top-k
               Top-k    search
                     Search  ininProduct
                                  product catalogues
                                           Catalogues

                                   Martin
                               Martin     Šumák,
                                      Šumák andPeter
                                                  PeterGurský
                                                        Gurský
                    Faculty
                     Facultyofof
                               science, University
                                 science,          of Pavol
                                          University        Jozef Šafárik,
                                                        of Pavol           Jesenná 5,
                                                                  Jozef Šafárik
            Jesenná 5, 040 01 Košice,040  01 Košice,
                                         Slovakia       Slovakia
                                                    martin.sumak@student.upjs.sk,
                martin.sumak@student.upjs.sk peter.gursky@upjs.sk
                                     peter.gursky@upjs.sk


           Abstract. In the era of huge datasets, the top-k search becomes an effective way
           to decrease the search time of top-k objects. The original top-k search requires a
           monotone combination function and lists of objects ordered by attribute values.
           Our approach of the top-k search is motivated by complex user preferences over
           product catalogues. Such user preferences are composed of the local user prefe-
           rences of the attributes’ values (user defined arbitrary fuzzy functions, one for
           each attribute) and a user defined monotone combination function. This paper
           compares two different approaches of the top-k search for this type of non-
           monotone query. The first approach uses several B+trees, one for each attribute,
           and it is based on ordered lists. The second approach is new for this type of
           query and requires an R-tree index.


    1 Introduction

    The top-k search becomes more popular with increasing datasets sizes. Users are
    usually interested in a few best objects rather than a large list of objects as a result.
    Top-k algorithms usually follow two main goals. First, they minimize the number of
    source data to be processed. Second, the algorithms try to minimize the number of
    accesses to the sources and the computation time as well.
       Our research in the area of the top-k search is motivated by product catalogue
    search. Users of the common e-catalogues have usually limited options of a preference
    specification. Typically, the only way to simplify product selection is to order the
    products by a single attribute. Sometimes, a user can restrict the object set by an inter-
    val of attribute values.
       Our goal is to enable usage of more complex user preferences. Complex user prefe-
    rences allow user to specify his top-k objects more accurately. On the other hand,
    preferences should be easy to specify and understandable for common user. Our mod-
    el of user preferences consists of local preferences to attribute values and a global
    monotone combination function.
       User’s local preference to the values in domain DA of attribute A is represented by a
    fuzzy function fA: DA → [0; 1]. The fuzzy function gives the value 1 for the attribute
    values the most preferred by user and the value 0 for the attribute values that the user
    does not accept. Values between 0 and 1 represent the rates of the user acceptance of
    the attribute value.


V. Snášel, J. Pokorný, K. Richta (Eds.): Dateso 2011, pp. 1–12, ISBN 978-80-248-2391-1.
2        Martin Šumák, Peter Gurský


   Example 1: Imagine a user searching for a cheap laptop with medium screen size.
Such preferences can be expressed by fuzzy functions similar to the ones shown on
figure 1. We can see that user accepts laptops cheaper than approximately 700 € with
a screen size between 11” and 15,5”. Moreover, we know that preferred screen sizes
are between 12” and 14” and that the cheaper the laptop is the better preference it has.


Fig.1. User’s local preferences to the screen sizes and the laptops prices

   The fuzzy functions can be specified explicitly by user using sliders [4]. Alterna-
tively they can be learned from objects’ ranking or by a click analysis on the catalogue
website [12, 13].
   The second part of user preferences is a global monotone combination function that
combines the fuzzy values of the local preferences to the overall object value. If user
considers m attributes in his preferences, the combination function C: [0; 1]m → R
expresses the overall value of the user preferences to the object. The higher the overall
value, the more preferred the object is.
   Example 2: Let us consider the previous example. Our user can specify that the
price is twice as important as the screen size. We can express this importance as
weights wsize= 1 and wprice= 2. Then the combination function can be expressed as
weighted sum of the fuzzy values:
                  C(fsize(size), fprice(price)) = 1*fsize(size) + 2*fprice(price)
   The monotone combination function is quite difficult to express by a common user.
Therefore the preferred approach uses a predefined combination function (sum, mini-
mum and product) or it can be learned as well as the local preferences [12, 13]. A
learned combination function has usually a form of a weighted sum or a set of mono-
tone fuzzy rules [13].
   Top-k search algorithms like Threshold algorithm (TA), algorithm No Random
Access (NRA) [1] or 3 phased-No Random Access (3P-NRA) can be used with user
preferences mentioned above [4, 12].
   The contribution of this paper is:
      new efficient algorithm for the top-k search based on R-tree (see section 4)
         experimental efficiency comparison of R-tree based algorithm to algorithms
          TA and 3P-NRA (see section 5)
   In [2], algorithms like TA, NRA (or 3P-NRA) are used for searching over several
types of attributes i.e. ordinal, metric, hierarchical, etc. Our new solution based on R-
tree supports only ordinal attributes so far. Ordinal attributes in TA, NRA (or 3P-
NRA) algorithms are handled by B+trees. In this paper we call these algorithms
B+trees based approach.
                                           Top-k Search in Product Catalogues           3


2 Problem definition

For a given set S of objects we have to find k most preferred objects for the user. Each
object O ∊ S has the same m attributes with real values v1(O),…, vm(O), where func-
tion vi: S → R for all i ∊ {1, …, m}. Input obtained from the user consists of m fuzzy
functions f1,…,fm (or less if user does not consider all attributes) and a monotone com-
bination function C. The overall value of object O is C(f1(v1(O)),…, fm(vm(O))). For
example, if C is a weighted sum, user is expected to specify only nonnegative weights
– one for each considered attribute. Then we have:
            C(f1(v1(O)),…, fm(vm(O))) = w1* f1(v1(O)) + … + wm* fm(vm(O))
where w1,…,wm are the weights. The bigger the overall value, the more preferred the
object O is to user. Output is a list of k objects from S ordered from the most preferred
objects to the less preferred ones.


3 B+trees based approach

The original TA [1] and its derivates (NRA, 3P-NRA) require:
     m lists of pairs having form <object identifier, attribute value> previously or-
         dered by attribute values
        a monotone combination function
    In the TA, NRA, 3P-NRA algorithms, the ordered lists are read sequentially and
the values from the lists are used as an input for a monotone combination function.
Typically a top-k algorithm returns top-k objects after processing only a part of the
lists.
    Our approach requires the monotonicity of a combination function C, having fuzzy
values of the local preferences as an input. In the naïve approach, the lists could be
sorted according to the local preferences. Then, having the lists ordered by fuzzy
values of the local preferences and a monotone combination function, we can use any
TA-like algorithm to find top-k objects. Unfortunately, every user usually has different
local preferences as well as different combination function. The sorting prior to every
top-k search is highly ineffective – it is cheaper to do a table scan and compute the
overall values for all objects in S.
    Instead of ordering the lists, a B+tree can be prepared over each attribute of objects
in S [2]. The main idea of this approach is that it does not need an ordered list to offer
the ordered sorted access.
    Example 3. Let us assume that the price and screen size attributes are indexed sepa-
rately each in one B+tree. According to the fuzzy functions shown on figure 1, i.e. the
user local preferences, the B+trees can be traversed to simulate the sorted accesses.
The price tree is traversed from a minimal attribute value in ascending direction using
pointers between leafs of the B+tree. In the case of the screen size two pointers are
created, both starting at the some attribute value with the highest fuzzy value, i.e. 13”
with fuzzy value 1. The first pointer traverses leafs of the B+tree in descending direc-
tion and the second one in ascending direction (see Fig.2).
4        Martin Šumák, Peter Gurský


Fig.2. Traversing B+tree organized by attribute screen size during the sorted access simulation
for TA-like algorithms

    Having simple functions (partially linear), the points to identify the pointers and di-
rections, i.e. the extremes of the fuzzy function, can be found very easily. The entry of
a B+tree leaf with the highest fuzzy value can be identified by simple traversing from
the root to the appropriate leaf.
    At the beginning of the sorted access simulation [2] the entries with the local max-
imums of a fuzzy function are identified, based on which a set of pointers to leaf en-
tries is obtained. Each pointer points to a leaf entry that should be returned as the next
entry in the sorted access simulation. For each pointed entry a fuzzy value is com-
puted. The entry with the highest fuzzy value is returned and its pointer is shifted to
the neighbor entry of the corresponding direction.
    The sorted access simulation allows us to use any TA-like algorithm. Instead of the
original NRA, we prefer to use an improvement of the NRA algorithm called 3P-NRA
with significantly better search time as shown in [2] especially in combination with the
proposed heuristics. Hence we use 3P-NRA in our experiments in section 5.


4 Searching over R-tree

A contribution of this paper is a top-k search algorithm based on R-tree. As shown in
the experiments, this approach is much more effective than B+trees based approach.
   Since each object O can be represented by the point p(O) = (v1(O),…, vm(O)) in m-
dimensional space (note that p: S → Rm) the set S of objects can be stored in multidi-
mensional R-tree index [8, 9]. The figure 3 shows an example of R-tree index used for
point data. For searching top-k objects we adapted algorithm Incremental Nearest
Neighbor (INN) [10] (also known as sorted access) commonly used for searching k-
nearest neighbors. For formal description of our algorithm we first have to define
some concepts.

Definition 1: Point P in m-dimensional space is vector P = (P1,..., Pm) ∊ Rm.

Remark 1: Having O ∊ S and point P such that P = p(O) then Pi = vi(O) for all
i ∊ {1,…, m}.

Definition 2: Rectangle R (parallel with axes) is a pair of points R = (L, H) where
Li ≤ Hi for all i ∊ {1,…, m}.
                                             Top-k Search in Product Catalogues            5


Fig.3. Example of R-tree for two-dimensional point data and capacity of nodes equal to 4

   For the implementation and the tests the more effective variant R*-tree [9] was
used. R*-tree has the same structure and properties as the R-tree, it differs only in the
way of construction. However, the algorithm for k-nearest neighbors and for top-k
search is the same for both of them.
   Real values of objects’ attributes come from different ranges (screen sizes of lap-
tops are from 8” to 17” while the prices are from 400 € to 3000 €) and they are in
different units. Therefore we will not build an R-tree index over real values of
attributes but we will build it over linear normalized values where all values are within
interval [0; 1]. The user defined local preferences (fuzzy functions) will be adjusted
according to such normalized data. The normalization is necessary for effective utili-
zation of R*-tree because of R*-tree aims e.g. to have as quadratic nodes as possible.
   INN algorithm offers the objects within R-tree in the incremental way and in order
from the nearest ones from a query point Q. In top-k search the input is not a single
point but it consists of the user’s preferences. Objects on the output should be ordered
from best for the user. This can be easily achieved by ordering the priority queue
within INN algorithm by some other value – not by minimal distance from Q but by
maximal overall value (using combination function). For a node N we have to com-
pute the maximal possible overall value for any object in a sub-tree rooted by the node
N. For this purpose we define function h.

Definition 3: Function h: S U V → R, where V is a set of nodes of R-tree, is defined
as follows: h(E) = C(y1,…, ym) where for all i ∊ {1, …, m} yi = fi(vi(E)) if E ∊ S or
yi = max{fi(x): Li ≤ x ≤ Hi} if E ∊ V and (L, H) is minimal bounding rectangle of E.

   Even though user’s preferences can be specified by arbitrary fuzzy functions, nev-
ertheless it must be possible to compute the maximum of the fuzzy function on an
arbitrary interval and also to compute a value in any permissible x. We prefer to work
with the partially linear fuzzy functions. They are easy to define by users using sliders
in a graphic interface and they are also simple in computations.
   The figure 4 shows a graphical representation of an example of function h. The
value h(O) of object O is greater than the value h(N) of node N. Object O is therefore
better for user than any possible object in a sub-tree of node N (any point in rectangle
(L, H)).
6        Martin Šumák, Peter Gurský


Fig.4. Graphical representation of the function h(E) = C(y1, y2) = y1 + 2 * y2 over normalized
data having two attributes of objects with user-defined fuzzy functions f1 and f2 on the right

Algorithm preferential top-k search over R-tree:
Input: R-tree index of objects from S, fuzzy functions
       f1,…, fm, combination function C and number k
Output: ordered list of k objects with the highest value
        of the h function
1. pqueue = empty priority queue ordered by the value of
   the h function of its elements in descending order
2. result = empty list of objects
3. add the root node of the R-tree in the empty pqueue
4. while the result does not contain k objects do
          a. let E be the first element of the pqueue, remove
             E from the pqueue
          b. if E is an inner node then add all its child
             nodes to the pqueue
          c. if E is a leaf node then add all its objects to
             the pqueue
          d. if E is an object then add object E to the end
             of the result
5. return result


   Algorithm starts with the root node as the only element in the priority queue. At the
beginning all the objects are taken into account (root contains all the objects in its sub-
tree). The idea of the algorithm is to remove the node at the top of the priority queue,
insert all its child nodes instead (or objects if the node was a leaf) with respect to
ordering. This is repeated until some object appears on the top of the priority queue.
Then the object is put to the end of the result list (the next best object to user). Priority
queue ensures that on the top there is an element (i.e. a node or an object) with the
maximal value of the function h among all elements in the priority queue. Moreover
the priority queue can be organized to prefer object to a node with the equal value of
function h. If the element on the top of the priority queue is a node, then its sub-tree
                                           Top-k Search in Product Catalogues            7


can contain an object with higher value of function h than any other object present in
the priority queue. On the other side, if there is an object on the top of the priority
queue, its value of function h is higher than or equal to the value of function h of any
other object in the priority queue and also of any object in sub-trees of nodes in the
priority queue. Hence the first object appeared on the top of the priority queue is the
best of all objects. The second object appeared on the top of the priority queue is the
best of all remaining objects (i.e. the second best of all objects) and so on.
   Inserting all child nodes to the priority queue requires the data (rectangles of child
nodes) from the node itself only. Therefore it is possible to store only pointers to
nodes in the priority queue for better efficiency and load real nodes when the informa-
tion about their child nodes or objects is needed.


5 Tests and results

In the tests the time and number of IOs of the following three algorithms are com-
pared: TA, 3P-NRA and the preferential top-k search over R-tree. Since R*-tree is the
most efficient variant of R-tree, we used it in all the following tests. The utilization of
R*-tree nodes was within the range from 30 to 90. In the tests we used three different
distributions of artificial random data: Gaussian, exponential and uniform. We used
sets of 100 000 and 1000 000 objects. The combination function was weighted sum
where weights were chosen randomly from interval [1; 5]. Fuzzy functions for the
evaluation of objects were chosen also randomly from 4 types (familiarly called: as-
cending, descending, hill, valley) used in [21]. All the tests were done on computer
with 2GB of RAM, 2core Intel Centrino processor and SSD hard disk. The purpose of
the tests was to reveal the influence of the data distribution, user preferences or the
number of considered attributes on time and number of IOs. Presented values are the
average values of 5 identical tests.
8        Martin Šumák, Peter Gurský


Fig.5. Search time of the top-1, 5, 10, 20, 50 objects in ms over 100 000 random objects with
10 attributes of exponential (top-left), Gaussian (top-right) and uniform (bottom) distribution
with all 10 attributes in a query

   First set of charts (figure 5) describes the search time of top-k objects. We can ob-
serve significant speedup of searching when data are indexed in R*-tree and the query
uses all 10 attributes of objects. The dependence of number of IOs copies the time
dependence for all three algorithms.
   The question is whether the R*-tree based algorithm will be so fast if the query
would contain only some of all indexed attributes. TA and 3P-NRA algorithms can be
considered faster trivially (because they read fewer ordered lists). The R*-tree based
algorithm always searches over the R*-tree structure containing all the attributes inde-
pendently of the user query.
   Next set of charts (figures 6, 7) describes the search time of the top-10 and the top-
50 objects of 10 attributes with less than 10 attributes used in a query.
                                              Top-k Search in Product Catalogues               9


Fig.6. Search time of the top-10 (left) and the top-50 (right) objects in ms over 100 000 random
objects of 10 attributes with uniform distribution with 2, 3, 5, 7, 9 attributes in a query


Fig.7. Search time of the top-10 (left) and the top-50 (right) objects in ms over 1 000 000
random objects of 10 attributes with exponential distribution with 2, 3, 5, 7, 10 attributes in a
query

   As we can see on the charts above the R*-tree based algorithm is fast even for a
small number of attributes in the query. In farther tests we observe that this property of
R*-tree based algorithm is preserved also for a bigger set of objects (1 000 000), more
attributes (20) and also for other distributions (exponential, Gaussian).


Fig.8. Search time of the top-10 (left) and the top-50 (right) objects in ms over 1 000 000
random objects of 20 attributes with Gaussian distribution with 2, 3, 4, 6, 8, 10, 12, 15, 20
attributes in a query
10       Martin Šumák, Peter Gurský


Fig.9. Number of IOs in the search of the top-10 (left) and the top-20 (right) objects over
1 000 000 random objects of 20 attributes with Gaussian distribution with 2, 3, 4, 6, 8, 10, 12,
15, 20 attributes in a query

   The tests show that R*-tree based approach is much faster than the B+trees based
approach. The remarkable speedup is reached by decreasing the number of IOs – from
several B+trees to one R*-tree.


6 Related work

This paper studies the top-k search for complex user preferences composed of arbi-
trary local preferences and a monotone combination function. Local preferences and
combination function together generate a non-monotone function giving the overall
value.
   The area of the top-k search was extensively researched in the last 7-11 years. The
main stream of the top-k search algorithms [1, 14, 15, 16, 17], we call them TA-like
algorithms, considers having a monotone combination function and possibly distri-
buted ordered lists for each attribute.
   The query composed of local preferences and monotone combination function was
introduced in [2]. As shown in section 3 the simulation of sorted access to the lists
allows us to use TA-like algorithms.
   A special branch of top-k algorithms is considered to be embedded in RDBMS
[18, 19, 20]. These approaches are concerned with augmenting the query optimizer to
consider rank-joins during plan evaluation. Optimization can be effective especially in
the case of very selective attributes. The rank-join algorithms require ordered data on
input similarly to TA-like algorithms. The way of ordering is not considered or the
ordering of attribute domains is used implicitly.
   The approaches in [5, 6] do the top-k search with an arbitrary (also non-monotone)
query analyzing the aggregation function with mathematical methods. If a ranking
function analysis over any domain sub-region is possible (to find the maximum and
possibly recognize monotonicity), according to authors this approach is able to find
the top-k objects in an effective way. In our opinion this analysis is rather difficult to
be done for an arbitrary function. However, we could not proceed in further analysis
of this approach because the source codes or a deeper description of the algorithms
are not available.
                                            Top-k Search in Product Catalogues           11


   R-tree is designed especially for multidimensional range queries and similarity
search (nearest neighbor queries). Moreover R-tree can be effectively used for top-k
queries. The idea of using R-tree for the top-k search is briefly presented in [7] where
authors compare the top-k search over R-tree and over their new index called “Ranked
Join Index”. They consider only a simple weighted sum as a top-k query. Note that our
top-k query, consisting of m fuzzy functions f1,…,fm and a monotone combination
function C, is more complex. The composition of fuzzy functions and a combination
function is not monotone in general, thus our approach has a higher expressive power.
Since the “Ranked Join Index” requires a monotone top-k query, it cannot be used
with our query.


7 Conclusions and future work

Unlike separate ordered lists in the original TA-like algorithms our approach uses
single R-tree or R*-tree. Since R-tree is a multidimensional index, each attribute can
be represented as one dimension. A disadvantage in comparison to the original TA-
like algorithms is that we must know the values of all objects’ attributes and all the
data must be stored locally in one structure. We must have the whole R-tree prepared
prior to the query (we emphasize that we have to prepare the R*-tree only once, not
prior to each query). We assume that a condition of locally accessible data prepared in
advance is performable for almost all attributes. Nevertheless in the case of dynamic,
remote or not ordinal attributes, the TA-like algorithms and the R-tree based approach
can be combined. Since algorithm for the top-k search over R-tree can be used to
produce an ordered list, we can carry out the top-k search using a TA-like algorithm.
One ordered list for a TA-like algorithm can be simulated by the top-k search algo-
rithm over R-tree and several (not so many) ordered lists can be obtained in other way,
i.e. simulation over remote B+tree. In this case we can dramatically decrease the num-
ber of ordered lists read by a TA-like algorithm and consequently the overall time of
the top-k search as well. This is the idea of our future work.


Acknowledgement

This work was partially supported by VEGA 1/0131/09 and by the Agency of the
Slovak Ministry of Education for the Structural Funds of the EU, under project ITMS:
26220120007.


References

1.   R. Fagin, A. Lotem, M. Naor, “Optimal Aggregation Algorithms for Middleware”, in
     Proc. ACM PODS, 2001
2.   P. Gurský, R. Pázman, P. Vojtáš, “On supporting wide range of attribute types for top-k
     search”, Computing and Informatics, Vol. 28, no. 4, 2009, ISSN 1335-9150, p. 483-513..
12        Martin Šumák, Peter Gurský


3.    T. Horváth, “A Model of User Preference Learning for Content-Based Recommender
      Systems”, Computing and informatics Vol. 28 (2009), No. 4: SAV, Slovakia, 2009, ISSN
      1335-9150, p: 453-481.
4.    V. Vaneková, P. Vojtáš, “Fuzziness as a Model of User Preference in Semantic Web
      Search”, in Proc. IFSA-EUSFLAT 2009, pp. 998-1003
5.    Z. Zhang, S. Hwang, K. Chang, M. Wang, C. Lang, Y. Chang, “Boolean + Ranking:
      Querying a Database by K-Constrained Optimization,” In SIGMOD 2006.
6.    D. Xin, J. Han, K. Chang, “Progressive and Selective Merge: Computing Top-K with Ad-
      Hoc Ranking Functions,” in SIGMOD 2007.
7.    P. Tsaparas, T. Palpanas, Y. Kotidis, N. Koudas, D. Srivastava, “Ranked Join Indices”,
      ICDE, pp.277, 2003
8.    A. Guttman, “R-Trees: A Dynamic Index Structure for Spatial Searching”, SIGMOD
      Conference 1984
9.    N. Beckmann, H.P. Kriegel, R. Schneider, B. Seeger, “The R*-Tree: An Efficient and
      Robust Access Method for Points and Rectangles”, SIGMOD Conference 1990: 322-331
10.   G. R. Hjaltason , H. Samet, “Distance browsing in spatial databases”, ACM Transactions
      on Database Systems (TODS), v.24 n.2, p.265-318, June 1999
11.   G. R. Hjaltason, H. Samet, “Ranking in Spatial Databases”, Proceedings of the 4th Inter-
      national Symposium on Advances in Spatial Databases, p.83-95, August 06-09, 1995
12.   P. Gurský, T. Horváth, R. Novotný, V. Vaneková, P. Vojtáš, “UPRE: User preference
      based search system”, In IEEE/WIC/ACM Web Inteligence (2006)
13.   T. Horváth, P. Vojtáš, “Ordinal Classification with Monotonicity Constraints”, In Proc.
      6th Industrial Conference on Data Mining ICDM (2006)
14.   R. Akbarinia, E. Pacitti, P. Valduriez, “Best Position Algorithms for Top-k Queries”, in
      VLDB, 2007.
15.   H. Bast, D.Majumdar, R. Schenkel, M. Theobald, G. Weikum, “IOTop-k: Index-Access
      Optimized Top-k Query Processing”. In VLDB, 2006.
16.   W. Balke, U. Güntzer, “Multi-Objective Query Processing for Database Systems”, in
      VLDB, 2004.
17.   U. Güntzer, W. Balke, W. Kiessling, “Towards Efficient Multi-Feature Queries in Hetero-
      geneous Enviroments”, in ITCC, 2001.
18.   F. Ilyas, W. Aref, A. Elmagarmid, “Supporting Top-k Join Queries in Relational Data-
      base”, in VLDB, 2003.
19.   F. Ilyas, R. Shah, W.G. Aref, J. S. Vitter, A.K. Elmagarmid, “Rank-Aware Query Optimi-
      zation”, in SIGMOD, 2004.
20.   C. Li, K. Chang, I. F. Ilyas, S. Song, “RankSQL: Query Algebra and Optimization for
      Relational Top-k Queries”, in SIGMOD, 2005.
21.   P. Gurský, “Towards better semantics in the multifeature querying”, proceedings of Date-
      so 2006, ISBN 80-248-1025-5, pages 63-73, 2006