=Paper= {{Paper |id=Vol-129/paper-1 |storemode=property |title=Comparison of parallel and random approach to a candidate list in the multifeature querying |pdfUrl=https://ceur-ws.org/Vol-129/paper1.pdf |volume=Vol-129 |dblpUrl=https://dblp.org/rec/conf/dateso/Gursky05 }} ==Comparison of parallel and random approach to a candidate list in the multifeature querying== https://ceur-ws.org/Vol-129/paper1.pdf
 Comparison
 Comparison of
             of parallel
                 parallel and
                          and random
                              random approach
                                       approach to
                                                to
   a candidate list in the multifeature querying
  a candidate list in the multifeature querying∗∗


                                        Peter Gurský
                                        Peter Gurský
                     Institute of Computer Science, Faculty of Science??
                                 P.J.
                      Institute of   Šafárik University
                                   Computer                Košice of Science
                                                 Science, Faculty
       P.J.Šafárik University in Jesenná   9, 040 01,
                                   Košice, Jesenná  9, Košice
                                                         040 01, Košice, Slovak Republic
                                        gursky@upjs.sk



         Abstract. In the field of the multifeature querying it is possible to
         use many heuristics to retrieve top k objects to became low number
         of accesses to the sources. When the sources have many equal values,
         it is often hard to choose which source should be accessed next. In this
         paper we compare previous random approach with the parallel approach
         to the set of actual candidate sources.


      Key words: multifeature querying, top-k objects, aggregation


1      Introduction

Many times we want to find the best object or top k objects in the possible huge
set of objects. The reason of which object is better than the other, is based on
the properties of the objects. Such properties are typically fuzzy. For example,
when we want to find top k hotels, we can look at a distance from the beach,
price per night, number of stars, travel expenses, etc. We need is to specify, how
to compute the overall score of each object to became the order of the objects.
Moreover all these particular data can be accessible by different sources (web
services).
    There are several algorithms in this area, solving this problem. Ronald Fagin
introduced ”Fagin’s algorithm”, which solves this problem first time [6]. Fagin
et al. [2] presented ”threshold” algorithm that made the search much faster.
Güntzer et al. [1] defined ”quick-combine” algorithm using first heuristic. Other
heuristics was presented by P. Gurský and R. Lencses [4].
    The ”quick-combine” algorithm was originally developed over multimedial
data. Such a data are typically continuous i.e. it is very unusual to have two
objects with the same value of a property. The experimental comparison of the
heuristics [4] showed, that the heuristics used in [1] is quite ineffective, when
??
     This work was partially supported by the grant VEGA 1/0385/03 and ’Štátna úloha
     výskumu a vývoja ”Nástroje pre zı́skavanie, organizovanie a udržovanie znalostı́
     v prostredı́ heterogénnych informačných zdrojov” prierezového štátneho programu
     ”Budovanie informačnej spoločnosti”.’


K. Richta, V. Snášel, J. Pokorný (Eds.): Dateso 2005, pp. 1–8, ISBN 80-01-03204-3.
2       Peter Gurský


the sources have few discretized values, e.g. number of stars of hotels. In this
paper, we show a possible improvement of the performance of this heuristic
over discretized data. We also try to use the same approach to other relevant
heuristics presented in [4].
    In chapter 2 we describe a formal model of data. In chapter 3 we present
a generalized version of all mentioned algorithms and compare three different
heuristics. The experimental comparison of the heuristics is showed in chapter
4. Chapter 5 concludes our paper.


2    Model

Assume we have a finite set of objects. Cardinality of this set is N . Every object
x has m attributes x1 , . . . , xm . All objects (or identificators of objects) are in
lists L1 , . . . , Lm , each of length N . Objects in list Li are ordered descending
by values of attribute xi . We can define two functions to access objects in lists.
Let x be an object. Then si (x) is a grade (or score, rank) of object x in list the
Li and ri (j) is an object in list the Li in the j-th position. Using the function
si (x) we can realize the random access1 to the lists. The second type of access
we will use, is the sorted access2 . Using this type of access the grades of objects
are obtained by proceeding through the list sequentially from the top.
     We have also monotone aggregation function F , which combine grades of
object x from lists L1 , . . . , Lm . The overall value of an object x we denote as
S(x) and it is computed as F (s1 (x), . . . , sm (x)).
     Our task is to find top k objects with highest overall grades. We also want to
minimize time and space. That means we want to use as low sorted and random
accesses as possible.


3    Generalized Threshold algorithm and heuristics

For each list Li , let ui = si (ri (zi )) be the value of the attribute of the last
object seen under sorted access, where zi is the number of that possition. Define
the threshold value τ to be F (u1 , . . . , um ). Because we assume that we have
a monotone aggregation function F and the lists are sorted descend by their
values, the threshold τ is the value, which none of still unseen objects can reach
[2]. Hence when all objects in the top k list have their values greater or equal to
the threshold, then this top k list is the final and there is none unseen object with
greater value. This property is very important to have the algorithm correct.
     Let z = (z1 , . . . , zm ) be a vector, which assigns for each i = 1, . . . , m the
position in list Li last seen under sorted access. Let H be a heuristic that decides
which list (or lists) should be accessed next under sorted access (notice that
1
  Random access - direct access via an indexing mechanism. Please do not confuse
  this term with the term ”random approach”. The random approach to a set means,
  that we choose one element of this set randomly.
2
  Sorted access - sequential access to a sorted list.
                                  Comparison of parallel and random approach . . .                  3


heuristics can change during the computation). Moreover, assume that H is
such, that for all j ≤ m we have H(z)j = zj or H(z)j = zj +1 and there is at
least one i ≤ m such that H(z)i = zi +1. The set {i ≤ m : H(z)i = zi + 1} we
call the set of candidate lists (or simply candidates) for the next sorted access.
    In this paper we use three types of heuristics.
    First heuristic (denote H1) does the sorted access in all m lists parallelly.
It means, that for each i ≤ m holds H(z)i = zi +1. This heuristic was firstly
presented by Fagin et al. [2] in the ”Threshold algorithm”. This kind of heuristic
we use only (if ever) in the first phase of computation to retrieve the beginnings
of the lists. Next two heuristics are used in the rest of computation.
    The use of the ((δF/δx)∗∆x) heuristic (H2) was firstly presented by Güntzer
et al. [1] as a part of ”Quick-combine algorithm”. Let us look at the next non-
equality. To keep algorithm correct, for each object x in the final top k list must
hold:

                               S(x) = F (s1 (x), . . . , sm (x)) ≥ τ                              (1)

    Hence, when we can say, that this non-equality holds, we have the final top
k list. Obviously, there are two ways to make (1) hold: to increase the left side
or to decrease the right side. Heuristic H2 tries to decrease τ as fast as possible.
As a criterion for a list Li , i ≤ m to be given to the set of candidates for the
next sorted access, is to have ∆i maximal. ∆i is defined as:

                                                       −
              δF
   ∆i =           (s1 (ri (zi )), . . . , sm (ri (zi )))   ∗ (si (ri (zi − p)) − si (ri (zi )))   (2)
              δxi

    The constant p is some suitable (small) natural number. Hence ∆i is the
multiplication of the partial derivative of aggregation function F from the left
in the point (s1 (r1 (z1 )), ..., sm (rm (zm ))) and the expected change of values in
p steps (∆x factor) of the i-th list. When we have ∆i for each i ≤ m, we can
set the value of H(z)i . Heuristic H2 sets H(z)i =zi +1 if ∆i =max{∆j ; j ≤ m}.
Otherwise H(z)i =zi . The only necessary condition we required from F is the
continuity from the left.
    The (δF/δx) ∗ x) heuristic (H3) is a variation of the last one. This heuristic
was presented by Gurský et al.[4] at the first time. Instead of the ∆x factor, H3
chooses an x-factor, thus the last seen value in the i-th list. The criterion for
this heuristic is:
                                                                     −
                            δF
                 χi =           (s1 (ri (zi )), . . . , sm (ri (zi )))   ∗ si (ri (zi ))          (3)
                            δxi
     The criterion (3) computes the partial derivation of F from the left in
the point (s1 (r1 (z1 )), ..., sm (rm (zm ))) and multiply it with value in the point
si (ri (zi )) in Li . This heuristic sets H(z)i =zi +1 if χi = max{χj ; j ≤ m}. Oth-
erwise H(z)i =zi . We need the continuity from the left for the function F again.
4      Peter Gurský




                                                                                                                          163,65%




           δ F/δ x(x)
           δ F/δ x(Δ x)
           x/Δ x switch
                                                                                      129,98%
                                                                   126,37%                               126,06%


                                                118,90%



                               110,56%


           104,32%

                                             100,00%            100,00%            100,00%                             100,00%

        99,99%       104,50% 99,99% 98,97%                                                            100,00% 99,45%
                                                       97,09%                                97,61%
                                                                                                                                 96,37%
                                                                          93,09%

                 1                5                10                 20                 30                 50              100
                                                                      k




     Fig. 1. Change of performance using parallel approach (benchmark data)


    When the aggregation function is the simple weighted mean, the derivation
used by these heuristics is a constant, more precise it is the weight of the at-
tribute.
    Now we can describe the generalized Threshold algorithm:

0. z:=(0,. . . ,0), in case of algorithms ((δF/δx) ∗ ∆x) or ∆x/x, set the suitable
   small natural p.
1. Set the heuristic H:
     • ((δF/δx) ∗ x): H:=H3
     • ((δF/δx) ∗ ∆x): if any zi < p then H:=H1; otherwise H:=H2
     • ∆x/x: if any zi < p then H:=H1; otherwise if H=H1 then H:=H2; if
        H=H2 then H:=H3; and if H=H3 then H:=H2.
2. • parallel approach: Do the sorted access in parallel to each of the sorted
        lists to all positions where H(z)i = zi +1. Put zi = H(z)i .
     • random approach: Do the sorted access to randomly chosen sorted list
        Li where H(z)i = zi +1. Put zi = zi + 1 and for each j ≤ m, j 6= i do
        nothing.
3. First control: Compute the threshold value τ . As soon as at least k objects
   have been seen whose grade is at least equal to τ , then go to step 6.
4. For every object x that was seen under sorted access in the step 2, do the
   random access to the other lists to find the grade si (x) of object in every
   list. Then compute the grade S(x) = F (s1 (x), . . . , sm (x)) of object x. If this
   grade is one of the k highest ones we have seen, then remember object x and
   its grade S(x).
                                                  Comparison of parallel and random approach . . .                                               5



                                                                                                                                 204,23%


                                                                       191,94%
            δ F/δ x(x)
            δ F/δ x(Δ x)
                                 181,24%                                                     180,78%
            x/Δ x switch


                                                     165,47%
                                                                                                              158,96%

           149,84%



                                                                                                                                       134,57%




                                        108,92%
                     120,00%


        99,88%                 99,88%             99,88%            99,91%                99,93%            99,94%            99,96%

                                                                                                                     91,19%
                                                           88,04%


                                                                                 76,06%
                                                                                                   74,33%
                 1                      5              10                20                    30                50                100
                                                                             k




       Fig. 2. Change of performance using parallel approach (artificial data)


 5. Second control: As soon as at least k objects have been seen whose grade is
    at least equal to τ , then go to step 6, otherwise go to step 1.
 6. Let Y be a set containing the k objects that have been seen with the highest
    grades. The output is then the graded set {(x, S(x)) : x ∈ Y }.

    Individual algorithms differ in the steps 2 and 3. Shortly the (δF/δx)∗x) algo-
rithm use heuristic H3 directly on beginning. The Quick-combine (or ((δF/δx) ∗
∆x)) algorithm use for the first p steps in each list the heuristic H1 and than the
heuristic H2. The ∆x/x algorithm use for first p steps in each list the heuristic
H1, too. After that it switches between the heuristics H2 and H3.
    Original algorithms choose the random approach in the step 3. All mentioned
algorithms can have their ”parallel variant” too. In our experiments we compare
all 6 possible variants.


4     Experiments
4.1   Testing data
Benchmark data The first sort of data are real data that come from ran-
domly generated queries in information retrieval system with support of rela-
tional database. We used different combinations of local and global weights as
different ways for weighting of occurrences of terms in documents to generate 6
sets of benchmark data. Each set include over 25 000 objects (documents) with
50 different attributes (terms). To measure all particular experimental results we
6              Peter Gurský



              1000000



               900000



               800000



               700000



               600000                                                          δ F/δ x(x)
                                                                               δ F/δ x(Δ x)-par
      SA+RA




                                                                               x/Δ x switch
                                                                               δ F/δ x(Δ x)
               500000



               400000



               300000



               200000



               100000
                        1         5       10     20      30      50      100
                                                  k




                            Fig. 3. Number of all accesses (benchmark data)


compare the average values from 6 sets of these benchmark data with the use of
aggregation functions with randomly chosen weights for each list. Histograms of
data are exponential - there is very low number of objects with high values and
a lot of objects with low values. Such distribution is typical in the information
retrieval area but in many other areas too.


Artificial data The second sort of data were generated with various distribu-
tion of values. We used 2 exponential and 2 logarithmic distributions with 10000
objects and 6 types of aggregation functions. The values of the attributes was
rounded to 10 discrete values. Such a discretisation is quite common in real data
e.g. number of stars of hotels, rating classes of companies, and other human pro-
duced ratings. Finest discretisation can be e.g. prices of some product or a guess
for the length of a trip. Continuous data are typical for some physical experi-
ments, precise measurements or multimedial data. In this paper we focus on the
discrete data. Using different combination of the source data and aggregation
functions we became 16 different inputs for algorithms. In the final results we
use the averages of the particular results.


4.2      Results

In our experiments we wanted to compare the random and parallel approach to
the set of candidates. On the figure 1 and 2 we take as the base the random ap-
proach (=100%). We can see the performance of the parallel approach compared
                                   Comparison of parallel and random approach . . .             7



              14000




              12000




              10000




               8000                                                          δ F/δ x(x)
                                                                             δ F/δ x(Δ x)-par
      SA+RA




                                                                             x/Δ x switch
                                                                             δ F/δ x(Δ x)

               6000




               4000




               2000




                  0
                      1       5       10      20      30       50      100
                                              k




                          Fig. 4. Number of all accesses (artificial data)




with the random approach for each heuristic. Both artificial and benchmark data
shows, that different approach to the candidates does not change anything for
the (δF/δx) ∗ x heuristic. For the x/∆x heuristic the random approach seems to
be better in most cases.

    In our results, the new parallel approach positively improved the performance
of the (δF/δx)∗∆x heuristic over discretised values. This heuristics was originally
developed for multimedial data. Using our improvement this heuristic can be
quite efficient over discretized data too, but as we can see in figures 3 and 4
when we work over discretized values the new approach to the (δF/δx) ∗ ∆x
heuristic is still not better than quite stable x/∆x heuristics.

   Why we have such a results? The parallel version of the (δF/δx)∗x algorithm
has almost same results as its random variant maybe because there was almost
none situation in which we had more than one candidate in the candidate set.
On the other side the (δF/δx) ∗ ∆x algorithm had many such a situations. The
reason of this situation is the fact, that the expression (si (ri (zi − p)) − si (ri (zi )))
almost always equals to zero for small p and discrete values in the lists. As
experiments show, for this heuristic when we don’t know to prefer one list, it is
better to access all lists. For the x/∆x heuristic seem the random approach to
be better when k is greater or equals to 5.
8       Peter Gurský


5   Conclusions

We proposed the new parallel approach to the candidate set and compare it
with the previous random approach. We experimentally showed that this type
of approach improved the (δF/δx)∗∆x heuristic over discrete data. On the other
hand the x/∆x heuristic keeps its first place in lower number of accesses as was
shown in [4].


References
 1. U.Güntzer, W.Balke, W.Kiessling Optimizing Multi-Feature Queries for Image
    Databases, proceedings of the 26th VLDB Conference, Cairo, Egypt, 2000
 2. R.Fagin Combining fuzzy information from multiple systems, J. Comput. System
    Sci., 58:83-99, 1999
 3. R.Fagin, A.Lotem, M.Naor Optimal Aggregation Algorithms for Middleware, proc.
    20th ACM Symposium on Principles of Database Systems, pages 102-113, 2001
 4. P.Gurský, R.Lencses Aspects of integration of ranked distributed data, proc.
    Datakon , ISBN 80-210-3516-1, pages 221-230, 2004
 5. R.Fagin Combining Fuzzy Information: an Overview, ACM SIGMOID Record 31,
    Database principles column, pages 109-118, 2002
 6. R.Fagin Combining fuzzy information from multiple systems, 15th ACM Sympo-
    sium on Principles of Databases Systems, pages 216-226, 1996
 7. P.Gurský, R.Lencses, P.Vojtáš Algorithms for user dependent integration of ranked
    distributed information, technical report, 2004