=Paper= {{Paper |id=Vol-90/paper-5 |storemode=property |title=Probabilistic Iterative Expansion of Candidates in Mining Frequent Itemsets |pdfUrl=https://ceur-ws.org/Vol-90/gyenesei.pdf |volume=Vol-90 |dblpUrl=https://dblp.org/rec/conf/fimi/GyeneseiT03 }} ==Probabilistic Iterative Expansion of Candidates in Mining Frequent Itemsets== https://ceur-ws.org/Vol-90/gyenesei.pdf
                      Probabilistic Iterative Expansion of Candidates
                               in Mining Frequent Itemsets

                                 Attila Gyenesei and Jukka Teuhola
          Turku Centre for Computer Science, Dept. of Inf. Technology, Univ. of Turku, Finland
                                 Email: {gyenesei,teuhola}@it.utu.fi


                        Abstract                                itemset. A transaction is an itemset identified by a tid. A
                                                                transaction with itemset Y is said to support itemset X, if
   A simple new algorithm is suggested for frequent             X ⊆ Y. The cover of an itemset X in a database D is the set
itemset mining, using item probabilities as the basis for       of transactions in D that support X. The support of itemset
generating candidates. The method first finds all the           X is the size of its cover in D. The relative frequency
frequent items, and then generates an estimate of the           (probability) of itemset X with respect to D is
frequent sets, assuming item independence. The candi-                                            Support ( X , D)
dates are stored in a trie where each path from the root to                        P( X , D) =                          (1)
a node represents one candidate itemset. The method                                                   D
expands the trie iteratively, until all frequent itemsets are   An itemset X is frequent if its support is greater than or
found. Expansion is based on scanning through the data
                                                                equal to a given threshold σ. We can also express the
set in each iteration cycle, and extending the subtries
                                                                condition using a relative threshold for the frequency:
based on observed node frequencies. Trie probing can be
                                                                P(X, D) ≥ σrel , where 0 ≤ σrel ≤ 1. There are variants of
restricted to only those nodes which possibly need exten-
                                                                the basic ‘all-frequent-itemsets’ problem, namely the
sion. The number of candidates is usually quite moderate;
                                                                maximal and closed itemset mining problems, see [1, 4, 5,
for dense datasets 2-4 times the number of final frequent
                                                                8, 12]. However, here we restrict ourselves to the basic
itemsets, for non-dense sets somewhat more. In practical
                                                                task.
experiments the method has been observed to make
                                                                    A large number of algorithms have been suggested for
clearly fewer passes than the well-known Apriori method.
                                                                frequent itemset mining during the last decade; for
As for speed, our non-optimised implementation is in some
                                                                surveys, see [7, 10, 15]. Most of the algorithms share the
cases faster, in some others slower than the comparison
                                                                same general approach: generate a set of candidate
methods.
                                                                itemsets, count their frequencies in D, and use the
                                                                obtained information in generating more candidates, until
                                                                the complete set is found. The methods differ mainly in
1. Introduction                                                 the order and extent of candidate generation. The most
                                                                famous is probably the Apriori algorithm, developed
   We study the well-known problem of finding frequent          independently by Agrawal et al. [3] and Mannila et al.
itemsets from a transaction database, see [2]. A trans-         [11]. It is a representative of breadth-first candidate
action in this case means a set of so-called items. For         generation: it first finds all frequent 1-itemsets, then all
example, a supermarket basket is represented as a trans-        frequent 2-itemsets, etc. The core of the method is clever
action, where the purchased products represent the items.       pruning of candidate k-itemsets, for which there exists a
The database may contain millions of such transactions.         non-frequent k-1-subset. This is an application of the
The frequent itemset mining is a task, where we should          obvious monotonicity property: All subsets of a frequent
find those subsets of items that occur at least in a given      itemset must also be frequent. Apriori is essentially based
minimum number of transactions. This is an important            on this property.
basic task, applicable in solving more advanced data                The other main candidate generation approach is depth-
mining problems, for example discovering association            first order, of which the best-known representatives are
rules [2]. What makes the task difficult is that the number     Eclat [14] and FP-growth [9] (though the ‘candidate’
of potential frequent itemsets is exponential in the number     concept in the context of FP-growth is disputable). These
of distinct items.                                              two are generally considered to be among the fastest
   In this paper, we follow the notations of Goethals [7].      algorithms for frequent itemset mining. However, we shall
The overall set of items is denoted by I. Any subset X ⊆ I      mainly use Apriori as a reference method, because it is
is called an itemset. If X has k items, it is called a k-       technically closer to ours.
   Most of the suggested methods are analytical in the
sense that they are based on logical inductions to restrict
the number of candidates to be checked. Our approach
(called PIE) is probabilistic, based on relative item
frequencies, using which we compute estimates for
itemset frequencies in candidate generation. More
precisely, we generate iteratively improving approxi-                                 0                            1             2         3
mations (candidate itemsets) to the solution. Our general
endeavour has been to develop a relatively simple method,
with fast basic steps and few iteration cycles, at the cost of
                                                                         1            2             3      2           3         3
somewhat increased number of candidates. However,
another goal is that the method should be robust, i.e. it
should work reasonably fast for all kinds of datasets.
                                                                   2           3      3                    3

2. Method description
     Our method can be characterized as a generate-and-test        3
algorithm, such as Apriori. However, our candidate
generation is based on probabilistic estimates of the
supports of itemsets. The testing phase is rather similar to     Figure 1. The complete trie for 4 items.
Apriori, but involves special book-keeping to lay a basis
for the next generation phase.
     We start with a general description of the main steps of
the algorithm. The first thing to do is to determine the
frequencies of all items in the dataset, and select the                                                                                1/2
                                                                                   1/2                         3/4         1/4
frequent ones for subsequent processing. If there are m
frequent items, we internally identify them by numbers                                   0                         1             2         3
0, …, m-1. For each item i, we use its probability (relative
frequency) P(i) in the generation of candidates for                     3/8        1/8              1/4 3/16           3/8           1/8
frequent itemsets.
                                                                         1               2          3          2       3         3
     The candidates are represented as a trie structure,
which is normal in this context, see [7]. Each node is
                                                                 3/32          3/16          1/16       3/32
labelled by one item, and a path of labels from the root to
a node represents an itemset. The root itself represents the        2          3         3                     3
empty itemset. The paths are sorted, so that a subtrie
rooted by item i can contain only items > i. Note also that             3/64
several nodes in the trie can have the same item label, but
not on a single path. A complete trie, storing all subsets of       3
the whole itemset, would have 2m nodes and be
structurally a binomial tree [13], where on level j there are
                                                                 Figure 2. An initial trie for the transaction set
 ( mj ) nodes, see Fig. 1 for m = 4.                             {(0, 3), (1, 2), (0, 1, 3), (1)}, with minimum support
    The trie is used for book-keeping purposes. However, it      threshold σ = 1/6. The virtual nodes with pro-
is important to avoid building the complete trie, but only       babilities < 1/6 are shown using dashed lines.
some upper part of it, so that the nodes (i.e. their root
paths) represent reasonable candidates for frequent sets. In
our algorithm, the first approximation for candidate             shows an example of the initial trie for a given set of
itemsets is obtained by computing estimates for their            transactions (with m = 4). Those nodes of the complete
probabilities, assuming independence of item occurrences.        trie (Fig. 1) that do not exist in the actual trie are called
It means that, for example, for an itemset {x, y, z} the         virtual nodes, and marked with dashed circles in Fig. 2.
estimated probability is the product P(x)P(y)P(z). Nodes             The next step is to read the transactions and count the
are created in the trie from root down along all paths as        true number of occurrences for each node (i.e. the related
long as the path-related probability is not less that the        path support) in the trie. Simultaneously, for each visited
threshold σrel. Note that the probability values are             node, we maintain a counter called pending support (PS),
monotonically non-increasing on the way down. Fig. 2             being the number of transactions for which at least one
virtual child of the node would match. The pending               speeds up the local expansion growth by one level, on the
support will be our criterion for the expansion of the node:     average (k levels for αk). This acceleration restricts the
If PS(x) ≥ σ, then it is possible that a virtual child of node   number of iterations efficiently. The largest extensions are
x is frequent, and the node must be expanded. If there are       applied only to the ‘skewest’ subtries, so that the total size
no such nodes, the algorithm is ready, and the result can        of the trie remains tolerable. Another approach to choose
be read from the trie: All nodes with support ≥ σ represent      α would be to do a statistical analysis to determine confi-
frequent itemsets.                                               dence bounds for ES. However, this is left for future work.
    Trie expansion starts the next cycle, and we iterate until      Fig. 3 shows an example of trie expansion, assuming
the stopping condition holds. However, we must be very           that the minimum support threshold σ = 80, α = 0.8, and k
careful in the expansion: which virtual nodes should we          = 1. The item probabilities are assumed to be P(y) = 0.7,
materialize (and how deep, recursively), in order to avoid       P(z) = 0.5, and P(v) = 0.8. Node t has a pending support of
trie ‘explosion’, but yet approach the final solution? Here      100, related to its two virtual children, y and z. This means
we apply item probabilities, again. In principle, we could       that 100 transactions contained the path from root to t,
take advantage of all information available in the current       plus either or both of items y and z, so we have to test for
trie (frequencies of subsets, etc.), as is done in the Apriori   expansion. Our formula gives y a local probability LP(y) =
algorithm and many others. However, we prefer simpler            0.7 / (1−(1−0.7)(1−0.5)) ≈ 0.82, so the estimated support
calculation, based on global probabilities of items.             is 82 > α⋅σ = 64, and we expand y. However, the local
    Suppose that we have a node x with pending support           probability of z is only ≈ 0.59, so its estimated support is
PS(x) ≥ σ. Assume that it has virtual child items v0, v1, …,     59, and it will not be expanded.
vs-1 with global probabilities P(v0), P(v1), …, P(vs-1). Every
transaction contributing to PS(x) has a match with at least
one of v0, v1, …, vs-1. The local probability (LP) for a                               PS=100
match with vi is computed as follows:
                                                                                                    t
  LP (vi )
    = P (vi matches | One of v 0 , v1 , … matches )
                                                                    x
                                                                                ES=82
                                                                                         y
                                                                                                 ES=59
                                                                                                           z         …
         P ((vi matches) ∧ (One of v0 , v1 … matches))
    =
                  P(One of v0 , v1, … matches)                              ES=41               ES=65.6
                P (vi matches)                                                     z            v
    =
        P (One of v0 , v1, … matches)
                         P (vi )
    =                                                      (2)
        1− (1− P (v 0 ))(1− P (v1 ))K(1− P (v s ))
                                                                 Figure 3. An example of expansion for probabili-
                                                                 ties P(y) = 0.7, P(z) = 0.5, and P(v) = 0.8.
Using this formula, we get an estimated support ES(vi):

               ES (vi )= LP (vi ) PS ( Parent (Vi ))      (3)
                                                                    When a virtual node (y) has been materialized, we
                                                                 immediately test also its expansion, based on its ES-value,
If ES(vi) ≥ σ, then we conclude that vi is expected to be        recursively. However, in the recursive steps we cannot
frequent. However, in order to guarantee a finite number         apply formula (2), because we have no evidence of the
of iterations in the worst case, we have to relax this           children of y. Instead, we apply the unconditional
condition a bit. Since the true distribution may be very         probabilities of z and v in estimation: LP(z) = 82⋅0.5 = 41
skewed, almost the whole pending support may belong to           < α⋅σ = 64, and LP(v) = 82⋅0.8 = 65.6 > 64. Node v is
only one virtual child. To ensure convergence, we apply          materialized, but z is not. Expansion test continues down
the following condition for child expansion in the kth           from v. Thus, both in initialization of the trie and in its
iteration,                                                       expansion phases, we can create several new levels (i.e.
                                 k                     (4)       longer candidates) at a time, contrary to e.g. the base
                     ES (vi ) ≥ α σ
                                                                 version of Apriori. It is true that also Apriori can be
with some constant α between 0 and 1. In the worst case          modified to create several candidate levels at a time, but at
this will eventually (when k is high enough) result in           the cost of increased number of candidates.
expansion, to get rid of a PS-value ≥ σ. In our tests, we           After the expansion phase the iteration continues with
used the heuristic value α = average probability of fre-         the counting phase, and new values for node supports and
quent items. The reasoning behind this choice is that it         pending supports are determined. The two phases alternate
until all pending supports are less than σ. We have given
our method the name ‘PIE’, reflecting this Probabilistic        Algorithm PIE − Probabilistic iterative expansion of
Iterative Expansion property.                                   candidates in frequent itemset mining

                                                                Input: A transaction database D, the minimum
                                                                support threshold σ.
3. Elaboration                                                  Output: The complete set of frequent itemsets.
    The above described basic version does a lot of extra
                                                                 1. // Initial steps.
work. One observation is that as soon as the pending
                                                                 2. scan D and collect the set F of frequent items;
support of some node x is smaller than σ, we can often
                                                                 3. α := average probability of items in F;
‘freeze’ the whole subtrie, because it will not give us
                                                                 4. iter := 0;
anything new; we call it ‘ready’. The readiness of nodes
can be checked easily with a recursive process: A node x
                                                                 5. // The first generation of candidates, based on
is ready if PS(x) < σ and all its real children are ready.          // item probabilities.
The readiness can be utilized to reduce work both in             6. create a PIE-trie P so that it contains all such
counting and expansion phases. In counting, we process
                                                                       ordered subsets S ⊆ F for which
one transaction at a time and scan its item subsets down
                                                                          Π(Prob(s∈S)) ⋅ |D| ≥ σ ; // Frequency test
the trie, but only until the first ready node on each path.
                                                                 7. set the status of all nodes of P to not-ready;
Also the expansion procedure is skipped for ready nodes.
Finally, a simple stopping condition is when the root
                                                                 8. // The main loop: alternating count, test and
becomes ready.
                                                                    // expand.
    Another tailoring, not yet implemented, relates to the
                                                                 9. loop
observation that most of the frequent itemsets are found in
                                                                10. // Scan the database and check readiness.
the first few iterations, and a lot of I/O effort is spent to
                                                                11. scan D and count the support and pending
find the last few frequent sets. For those, not all
                                                                         support values for non-ready nodes in P;
transactions are needed in solving the frequency. In the
                                                                12. iter := iter + 1;
counting phase, we can distinguish between relevant and
irrelevant transactions. A transaction is irrelevant, if it     13. for each node p∈P do
does not increase the pending support value of any non-         14.      if pending_support(p) < σ then
ready node. If the number of relevant transactions is small     15.         if p is a leaf then set p ready
enough, we can store them separately (in main memory or         16.         else if the children of p are ready then
temporary file) during the next scanning phase.                 17.                set p ready;
    Our implementation of the trie is quite simple; saving      18. if root(P) is ready then exit loop;
memory is considered, but not as the first preference. The
child linkage is implemented as an array of pointers, and       19.   // Expansion phase: Creation of subtries on
the frequent items are renumbered to 0, …, m-1 (if there              // the basis of observed pending supports.
are m frequent items) to be able to use them as indices to      20.   for each non-ready node p in P do
the array. A minor improvement is that for item i, we need      21.      if pending_support(p) ≥ σ then
only m-i-1 pointers, corresponding to the possible children     22.         for each virtual child v of p do
i+1, …, m-1.                                                    23.           compute local_prob(v) by formula (2);
    The main restriction of the current implementation is       24.           estim_support(v) :=
the assumption that the trie fits in the main memory.                            local_prob(v) ⋅ pending_support(p);
                                                                                                      iter
Compression of nodes would help to some extent: Now             25.           if estim_support(v) ≥ α σ then
we reserve a pointer for every possible child node, but         26.              create node v as the child of p;
most of them are null. Storing only non-null pointers           27.              add such ordered subsets S ⊆ F\{1..v}
saves memory, but makes the trie scanning slower. Also,                            as descendant paths of v, for which
we could release the ready nodes as soon as they are                               Π(Prob(s∈S)) ⋅ estim_support(v)
detected, in order to make room for expansions. Of                                      iter
                                                                                     ≥α σ;
course, before releasing, the related frequent itemsets
should be reported. However, a fully general solution           28. // Gather up results from the trie
should work for any main memory and trie size. Some             29. return the paths for nodes p in P such that
kind of external representation should be developed, but               support(p) ≥ σ;
this is left for future work.                                   30. end
    A high-level pseudocode of the current implementation
is given in the following. The recursive parts are not
coded explicitly, but should be rather obvious.
4. Experimental results                                         frequent itemset dictates the number of iterations. This is
                                                                roughly the same as the trie depth, as shown in Table 2.
   For verifying the usability of our PIE algorithm, we            The PIE method can also be characterized by describ-
used four of the test datasets made available to the            ing the development of the trie during the iterations. The
Workshop on Frequent Itemset Mining Implementations             most interesting figures are the number of nodes and the
(FIMI’03) [6]. The test datasets and some of their              number of ready nodes, given in Table 3. Especially the
properties are described in Table 1. They represent rather      number of ready nodes implies that even though we have
different kinds of domains, and we wanted to include both       rather many candidates (= nodes in the trie), large parts of
dense and non-dense datasets, as well as various numbers        them are not touched in the later iterations.
of items.

                                                                   Table 3. Development of the trie for dataset
          Table 1. Test dataset description                         ‘Chess’, with three different values of σ.

       Dataset           #Transactions     #Items                                        #Frequent                #Ready
                                                                    σ        Iteration                  #Nodes
       Chess                 3 196             75                                        sets found                nodes
       Mushroom              8 124            119                 2600          1            4 720        4 766     2 021
       T40I10D100K         100 000            942                               2            6 036        9 583     9 255
       Kosarak             900 002         41 270                               3            6 134       10 296    10 173
                                                                                4            6 135       10 516    10 516
                                                                                1          15 601        15 760     5 219
For the PIE method, the interesting statistics to be                            2          20 344        34 995    25 631
                                                                  2400
collected are the number of candidates, depth of the trie,                      3          20 580        47 203    46 952
and the number of iterations. These results are given in                        4          20 582        47 515    47 515
Table 2 for selected values of σ, for the ‘Chess’ dataset.                      1          44 022        44 800     1 210
We chose values of σ that keep the number of frequent                           2          58 319       112 370    64 174
itemsets reasonable (extremely high numbers are probably          2200          3          59 176       206 292   196 782
useless for any application). The table shows also the                          4          59 181       216 931   216 922
number of frequent items and frequent sets, to enable                           5          59 181       216 943   216 943
comparison with the number of candidates. For this dense
dataset, the number of candidates varies between 2-4
times the number of frequent itemsets. For non-dense               For speed comparison, we chose the Apriori and FP-
datasets the ratio is usually larger. Table 2 shows also the    growth implementations, provided by Bart Goethals [6].
values of the ‘security parameter’ α, being the average         The results for the four test datasets and for different
probability of frequent items. Considering I/O perfor-          minimum support thresholds are shown in Table 4. The
mance, we can see that the number of iteration cycles (=        processor used in the experiments was a 1.5 GHz Pentium
number of file scans) is quite small, compared to the base      4, with 512 MB main memory. We used a g++ compiler,
version of the Apriori method, for which the largest            using optimizing switch –O6. The PIE algorithm was
                                                                coded in C.


                          Table 2. Statistics from the PIE algorithm for dataset ‘Chess’.

                    #Frequent      #Frequent                                     Trie                     #Apriori’s
           σ                                        Alpha      #Candidates                #Iterations
                      items           sets                                      depth                     iterations
         3 000          12               155        0.970            400          6            3               6
         2 900          13               473        0.967          1 042           8           4               7
         2 800          16             1 350        0.953          2 495           8           4               8
         2 700          17             3 134        0.947          5 218           9           4               8
         2 600          19             6 135        0.934         10 516          10           4               9
         2 500          22           11 493         0.914         18 709          11           4              10
         2 400          23           20 582         0.907         47 515          12           4              11
         2 300          24           35 266         0.900        131 108          13           4              12
         2 200          27           59 181         0.877        216 943          14           5              13
   Table 4. Comparison of execution times (in              We can see that in some situations the PIE algorithm is
   seconds) of three frequent itemset mining           the fastest, in some others the slowest. This is probably a
        programs for four test datasets.               general observation: the performance of most frequent
                                                       itemset mining algorithms is highly dependent on the data
(a) Chess                                              set and threshold. It seems that PIE is at its best for sparse
              #Freq.                 FP-               datasets (such as T40I10D100K and Kosarak), but not so
    σ                    Apriori              PIE      good for very dense datasets (such as ‘Chess’ and
                sets               growth
    3 000          155     0.312     0.250     0.125   ‘Mushroom’). Its speed for large thresholds probably
    2 900          473     0.469     0.266     0.265   results from the simplicity of the algorithm. For smaller
    2 800        1 350     0.797     0.297     1.813   thresholds, the trie gets large and the counting starts to
    2 700        3 134     1.438     0.344     6.938   consume more time, especially with a small main memory
                                                       size.
    2 600        6 135     3.016     0.438    14.876
                                                           One might guess that our method is at its best for
    2 500      11 493     10.204     0.610    26.360
                                                       random data sets, because those would correspond to our
    2 400      20 582     21.907     0.829    78.325
                                                       assumption about independent item occurrences. We
    2 300      35 266     42.048     1.156   203.828   tested this with a dataset of 100 000 transactions, each of
    2 200      59 181     73.297     1.766   315.562   which contained 20 random items out of 30 possible. The
                                                       results were rather interesting: For all tested thresholds for
(b) Mushroom                                           minimum support, we found all the frequent itemsets in
           #Freq.                    FP-               the first iteration. However, verification of the complete-
     σ                   Apriori              PIE
             sets                  growth              ness required one or two additional iterations, with a
    5 000        41        0.375     0.391     0.062   clearly higher number of candidates, consuming a
    4 500        97        0.437     0.406     0.094   majority of the total time. Table 5 shows the time and
    4 000       167        0.578     0.438     0.141   number of candidates both after the first and after the final
    3 500       369        0.797     0.500     0.297   iteration. The stepwise growth of the values reveals the
    3 000       931        1.062     0.546     1.157   levelwise growth of the trie. Apriori worked well also for
    2 500     2 365        1.781     0.610     6.046   this dataset, being in most cases faster than PIE. Results
    2 000     6 613        3.719     0.750    27.047   for FP-growth (not shown) are naturally much slower,
    1 500   56 693        55.110     1.124   153.187   because randomness prevents a compact representation of
                                                       the transactions.
(c) T40I10D100K                                            We wish to point out that our implementation was an
            #Freq.                   FP-               initial version, with no special tricks for speed-up. We are
      σ                  Apriori              PIE      convinced that the code details can be improved to make
             sets                  growth
   20 000         5        2.797     6.328     0.797   the method still more competitive. For example, buffering
   18 000         9        2.828     6.578     1.110   of transactions (or temporary files) were not used to
   16 000        17        3.001     7.250     1.156   enhance the I/O performance.
   14 000        24        3.141     8.484     1.187
   12 000        48        3.578    14.750     1.906
   10 000        82        4.296    23.874     4.344   5. Conclusions and future work
     8 000     137         7.859    41.203    11.796
     6 000     239        20.531    72.985    29.671      A probability-based approach was suggested for
     4 000     440        35.282   114.953    68.672   frequent itemset mining, as an alternative to the ‘analytic’
                                                       methods common today. It has been observed to be rather
(c) Kosarak                                            robust, working reasonably well for various kinds of
              #Freq.                 FP-               datasets. The number of candidate itemsets does not
    σ                    Apriori              PIE
                                                       ‘explode’, so that the data structure (trie) can be kept in
               sets                growth
  20 000          121     27.970    30.141     5.203   the main memory in most practical cases.
  18 000          141     28.438    31.296     6.110      The number of iterations is smallest for random
  16 000          167     29.016    32.765     7.969   datasets, because candidate generation is based on just that
  14 000          202     29.061    33.516     9.688   assumption. For skewed datasets, the number of iterations
  12 000          267     29.766    34.875    12.032   may somewhat grow. This is partly due to our simplifying
  10 000          376     34.906    37.657    18.016   premise that the items are independent. This point could
   8 000          575     35.891    41.657    30.453   be tackled by making use of the conditional probabilities
   6 000        1 110     39.656    51.922    70.376   obtainable from the trie. Initial tests did not show any
                                                       significant advantage over the basic approach, but a more
                          Table 5. Statistics from the PIE algorithm for a random dataset.

                                                              PIE
                                                                                                          Apriori
                    #Freq.             After iteration 1.        After the last iteration (final)
           σ
                     sets       #Freq.       Time                            #Iter-      Time        #Iter-       Time
                                                       #Cand. #Cand.
                                  sets       (sec.)                          ations      (sec.)      ations       (sec.)
        50 000          30          30         0.500         30      464        2          2.234        2           3.953
        44 000          42          42         2.016        465      509        3          2.704        3           5.173
        43 800         124         124         1.875        465    1 247        3        10.579         3           6.015
        43 700         214         214         1.876        465    1 792        3        20.250         3           7.235
        43 600         331         331         1.891        465    2 775        3        37.375         3           9.657
        43 500         413         413         1.860        465    3 530        3        48.953         3         11.876
        40 000         465         465         1.844        465    4 443        2        62.000         3         13.875
        28 400         522         522       60.265       4 525    4 900        3        64.235         4         15.016
        28 300         724         724       61.422       4 525    5 989        3        82.140         4         15.531
        28 200       1 270       1 270       61.469       4 525    8 697        3       115.250         4         19.265
        28 100       2 223       2 223       61.734       4 525   13 608        3       167.047         4         31.266
        28 000       3 357       3 357       60.969       4 525   19 909        3       219.578         4         69.797



sophisticated probabilistic analysis might imply some               [8] K. Gouda and M.J. Zaki, “Efficiently Mining Maximal
ways to restrict the number of candidates. The exploration              Frequent Itemsets”, In N. Cercone, T.Y. Lin, and X. Wu
of these elaborations, as well as tuning the buffering, data            (eds.), Proc. of 2001 IEEE International Conference on
structure, and parameters, is left for future work.                     Data Mining, Nov. 2001, pp. 163-170.

                                                                    [9] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns
References                                                              Without Candidate Generation”, In W. Chen, J. Naughton,
                                                                        and P.A. Bernstein (eds.), Proc. of ACM SIGMOD Int.
                                                                        Conf. on Management of Data, 2000, pp. 1-12.
 [1] R. Agrawal, C. Aggarwal, and V.V.V. Prasad, “Depth First
     Generation of Long Patterns”, In R. Ramakrishnan, S.          [10] J. Hipp, U. Güntzer, and N. Nakhaeizadeh, “Algorithms
     Stolfo, R. Bayardo, and I. Parsa (eds.), Proc. of the Int.         for Association Rule Mining - a General Survey and
     Conf. on Knowledge Discovery and Data Mining, ACM,                 Comparison”, ACM SIGKDD Explorations 2, July 2000,
     Aug. 2000, pp. 108-118.                                            pp. 58-65.
 [2] R. Agrawal, T. Imielinski, and A. Swami, “Mining Associ-      [11] H. Mannila, H. Toivonen, and A.I. Verkamo, “Efficient
     ation Rules Between Sets of Items in Large Databases”, In          Algorithms for Discovering Association Rules”, In U.M.
     P. Buneman and S. Jajodia (eds.), Proc. of ACM SIGMOD              Fayyad and R. Uthurusamy (eds.), Proc. of the AAAI
     Int. Conf. of Management of Data, May 1993, pp. 207-216.           Workshop on Knowledge Discovery in Databases, July
                                                                        1994, pp. 181-192.
 [3] R. Agrawal and R. Srikant, “Fast Algorithms for Mining
     Association Rules in Large Databases”, In J.B. Bocca, M.      [12] J. Pei, J. Han, and R. Mao, “Closet: An Efficient Algo-
     Jarke, and C. Zaniolo (eds.), Proc. of the 20th VLDB Conf.,        rithm for Mining Frequent Closed Itemsets”, Proc. of ACM
     Sept. 1994, pp. 487-499.                                           SIGMOD Workshop on Research Issues in Data Mining
                                                                        and Knowledge Discovery, May 2000, pp. 21-30.
 [4] R.J. Bayardo, “Efficiently Mining Long Patterns from
     Databases”, In L.M. Haas and A. Tiwary (eds.), Proc. of       [13] J. Vuillemin, “A Data Structure for Manipulating Priority
     the ACM SIGMOD Int. Conf. on Management of Data,                   Queues”, Comm. of the ACM, 21(4), 1978, pp. 309-314.
     June 1998, pp. 85-93.
                                                                   [14] M.J. Zaki, “Scalable Algorithms for Association Mining”,
 [5] D. Burdick, M. Calimlim, and J. Gehrke, “MAFIA: a                  IEEE Transactions on Knowledge and Data Engineering
     Maximal Frequent Itemset Algorithm for Transactional               12 (3), 2000, pp. 372-390.
     Databases”, Proc. of IEEE Int. Conf. on Data Engineering,
     April 2001, pp. 443-552.                                      [15] Z. Zheng, R. Kohavi, and L. Mason, “Real World Perfor-
                                                                        mance of Association Rule Algorithms”, In F. Provost and
 [6] Frequent Itemset Mining Implementations (FIMI’03)                  R. Srikant (eds.), Proc. of the Seventh ACM SIGKDD
     Workshop website, http://fimi.cs.helsinki.fi, 2003.                International Conference on Knowledge Discovery and
                                                                        Data Mining, 2001, pp. 401-406.
 [7] B. Goethals, “Efficient Frequent Pattern Mining”, PhD
     thesis, University of Limburg, Belgium, Dec. 2002.