=Paper=
{{Paper
|id=Vol-90/paper-5
|storemode=property
|title=Probabilistic Iterative Expansion of Candidates in Mining Frequent Itemsets
|pdfUrl=https://ceur-ws.org/Vol-90/gyenesei.pdf
|volume=Vol-90
|dblpUrl=https://dblp.org/rec/conf/fimi/GyeneseiT03
}}
==Probabilistic Iterative Expansion of Candidates in Mining Frequent Itemsets==
Probabilistic Iterative Expansion of Candidates
in Mining Frequent Itemsets
Attila Gyenesei and Jukka Teuhola
Turku Centre for Computer Science, Dept. of Inf. Technology, Univ. of Turku, Finland
Email: {gyenesei,teuhola}@it.utu.fi
Abstract itemset. A transaction is an itemset identified by a tid. A
transaction with itemset Y is said to support itemset X, if
A simple new algorithm is suggested for frequent X ⊆ Y. The cover of an itemset X in a database D is the set
itemset mining, using item probabilities as the basis for of transactions in D that support X. The support of itemset
generating candidates. The method first finds all the X is the size of its cover in D. The relative frequency
frequent items, and then generates an estimate of the (probability) of itemset X with respect to D is
frequent sets, assuming item independence. The candi- Support ( X , D)
dates are stored in a trie where each path from the root to P( X , D) = (1)
a node represents one candidate itemset. The method D
expands the trie iteratively, until all frequent itemsets are An itemset X is frequent if its support is greater than or
found. Expansion is based on scanning through the data
equal to a given threshold σ. We can also express the
set in each iteration cycle, and extending the subtries
condition using a relative threshold for the frequency:
based on observed node frequencies. Trie probing can be
P(X, D) ≥ σrel , where 0 ≤ σrel ≤ 1. There are variants of
restricted to only those nodes which possibly need exten-
the basic ‘all-frequent-itemsets’ problem, namely the
sion. The number of candidates is usually quite moderate;
maximal and closed itemset mining problems, see [1, 4, 5,
for dense datasets 2-4 times the number of final frequent
8, 12]. However, here we restrict ourselves to the basic
itemsets, for non-dense sets somewhat more. In practical
task.
experiments the method has been observed to make
A large number of algorithms have been suggested for
clearly fewer passes than the well-known Apriori method.
frequent itemset mining during the last decade; for
As for speed, our non-optimised implementation is in some
surveys, see [7, 10, 15]. Most of the algorithms share the
cases faster, in some others slower than the comparison
same general approach: generate a set of candidate
methods.
itemsets, count their frequencies in D, and use the
obtained information in generating more candidates, until
the complete set is found. The methods differ mainly in
1. Introduction the order and extent of candidate generation. The most
famous is probably the Apriori algorithm, developed
We study the well-known problem of finding frequent independently by Agrawal et al. [3] and Mannila et al.
itemsets from a transaction database, see [2]. A trans- [11]. It is a representative of breadth-first candidate
action in this case means a set of so-called items. For generation: it first finds all frequent 1-itemsets, then all
example, a supermarket basket is represented as a trans- frequent 2-itemsets, etc. The core of the method is clever
action, where the purchased products represent the items. pruning of candidate k-itemsets, for which there exists a
The database may contain millions of such transactions. non-frequent k-1-subset. This is an application of the
The frequent itemset mining is a task, where we should obvious monotonicity property: All subsets of a frequent
find those subsets of items that occur at least in a given itemset must also be frequent. Apriori is essentially based
minimum number of transactions. This is an important on this property.
basic task, applicable in solving more advanced data The other main candidate generation approach is depth-
mining problems, for example discovering association first order, of which the best-known representatives are
rules [2]. What makes the task difficult is that the number Eclat [14] and FP-growth [9] (though the ‘candidate’
of potential frequent itemsets is exponential in the number concept in the context of FP-growth is disputable). These
of distinct items. two are generally considered to be among the fastest
In this paper, we follow the notations of Goethals [7]. algorithms for frequent itemset mining. However, we shall
The overall set of items is denoted by I. Any subset X ⊆ I mainly use Apriori as a reference method, because it is
is called an itemset. If X has k items, it is called a k- technically closer to ours.
Most of the suggested methods are analytical in the
sense that they are based on logical inductions to restrict
the number of candidates to be checked. Our approach
(called PIE) is probabilistic, based on relative item
frequencies, using which we compute estimates for
itemset frequencies in candidate generation. More
precisely, we generate iteratively improving approxi- 0 1 2 3
mations (candidate itemsets) to the solution. Our general
endeavour has been to develop a relatively simple method,
with fast basic steps and few iteration cycles, at the cost of
1 2 3 2 3 3
somewhat increased number of candidates. However,
another goal is that the method should be robust, i.e. it
should work reasonably fast for all kinds of datasets.
2 3 3 3
2. Method description
Our method can be characterized as a generate-and-test 3
algorithm, such as Apriori. However, our candidate
generation is based on probabilistic estimates of the
supports of itemsets. The testing phase is rather similar to Figure 1. The complete trie for 4 items.
Apriori, but involves special book-keeping to lay a basis
for the next generation phase.
We start with a general description of the main steps of
the algorithm. The first thing to do is to determine the
frequencies of all items in the dataset, and select the 1/2
1/2 3/4 1/4
frequent ones for subsequent processing. If there are m
frequent items, we internally identify them by numbers 0 1 2 3
0, …, m-1. For each item i, we use its probability (relative
frequency) P(i) in the generation of candidates for 3/8 1/8 1/4 3/16 3/8 1/8
frequent itemsets.
1 2 3 2 3 3
The candidates are represented as a trie structure,
which is normal in this context, see [7]. Each node is
3/32 3/16 1/16 3/32
labelled by one item, and a path of labels from the root to
a node represents an itemset. The root itself represents the 2 3 3 3
empty itemset. The paths are sorted, so that a subtrie
rooted by item i can contain only items > i. Note also that 3/64
several nodes in the trie can have the same item label, but
not on a single path. A complete trie, storing all subsets of 3
the whole itemset, would have 2m nodes and be
structurally a binomial tree [13], where on level j there are
Figure 2. An initial trie for the transaction set
( mj ) nodes, see Fig. 1 for m = 4. {(0, 3), (1, 2), (0, 1, 3), (1)}, with minimum support
The trie is used for book-keeping purposes. However, it threshold σ = 1/6. The virtual nodes with pro-
is important to avoid building the complete trie, but only babilities < 1/6 are shown using dashed lines.
some upper part of it, so that the nodes (i.e. their root
paths) represent reasonable candidates for frequent sets. In
our algorithm, the first approximation for candidate shows an example of the initial trie for a given set of
itemsets is obtained by computing estimates for their transactions (with m = 4). Those nodes of the complete
probabilities, assuming independence of item occurrences. trie (Fig. 1) that do not exist in the actual trie are called
It means that, for example, for an itemset {x, y, z} the virtual nodes, and marked with dashed circles in Fig. 2.
estimated probability is the product P(x)P(y)P(z). Nodes The next step is to read the transactions and count the
are created in the trie from root down along all paths as true number of occurrences for each node (i.e. the related
long as the path-related probability is not less that the path support) in the trie. Simultaneously, for each visited
threshold σrel. Note that the probability values are node, we maintain a counter called pending support (PS),
monotonically non-increasing on the way down. Fig. 2 being the number of transactions for which at least one
virtual child of the node would match. The pending speeds up the local expansion growth by one level, on the
support will be our criterion for the expansion of the node: average (k levels for αk). This acceleration restricts the
If PS(x) ≥ σ, then it is possible that a virtual child of node number of iterations efficiently. The largest extensions are
x is frequent, and the node must be expanded. If there are applied only to the ‘skewest’ subtries, so that the total size
no such nodes, the algorithm is ready, and the result can of the trie remains tolerable. Another approach to choose
be read from the trie: All nodes with support ≥ σ represent α would be to do a statistical analysis to determine confi-
frequent itemsets. dence bounds for ES. However, this is left for future work.
Trie expansion starts the next cycle, and we iterate until Fig. 3 shows an example of trie expansion, assuming
the stopping condition holds. However, we must be very that the minimum support threshold σ = 80, α = 0.8, and k
careful in the expansion: which virtual nodes should we = 1. The item probabilities are assumed to be P(y) = 0.7,
materialize (and how deep, recursively), in order to avoid P(z) = 0.5, and P(v) = 0.8. Node t has a pending support of
trie ‘explosion’, but yet approach the final solution? Here 100, related to its two virtual children, y and z. This means
we apply item probabilities, again. In principle, we could that 100 transactions contained the path from root to t,
take advantage of all information available in the current plus either or both of items y and z, so we have to test for
trie (frequencies of subsets, etc.), as is done in the Apriori expansion. Our formula gives y a local probability LP(y) =
algorithm and many others. However, we prefer simpler 0.7 / (1−(1−0.7)(1−0.5)) ≈ 0.82, so the estimated support
calculation, based on global probabilities of items. is 82 > α⋅σ = 64, and we expand y. However, the local
Suppose that we have a node x with pending support probability of z is only ≈ 0.59, so its estimated support is
PS(x) ≥ σ. Assume that it has virtual child items v0, v1, …, 59, and it will not be expanded.
vs-1 with global probabilities P(v0), P(v1), …, P(vs-1). Every
transaction contributing to PS(x) has a match with at least
one of v0, v1, …, vs-1. The local probability (LP) for a PS=100
match with vi is computed as follows:
t
LP (vi )
= P (vi matches | One of v 0 , v1 , … matches )
x
ES=82
y
ES=59
z …
P ((vi matches) ∧ (One of v0 , v1 … matches))
=
P(One of v0 , v1, … matches) ES=41 ES=65.6
P (vi matches) z v
=
P (One of v0 , v1, … matches)
P (vi )
= (2)
1− (1− P (v 0 ))(1− P (v1 ))K(1− P (v s ))
Figure 3. An example of expansion for probabili-
ties P(y) = 0.7, P(z) = 0.5, and P(v) = 0.8.
Using this formula, we get an estimated support ES(vi):
ES (vi )= LP (vi ) PS ( Parent (Vi )) (3)
When a virtual node (y) has been materialized, we
immediately test also its expansion, based on its ES-value,
If ES(vi) ≥ σ, then we conclude that vi is expected to be recursively. However, in the recursive steps we cannot
frequent. However, in order to guarantee a finite number apply formula (2), because we have no evidence of the
of iterations in the worst case, we have to relax this children of y. Instead, we apply the unconditional
condition a bit. Since the true distribution may be very probabilities of z and v in estimation: LP(z) = 82⋅0.5 = 41
skewed, almost the whole pending support may belong to < α⋅σ = 64, and LP(v) = 82⋅0.8 = 65.6 > 64. Node v is
only one virtual child. To ensure convergence, we apply materialized, but z is not. Expansion test continues down
the following condition for child expansion in the kth from v. Thus, both in initialization of the trie and in its
iteration, expansion phases, we can create several new levels (i.e.
k (4) longer candidates) at a time, contrary to e.g. the base
ES (vi ) ≥ α σ
version of Apriori. It is true that also Apriori can be
with some constant α between 0 and 1. In the worst case modified to create several candidate levels at a time, but at
this will eventually (when k is high enough) result in the cost of increased number of candidates.
expansion, to get rid of a PS-value ≥ σ. In our tests, we After the expansion phase the iteration continues with
used the heuristic value α = average probability of fre- the counting phase, and new values for node supports and
quent items. The reasoning behind this choice is that it pending supports are determined. The two phases alternate
until all pending supports are less than σ. We have given
our method the name ‘PIE’, reflecting this Probabilistic Algorithm PIE − Probabilistic iterative expansion of
Iterative Expansion property. candidates in frequent itemset mining
Input: A transaction database D, the minimum
support threshold σ.
3. Elaboration Output: The complete set of frequent itemsets.
The above described basic version does a lot of extra
1. // Initial steps.
work. One observation is that as soon as the pending
2. scan D and collect the set F of frequent items;
support of some node x is smaller than σ, we can often
3. α := average probability of items in F;
‘freeze’ the whole subtrie, because it will not give us
4. iter := 0;
anything new; we call it ‘ready’. The readiness of nodes
can be checked easily with a recursive process: A node x
5. // The first generation of candidates, based on
is ready if PS(x) < σ and all its real children are ready. // item probabilities.
The readiness can be utilized to reduce work both in 6. create a PIE-trie P so that it contains all such
counting and expansion phases. In counting, we process
ordered subsets S ⊆ F for which
one transaction at a time and scan its item subsets down
Π(Prob(s∈S)) ⋅ |D| ≥ σ ; // Frequency test
the trie, but only until the first ready node on each path.
7. set the status of all nodes of P to not-ready;
Also the expansion procedure is skipped for ready nodes.
Finally, a simple stopping condition is when the root
8. // The main loop: alternating count, test and
becomes ready.
// expand.
Another tailoring, not yet implemented, relates to the
9. loop
observation that most of the frequent itemsets are found in
10. // Scan the database and check readiness.
the first few iterations, and a lot of I/O effort is spent to
11. scan D and count the support and pending
find the last few frequent sets. For those, not all
support values for non-ready nodes in P;
transactions are needed in solving the frequency. In the
12. iter := iter + 1;
counting phase, we can distinguish between relevant and
irrelevant transactions. A transaction is irrelevant, if it 13. for each node p∈P do
does not increase the pending support value of any non- 14. if pending_support(p) < σ then
ready node. If the number of relevant transactions is small 15. if p is a leaf then set p ready
enough, we can store them separately (in main memory or 16. else if the children of p are ready then
temporary file) during the next scanning phase. 17. set p ready;
Our implementation of the trie is quite simple; saving 18. if root(P) is ready then exit loop;
memory is considered, but not as the first preference. The
child linkage is implemented as an array of pointers, and 19. // Expansion phase: Creation of subtries on
the frequent items are renumbered to 0, …, m-1 (if there // the basis of observed pending supports.
are m frequent items) to be able to use them as indices to 20. for each non-ready node p in P do
the array. A minor improvement is that for item i, we need 21. if pending_support(p) ≥ σ then
only m-i-1 pointers, corresponding to the possible children 22. for each virtual child v of p do
i+1, …, m-1. 23. compute local_prob(v) by formula (2);
The main restriction of the current implementation is 24. estim_support(v) :=
the assumption that the trie fits in the main memory. local_prob(v) ⋅ pending_support(p);
iter
Compression of nodes would help to some extent: Now 25. if estim_support(v) ≥ α σ then
we reserve a pointer for every possible child node, but 26. create node v as the child of p;
most of them are null. Storing only non-null pointers 27. add such ordered subsets S ⊆ F\{1..v}
saves memory, but makes the trie scanning slower. Also, as descendant paths of v, for which
we could release the ready nodes as soon as they are Π(Prob(s∈S)) ⋅ estim_support(v)
detected, in order to make room for expansions. Of iter
≥α σ;
course, before releasing, the related frequent itemsets
should be reported. However, a fully general solution 28. // Gather up results from the trie
should work for any main memory and trie size. Some 29. return the paths for nodes p in P such that
kind of external representation should be developed, but support(p) ≥ σ;
this is left for future work. 30. end
A high-level pseudocode of the current implementation
is given in the following. The recursive parts are not
coded explicitly, but should be rather obvious.
4. Experimental results frequent itemset dictates the number of iterations. This is
roughly the same as the trie depth, as shown in Table 2.
For verifying the usability of our PIE algorithm, we The PIE method can also be characterized by describ-
used four of the test datasets made available to the ing the development of the trie during the iterations. The
Workshop on Frequent Itemset Mining Implementations most interesting figures are the number of nodes and the
(FIMI’03) [6]. The test datasets and some of their number of ready nodes, given in Table 3. Especially the
properties are described in Table 1. They represent rather number of ready nodes implies that even though we have
different kinds of domains, and we wanted to include both rather many candidates (= nodes in the trie), large parts of
dense and non-dense datasets, as well as various numbers them are not touched in the later iterations.
of items.
Table 3. Development of the trie for dataset
Table 1. Test dataset description ‘Chess’, with three different values of σ.
Dataset #Transactions #Items #Frequent #Ready
σ Iteration #Nodes
Chess 3 196 75 sets found nodes
Mushroom 8 124 119 2600 1 4 720 4 766 2 021
T40I10D100K 100 000 942 2 6 036 9 583 9 255
Kosarak 900 002 41 270 3 6 134 10 296 10 173
4 6 135 10 516 10 516
1 15 601 15 760 5 219
For the PIE method, the interesting statistics to be 2 20 344 34 995 25 631
2400
collected are the number of candidates, depth of the trie, 3 20 580 47 203 46 952
and the number of iterations. These results are given in 4 20 582 47 515 47 515
Table 2 for selected values of σ, for the ‘Chess’ dataset. 1 44 022 44 800 1 210
We chose values of σ that keep the number of frequent 2 58 319 112 370 64 174
itemsets reasonable (extremely high numbers are probably 2200 3 59 176 206 292 196 782
useless for any application). The table shows also the 4 59 181 216 931 216 922
number of frequent items and frequent sets, to enable 5 59 181 216 943 216 943
comparison with the number of candidates. For this dense
dataset, the number of candidates varies between 2-4
times the number of frequent itemsets. For non-dense For speed comparison, we chose the Apriori and FP-
datasets the ratio is usually larger. Table 2 shows also the growth implementations, provided by Bart Goethals [6].
values of the ‘security parameter’ α, being the average The results for the four test datasets and for different
probability of frequent items. Considering I/O perfor- minimum support thresholds are shown in Table 4. The
mance, we can see that the number of iteration cycles (= processor used in the experiments was a 1.5 GHz Pentium
number of file scans) is quite small, compared to the base 4, with 512 MB main memory. We used a g++ compiler,
version of the Apriori method, for which the largest using optimizing switch –O6. The PIE algorithm was
coded in C.
Table 2. Statistics from the PIE algorithm for dataset ‘Chess’.
#Frequent #Frequent Trie #Apriori’s
σ Alpha #Candidates #Iterations
items sets depth iterations
3 000 12 155 0.970 400 6 3 6
2 900 13 473 0.967 1 042 8 4 7
2 800 16 1 350 0.953 2 495 8 4 8
2 700 17 3 134 0.947 5 218 9 4 8
2 600 19 6 135 0.934 10 516 10 4 9
2 500 22 11 493 0.914 18 709 11 4 10
2 400 23 20 582 0.907 47 515 12 4 11
2 300 24 35 266 0.900 131 108 13 4 12
2 200 27 59 181 0.877 216 943 14 5 13
Table 4. Comparison of execution times (in We can see that in some situations the PIE algorithm is
seconds) of three frequent itemset mining the fastest, in some others the slowest. This is probably a
programs for four test datasets. general observation: the performance of most frequent
itemset mining algorithms is highly dependent on the data
(a) Chess set and threshold. It seems that PIE is at its best for sparse
#Freq. FP- datasets (such as T40I10D100K and Kosarak), but not so
σ Apriori PIE good for very dense datasets (such as ‘Chess’ and
sets growth
3 000 155 0.312 0.250 0.125 ‘Mushroom’). Its speed for large thresholds probably
2 900 473 0.469 0.266 0.265 results from the simplicity of the algorithm. For smaller
2 800 1 350 0.797 0.297 1.813 thresholds, the trie gets large and the counting starts to
2 700 3 134 1.438 0.344 6.938 consume more time, especially with a small main memory
size.
2 600 6 135 3.016 0.438 14.876
One might guess that our method is at its best for
2 500 11 493 10.204 0.610 26.360
random data sets, because those would correspond to our
2 400 20 582 21.907 0.829 78.325
assumption about independent item occurrences. We
2 300 35 266 42.048 1.156 203.828 tested this with a dataset of 100 000 transactions, each of
2 200 59 181 73.297 1.766 315.562 which contained 20 random items out of 30 possible. The
results were rather interesting: For all tested thresholds for
(b) Mushroom minimum support, we found all the frequent itemsets in
#Freq. FP- the first iteration. However, verification of the complete-
σ Apriori PIE
sets growth ness required one or two additional iterations, with a
5 000 41 0.375 0.391 0.062 clearly higher number of candidates, consuming a
4 500 97 0.437 0.406 0.094 majority of the total time. Table 5 shows the time and
4 000 167 0.578 0.438 0.141 number of candidates both after the first and after the final
3 500 369 0.797 0.500 0.297 iteration. The stepwise growth of the values reveals the
3 000 931 1.062 0.546 1.157 levelwise growth of the trie. Apriori worked well also for
2 500 2 365 1.781 0.610 6.046 this dataset, being in most cases faster than PIE. Results
2 000 6 613 3.719 0.750 27.047 for FP-growth (not shown) are naturally much slower,
1 500 56 693 55.110 1.124 153.187 because randomness prevents a compact representation of
the transactions.
(c) T40I10D100K We wish to point out that our implementation was an
#Freq. FP- initial version, with no special tricks for speed-up. We are
σ Apriori PIE convinced that the code details can be improved to make
sets growth
20 000 5 2.797 6.328 0.797 the method still more competitive. For example, buffering
18 000 9 2.828 6.578 1.110 of transactions (or temporary files) were not used to
16 000 17 3.001 7.250 1.156 enhance the I/O performance.
14 000 24 3.141 8.484 1.187
12 000 48 3.578 14.750 1.906
10 000 82 4.296 23.874 4.344 5. Conclusions and future work
8 000 137 7.859 41.203 11.796
6 000 239 20.531 72.985 29.671 A probability-based approach was suggested for
4 000 440 35.282 114.953 68.672 frequent itemset mining, as an alternative to the ‘analytic’
methods common today. It has been observed to be rather
(c) Kosarak robust, working reasonably well for various kinds of
#Freq. FP- datasets. The number of candidate itemsets does not
σ Apriori PIE
‘explode’, so that the data structure (trie) can be kept in
sets growth
20 000 121 27.970 30.141 5.203 the main memory in most practical cases.
18 000 141 28.438 31.296 6.110 The number of iterations is smallest for random
16 000 167 29.016 32.765 7.969 datasets, because candidate generation is based on just that
14 000 202 29.061 33.516 9.688 assumption. For skewed datasets, the number of iterations
12 000 267 29.766 34.875 12.032 may somewhat grow. This is partly due to our simplifying
10 000 376 34.906 37.657 18.016 premise that the items are independent. This point could
8 000 575 35.891 41.657 30.453 be tackled by making use of the conditional probabilities
6 000 1 110 39.656 51.922 70.376 obtainable from the trie. Initial tests did not show any
significant advantage over the basic approach, but a more
Table 5. Statistics from the PIE algorithm for a random dataset.
PIE
Apriori
#Freq. After iteration 1. After the last iteration (final)
σ
sets #Freq. Time #Iter- Time #Iter- Time
#Cand. #Cand.
sets (sec.) ations (sec.) ations (sec.)
50 000 30 30 0.500 30 464 2 2.234 2 3.953
44 000 42 42 2.016 465 509 3 2.704 3 5.173
43 800 124 124 1.875 465 1 247 3 10.579 3 6.015
43 700 214 214 1.876 465 1 792 3 20.250 3 7.235
43 600 331 331 1.891 465 2 775 3 37.375 3 9.657
43 500 413 413 1.860 465 3 530 3 48.953 3 11.876
40 000 465 465 1.844 465 4 443 2 62.000 3 13.875
28 400 522 522 60.265 4 525 4 900 3 64.235 4 15.016
28 300 724 724 61.422 4 525 5 989 3 82.140 4 15.531
28 200 1 270 1 270 61.469 4 525 8 697 3 115.250 4 19.265
28 100 2 223 2 223 61.734 4 525 13 608 3 167.047 4 31.266
28 000 3 357 3 357 60.969 4 525 19 909 3 219.578 4 69.797
sophisticated probabilistic analysis might imply some [8] K. Gouda and M.J. Zaki, “Efficiently Mining Maximal
ways to restrict the number of candidates. The exploration Frequent Itemsets”, In N. Cercone, T.Y. Lin, and X. Wu
of these elaborations, as well as tuning the buffering, data (eds.), Proc. of 2001 IEEE International Conference on
structure, and parameters, is left for future work. Data Mining, Nov. 2001, pp. 163-170.
[9] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns
References Without Candidate Generation”, In W. Chen, J. Naughton,
and P.A. Bernstein (eds.), Proc. of ACM SIGMOD Int.
Conf. on Management of Data, 2000, pp. 1-12.
[1] R. Agrawal, C. Aggarwal, and V.V.V. Prasad, “Depth First
Generation of Long Patterns”, In R. Ramakrishnan, S. [10] J. Hipp, U. Güntzer, and N. Nakhaeizadeh, “Algorithms
Stolfo, R. Bayardo, and I. Parsa (eds.), Proc. of the Int. for Association Rule Mining - a General Survey and
Conf. on Knowledge Discovery and Data Mining, ACM, Comparison”, ACM SIGKDD Explorations 2, July 2000,
Aug. 2000, pp. 108-118. pp. 58-65.
[2] R. Agrawal, T. Imielinski, and A. Swami, “Mining Associ- [11] H. Mannila, H. Toivonen, and A.I. Verkamo, “Efficient
ation Rules Between Sets of Items in Large Databases”, In Algorithms for Discovering Association Rules”, In U.M.
P. Buneman and S. Jajodia (eds.), Proc. of ACM SIGMOD Fayyad and R. Uthurusamy (eds.), Proc. of the AAAI
Int. Conf. of Management of Data, May 1993, pp. 207-216. Workshop on Knowledge Discovery in Databases, July
1994, pp. 181-192.
[3] R. Agrawal and R. Srikant, “Fast Algorithms for Mining
Association Rules in Large Databases”, In J.B. Bocca, M. [12] J. Pei, J. Han, and R. Mao, “Closet: An Efficient Algo-
Jarke, and C. Zaniolo (eds.), Proc. of the 20th VLDB Conf., rithm for Mining Frequent Closed Itemsets”, Proc. of ACM
Sept. 1994, pp. 487-499. SIGMOD Workshop on Research Issues in Data Mining
and Knowledge Discovery, May 2000, pp. 21-30.
[4] R.J. Bayardo, “Efficiently Mining Long Patterns from
Databases”, In L.M. Haas and A. Tiwary (eds.), Proc. of [13] J. Vuillemin, “A Data Structure for Manipulating Priority
the ACM SIGMOD Int. Conf. on Management of Data, Queues”, Comm. of the ACM, 21(4), 1978, pp. 309-314.
June 1998, pp. 85-93.
[14] M.J. Zaki, “Scalable Algorithms for Association Mining”,
[5] D. Burdick, M. Calimlim, and J. Gehrke, “MAFIA: a IEEE Transactions on Knowledge and Data Engineering
Maximal Frequent Itemset Algorithm for Transactional 12 (3), 2000, pp. 372-390.
Databases”, Proc. of IEEE Int. Conf. on Data Engineering,
April 2001, pp. 443-552. [15] Z. Zheng, R. Kohavi, and L. Mason, “Real World Perfor-
mance of Association Rule Algorithms”, In F. Provost and
[6] Frequent Itemset Mining Implementations (FIMI’03) R. Srikant (eds.), Proc. of the Seventh ACM SIGKDD
Workshop website, http://fimi.cs.helsinki.fi, 2003. International Conference on Knowledge Discovery and
Data Mining, 2001, pp. 401-406.
[7] B. Goethals, “Efficient Frequent Pattern Mining”, PhD
thesis, University of Limburg, Belgium, Dec. 2002.