=Paper=
{{Paper
|id=Vol-90/paper-10
|storemode=property
|title=MAFIA: A Performance Study of Mining Maximal Frequent Itemsets
|pdfUrl=https://ceur-ws.org/Vol-90/burdick.pdf
|volume=Vol-90
|dblpUrl=https://dblp.org/rec/conf/fimi/BurdickCFGY03
}}
==MAFIA: A Performance Study of Mining Maximal Frequent Itemsets==
MAFIA: A Performance Study of Mining Maximal Frequent Itemsets
Doug Burdick £ Manuel Calimlim Jason Flannick Ý
University of Wisconsin-Madison Cornell University Stanford University
who0ps99@cs.wisc.edu calimlim@cs.cornell.edu flannick@cs.stanford.edu
Johannes Gehrke Tomi Yiu
Cornell University Cornell University
johannes@cs.cornell.edu ty42@cornell.edu
Abstract that MAFIA outperforms other algorithms by a factor of
three to thirty.
We present a performance study of the MAFIA algorithm
for mining maximal frequent itemsets from a transactional 2 Search Space Pruning
database. In a thorough experimental analysis, we isolate
the effects of individual components of MAFIA, including
MAFIA uses the lexicographic subset tree originally pre-
search space pruning techniques and adaptive compression.
sented by Rymon [9] and adopted by both Agarwal [3] and
We also compare our performance with previous work by
Bayardo [4]. The itemset identifying each node will be re-
running tests on very different types of datasets. Our exper-
ferred to as the node’s head, while possible extensions of
iments show that MAFIA performs best when mining long
the node are called the tail. In a pure depth-first traversal of
itemsets and outperforms other algorithms on dense data
the tree, the tail contains all items lexicographically larger
by a factor of three to thirty.
than any element of the head. With a dynamic reordering
scheme, the tail contains only the frequent extensions of
the current node. Notice that all items that can appear in
1 Introduction a subtree are contained in the subtree root’s head union tail
( ), a set formed by combining all elements of the head
MAFIA uses a vertical bitmap representation for support and tail.
counting and effective pruning mechanisms for searching In the simplest itemset traversal, we traverse the lexico-
the itemset lattice [6]. The algorithm is designed to mine
graphic tree in pure depth-first order. At each node , each
maximal frequent itemsets (MFI), but by changing some element in the node’s tail is generated and counted as a ½-
of the pruning tools, MAFIA can also generate all frequent extension. If the support of n’s head ½-extension is
itemsets (FI) and closed frequent itemsets (FCI). less than , then we can stop by the Apriori princi-
MAFIA assumes that the entire database (and all data ple, since any itemset from that possible ½-extension would
structures used for the algorithm) completely fit into main have an infrequent subset.
memory. Since all algorithms for finding association For each candidate itemset, we need to check if a su-
rules, including algorithms that work with disk-resident perset of the candidate itemset is already in the MFI. If
databases, are CPU-bound, we believe that our study sheds no superset exists, then we add the candidate itemset to the
light on some important performance bottlenecks. MFI. It is important to note that with the depth-first traver-
In a thorough experimental evaluation, we first quantify sal, itemsets already inserted into the MFI will be lexico-
the effect of each individual pruning component on the per- graphically ordered earlier.
formance of MAFIA. Because of our strong pruning mecha-
nisms, MAFIA performs best on dense datasets where large 2.1 Parent Equivalence Pruning (PEP)
subtrees can be removed from the search space. On shal-
low datasets, MAFIA is competitive though not always the One method of pruning involves comparing the transac-
fastest algorithm. On dense datasets, our results indicate
tion sets of each parent/child pair. Let be a node ’s head
£ Research for this paper done while at Cornell University
and be an element in ’s tail. If ´ µ ´ µ, then any
Ý Research for this paper done while at Cornell University transaction containing also contains . Since we only
want the maximal frequent itemsets, it is not necessary to same support. PEP is the only type of pruning used when
count itemsets containing and not . Therefore, we can mining for frequent closed itemsets (FCI). Recall from Sec-
move item from the node’s tail to the node’s head. tion 2.1 that PEP moves all extensions with the same sup-
port from the tail to the head of each node. Any items re-
2.2 FHUT maining in the tail must have a lower support and thus are
different closed itemsets. Note that we must still check for
Another type of pruning is superset pruning. We observe supersets in the previously discovered FCI.
that at node , the largest possible frequent itemset con-
tained in the subtree rooted at is ’s HUT (head union 4 Optimizations
tail) as observed by Bayardo [4]. If ’s HUT is discov-
ered to be frequent, we never have to explore any subsets 4.1 Effective MFI Superset Checking
of the HUT and thus can prune out the entire subtree rooted
at node . We refer to this method of pruning as FHUT In order to enumerate the exact set of maximally fre-
(Frequent Head Union Tail) pruning. quent itemsets, before adding any itemset to the MFI we
must check the entire MFI to ensure that no superset of the
2.3 HUTMFI itemset has already been found. This check is done often,
and significant performance improvements can be realized
There are two methods for determining whether an item- if it is done efficiently. To ensure this, we adopt the pro-
set is frequent: direct counting of the support of , and gressive focusing technique introduced by Gouda and Zaki
checking if a superset of has already been declared fre- [7].
quent; FHUT uses the former method. The latter approach The basic idea is that while the entire MFI may be large,
determines if a superset of the HUT is in the MFI. If a su- at any given node only a fraction of the MFI are possible
perset does exist, then the HUT must be frequent and the supersets of the itemset at the node. We therefore maintain
subtree rooted at the node corresponding to can be pruned for each node a LMFI (Local MFI), which is the subset of
away. We call this type of superset pruning HUTMFI. the MFI that contains supersets of the current node’s item-
set. For more details on the LMFI concept, please see the
2.4 Dynamic Reordering paper by Gouda and Zaki [7].
The benefit of dynamically reordering the children of 4.2 Support Counting and Bitmap Compression
each node based on support instead of following the lexi-
cographic order is significant. An algorithm that trims the MAFIA uses a vertical bitmap representation for the
tail to only frequent extensions at a higher level will save database [6]. In a vertical bitmap, there is one bit for each
a lot of computation. The order of the tail elements is also
transaction in the database. If item appears in transac-
an important consideration. Ordering the tail elements by
increasing support will keep the search space as small as
tion , then bit of the bitmap for item is set to one;
otherwise, the bit is set to zero. This naturally extends
possible. This heuristic was first used by Bayardo [4]. to itemsets. Generation of new itemset bitmaps involves
In Section 5.3.1, we quantify the effects of the algo- bitwise-ANDing bitmap( ) with a bitmap for 1-itemset
rithmic components by analyzing different combinations of
and storing the result in bitmap ( ). For each byte in
pruning mechanisms.
bitmap( ), the number of 1’s in the byte is determined
using a pre-computed table. Summing these lookups gives
3 MAFIA Extensions the support of ´ µ.
MAFIA is designed and optimized for mining maximal 4.3 Compression and Projected Bitmaps
frequent itemsets, but the general framework can be used to
mine all frequent itemsets and closed frequent itemsets. The weakness of a vertical representation is the sparse-
The algorithm can easily be extended to mine all fre- ness of the bitmaps especially at the lower support levels.
quent itemsets. The main changes required are suppressing Since every transaction has a bit in vertical bitmaps, there
any pruning tools (PEP, FHUT, HUTMFI) and adding all are many zeros because both the absence and presence of
frequent nodes in the itemset lattice to the set FI without the itemset in a transaction need to be represented. How-
any superset checking. Itemsets are counted using the same ever, note that we only need information about transactions
techniques as for the regular MAFIA algorithm. containing the itemset to count the support of the subtree
MAFIA can also be used to mine closed frequent item-
rooted at node . So, conceptually we can remove the bit
sets. An itemset is closed if there are no supersets with the
for transaction from if does not contain . This is
Dataset T I ATL
MFI Itemset Distribution
T10I4D100K 100,000 1,000 10 50
T40I10D100K 100,000 1,000 40 45
T10I4D100K
T40I10D100K
BMS-POS 515,597 1,657 6.53 40
BMS-WebView-1 59,602 497 2.51 35
Frequency (%)
30
BMS-WebView-2 3,340 161 4.62
25
chess 3196 76 37 20
connect4 67,557 130 43 15
pumsb 49,046 7,117 74 10
pumsb-star 49,046 7,117 50 5
T = Numbers of transactions 0
0 5 10 15 20
I = Numbers of items Itemset Length
ATL = Average transaction length
Figure 1. Dataset Statistics Figure 2. Itemset Lengths for shallow, artifi-
cial datasets
a form of lossless compression on the vertical bitmaps to
speed up calculations. MFI Itemset Distribution
30
BMS-POS
4.3.1 Adaptive Compression 25 BMS-WebView-1
BMS-WebView-2
Determining when to compress the bitmaps is not as simple
Frequency (%)
20
as it first appears. Each 1-extension bitmap in the tail of the
node must be projected relative to the itemset , and the 15
cost for projection may outweigh the benefits of using the
10
compressed bitmaps. The best approach is to compress only
when we know that the savings from using the compressed 5
bitmaps outweigh the cost of projection.
0
We use an adaptive approach to determine when to ap- 0 5 10 15 20 25
ply compression. At each node, we estimate both the cost Itemset Length
of compression and the benefits of using the compressed
bitmaps instead of the full bitmaps. When the benefits out- Figure 3. Itemset Lengths for shallow, real
weight the costs, compression is chosen for that node and datasets
the subtree rooted at that node.
5 Experimental Results MFI Itemset Distribution
20
chess
18
5.1 Datasets connect4
16 pumsb
pumsb_star
14
Frequency (%)
To test MAFIA, we used three different types of data. 12
The first group of datasets is sparse; the frequent itemset 10
patterns are short and thus nodes in the itemset tree will 8
have small tails and few branches. We first used artificial 6
datasets that were created using the data generator from 4
IBM Almaden [1]. Stats for these datasets can be found in 2
Figure 1 under T10I4D100K and T40I10D100K. The distri- 0
0 10 20 30 40
bution of maximal frequent itemsets is displayed in Figure
Itemset Length
2. For all datasets, the minimum support was chosen to
yield around 100,000 elements in the MFI. Note that both
T10I4 and T40I10 have very high concentrations of item- Figure 4. Itemset Lengths for dense, real
sets around two and three items long with T40I10 having datasets
another smaller peak around eight to nine items.
The second dataset type is click stream data from two 5.2.2 GenMax
different e-commerce websites (BMS-WebView-1 and BMS-
GenMax is a new algorithm by Gouda and Zaki for finding
WebView-2) where each transaction is a web session and
maximal itemset patterns [7]. GenMax introduced a novel
each item is a product page view; this data was provided
concept for finding supersets in the MFI called progessive
by Blue Martini [8]. BMS-POS contains point-of-sale data
focusing. The newest version of MAFIA has incorporated
from an electronics retailer with the item-ids corresponding
this technique with the LMFI update. GenMax also uses
to product categories. Figure 3 shows that BMS-POS and
diffset propagation for fast support counting. Both algo-
BMS-WebView-1 have very similar normal curve itemset
rithms use similar methods for itemset lattice exploration
distributions with the average length of a maximal frequent
and pruning of the search space.
itemset around five to six items long. On the other hand,
BMS-WebView-2 has a right skewed distribution; there’s
a sharp incline until three items and then a more gradual 5.3 Experimental Analysis
decline on the right tail.
Finally, the last datasets used for analysis are the dense We performed three types of experiments to analyze the
datasets. They are characterized by very long itemset pat- performance of MAFIA. First, we analyze the effect of each
terns that peak around 10-25 items (see Figure 4). Chess pruning component of the MAFIA algorithm to demon-
and Connect4 are gathered from game state information and strate how the algorithm works to trim the search space of
are available from the UCI Machine Learning Repository the itemset lattice. The second set of experiments exam-
[5]. The Pumsb dataset is census data from PUMS (Public ines the savings generated by using compression to speed
Use Microdata Sample). Pumsb-star is the same dataset as support counting. Finally, we compare the performance of
Pumsb except all items of 80% support or more have been MAFIA against other current algorithms on all three types
removed, making it less dense and easier to mine. Figure of data (see Section 5.1). In general, MAFIA works best on
4 shows that Chess and Pumsb have nearly identical item- dense data with long itemsets, though the algorithm is still
set distributions that are normal around 10-12 items long. competitive on even very shallow data.
Connect4 and Pumsb-star are somewhat left-skewed with These experiments were conducted on a 1500 Mhz Pen-
a slower incline that peaks around 20-23 items and then a tium with 1GB of memory running Redhat Linux 9.0. All
sharp decline in the length of the frequent itemsets. code was written in C++ and compiled using gcc version
3.2 with all optimizations enabled.
5.2 Other Algorithms
5.3.1 Algorithmic Component Analysis
5.2.1 DepthProject
First, we present a full analysis of each pruning component
DepthProject demonstrated an order of magnitude improve- of the MAFIA algorithm (see Section 2 for algorithmic de-
ment over previous algorithms for mining maximal frequent tails). There are three types of pruning used to trim the
itemsets [2]. MAFIA was originally designed with Depth- tree: FHUT, HUTMFI, and PEP. FHUT and HUTMFI are
Project as the primary benchmark for comparison and we both forms of superset pruning and thus will tend to “over-
have implemented our own version of the DepthProject al- lap” in their efficacy for reducing the search space. In ad-
gorithm for testing. dition, dynamic reordering can significantly reduce the size
The primary differences between MAFIA and Depth- of the search space by removing infrequent items from each
Project are the database representation (and consequently node’s tail.
the support counting) and the application of pruning Figures 5 and 6 show the effects of each component of
tools. DepthProject uses a horizontal database layout while the MAFIA algorithm on the Connect4 dataset at 40% min-
MAFIA uses a vertical bitmap format, and supports of item- imum support. The components of the algorithm are repre-
sets are counted very differently. Both algorithms use some sented in a cube format with the running times (in seconds)
form of compression when the bitmaps become sparse. and the number of itemsets counted during the MAFIA
However, DepthProject also utilizes a specialized counting search. The top of the cube shows the time for a simple
technique called bucketing for the lower levels of the item- traversal where the full search space is explored, while the
set lattice. When the tail of a node is small enough, bucket- bottom of the cube corresponds to all three pruning meth-
ing will count the entire subtree with one pass over the data. ods being used. Two separate cubes (with and without dy-
Since bucketing counts all of the nodes in a subtree, many namic reordering) rather than one giant cube are presented
itemsets that MAFIA will prune out will be counted with for readability.
DepthProject. For more details on the DepthProject algo- Note that all of the pruning components yield great sav-
rithm, please refer to the paper by Agarwal and Aggarwal ings in running time compared to using no pruning. Apply-
[2]. ing a single pruning mechanism runs two to three orders of
NONE NONE
8,423.85s 12,158.15s
341,515,395c 339,923,486c
FHUT HUTMFI PEP FHUT HUTMFI PEP
173.62s 101.54s 20.56s 15.56s 14.98s 9.89s
7,523,948c 4,471,023c 847,439c 609,993c 609,100c 296,685c
FH+HM FH+PEP HM+PEP FH+HM FH+PEP HM+PEP
101.25s 9.84s 2.67s 14.78s 1.82s 1.74s
4,429,998c 409,741c 102,759c 608,222c 63,027c 62,307c
ALL ALL
2.48s 1.72s
96,871c 62,244c
Figure 5. Pruning Components for Connect4 Figure 6. Pruning Components for Connect4
at 40% support without reordering at 40% support with reordering
magnitude faster while using all of the pruning tools is four ing so many nodes outweighs the savings of counting fewer
orders of magnitude faster than no pruning. nodes.
Several of the pruning components seem to overlap in However, once pruning is applied, dynamic reordering
trimming the search space. In particular, HUTMFI and runs nearly an order of magnitude faster than the static or-
FHUT yield very similar results, since they use the same dering. PEP is more effective since the tail is trimmed as
type of superset pruning but with different methods of im- early in the tree as possible; all of the extensions with the
plementation. It is interesting to see that adding FHUT same support are moved from the tail to the head in one step
when HUTMFI is already performed yields very little sav- at the start of the subtree. Also, FHUT and HUTMFI have
ings, i.e. from HUTMFI to FH+HM or from HM+PEP much more impact. With dynamic reordering, subtrees gen-
to ALL, the running times do not significantly change. erated from the end of tail have the itemsets with the highest
HUTMFI first checks for the frequency of a node’s HUT supports and thus the HUT is more likely to be frequent.
by looking for a frequent superset in the MFI, while FHUT
will explore the leftmost branch of the subtree rooted at that
5.3.2 Effects of Compression in MAFIA
node. Apparently, there are very few cases where a superset
of a node’s HUT is not in the MFI, but the HUT is frequent. Adaptive compression uses cost estimation to determine
PEP has the largest impact of the three pruning meth- when it is appropriate to compress the bitmaps. Since the
ods. Most of the running time of the algorithm occurs at the cost estimate adapts to each dataset, adaptive compression
lower levels of the tree where the border between frequent is always better than using no compression. Results on dif-
and infrequent itemsets exists. Near this border, many of the ferent types of data show that adaptive compression is at
itemsets have the same exact support right above the mini- least 25% faster as higher supports and at lower supports up
mum support and thus, PEP is more likely to trim out large to an order of magnitude faster.
sections of the tree at the lower levels. Figures 7 and 8 display the effect of compression on
Dynamically reordering the tail also has dramatic sav- sparse data. First, we analyze the sparse, artificial datasets
ings (cf. Figure 5 with Figure 6). At the top of each cube, it T10I4 and T40I10 that are characterized by very short item-
is interesting to note that without any pruning mechanisms, sets, where the average length of maximally frequent item-
dynamic reordering will actually run slower than static or- sets is only 2-6 items. Because these datasets are so sparse
dering. Fewer itemsets get counted, but the cost of reorder- with small subtrees, at higher supports compression is not
Compression on T10I4D100K Compression on T40I10D100K
10000 10000
NONE NONE
ADAPTIVE ADAPTIVE
1000 1000
Time (s)
Time (s)
100 100
10 10
0.12 0.1 0.08 0.06 0.04 0.02 0 2 1.5 1 0.5 0
Min Sup (%) Min Sup (%)
Compression on BMS-POS
10000
Figure 7. Compression on sparse datasets NONE
ADAPTIVE
often used and thus has a negligible effect. But as the sup- 1000
port drops and the subtrees grow larger, the effect of com-
Time (s)
pression is enhanced and the running times for adaptive
compression increase to nearly 3-10 times faster. 100
Next are the results on the sparse, real datasets: BMS-
POS, BMS-WebView-1, and BMS-WebView-2 in Figure
8. Note that for BMS-POS, adaptive compression follows 10
the exact same pattern as the synthetic datasets with the 0.5 0.4 0.3 0.2 0.1 0
Min Sup (%)
difference growing from negligible to over 10 times bet-
ter. BMS-WebView-1 follows the same general pattern ex- Compression on BMS-WebView-1
10000
cept for an anomalous spike in the running times without NONE
compression around .05%. However, for BMS-WebView-2 ADAPTIVE
compression has a very small impact and is only really ef- 1000
fective at the lowest supports. Recall from Figure 3 that
Time (s)
BMS-WebView-2 has a right-skewed distribution of fre- 100
quent itemsets, which may help explain the different com-
pression effect.
10
The final group of datasets is found in Figure 9 and
shows the results of compression on dense, real data. The
results on Chess and Pumsb indicate that very few com- 1
0.12 0.1 0.08 0.06 0.04 0.02 0
pressed bitmaps were used; apparently, the adaptive com- Min Sup (%)
pression algorithm determined compression to be too ex-
Compression on BMS-WebView-2
pensive. As a result, adaptive compression is only around 10000
15-30% better than using no compression at all. On the NONE
ADAPTIVE
other hand, the Connect4 and Pumsb-star datasets use a
much higher ratio of compressed bitmaps and adaptive com-
1000
pression is more than three times faster than no compres-
Time (s)
sion.
It is interesting to note that Chess and Pumsb both have
100
left-skewed distributions (see Figure 4) while Connect4 and
Pumsb-star follow a more normal distribution of itemsets.
The results indicate that when the data is skewed (left or
10
right), adaptive compression is not as effective. Still, even 0.06 0.05 0.04 0.03 0.02 0.01 0
in the worst case adaptive compression will use the cost es- Min Sup (%)
timate to determine that compression should not be chosen
and thus is at least as fast as never compressing at all. In the Figure 8. Compression on more sparse
best case, compression can significantly speed up support datasets
Compression on Chess Time Comparison on T10I4D100K
10000
1000
MAFIA
NONE DP
ADAPTIVE GENMAX
1000
100
Time (s)
Time (s)
100
10
10
1
1
0.06 0.05 0.04 0.03 0.02 0.01 0
35 30 25 20 15 10 5 0
Min Sup (%)
Min Sup (%)
Compression on Pumsb
10000 Figure 10. Performance on sparse datasets
NONE
ADAPTIVE
1000
counting by over an order of magnitude.
Time (s)
5.3.3 Performance Comparisons
100
Figures 10 and 11 show the results of comparing MAFIA
with DepthProject and GenMax on sparse data. MAFIA
10 is always faster than DepthProject and grows from twice
70 60 50 40 30 20 10 0
Min Sup (%) as fast at the higher supports to more than 20 times faster
Compression on Connect4
at the lowest supports tested. GenMax demonstrates the
10000 best performance of the three algorithms for higher supports
NONE
and is around two to three times faster than MAFIA. How-
ADAPTIVE
1000
ever, note that as the support drops and the itemsets become
longer, MAFIA passes Genmax in performance to become
Time (s)
the fastest algorithm.
100
The performances for sparse, real datasets are found in
Figure 11. MAFIA has the worst performance on BMS-
10 WebView-2 for higher supports, though it eventually passes
DepthProject as the support lowers. BMS-POS and BMS-
1
WebView-1 follow a similar pattern to the synthetic datasets
35 30 25 20 15 10 5 0 where MAFIA is always better than DepthProject, and Gen-
Min Sup (%)
Max is better than MAFIA until the lower supports where
Compression on Pumsb-star they cross over. In fact, at the lowest supports for BMS-
10000
NONE
WebView-1, MAFIA is an order of magnitude better than
ADAPTIVE GenMax and over 50 times faster than DepthProject. It
is clear that MAFIA performs best when the itemsets are
1000 longer, though even for sparse data MAFIA is within two to
Time (s)
three times the running times of DepthProject and GenMax.
The dense datasets in Figure 12 support the idea that
100 MAFIA runs the fastest on longer itemsets. For all supports
on the dense datasets, MAFIA has the best performance.
MAFIA runs around two to five times faster than GenMax
10 on Connect4, Pumsb, and Pumsb-star and over five to ten
6 5 4 3 2 1 0
Min Sup (%)
times faster on Chess. DepthProject is by far the slowest al-
gorithm on all of the dense datasets and runs between ten to
thirty times worse than MAFIA on all of the datasets across
Figure 9. Compression on dense datasets
all supports.
Time Comparison on T40I10D100K Time Comparison on Chess
10000
MAFIA 10000
DP MAFIA
GENMAX DP
1000 GEMAX
1000
Time (s)
Time (s)
100
100
10
10
1
1
3 2.5 2 1.5 1 0.5 0
50 40 30 20 10 0
Min Sup (%)
Min Sup (%)
Time Comparison on BMS-POS
Time Comparison on Pumsb
10000
MAFIA 10000
DP MAFIA
GENMAX DP
GENMAX
1000 Time (s) 1000
Time (s)
100
100
10
10
1
1
0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
80 70 60 50 40 30
Min Sup (%)
Min Sup (%)
Time Comparison on BMS-WebView-1
Time Comparison on Connect4
10000
10000
MAFIA
DP MAFIA
GENMAX DP
GENMAX
1000
1000
Time (s)
Time (s)
100
100
10
10
1
1
0.03 0.025 0.02 0.015 0.01 0.005 0
50 40 30 20 10 0
Min Sup (%)
Min Sup (%)
Time Comparison on BMS-WebView-2
Time Comparison on Pumsb_star
10000
10000
MAFIA
MAFIA
DP
DP
GENMAX GENMAX
1000
1000
Time (s)
Time (s)
100
100
10
10
1
1
0.035 0.03 0.025 0.02 0.015 0.01 0.005 0
35 30 25 20 15 10 5 0
Min Sup (%) Min Sup (%)
Figure 11. Performance on more sparse Figure 12. Performance on dense datasets
datasets
6 Conclusion
In this paper we present a detailed performance analy-
sis of MAFIA. The breakdown of the algorithmic compo-
nents show that powerful pruning techniques such as parent-
equivalence pruning and superset checking are very benefi-
cial in reducing the search space. We also show that adap-
tive compression/projection of the vertical bitmaps dramat-
ically cuts the cost of counting supports of itemsets. Our
experimental results demonstrate that MAFIA is highly op-
timized for mining long itemsets and on dense data consis-
tently outperforms GenMax by two to ten and DepthProject
by ten to thirty.
Acknowledgements: We would like to thank Ramesh
Agarwal and Charu Aggarwal for discussing DepthProject
and giving us advice on its implementation. We also thank
Jayant Haritsa for his insightful comments on the MAFIA
algorithm, Jiawei Han for helping in our understanding of
CLOSET and providing us the executable of the FP-Tree
algorithm, and Mohammed Zaki for making the source code
of GenMax available.
References
[1] Data generator available at
http://www.almaden.ibm.com/software/quest/Resources/.
[2] R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad. Depth
first generation of long patterns. In Knowledge Discovery and
Data Mining, pages 108–118, 2000.
[3] R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad. A
tree projection algorithm for generation of frequent item sets.
Journal of Parallel and Distributed Computing, 61(3):350–
371, 2001.
[4] R. J. Bayardo. Efficiently mining long patterns from
databases. In SIGMOD, pages 85–93, 1998.
[5] C. Blake and C. Merz. UCI repository of machine learning
databases, 1998.
[6] D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maxi-
mal frequent itemset algorithm for transactional databases. In
ICDE 2001, Heidelberg, Germany, 2001.
[7] K. Gouda and M. J. Zaki. Efficiently mining maximal fre-
quent itemsets. In ICDM, pages 163–170, 2001.
[8] R. Kohavi, C. Brodley, B. Frasca, L. Mason, and
Z. Zheng. KDD-Cup 2000 organizers’ report: Peel-
ing the onion. SIGKDD Explorations, 2(2):86–98, 2000.
http://www.ecn.purdue.edu/KDDCUP.
[9] R.Rymon. Search through systematic set enumeration. In
International Conference on Principles of Knowledge Repre-
sentation and Reasoning, pages 539–550, 1992.