=Paper= {{Paper |id=Vol-90/paper-10 |storemode=property |title=MAFIA: A Performance Study of Mining Maximal Frequent Itemsets |pdfUrl=https://ceur-ws.org/Vol-90/burdick.pdf |volume=Vol-90 |dblpUrl=https://dblp.org/rec/conf/fimi/BurdickCFGY03 }} ==MAFIA: A Performance Study of Mining Maximal Frequent Itemsets== https://ceur-ws.org/Vol-90/burdick.pdf
         MAFIA: A Performance Study of Mining Maximal Frequent Itemsets

              Doug Burdick £                                       Manuel Calimlim                     Jason Flannick Ý
     University of Wisconsin-Madison                              Cornell University                 Stanford University
         who0ps99@cs.wisc.edu                                  calimlim@cs.cornell.edu            flannick@cs.stanford.edu
                                        Johannes Gehrke                              Tomi Yiu
                                       Cornell University                        Cornell University
                                    johannes@cs.cornell.edu                      ty42@cornell.edu


                            Abstract                                     that MAFIA outperforms other algorithms by a factor of
                                                                         three to thirty.
   We present a performance study of the MAFIA algorithm
for mining maximal frequent itemsets from a transactional                2 Search Space Pruning
database. In a thorough experimental analysis, we isolate
the effects of individual components of MAFIA, including
                                                                            MAFIA uses the lexicographic subset tree originally pre-
search space pruning techniques and adaptive compression.
                                                                         sented by Rymon [9] and adopted by both Agarwal [3] and
We also compare our performance with previous work by
                                                                         Bayardo [4]. The itemset identifying each node will be re-
running tests on very different types of datasets. Our exper-
                                                                         ferred to as the node’s head, while possible extensions of
iments show that MAFIA performs best when mining long
                                                                         the node are called the tail. In a pure depth-first traversal of
itemsets and outperforms other algorithms on dense data
                                                                         the tree, the tail contains all items lexicographically larger
by a factor of three to thirty.
                                                                         than any element of the head. With a dynamic reordering
                                                                         scheme, the tail contains only the frequent extensions of
                                                                         the current node. Notice that all items that can appear in
1 Introduction                                                           a subtree are contained in the subtree root’s head union tail
                                                                         (     ), a set formed by combining all elements of the head
   MAFIA uses a vertical bitmap representation for support               and tail.
counting and effective pruning mechanisms for searching                     In the simplest itemset traversal, we traverse the lexico-
the itemset lattice [6]. The algorithm is designed to mine                                                                      
                                                                         graphic tree in pure depth-first order. At each node , each
maximal frequent itemsets (MFI), but by changing some                    element in the node’s tail is generated and counted as a ½-
of the pruning tools, MAFIA can also generate all frequent               extension. If the support of n’s head  ½-extension is
itemsets (FI) and closed frequent itemsets (FCI).                        less than      , then we can stop by the Apriori princi-
   MAFIA assumes that the entire database (and all data                  ple, since any itemset from that possible ½-extension would
structures used for the algorithm) completely fit into main              have an infrequent subset.
memory. Since all algorithms for finding association                        For each candidate itemset, we need to check if a su-
rules, including algorithms that work with disk-resident                 perset of the candidate itemset is already in the MFI. If
databases, are CPU-bound, we believe that our study sheds                no superset exists, then we add the candidate itemset to the
light on some important performance bottlenecks.                         MFI. It is important to note that with the depth-first traver-
   In a thorough experimental evaluation, we first quantify              sal, itemsets already inserted into the MFI will be lexico-
the effect of each individual pruning component on the per-              graphically ordered earlier.
formance of MAFIA. Because of our strong pruning mecha-
nisms, MAFIA performs best on dense datasets where large                 2.1 Parent Equivalence Pruning (PEP)
subtrees can be removed from the search space. On shal-
low datasets, MAFIA is competitive though not always the                    One method of pruning involves comparing the transac-
fastest algorithm. On dense datasets, our results indicate                                                                     
                                                                         tion sets of each parent/child pair. Let be a node ’s head
  £ Research for this paper done while at Cornell University                                       
                                                                         and be an element in ’s tail. If ´ µ  ´ µ, then any
  Ý Research for this paper done while at Cornell University             transaction containing also contains . Since we only
want the maximal frequent itemsets, it is not necessary to      same support. PEP is the only type of pruning used when
count itemsets containing and not . Therefore, we can           mining for frequent closed itemsets (FCI). Recall from Sec-
move item from the node’s tail to the node’s head.              tion 2.1 that PEP moves all extensions with the same sup-
                                                                port from the tail to the head of each node. Any items re-
2.2 FHUT                                                        maining in the tail must have a lower support and thus are
                                                                different closed itemsets. Note that we must still check for
    Another type of pruning is superset pruning. We observe     supersets in the previously discovered FCI.
             
that at node , the largest possible frequent itemset con-
                                 
tained in the subtree rooted at is ’s HUT (head union           4 Optimizations
                                        
tail) as observed by Bayardo [4]. If ’s HUT is discov-
ered to be frequent, we never have to explore any subsets       4.1 Effective MFI Superset Checking
of the HUT and thus can prune out the entire subtree rooted
        
at node . We refer to this method of pruning as FHUT                In order to enumerate the exact set of maximally fre-
(Frequent Head Union Tail) pruning.                             quent itemsets, before adding any itemset to the MFI we
                                                                must check the entire MFI to ensure that no superset of the
2.3 HUTMFI                                                      itemset has already been found. This check is done often,
                                                                and significant performance improvements can be realized
   There are two methods for determining whether an item-       if it is done efficiently. To ensure this, we adopt the pro-
set is frequent: direct counting of the support of , and        gressive focusing technique introduced by Gouda and Zaki
checking if a superset of has already been declared fre-        [7].
quent; FHUT uses the former method. The latter approach             The basic idea is that while the entire MFI may be large,
determines if a superset of the HUT is in the MFI. If a su-     at any given node only a fraction of the MFI are possible
perset does exist, then the HUT must be frequent and the        supersets of the itemset at the node. We therefore maintain
subtree rooted at the node corresponding to can be pruned       for each node a LMFI (Local MFI), which is the subset of
away. We call this type of superset pruning HUTMFI.             the MFI that contains supersets of the current node’s item-
                                                                set. For more details on the LMFI concept, please see the
2.4 Dynamic Reordering                                          paper by Gouda and Zaki [7].

    The benefit of dynamically reordering the children of       4.2 Support Counting and Bitmap Compression
each node based on support instead of following the lexi-
cographic order is significant. An algorithm that trims the        MAFIA uses a vertical bitmap representation for the
tail to only frequent extensions at a higher level will save    database [6]. In a vertical bitmap, there is one bit for each
a lot of computation. The order of the tail elements is also                                           
                                                                transaction in the database. If item appears in transac-
an important consideration. Ordering the tail elements by
increasing support will keep the search space as small as
                                                                                                            
                                                                tion , then bit of the bitmap for item is set to one;
                                                                otherwise, the bit is set to zero. This naturally extends
possible. This heuristic was first used by Bayardo [4].         to itemsets. Generation of new itemset bitmaps involves
    In Section 5.3.1, we quantify the effects of the algo-      bitwise-ANDing bitmap( ) with a bitmap for 1-itemset      
rithmic components by analyzing different combinations of                                              
                                                                and storing the result in bitmap (  ). For each byte in
pruning mechanisms.                                                         
                                                                bitmap(  ), the number of 1’s in the byte is determined
                                                                using a pre-computed table. Summing these lookups gives
3 MAFIA Extensions                                              the support of ´  µ.
   MAFIA is designed and optimized for mining maximal           4.3 Compression and Projected Bitmaps
frequent itemsets, but the general framework can be used to
mine all frequent itemsets and closed frequent itemsets.           The weakness of a vertical representation is the sparse-
   The algorithm can easily be extended to mine all fre-        ness of the bitmaps especially at the lower support levels.
quent itemsets. The main changes required are suppressing       Since every transaction has a bit in vertical bitmaps, there
any pruning tools (PEP, FHUT, HUTMFI) and adding all            are many zeros because both the absence and presence of
frequent nodes in the itemset lattice to the set FI without     the itemset in a transaction need to be represented. How-
any superset checking. Itemsets are counted using the same      ever, note that we only need information about transactions
techniques as for the regular MAFIA algorithm.                  containing the itemset to count the support of the subtree
   MAFIA can also be used to mine closed frequent item-                         
                                                                rooted at node . So, conceptually we can remove the bit
sets. An itemset is closed if there are no supersets with the                               
                                                                for transaction from if does not contain . This is
       Dataset             T          I       ATL
                                                                                            MFI Itemset Distribution
       T10I4D100K          100,000 1,000 10                                        50
       T40I10D100K         100,000 1,000 40                                        45
                                                                                                                          T10I4D100K
                                                                                                                          T40I10D100K
       BMS-POS             515,597 1,657 6.53                                      40

       BMS-WebView-1 59,602           497     2.51                                 35




                                                                  Frequency (%)
                                                                                   30
       BMS-WebView-2 3,340            161     4.62
                                                                                   25
       chess               3196       76      37                                   20
       connect4            67,557     130     43                                   15
       pumsb               49,046     7,117 74                                     10
       pumsb-star          49,046     7,117 50                                      5

             T = Numbers of transactions                                            0
                                                                                        0       5          10            15             20
             I = Numbers of items                                                                    Itemset Length
             ATL = Average transaction length

              Figure 1. Dataset Statistics                        Figure 2. Itemset Lengths for shallow, artifi-
                                                                  cial datasets

a form of lossless compression on the vertical bitmaps to
speed up calculations.                                                                      MFI Itemset Distribution
                                                                                   30
                                                                                                                      BMS-POS
4.3.1 Adaptive Compression                                                         25                                 BMS-WebView-1
                                                                                                                      BMS-WebView-2
Determining when to compress the bitmaps is not as simple
                                                                   Frequency (%)

                                                                                   20
as it first appears. Each 1-extension bitmap in the tail of the
     
node must be projected relative to the itemset , and the                           15

cost for projection may outweigh the benefits of using the
                                                                                   10
compressed bitmaps. The best approach is to compress only
when we know that the savings from using the compressed                             5

bitmaps outweigh the cost of projection.
                                                                                    0
   We use an adaptive approach to determine when to ap-                                 0   5         10        15            20        25
ply compression. At each node, we estimate both the cost                                             Itemset Length
of compression and the benefits of using the compressed
bitmaps instead of the full bitmaps. When the benefits out-       Figure 3. Itemset Lengths for shallow, real
weight the costs, compression is chosen for that node and         datasets
the subtree rooted at that node.

5 Experimental Results                                                                      MFI Itemset Distribution
                                                                                   20
                                                                                                                         chess
                                                                                   18
5.1 Datasets                                                                                                             connect4
                                                                                   16                                    pumsb
                                                                                                                         pumsb_star
                                                                                   14
                                                                   Frequency (%)




   To test MAFIA, we used three different types of data.                           12
The first group of datasets is sparse; the frequent itemset                        10
patterns are short and thus nodes in the itemset tree will                          8
have small tails and few branches. We first used artificial                         6
datasets that were created using the data generator from                            4
IBM Almaden [1]. Stats for these datasets can be found in                           2

Figure 1 under T10I4D100K and T40I10D100K. The distri-                              0
                                                                                        0       10         20            30             40
bution of maximal frequent itemsets is displayed in Figure
                                                                                                     Itemset Length
2. For all datasets, the minimum support was chosen to
yield around 100,000 elements in the MFI. Note that both
T10I4 and T40I10 have very high concentrations of item-           Figure 4. Itemset Lengths for dense, real
sets around two and three items long with T40I10 having           datasets
another smaller peak around eight to nine items.
   The second dataset type is click stream data from two         5.2.2 GenMax
different e-commerce websites (BMS-WebView-1 and BMS-
                                                                 GenMax is a new algorithm by Gouda and Zaki for finding
WebView-2) where each transaction is a web session and
                                                                 maximal itemset patterns [7]. GenMax introduced a novel
each item is a product page view; this data was provided
                                                                 concept for finding supersets in the MFI called progessive
by Blue Martini [8]. BMS-POS contains point-of-sale data
                                                                 focusing. The newest version of MAFIA has incorporated
from an electronics retailer with the item-ids corresponding
                                                                 this technique with the LMFI update. GenMax also uses
to product categories. Figure 3 shows that BMS-POS and
                                                                 diffset propagation for fast support counting. Both algo-
BMS-WebView-1 have very similar normal curve itemset
                                                                 rithms use similar methods for itemset lattice exploration
distributions with the average length of a maximal frequent
                                                                 and pruning of the search space.
itemset around five to six items long. On the other hand,
BMS-WebView-2 has a right skewed distribution; there’s
a sharp incline until three items and then a more gradual        5.3 Experimental Analysis
decline on the right tail.
   Finally, the last datasets used for analysis are the dense       We performed three types of experiments to analyze the
datasets. They are characterized by very long itemset pat-       performance of MAFIA. First, we analyze the effect of each
terns that peak around 10-25 items (see Figure 4). Chess         pruning component of the MAFIA algorithm to demon-
and Connect4 are gathered from game state information and        strate how the algorithm works to trim the search space of
are available from the UCI Machine Learning Repository           the itemset lattice. The second set of experiments exam-
[5]. The Pumsb dataset is census data from PUMS (Public          ines the savings generated by using compression to speed
Use Microdata Sample). Pumsb-star is the same dataset as         support counting. Finally, we compare the performance of
Pumsb except all items of 80% support or more have been          MAFIA against other current algorithms on all three types
removed, making it less dense and easier to mine. Figure         of data (see Section 5.1). In general, MAFIA works best on
4 shows that Chess and Pumsb have nearly identical item-         dense data with long itemsets, though the algorithm is still
set distributions that are normal around 10-12 items long.       competitive on even very shallow data.
Connect4 and Pumsb-star are somewhat left-skewed with               These experiments were conducted on a 1500 Mhz Pen-
a slower incline that peaks around 20-23 items and then a        tium with 1GB of memory running Redhat Linux 9.0. All
sharp decline in the length of the frequent itemsets.            code was written in C++ and compiled using gcc version
                                                                 3.2 with all optimizations enabled.
5.2 Other Algorithms
                                                                 5.3.1 Algorithmic Component Analysis
5.2.1 DepthProject
                                                                 First, we present a full analysis of each pruning component
DepthProject demonstrated an order of magnitude improve-         of the MAFIA algorithm (see Section 2 for algorithmic de-
ment over previous algorithms for mining maximal frequent        tails). There are three types of pruning used to trim the
itemsets [2]. MAFIA was originally designed with Depth-          tree: FHUT, HUTMFI, and PEP. FHUT and HUTMFI are
Project as the primary benchmark for comparison and we           both forms of superset pruning and thus will tend to “over-
have implemented our own version of the DepthProject al-         lap” in their efficacy for reducing the search space. In ad-
gorithm for testing.                                             dition, dynamic reordering can significantly reduce the size
    The primary differences between MAFIA and Depth-             of the search space by removing infrequent items from each
Project are the database representation (and consequently        node’s tail.
the support counting) and the application of pruning                 Figures 5 and 6 show the effects of each component of
tools. DepthProject uses a horizontal database layout while      the MAFIA algorithm on the Connect4 dataset at 40% min-
MAFIA uses a vertical bitmap format, and supports of item-       imum support. The components of the algorithm are repre-
sets are counted very differently. Both algorithms use some      sented in a cube format with the running times (in seconds)
form of compression when the bitmaps become sparse.              and the number of itemsets counted during the MAFIA
However, DepthProject also utilizes a specialized counting       search. The top of the cube shows the time for a simple
technique called bucketing for the lower levels of the item-     traversal where the full search space is explored, while the
set lattice. When the tail of a node is small enough, bucket-    bottom of the cube corresponds to all three pruning meth-
ing will count the entire subtree with one pass over the data.   ods being used. Two separate cubes (with and without dy-
Since bucketing counts all of the nodes in a subtree, many       namic reordering) rather than one giant cube are presented
itemsets that MAFIA will prune out will be counted with          for readability.
DepthProject. For more details on the DepthProject algo-             Note that all of the pruning components yield great sav-
rithm, please refer to the paper by Agarwal and Aggarwal         ings in running time compared to using no pruning. Apply-
[2].                                                             ing a single pruning mechanism runs two to three orders of
                           NONE                                                            NONE
                          8,423.85s                                                      12,158.15s
                        341,515,395c                                                    339,923,486c



        FHUT              HUTMFI                PEP                     FHUT              HUTMFI                PEP
       173.62s             101.54s             20.56s                   15.56s             14.98s               9.89s
     7,523,948c          4,471,023c           847,439c                 609,993c           609,100c            296,685c



      FH+HM               FH+PEP              HM+PEP                   FH+HM               FH+PEP             HM+PEP
       101.25s              9.84s               2.67s                   14.78s              1.82s               1.74s
     4,429,998c           409,741c            102,759c                 608,222c            63,027c             62,307c



                            ALL                                                             ALL
                            2.48s                                                           1.72s
                           96,871c                                                         62,244c



   Figure 5. Pruning Components for Connect4                        Figure 6. Pruning Components for Connect4
   at 40% support without reordering                                at 40% support with reordering



magnitude faster while using all of the pruning tools is four    ing so many nodes outweighs the savings of counting fewer
orders of magnitude faster than no pruning.                      nodes.
    Several of the pruning components seem to overlap in             However, once pruning is applied, dynamic reordering
trimming the search space. In particular, HUTMFI and             runs nearly an order of magnitude faster than the static or-
FHUT yield very similar results, since they use the same         dering. PEP is more effective since the tail is trimmed as
type of superset pruning but with different methods of im-       early in the tree as possible; all of the extensions with the
plementation. It is interesting to see that adding FHUT          same support are moved from the tail to the head in one step
when HUTMFI is already performed yields very little sav-         at the start of the subtree. Also, FHUT and HUTMFI have
ings, i.e. from HUTMFI to FH+HM or from HM+PEP                   much more impact. With dynamic reordering, subtrees gen-
to ALL, the running times do not significantly change.           erated from the end of tail have the itemsets with the highest
HUTMFI first checks for the frequency of a node’s HUT            supports and thus the HUT is more likely to be frequent.
by looking for a frequent superset in the MFI, while FHUT
will explore the leftmost branch of the subtree rooted at that
                                                                 5.3.2 Effects of Compression in MAFIA
node. Apparently, there are very few cases where a superset
of a node’s HUT is not in the MFI, but the HUT is frequent.      Adaptive compression uses cost estimation to determine
    PEP has the largest impact of the three pruning meth-        when it is appropriate to compress the bitmaps. Since the
ods. Most of the running time of the algorithm occurs at the     cost estimate adapts to each dataset, adaptive compression
lower levels of the tree where the border between frequent       is always better than using no compression. Results on dif-
and infrequent itemsets exists. Near this border, many of the    ferent types of data show that adaptive compression is at
itemsets have the same exact support right above the mini-       least 25% faster as higher supports and at lower supports up
mum support and thus, PEP is more likely to trim out large       to an order of magnitude faster.
sections of the tree at the lower levels.                            Figures 7 and 8 display the effect of compression on
    Dynamically reordering the tail also has dramatic sav-       sparse data. First, we analyze the sparse, artificial datasets
ings (cf. Figure 5 with Figure 6). At the top of each cube, it   T10I4 and T40I10 that are characterized by very short item-
is interesting to note that without any pruning mechanisms,      sets, where the average length of maximally frequent item-
dynamic reordering will actually run slower than static or-      sets is only 2-6 items. Because these datasets are so sparse
dering. Fewer itemsets get counted, but the cost of reorder-     with small subtrees, at higher supports compression is not
                  Compression on T10I4D100K                                                  Compression on T40I10D100K
                                                             10000                                                                                     10000
           NONE                                                                      NONE
           ADAPTIVE                                                                  ADAPTIVE


                                                             1000                                                                                      1000




                                                                   Time (s)




                                                                                                                                                             Time (s)
                                                             100                                                                                       100




                                                             10                                                                                        10
    0.12    0.1       0.08      0.06       0.04   0.02   0                     2              1.5                 1                  0.5           0
                             Min Sup (%)                                                                   Min Sup (%)

                                                                                               Compression on BMS-POS
                                                                                                                                                       10000
    Figure 7. Compression on sparse datasets                                         NONE
                                                                                     ADAPTIVE


often used and thus has a negligible effect. But as the sup-                                                                                           1000

port drops and the subtrees grow larger, the effect of com-




                                                                                                                                                             Time (s)
pression is enhanced and the running times for adaptive
compression increase to nearly 3-10 times faster.                                                                                                      100
    Next are the results on the sparse, real datasets: BMS-
POS, BMS-WebView-1, and BMS-WebView-2 in Figure
8. Note that for BMS-POS, adaptive compression follows                                                                                                 10
the exact same pattern as the synthetic datasets with the                      0.5      0.4                0.3          0.2                0.1     0
                                                                                                           Min Sup (%)
difference growing from negligible to over 10 times bet-
ter. BMS-WebView-1 follows the same general pattern ex-                                 Compression on BMS-WebView-1
                                                                                                                                                       10000
cept for an anomalous spike in the running times without                             NONE
compression around .05%. However, for BMS-WebView-2                                  ADAPTIVE

compression has a very small impact and is only really ef-                                                                                             1000

fective at the lowest supports. Recall from Figure 3 that




                                                                                                                                                             Time (s)
BMS-WebView-2 has a right-skewed distribution of fre-                                                                                                  100
quent itemsets, which may help explain the different com-
pression effect.
                                                                                                                                                       10
    The final group of datasets is found in Figure 9 and
shows the results of compression on dense, real data. The
results on Chess and Pumsb indicate that very few com-                                                                                                 1
                                                                              0.12    0.1           0.08         0.06         0.04          0.02   0
pressed bitmaps were used; apparently, the adaptive com-                                                   Min Sup (%)
pression algorithm determined compression to be too ex-
                                                                                         Compression on BMS-WebView-2
pensive. As a result, adaptive compression is only around                                                                                              10000
15-30% better than using no compression at all. On the                               NONE
                                                                                     ADAPTIVE
other hand, the Connect4 and Pumsb-star datasets use a
much higher ratio of compressed bitmaps and adaptive com-
                                                                                                                                                       1000
pression is more than three times faster than no compres-
                                                                                                                                                             Time (s)




sion.
    It is interesting to note that Chess and Pumsb both have
                                                                                                                                                       100
left-skewed distributions (see Figure 4) while Connect4 and
Pumsb-star follow a more normal distribution of itemsets.
The results indicate that when the data is skewed (left or
                                                                                                                                                       10
right), adaptive compression is not as effective. Still, even                 0.06    0.05          0.04         0.03         0.02          0.01   0
in the worst case adaptive compression will use the cost es-                                               Min Sup (%)
timate to determine that compression should not be chosen
and thus is at least as fast as never compressing at all. In the              Figure 8. Compression on more sparse
best case, compression can significantly speed up support                     datasets
                Compression on Chess                                                    Time Comparison on T10I4D100K
                                                                                                                                      10000
                                                          1000
                                                                                      MAFIA
     NONE                                                                             DP
     ADAPTIVE                                                                         GENMAX
                                                                                                                                      1000

                                                          100




                                                                                                                                             Time (s)
                                                                Time (s)
                                                                                                                                      100



                                                          10
                                                                                                                                      10



                                                                                                                                      1
                                                          1
                                                                               0.06   0.05     0.04      0.03       0.02   0.01   0
35   30      25        20       15       10       5   0
                                                                                                      Min Sup (%)
                      Min Sup (%)

                Compression on Pumsb
                                                          10000                Figure 10. Performance on sparse datasets
     NONE
     ADAPTIVE


                                                          1000
                                                                           counting by over an order of magnitude.
                                                                Time (s)




                                                                           5.3.3 Performance Comparisons
                                                          100

                                                                           Figures 10 and 11 show the results of comparing MAFIA
                                                                           with DepthProject and GenMax on sparse data. MAFIA
                                                          10               is always faster than DepthProject and grows from twice
70   60      50        40       30       20   10      0
                      Min Sup (%)                                          as fast at the higher supports to more than 20 times faster
            Compression on Connect4
                                                                           at the lowest supports tested. GenMax demonstrates the
                                                          10000            best performance of the three algorithms for higher supports
     NONE
                                                                           and is around two to three times faster than MAFIA. How-
     ADAPTIVE

                                                          1000
                                                                           ever, note that as the support drops and the itemsets become
                                                                           longer, MAFIA passes Genmax in performance to become
                                                                Time (s)




                                                                           the fastest algorithm.
                                                          100
                                                                               The performances for sparse, real datasets are found in
                                                                           Figure 11. MAFIA has the worst performance on BMS-
                                                          10               WebView-2 for higher supports, though it eventually passes
                                                                           DepthProject as the support lowers. BMS-POS and BMS-
                                                          1
                                                                           WebView-1 follow a similar pattern to the synthetic datasets
35   30      25        20       15       10       5   0                    where MAFIA is always better than DepthProject, and Gen-
                      Min Sup (%)
                                                                           Max is better than MAFIA until the lower supports where
            Compression on Pumsb-star                                      they cross over. In fact, at the lowest supports for BMS-
                                                          10000
     NONE
                                                                           WebView-1, MAFIA is an order of magnitude better than
     ADAPTIVE                                                              GenMax and over 50 times faster than DepthProject. It
                                                                           is clear that MAFIA performs best when the itemsets are
                                                          1000             longer, though even for sparse data MAFIA is within two to
                                                                Time (s)




                                                                           three times the running times of DepthProject and GenMax.
                                                                               The dense datasets in Figure 12 support the idea that
                                                          100              MAFIA runs the fastest on longer itemsets. For all supports
                                                                           on the dense datasets, MAFIA has the best performance.
                                                                           MAFIA runs around two to five times faster than GenMax
                                                          10               on Connect4, Pumsb, and Pumsb-star and over five to ten
6      5          4         3        2        1       0
                      Min Sup (%)
                                                                           times faster on Chess. DepthProject is by far the slowest al-
                                                                           gorithm on all of the dense datasets and runs between ten to
                                                                           thirty times worse than MAFIA on all of the datasets across
Figure 9. Compression on dense datasets
                                                                           all supports.
          Time Comparison on T40I10D100K                                                                  Time Comparison on Chess
                                                                        10000
         MAFIA                                                                                                                                   10000
         DP                                                                                      MAFIA
         GENMAX                                                                                  DP
                                                                        1000                     GEMAX
                                                                                                                                                 1000




                                                                               Time (s)




                                                                                                                                                        Time (s)
                                                                        100
                                                                                                                                                 100



                                                                        10
                                                                                                                                                 10



                                                                        1
                                                                                                                                                 1
 3       2.5           2           1.5           1          0.5     0
                                                                                           50        40         30       20        10       0
                           Min Sup (%)
                                                                                                                Min Sup (%)
               Time Comparison on BMS-POS
                                                                                                          Time Comparison on Pumsb
                                                                        10000
         MAFIA                                                                                                                                   10000
         DP                                                                                     MAFIA
         GENMAX                                                                                 DP
                                                                                                GENMAX
                                                                        1000    Time (s)                                                         1000




                                                                                                                                                        Time (s)
                                                                        100
                                                                                                                                                 100



                                                                        10
                                                                                                                                                 10



                                                                        1
                                                                                                                                                 1
0.35    0.3     0.25        0.2          0.15        0.1    0.05    0
                                                                                           80        70         60       50        40       30
                           Min Sup (%)
                                                                                                                Min Sup (%)
         Time Comparison on BMS-WebView-1
                                                                                                     Time Comparison on Connect4
                                                                        10000
                                                                                                                                                 10000
         MAFIA
         DP                                                                                     MAFIA
         GENMAX                                                                                 DP
                                                                                                GENMAX
                                                                        1000
                                                                                                                                                 1000
                                                                                Time (s)




                                                                                                                                                        Time (s)
                                                                        100
                                                                                                                                                 100



                                                                        10
                                                                                                                                                 10



                                                                        1
                                                                                                                                                 1
0.03    0.025      0.02           0.015         0.01       0.005    0
                                                                                           50        40         30       20        10       0
                           Min Sup (%)
                                                                                                                Min Sup (%)
         Time Comparison on BMS-WebView-2
                                                                                                     Time Comparison on Pumsb_star
                                                                        10000
                                                                                                                                                 10000
         MAFIA
                                                                                                MAFIA
         DP
                                                                                                DP
         GENMAX                                                                                 GENMAX
                                                                        1000
                                                                                                                                                 1000
                                                                                Time (s)




                                                                                                                                                        Time (s)




                                                                        100
                                                                                                                                                 100



                                                                        10
                                                                                                                                                 10



                                                                        1
                                                                                                                                                 1
0.035   0.03    0.025      0.02      0.015        0.01      0.005   0
                                                                                           35   30         25    20    15     10        5   0
                           Min Sup (%)                                                                          Min Sup (%)


Figure 11. Performance on more sparse                                                      Figure 12. Performance on dense datasets
datasets
6 Conclusion

   In this paper we present a detailed performance analy-
sis of MAFIA. The breakdown of the algorithmic compo-
nents show that powerful pruning techniques such as parent-
equivalence pruning and superset checking are very benefi-
cial in reducing the search space. We also show that adap-
tive compression/projection of the vertical bitmaps dramat-
ically cuts the cost of counting supports of itemsets. Our
experimental results demonstrate that MAFIA is highly op-
timized for mining long itemsets and on dense data consis-
tently outperforms GenMax by two to ten and DepthProject
by ten to thirty.
   Acknowledgements: We would like to thank Ramesh
Agarwal and Charu Aggarwal for discussing DepthProject
and giving us advice on its implementation. We also thank
Jayant Haritsa for his insightful comments on the MAFIA
algorithm, Jiawei Han for helping in our understanding of
CLOSET and providing us the executable of the FP-Tree
algorithm, and Mohammed Zaki for making the source code
of GenMax available.

References

[1] Data generator available at
    http://www.almaden.ibm.com/software/quest/Resources/.
[2] R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad. Depth
    first generation of long patterns. In Knowledge Discovery and
    Data Mining, pages 108–118, 2000.
[3] R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad. A
    tree projection algorithm for generation of frequent item sets.
    Journal of Parallel and Distributed Computing, 61(3):350–
    371, 2001.
[4] R. J. Bayardo.        Efficiently mining long patterns from
    databases. In SIGMOD, pages 85–93, 1998.
[5] C. Blake and C. Merz. UCI repository of machine learning
    databases, 1998.
[6] D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maxi-
    mal frequent itemset algorithm for transactional databases. In
    ICDE 2001, Heidelberg, Germany, 2001.
[7] K. Gouda and M. J. Zaki. Efficiently mining maximal fre-
    quent itemsets. In ICDM, pages 163–170, 2001.
[8] R. Kohavi, C. Brodley, B. Frasca, L. Mason, and
    Z. Zheng.       KDD-Cup 2000 organizers’ report: Peel-
    ing the onion. SIGKDD Explorations, 2(2):86–98, 2000.
    http://www.ecn.purdue.edu/KDDCUP.
[9] R.Rymon. Search through systematic set enumeration. In
    International Conference on Principles of Knowledge Repre-
    sentation and Reasoning, pages 539–550, 1992.