Reducing the Main Memory Consumptions of FPmax* and FPclose

                                           Gösta Grahne and Jianfei Zhu
                                               Concordia University
                                                 Montreal, Canada
                                          {grahne, j zhu}@cs.concordia.ca


                        Abstract                               2. Improving the Main Memory Requirements

   In [4], we gave FPgrowth*, FPmax* and FPclose for              We first give a detailed introduction to the main memory
mining all, maximal and closed frequent itemsets, respec-      requirements of the two algorithm implementations in [4].
tively. In this short paper, we describe two approaches        Then two approaches for improving the main memory con-
for improving the main memory consumptions of FPmax*           sumption are introduced. Since FPmax* and FPclose have
and FPclose. Experimental results show that the two ap-        similar main memory requirements, here we only consider
proaches successfully reduce the main memory require-          main memory improvements in the implementation of FP-
ments of the two algorithms, and that in particular one of     max*.
the approaches does not incur any practically significant
extra running time.                                            The Basic Case

                                                                  In [4], when implementing FPgrowth*, FPmax* and FP-
1. Introduction                                                close, each node P of an FP-tree, an MFI-tree or a CFI-tree
                                                               has 4 pointers that point to its parent node, left-child node,
    In FIMI’03 [2], many implementations of algorithms for     right-sibling node and the next node that corresponds to the
mining all, maximal and closed frequent itemsets were sub-     same itemname as P . The left-child node and right-sibling
mitted and tested independently by the organizers. The ex-     node pointers are set for the tree construction. The parent
perimental results in [3] showed that our algorithms FP-       node pointer and next node pointer in an FP-tree are used
growth*, FPmax* and FPclose [4] have great performance         for finding the conditional pattern base of an item. In MFI-
on most datasets, and that FPmax* and FPclose were among       trees and CFI-trees, they are used for maximality checking
the fastest implementations. Our experimental results in [7]   and closedness testing.
also showed that the three algorithms are among the algo-         In all algorithms, the FP-tree T∅ constructed from the
rithms that consume the least amount of main memory when       original database D is always stored in main memory dur-
running them on dense datasets.                                ing the execution of the algorithms. For FPmax*, during
    However, we also found in [7] that FPmax* and FPclose      the recursive calls, many small FP-trees and MFI-trees will
require much more main memory than other algorithms in         be constructed. The biggest MFI-tree is M∅ whose size in-
[2] especially when the datasets are sparse. This is because   creases slowly. At the end of the call of FPmax*(T∅ , M∅ ),
the FP-trees constructed from the sparse datasets sometimes    M∅ stores all maximal frequent itemsets mined from D.
are fairly big, and they are stored in main memory for the        We can see that the main memory requirement of FP-
entire execution of the algorithms FPmax* and FPclose.         max* in the basic case is at least the size of T∅ plus the size
The sizes of some auxiliary data structures for storing max-   of M∅ which contains all maximal frequent itemsets in D.
imal and closed frequent itemsets, the MFI-trees and the
CFI-trees also always increase even though many nodes in       Approach 1: Trimming the FP-trees and MFI-trees
the trees become useless.                                      Continuously
    In this short paper, we describe two approaches for re-
ducing the main memory usages of FPmax* and FPclose.              To see if we can reduce the main memory requirement
We also give the experimental results which show that FP-      of FPmax*, let’s analyze FPmax* first.
max* with either of the approaches needs less main memory         Suppose during the execution of FPmax*, an FP-tree T
for running on both synthetic dataset and real dataset.        and its corresponding MFI-tree M are constructed. The
items in T and M are i1 , i2 , . . . , in in decreasing order of    Approach 2: Trimming the FP-trees and MFI-trees
their frequency. Note that the header tables of T and M             Once
have the same items and item order. Starting from the least
frequent item in , FPmax* mines maximal frequent item-                  In approach 2, we use the main memory management
sets from T . A candidate frequent itemset X is compared            technique by trimming the FP-trees and MFI-trees only
with the maximal frequent itemsets in M . If X is max-              once. We still assume that main memory consumption of
imal, X is inserted into M . When processing the item               the recursively constructed FP-trees and MFI-trees can be
ik , FPmax* needs the frequency information that contains           neglected, and only the FP-tree T∅ and the MFI-tree M∅ are
only items i1 , i2 , . . . , ik−1 , and the frequency information   trimmed.
of ik+1 , . . . , in will not be used any more. In other words,         Suppose the items in T∅ and M∅ are i1 , i2 , . . . , in . In our
in T , only the nodes that correspond to i1 , i2 , . . . , ik are   implementation, we allocate a chunk of main memory for
useful, and the nodes corresponding to ik+1 , . . . , in can be     those nodes in T∅ and M∅ that correspond to ibn/2c , . . . , in .
deleted from T . If a candidate maximal frequent itemset            The size of the chunk is changeable. During the execution
X is found, X must be a subset of i1 , i2 , . . . , ik . Thus       of FPmax*, T∅ and M∅ are not trimmed until item ibn/2c in
in M , only the nodes corresponding to i1 , i2 , . . . , ik are     T∅ .header is processed. The main memory of the chunk is
used for maximality checking, and the nodes correspond-             freed and all notes in the chunk are deleted at that time.
ing to ik+1 , . . . , in will never be used, and therefore can be       In this approach, before processing ibn/2c and freeing the
deleted.                                                            chunk, T∅ and a partial M∅ are stored in the main memory.
                                                                    On the average, the size of M∅ is half of the size of the full
    Based on the above analysis, we can reduce the main
                                                                    M∅ . After freeing the chunk, new nodes for new maximal
memory requirement of FPmax* by continuously trimming
                                                                    frequent itemsets are inserted and they are never trimmed.
the FP-trees and MFI-trees. After processing an item ik , all
                                                                    However, considering the fact that MFI-tree structure is a
ik -nodes in T and M are deleted. This can be done by fol-
                                                                    compact data structure, the new nodes are for the bn/2c
lowing the head of the link list from ik in the header tables
                                                                    most frequent items, and M∅ already has many branches for
T.header and M.header. Remember that the children of a
                                                                    those nodes before trimming, we can expect that the size of
node are organized by a right-sibling linked list. To speed
                                                                    M∅ will be a little bit more than half of the size of the com-
up the deletions we make this list doubly linked, i.e. each
                                                                    plete M∅ . Therefore the peak main memory consumption is
node has pointers both to its right and left siblings.
                                                                    a little bit more than the size of T∅ plus half of the size of
   Before calling FPmax*, T∅ has to be stored in the main           M∅ . Compared with approach 1, the FPmax* with approach
memory. By deleting all nodes that will not be used any             2 is faster but consumes somewhat more main memory.
more, the sizes of FP-trees, especially the size of T∅ , be-
come smaller and smaller. The sizes of the MFI-trees still          3. Experimental Evaluation
increase because new nodes for new maximal frequent item-
sets are inserted, however, since obsolete nodes are also
                                                                       We now present a comparison of the runtime and main
deleted, the MFI-trees will grow more slowly. At the end
                                                                    memory consumptions of the basic case and the two ap-
of the call of FPmax*, the sizes of T∅ and M∅ are all zero.
                                                                    proaches. We ran the three implementations of FPmax* on
We assume that the sizes of the recursively constructed FP-
                                                                    many synthetic and real datasets. The synthetic datasets are
trees and MFI-trees are always far smaller than the size of
                                                                    sparse datasets, and the real datasets are all dense. Due to
the top-level trees T∅ and M∅ , and that the main memory
                                                                    the lack of space, only the results for one synthetic dataset
consumption of these trees can be neglected. Besides T∅ ,
                                                                    and one real dataset are shown here.
the main memory also stores M∅ . At the initial call of FP-
                                                                       The synthetic dataset T20I10D200K was generated from
max*, the size of M∅ is zero. Then M∅ never reaches its
                                                                    the application on the website [1]. It contains 200,000 trans-
full size because of the trimming. We estimate that the av-
                                                                    actions and 1000 items. The real dataset pumsb* was down-
erage main memory requirement of FPmax* with approach
                                                                    loaded from the FIMI’03 website [2]. It was produced from
1 is the size of T∅ plus half of the size of M∅ .
                                                                    census data of Public Use Microdata Sample (PUMS).
   In [4], we mentioned that we can allocate a chunk of                All experiments were performed on a 1GHz Pentium III
main memory for an FP-tree, and delete all nodes in the             with 512 MB of memory running RedHat Linux 7.3.
FP-tree at a time by deleting the chunk. Time is saved by              Figure 1 shows the runtime and the main memory usage
avoiding deleting the nodes in the FP-tree one by one. Obvi-        of running FPmax* with the implementations of the basic
ously, this technique can not be used parallel with approach        case and the two approaches on the dataset T20I10D200K.
1. Therefore, FPmax* with approach 1 will be slower than            As expected, in the runtime graph, FPmax* with approach
the basic FPmax*, but its peak main memory requirement              1 took the longest time. Its runtime is almost twice the run-
will be smaller than that of the basic FPmax*.                      time of the basic case and approach 2. However, approach 1
                                                                                                                                   T20I10D200K
                                            T20I10D200K
               100                                                             100
                                                                                                       100
                                                                                                                  Basic
                                                                                                                                                                      100
                                                                                                                                                                            shows great speed and less main memory requirement. We
                           Basic                                                                       95                                                             95

                           Approach 1                                                                  90
                                                                                                                  Approach 1

                                                                                                                  Approach 2
                                                                                                                                                                      90    are currently considering implementing FPmax* and FP-
                           Approach 2                                                                  85                                                             85
                                                                                                                                                                            close using a Patrica trie.


                                                                                     Main Memory (M)
 Runtime (s)


                10                                                             10                      80                                                             80

                                                                                                       75                                                             75

                                                                                                       70                                                             70


                 1                                                             1
                                                                                                       65

                                                                                                       60
                                                                                                                                                                      65

                                                                                                                                                                      60
                                                                                                                                                                            References
                     1          0.75            0.5            0.25        0                                 1         0.75            0.5            0.25        0
                                        Minimum Support (%)                                                                    Minimum Support (%)


                                       Runtime                                                    Main Memory Consumption                                                       [1] http://www.almaden.ibm.com/software
                                                                                                                                                                                    /quest/Resources/index.shtml.
                                                                Figure 1. T20I10
                                                                                                                                                                                [2] http://fimi.cs.helsinki.fi.

consumes the least amount of main memory. The peak main                                                                                                                         [3] B. Goethals and M. J. Zaki (Eds.). Proceed-
memory of approach 1 is always less than the basic case for                                                                                                                         ings of the First IEEE ICDM Workshop on
about 10 megabytes, or about 15%. The speed of approach                                                                                                                             Frequent Itemset Mining Implementations (FIMI
2 is similar to that of the basic case, since approach 2 only                                                                                                                       ’03). CEUR Workshop Proceedings, Vol 80
trims the FP-tree T∅ and the MFI-tree M∅ once. The main                                                                                                                               http://CEUR-WS.org/Vol-90.
memory consumption of approach 2 is similar to that of ap-
proach 1, which means the approach 2 successfully saves                                                                                                                         [4] G. Grahne and J. Zhu. Efficiently using prefix-
main memory.                                                                                                                                                                        trees in mining frequent itemsets. In Proceed-
                                                                                                                                                                                    ings of the 1st IEEE ICDM Workshop on Frequent
                100
                                              Pumsb*
                                                                               100                      25
                                                                                                                                     Pumsb*
                                                                                                                                                                      25
                                                                                                                                                                                    Itemset Mining Implementations (FIMI’03), Mel-
                            Basic
                            Approach 1
                                                                                                        20
                                                                                                                   Basic

                                                                                                                   Approach 1                                         20
                                                                                                                                                                                    bourne, FL, Nov. 2003.
                            Approach 2
                                                                                                                   Approach 2
                                                                                     Main Memory (M)
  Runtime (s)


                                                                                                        15                                                            15
                10                                                             10                                                                                               [5] J. Han, J. Pei, and Y. Yin. Mining frequent patterns
                                                                                                        10                                                            10


                                                                                                         5                                                            5
                                                                                                                                                                                    without candidate generation. In Proceeding of
                 1                                                             1                         0                                                            0
                                                                                                                                                                                    Special Interest Group on Management of Data ,
                      60   50          40        30       20
                                        Minimum Support (%)
                                                                      10   0                                 60   50          40        30       20
                                                                                                                               Minimum Support (%)
                                                                                                                                                             10   0
                                                                                                                                                                                    pages 1-12, Dallas, TX, May 2000.
                                       Runtime                                                    Main Memory Consumption
                                                                                                                                                                                [6] A. Pietracaprina and D. Zandolin. Mining fre-
                                                               Figure 2. pumsb*                                                                                                     quent itemsets using Patricia tries. In Proceedings
                                                                                                                                                                                    of the 1st Workshop on Frequent Itemset Mining
   The runtime and main memory usage of running FP-                                                                                                                                 Implementations (FIMI’03), Melbourne, FL, Nov.
max* on real dataset pumsb* are shown in Figure 2. The                                                                                                                              2003.
results are similar to those results on synthetic dataset.
                                                                                                                                                                                [7] J. Zhu. Efficiently mining frequent itemsets from
Dataset pumsb* is a very dense dataset, its FP-trees and
                                                                                                                                                                                    very large databases. Ph.D. thesis, Sept. 2004.
MFI-trees have very good compactness, and there are not
many nodes in the trees. Therefore, in the two graphs in
Figure 2, the differences of the runtime and main memory
consumptions for the basic case and the two approaches are
not very big.

4. Conclusions

    We have analyzed the main memory requirements of
the FPmax* and FPclose implementation in [4]. Two ap-
proaches for reducing the main memory requirements of
FPmax* and FPclose are introduced. Experimental results
show that both approach 1 and approach 2 successfully de-
crease the main memory requirement of FPmax*. While the
continuous trimming of the trees in approach 1 slows down
the algorithm, the “one-time-trimming” used in approach 2
shows speed similar to the original method.
    We also noticed that the PatriciaMine in [6] using Pa-
tricia trie structure to implement the FP-growth method [5]