Reducing the Main Memory Consumptions of FPmax* and FPclose Gösta Grahne and Jianfei Zhu Concordia University Montreal, Canada {grahne, j zhu}@cs.concordia.ca Abstract 2. Improving the Main Memory Requirements In [4], we gave FPgrowth*, FPmax* and FPclose for We first give a detailed introduction to the main memory mining all, maximal and closed frequent itemsets, respec- requirements of the two algorithm implementations in [4]. tively. In this short paper, we describe two approaches Then two approaches for improving the main memory con- for improving the main memory consumptions of FPmax* sumption are introduced. Since FPmax* and FPclose have and FPclose. Experimental results show that the two ap- similar main memory requirements, here we only consider proaches successfully reduce the main memory require- main memory improvements in the implementation of FP- ments of the two algorithms, and that in particular one of max*. the approaches does not incur any practically significant extra running time. The Basic Case In [4], when implementing FPgrowth*, FPmax* and FP- 1. Introduction close, each node P of an FP-tree, an MFI-tree or a CFI-tree has 4 pointers that point to its parent node, left-child node, In FIMI’03 [2], many implementations of algorithms for right-sibling node and the next node that corresponds to the mining all, maximal and closed frequent itemsets were sub- same itemname as P . The left-child node and right-sibling mitted and tested independently by the organizers. The ex- node pointers are set for the tree construction. The parent perimental results in [3] showed that our algorithms FP- node pointer and next node pointer in an FP-tree are used growth*, FPmax* and FPclose [4] have great performance for finding the conditional pattern base of an item. In MFI- on most datasets, and that FPmax* and FPclose were among trees and CFI-trees, they are used for maximality checking the fastest implementations. Our experimental results in [7] and closedness testing. also showed that the three algorithms are among the algo- In all algorithms, the FP-tree T∅ constructed from the rithms that consume the least amount of main memory when original database D is always stored in main memory dur- running them on dense datasets. ing the execution of the algorithms. For FPmax*, during However, we also found in [7] that FPmax* and FPclose the recursive calls, many small FP-trees and MFI-trees will require much more main memory than other algorithms in be constructed. The biggest MFI-tree is M∅ whose size in- [2] especially when the datasets are sparse. This is because creases slowly. At the end of the call of FPmax*(T∅ , M∅ ), the FP-trees constructed from the sparse datasets sometimes M∅ stores all maximal frequent itemsets mined from D. are fairly big, and they are stored in main memory for the We can see that the main memory requirement of FP- entire execution of the algorithms FPmax* and FPclose. max* in the basic case is at least the size of T∅ plus the size The sizes of some auxiliary data structures for storing max- of M∅ which contains all maximal frequent itemsets in D. imal and closed frequent itemsets, the MFI-trees and the CFI-trees also always increase even though many nodes in Approach 1: Trimming the FP-trees and MFI-trees the trees become useless. Continuously In this short paper, we describe two approaches for re- ducing the main memory usages of FPmax* and FPclose. To see if we can reduce the main memory requirement We also give the experimental results which show that FP- of FPmax*, let’s analyze FPmax* first. max* with either of the approaches needs less main memory Suppose during the execution of FPmax*, an FP-tree T for running on both synthetic dataset and real dataset. and its corresponding MFI-tree M are constructed. The items in T and M are i1 , i2 , . . . , in in decreasing order of Approach 2: Trimming the FP-trees and MFI-trees their frequency. Note that the header tables of T and M Once have the same items and item order. Starting from the least frequent item in , FPmax* mines maximal frequent item- In approach 2, we use the main memory management sets from T . A candidate frequent itemset X is compared technique by trimming the FP-trees and MFI-trees only with the maximal frequent itemsets in M . If X is max- once. We still assume that main memory consumption of imal, X is inserted into M . When processing the item the recursively constructed FP-trees and MFI-trees can be ik , FPmax* needs the frequency information that contains neglected, and only the FP-tree T∅ and the MFI-tree M∅ are only items i1 , i2 , . . . , ik−1 , and the frequency information trimmed. of ik+1 , . . . , in will not be used any more. In other words, Suppose the items in T∅ and M∅ are i1 , i2 , . . . , in . In our in T , only the nodes that correspond to i1 , i2 , . . . , ik are implementation, we allocate a chunk of main memory for useful, and the nodes corresponding to ik+1 , . . . , in can be those nodes in T∅ and M∅ that correspond to ibn/2c , . . . , in . deleted from T . If a candidate maximal frequent itemset The size of the chunk is changeable. During the execution X is found, X must be a subset of i1 , i2 , . . . , ik . Thus of FPmax*, T∅ and M∅ are not trimmed until item ibn/2c in in M , only the nodes corresponding to i1 , i2 , . . . , ik are T∅ .header is processed. The main memory of the chunk is used for maximality checking, and the nodes correspond- freed and all notes in the chunk are deleted at that time. ing to ik+1 , . . . , in will never be used, and therefore can be In this approach, before processing ibn/2c and freeing the deleted. chunk, T∅ and a partial M∅ are stored in the main memory. On the average, the size of M∅ is half of the size of the full Based on the above analysis, we can reduce the main M∅ . After freeing the chunk, new nodes for new maximal memory requirement of FPmax* by continuously trimming frequent itemsets are inserted and they are never trimmed. the FP-trees and MFI-trees. After processing an item ik , all However, considering the fact that MFI-tree structure is a ik -nodes in T and M are deleted. This can be done by fol- compact data structure, the new nodes are for the bn/2c lowing the head of the link list from ik in the header tables most frequent items, and M∅ already has many branches for T.header and M.header. Remember that the children of a those nodes before trimming, we can expect that the size of node are organized by a right-sibling linked list. To speed M∅ will be a little bit more than half of the size of the com- up the deletions we make this list doubly linked, i.e. each plete M∅ . Therefore the peak main memory consumption is node has pointers both to its right and left siblings. a little bit more than the size of T∅ plus half of the size of Before calling FPmax*, T∅ has to be stored in the main M∅ . Compared with approach 1, the FPmax* with approach memory. By deleting all nodes that will not be used any 2 is faster but consumes somewhat more main memory. more, the sizes of FP-trees, especially the size of T∅ , be- come smaller and smaller. The sizes of the MFI-trees still 3. Experimental Evaluation increase because new nodes for new maximal frequent item- sets are inserted, however, since obsolete nodes are also We now present a comparison of the runtime and main deleted, the MFI-trees will grow more slowly. At the end memory consumptions of the basic case and the two ap- of the call of FPmax*, the sizes of T∅ and M∅ are all zero. proaches. We ran the three implementations of FPmax* on We assume that the sizes of the recursively constructed FP- many synthetic and real datasets. The synthetic datasets are trees and MFI-trees are always far smaller than the size of sparse datasets, and the real datasets are all dense. Due to the top-level trees T∅ and M∅ , and that the main memory the lack of space, only the results for one synthetic dataset consumption of these trees can be neglected. Besides T∅ , and one real dataset are shown here. the main memory also stores M∅ . At the initial call of FP- The synthetic dataset T20I10D200K was generated from max*, the size of M∅ is zero. Then M∅ never reaches its the application on the website [1]. It contains 200,000 trans- full size because of the trimming. We estimate that the av- actions and 1000 items. The real dataset pumsb* was down- erage main memory requirement of FPmax* with approach loaded from the FIMI’03 website [2]. It was produced from 1 is the size of T∅ plus half of the size of M∅ . census data of Public Use Microdata Sample (PUMS). In [4], we mentioned that we can allocate a chunk of All experiments were performed on a 1GHz Pentium III main memory for an FP-tree, and delete all nodes in the with 512 MB of memory running RedHat Linux 7.3. FP-tree at a time by deleting the chunk. Time is saved by Figure 1 shows the runtime and the main memory usage avoiding deleting the nodes in the FP-tree one by one. Obvi- of running FPmax* with the implementations of the basic ously, this technique can not be used parallel with approach case and the two approaches on the dataset T20I10D200K. 1. Therefore, FPmax* with approach 1 will be slower than As expected, in the runtime graph, FPmax* with approach the basic FPmax*, but its peak main memory requirement 1 took the longest time. Its runtime is almost twice the run- will be smaller than that of the basic FPmax*. time of the basic case and approach 2. However, approach 1 T20I10D200K T20I10D200K 100 100 100 Basic 100 shows great speed and less main memory requirement. We Basic 95 95 Approach 1 90 Approach 1 Approach 2 90 are currently considering implementing FPmax* and FP- Approach 2 85 85 close using a Patrica trie. Main Memory (M) Runtime (s) 10 10 80 80 75 75 70 70 1 1 65 60 65 60 References 1 0.75 0.5 0.25 0 1 0.75 0.5 0.25 0 Minimum Support (%) Minimum Support (%) Runtime Main Memory Consumption [1] http://www.almaden.ibm.com/software /quest/Resources/index.shtml. Figure 1. T20I10 [2] http://fimi.cs.helsinki.fi. consumes the least amount of main memory. The peak main [3] B. Goethals and M. J. Zaki (Eds.). Proceed- memory of approach 1 is always less than the basic case for ings of the First IEEE ICDM Workshop on about 10 megabytes, or about 15%. The speed of approach Frequent Itemset Mining Implementations (FIMI 2 is similar to that of the basic case, since approach 2 only ’03). CEUR Workshop Proceedings, Vol 80 trims the FP-tree T∅ and the MFI-tree M∅ once. The main http://CEUR-WS.org/Vol-90. memory consumption of approach 2 is similar to that of ap- proach 1, which means the approach 2 successfully saves [4] G. Grahne and J. Zhu. Efficiently using prefix- main memory. trees in mining frequent itemsets. In Proceed- ings of the 1st IEEE ICDM Workshop on Frequent 100 Pumsb* 100 25 Pumsb* 25 Itemset Mining Implementations (FIMI’03), Mel- Basic Approach 1 20 Basic Approach 1 20 bourne, FL, Nov. 2003. Approach 2 Approach 2 Main Memory (M) Runtime (s) 15 15 10 10 [5] J. Han, J. Pei, and Y. Yin. Mining frequent patterns 10 10 5 5 without candidate generation. In Proceeding of 1 1 0 0 Special Interest Group on Management of Data , 60 50 40 30 20 Minimum Support (%) 10 0 60 50 40 30 20 Minimum Support (%) 10 0 pages 1-12, Dallas, TX, May 2000. Runtime Main Memory Consumption [6] A. Pietracaprina and D. Zandolin. Mining fre- Figure 2. pumsb* quent itemsets using Patricia tries. In Proceedings of the 1st Workshop on Frequent Itemset Mining The runtime and main memory usage of running FP- Implementations (FIMI’03), Melbourne, FL, Nov. max* on real dataset pumsb* are shown in Figure 2. The 2003. results are similar to those results on synthetic dataset. [7] J. Zhu. Efficiently mining frequent itemsets from Dataset pumsb* is a very dense dataset, its FP-trees and very large databases. Ph.D. thesis, Sept. 2004. MFI-trees have very good compactness, and there are not many nodes in the trees. Therefore, in the two graphs in Figure 2, the differences of the runtime and main memory consumptions for the basic case and the two approaches are not very big. 4. Conclusions We have analyzed the main memory requirements of the FPmax* and FPclose implementation in [4]. Two ap- proaches for reducing the main memory requirements of FPmax* and FPclose are introduced. Experimental results show that both approach 1 and approach 2 successfully de- crease the main memory requirement of FPmax*. While the continuous trimming of the trees in approach 1 slows down the algorithm, the “one-time-trimming” used in approach 2 shows speed similar to the original method. We also noticed that the PatriciaMine in [6] using Pa- tricia trie structure to implement the FP-growth method [5]