AFOPT: An Efficient Implementation of Pattern Growth Approach∗ Guimei Liu Hongjun Lu Jeffrey Xu Yu Wei Wang Xiangye Xiao Department of Computer Science Department of Systems Engineering and Department of Computer Science Hong Kong University of Engineering Management Hong Kong University of Science & Technology The Chinese University of Hong Kong Science & Technology Hong Kong, China Hong Kong, China Hong Kong, China {cslgm, luhj}@cs.ust.hk {yu}@se.cuhk.edu.hk {fervvac, xiaoxy}@cs.ust.hk Abstract candidate itemsets, many of which are proved to be infre- quent after scanning the database; and (3) subset checking In this paper, we revisit the frequent itemset mining is a cost operation, especially when itemsets are very long. (FIM) problem and focus on studying the pattern growth ap- The pattern growth approach avoids the cost of generating proach. Existing pattern growth algorithms differ in several and testing a large number of candidate itemsets by grow- dimensions: (1) item search order; (2) conditional database ing a frequent itemset from its prefix. It constructs a condi- representation; (3) conditional database construction strat- tional database for each frequent itemset t such that all the egy; and (4) tree traversal strategy. They adopted differ- itemsets that have t as prefix can be mined only using the ent strategies on these dimensions. Several adaptive algo- conditional database of t. rithms were proposed to try to find good strategies for gen- The basic operations in the pattern growth approach are eral situations. In this paper, we described the implemen- counting frequent items and new conditional databases con- tation techniques of an adaptive pattern growth algorithm, struction. Therefore, the number of conditional databases called AFOPT, which demonstrated good performance on constructed during the mining process, and the mining cost all tested datasets. We also extended the algorithm to mine of each individual conditional database have a direct effect closed and maximal frequent itemsets. Comprehensive ex- on the performance of a pattern growth algorithm. The to- periments were conducted to demonstrate the efficiency of tal number of conditional databases mainly depends on in the proposed algorithms. what order the search space is explored. The traversal cost and construct cost of a conditional database depends on the 1 Introduction size, the representation format (tree-based or array-based) Since the frequent itemset mining problem (FIM) was and construction strategy (physical or pseudo) of the condi- first addressed [2], a large number of FIM algorithms have tional database. If the conditional databases are represented been proposed. There is a pressing need to completely char- by tree structure, the traversal strategy of the tree structure acterize and understand the algorithmic performance space also matters. In this paper, we investigate various aspects of FIM problem so that we can choose and integrate the best of the pattern growth approach, and try to find out what are strategies to achieve good performance in general cases. good strategies for a pattern growth algorithm. Existing FIM algorithms can be classified into two cat- The rest of the paper is organized as follows: Section egories: the candidate generate-and-test approach and the 2 revisits the FIM problem and introduces some related pattern growth approach. In each iteration of the candidate works; In Section 3, we describe an efficient pattern growth generate-and-test approach, pairs of frequent k-itemsets are algorithm—AFOPT; Section 4 and Section 5 extend the joined to form candidate (k+1)-itemsets, then the database AFOPT algorithm to mine frequent closed itemsets and is scanned to verify their supports. The resultant frequent maximal frequent itemsets respectively; Section 6 shows (k+1)-itemsets will be used as the input for next itera- experiment results; finally, Section 7 concludes this paper. tion. The drawbacks of this approach are: (1) it needs scan database multiple times, in worst case, equal to the maximal 2 Problem Revisit and Related Work length of the frequent itemsets; (2) it needs generate lots of In this section, we first briefly review FIM problem and the candidate generate-and-test approach, then focus on ∗ This work was partly supported by the Research Grant Council of the studying the algorithmic performance space of the pattern Hong Kong SAR, China (Grant HKUST6175/03E, CUHK4229/01E). growth approach. 1 2.1 Problem revisit that are estimated as frequent or themselves are not esti- Given a transactional database D, let I be the set of items mated as frequent but all of its subsets are frequent or esti- appearing in it. Any combination of the items in I can be mated as frequent into next block. The problem with AIS frequent in D, and they form the search space of FIM prob- algorithm is that it does not fully utilize the pruning power lem. The search space can be represented using set enumer- of the Apriori property, thus many unnecessary candidate ation tree [14, 1, 4, 5, 7]. For example, given a set of items itemsets are generated and tested. DIC algorithm [3] makes I = {a, b, c, d, e} sorted in lexicographic order, the search improvements based on Apriori algorithm. It starts count- space can be represented by a tree as shown in Figure 1. ing the support of an itemset shortly after all the subsets of The root of the search space tree represents the empty set, that itemset are determined to be frequent rather than wait and each node at level l (the root is at level 0, and its chil- until next pass. However, DIC algorithm cannot guarantee dren are at level 1, and so on) represents an l-itemset. The the full utilization of memory. The candidate generate-and- candidate extensions of an itemset p is defined as the set of test approach faces a trade-off: on one hand, the memory is items after the last item of p. For example, items d and e are not fully utilized and it is desirable to put as many as pos- candidate extensions of ac, while b is not a candidate exten- sible candidate itemsets into memory to reduce the number sion of ac because b is before c. The frequent extensions of of database scans; on the other hand, set containment test is p are those candidate extensions of p that can be appended a costly operation, putting itemsets into memory in earlier to p to form a longer frequent itemset. In the rest of this pa- stage has the risk of counting support for many unnecessary per, we will use cand exts(p) and f req exts(p) to denote candidate itemsets. the set of candidate extensions and frequent extensions of p respectively. 2.3 Pattern growth approach NULL{a,b ,c,d ,e} The pattern growth approach adopts the divide-and- conquer methodology. The search space is divided into disjoint sub search spaces. For example, the search space a b c d e shown in Figure 1 can be divided into 5 disjoint sub search ab ac ad ae bc bd be cd ce de spaces: (1) itemsets containing a; (2) itemsets containing b but no a; (3) itemsets containing c but no a, b; (4) itemsets ab c ab d ab e acd ace ad e b cd b ce bde cd e containing d but no a, b and c; and (5) itemsets containing only e. Accordingly, the database is divided into 5 parti- ab cd ab ce ab d e acd e b cd e tions, and each partition is called a conditional database. The conditional database of item i, denoted as Di , includes ab cd e Figure 1. Search space tree all the transactions containing item i. All the items before i are eliminated from each transaction. All the frequent item- 2.2 Candidate generate-and-test approach sets containing i can be mined from Di without accessing Frequent itemset mining can be viewed as a set contain- other information. Each conditional database is divided re- ment join between the transactional database and the search cursively following the same procedure. The pattern growth space of FIM. The candidate generate-and-test approach es- approach not only reduces the number of database scans, sentially uses block nested loop join, i.e. the search space but also avoids the costly set-containment-test operation. is the inner relation and it is divided into blocks accord- Two basic operations in pattern growth approach are ing to itemset length. Different from simple block nested counting frequent items and new conditional databases loop join, in candidate generate-and-test approach the out- construction. Therefore, the total number of conditional put of the previous pass is used as seeds to generate next databases constructed and the mining cost of each individ- block. For example, in the k-th pass of the Apriori algo- ual conditional database are key factors that affect the per- rithm, the transaction database and the candidate k-itemsets formance of a pattern growth algorithm. The total num- are joined to generate frequent k-itemsets. The frequent k- ber of conditional databases mainly depends on in what or- itemsets are then used to generate next block—candidate der the search space is explored. This order is called item (k+1)-itemsets. Given the large amount of memory avail- search order in this paper. Some structures for representing able nowadays, it is a waste of memory to put only a sin- conditional databases can also help reduce the total num- gle length of itemsets into memory. It is desirable to fully ber of conditional databases. For example, if a conditional utilize available memory by putting some longer and possi- database is represented by tree-structure and there is only bly frequent itemsets into memory in earlier stage to reduce one branch, then all the frequent itemsets in the conditional the number of database scans. The first FIM algorithm AIS database can be enumerated directly from the branch. There [2] tries to estimate the frequencies of longer itemsets us- is no need to construct new conditional databases. The min- ing the output of current pass, and includes those itemsets ing cost of a conditional database depends on the size, the 2 Datasets Asc Lex Des #cdb time max mem #cdb time max mem #cdb time max mem T10I4D100k (0.01%) 53688 4.52s 5199 kb 47799 4.89s 5471 kb 36725 5.32s 5675 kb T40I10D100k (0.5%) 311999 30.42s 17206 kb 310295 33.83s 20011 kb 309895 43.37s 21980 kb BMS-POS (0.05%) 115202 27.83s 17294 kb 53495 127.45s 38005 kb 39413 147.01s 40206 kb BMS-WebView-1 (0.06%) 33186 0.69s 731 kb 65378 1.12s 901 kb 79571 2.16s 918 kb chess (45%) 312202 2.68s 574 kb 617401 8.46s 1079 kb 405720 311.19s 2127 kb connect-4 (75%) 12242 1.31s 38 kb 245663 2.65s 57 kb 266792 14.27s 113 kb mushroom (5%) 9838 0.34s 1072 kb 258068 3.11s 676 kb 464903 272.30s 2304 kb pumsb (70%) 272373 3.87s 383 kb 649096 12.22s 570 kb 469983 16.62s 1225 kb Table 1. Comparison of Three Item Search Orders (Bucket Size=0) representation and construction strategy of the conditional first three datasets, ascending frequency order needs to build database. The traversal strategy also matters if the condi- more conditional databases than the other two orders, but tional database is represented using a tree-structure. its total running time and maximal memory usage is less Item Search Order. When we divide the search space, than the other two orders. It implies that the conditional all items are sorted in some order. This order is called item databases constructed using ascending frequency order are search order. The sub search space of an item contains all smaller. On the remaining datasets, ascending frequency the items after it in item search order but no item before it. order requires to build less conditional databases and needs Two item search orders were proposed in literature: static less running time and maximal memory usage, especially lexicographic order and dynamic ascending frequency or- on dense datasets connect-4 and mushroom. der. Static lexicographic order is to order the items lexico- Agrawal et al proposed an efficient support counting graphically. It is a fixed order—all the sub search spaces technique, called bucket counting, to reduce the total num- use the same order. Tree projection algorithm [15] and H- ber of conditional databases[1]. The basic idea is that if the Mine algorithm[12] adopted this order. Dynamic ascend- number of items in a conditional database is small enough, ing frequency order reorders frequent items in every condi- we can maintain a counter for every combination of the tional database in ascending order of their frequencies. The items instead of constructing a conditional database for each most infrequent item is the first item, and all the other items frequent item. The bucket counting can be implemented are its candidate extensions. The most frequent item is the very efficiently compared with conditional database con- last item and it has no candidate extensions. FP-growth [6], struction and traversal operation. AFOPT [9] and most of maximal frequent itemsets mining Conditional Database Representation. The traver- algorithms [7, 1, 4, 5] adopted this order. sal and construction cost of a conditional database heav- The number of conditional databases constructed by an ily depends on its representation. Different data struc- algorithm can differ greatly using different item search or- tures have been proposed to store conditional databases, e.g. ders. Ascending frequency order is capable of minimizing tree-based structures such as FP-tree [6] and AFOPT-tree the number and/or the size of conditional databases con- [9], and array-based structure such as Hyper-structure [12]. structed in subsequent mining. Intuitively, an itemset with Tree-based structures are capable of reducing traversal cost higher frequency will possibly have more frequent exten- because duplicated transactions can be merged and different sions than an itemset with lower frequency. We put the most transactions can share the storage of their prefixes. But they infrequent item in front, though the candidate extension set incur high construction cost especially when the dataset is is large, the frequent extension set cannot be very large. The sparse and large. Array-based structures incur little con- frequencies of successive items increase, at the same time struction cost but they need much more traversal cost be- the size of candidate extension set decreases. Therefore we cause the traversal cost of different transactions cannot be only need to build smaller and/or less conditional databases shared. It is a trade-off in choosing tree-based structures or in subsequent mining. Table 1 shows the total number of array-based structures. In general, tree-based structures are conditional databases constructed (#cdb column), total run- suitable for dense databases because there can be lots of pre- ning time and maximal memory usage when three orders are fix sharing among transactions, and array-based structures adopted in the framework of AFOPT algorithm described are suitable for sparse databases. in this paper. The three item search orders compared are: Conditional Database Construction Strategy Con- dynamic ascending frequency order (Asc column), lexico- structing every conditional database physically can be ex- graphic order (Lex column) and dynamic descending fre- pensive especially when successive conditional databases quency order (Des column). The minimum support thresh- do not shrink much. An alternative is to pseudo-construct old on each dataset is shown in the first column. On the them, i.e. using pointers pointing to transactions in upper 3 Algorithms Item Search Order CondDB Format CondDB Construction Tree Traversal Tree-Projection [15] static lexicographic array adaptive - FP-growth [6] dynamic frequency FP-tree physical bottom-up H-mine [12] static lexicographic hyper-structure pseudo - OP [10] adaptive adaptive adaptive bottom-up PP-mine [17] static lexicographic PP-tree pseudo top-down AFOPT [9] dynamic frequency adaptive physical top-down CLOSET+ [16] dynamic frequency FP-tree adaptive adaptive Table 2. Pattern Growth Algorithms level conditional databases. However, pseudo-construction above three implications into consideration. The distinct cannot reduce traversal cost as effectively as physical con- features of our AFOPT algorithm include: (1) It uses three struction. The item ascending frequency search order can different structures to represent conditional databases: ar- make the subsequent conditional databases shrink rapidly, rays for sparse conditional databases, AFOPT-tree for dense consequently it is beneficial to use physical construction conditional databases, and buckets for counting frequent strategy with item ascending frequency order together. itemsets containing only top-k frequent items, where k is Tree Traversal Strategy The traversal cost of a tree is a parameter to control the number of buckets used. Sev- minimal using top-down traversal strategy. FP-growth al- eral parameters are introduced to control when to use arrays gorithm [6] uses ascending frequency order to explore the or AFOPT-tree. (2) It adopts the dynamic ascending fre- search space, while FP-tree is constructed according to de- quency order. (3) The conditional databases are constructed scending frequency order. Hence FP-tree has to be traversed physically on all levels no matter whether the conditional using bottom-up strategy. As a result, FP-tree has to main- databases are represented by AFOPT-tree or arrays. tain parent links and node links at each node for bottom-up traversal. which increases the construction cost of the tree. 3.1 Framework AFOPT algorithm [9] uses ascending frequency order both Given a transactional database D and a minimum for search space exploration and prefix-tree construction, so support threshold, AFOPT algorithm scans the original it can use the top-down traversal strategy and do not need to database twice to mine all frequent itemsets. In the first maintain additional pointers at each node. The advantage of scan, all frequent items in D are counted and sorted in FP-tree is that it can be more compact than AFOPT-tree be- ascending order of their frequencies, denoted as F = cause descending frequency order increases the possibility {i1 , i2 , · · · , im }. We perform another database scan to con- of prefix sharing. The ascending frequency order adopted struct a conditional database for each ij ∈ F , denoted as by AFOPT may lead to many single branches in the tree. Dij . During the second scan, infrequent items in each trans- This problem was alleviated by using arrays to store single action t are removed and the remaining items are sorted ac- branches in AFOPT-tree. cording to their orders in F . Transaction t is put into Dij if Existing pattern growth algorithms mainly differ in the the first item of t after sorting is ij . The remaining mining several dimensions aforementioned. Table 2 lists existing will be performed on conditional databases only. There is pattern growth algorithms and their strategies on four di- no need to access the original database. mensions. AFOPT [9] is an efficient FIM algorithm devel- We first perform mining on Di1 to mine all the itemsets oped by our group. We will discuss its technical details in containing i1 . Mining on individual conditional database next three sections. follows the same process as mining on the original database. After the mining on Di1 is finished, Di1 can be discarded. 3 Mining All Frequent Itemsets Because Di1 also contains other items, the transactions in We discussed several trade-offs faced by a pattern growth it will be inserted into the remaining conditional databases. algorithm in last section. Some implications from above Given a transaction t in Di1 , suppose the next item after i1 discussions are: (1) Use tree structure on dense database in t is ij , then t will be inserted into Dij . This step is called and use array structure on sparse database. (2) Use dynamic push-right. Sorting the items in ascending order of their ascending frequency order on dense databases and/or when frequencies ensures that every time, a small conditional minimum support threshold is low. It can dramatically re- database is pushed right. The pseudo-code of AFOPT-all duce the number and/or the size of the successive condi- algorithm is shown in Algorithm 1. tional databases. (3) If dynamic ascending frequency order is adopted, then use physical construction strategy because 3.2 Conditional database representation the size of conditional databases will shrink quickly. In this Algorithm 1 is independent of the representation of con- section, we describe our algorithm AFOPT which takes the ditional databases. We choose proper representations ac- 4 Algorithm 1 AFOPT-all Algorithm ets of size 4 (=2bucket size ). The conditional databases of Input: f and p are represented by AFOPT-tree. The conditional p is a frequent itemset databases of item c and d are represented using arrays. From Dp is the conditional database of p our experience, the bucket size parameter can choose a min sup is the minimum support threshold; Description: value around 10. A value between 20 and 200 will be safe 1: Scan Dp count frequent items, F ={i1 , i2 ,· · ·, in }; for tree alphabet size parameter. We set tree min sup 2: Sort items in F in ascending order of their frequencies; to 5% and tree avg sup to 10% in our experiments. 3: for all item i ∈ F do Table 3 shows the size, construction time (build col- 4: Dp S{i} = φ; umn) and push-right time if applicable, of the initial struc- 5: for all transaction t ∈ Dp do 6: remove infrequent items from t, and sort remaining items according ture constructed from original database by AFOPT, H-Mine to their orders in F ; and FP-growth algorithms. We set bucket size to 8 and 7: let i be the first item of t, insert t into Dp S{i} . tree alphabet size to 20 for AFOPT algorithm. The ini- 8: for all item i ∈ S F do tial structure of AFOPT includes all three structures. The 9: Output s = p {i}; array structure in AFOPT algorithm simply stores all items 10: AFOPT-all(s, Ds , min sup); in a transaction. Each node in hyper-structure stores three 11: PushRight(Ds ); pieces of information: an item, a pointer pointing to the next item in the same transaction and a pointer pointing to TID Transactions TID Transactions header table the same item in another transaction. Therefore the size of 1 a, b, c, f, m, p 1 c, p, f, m, a c:3 d:3 p:4 f:5 m:5 a:6 2 a, d, e, f, g 2 d, f, a hyper-structure is approximately 3 times larger than the ar- 3 a, b, f, m, n 3 f, m, a 4 4 3 2 2 2 m:1 ray structure used in AFOPT. A node in AFOPT-tree main- 4 a, c, e, f, m, p 4 c, p, f, m, a p p f p f m a:1 tains only a child pointer and a sibling pointer, while a FP- 5 d, f, n, p 5 d, p, f f f m f a a m m a tree node maintains two more pointers for bottom-up traver- 6 a, c, h, m, p 6 c, p, m, a 7 a, d, m, s 7 d, m, a a a sal: a parent pointer and a node link. AFOPT consumes the least amount of space on almost all tested datasets. (a) D (b) (c) Figure 2. Conditional DB Representation 4 Mining Frequent Closed Itemsets cording to the density of conditional databases. Three struc- The complete set of frequent itemsets can be very large. tures are used: (1) array, (2) AFOPT-tree, and (3) buck- It has been shown that it contains many redundant informa- ets. As aforementioned, these three structures are suit- tion [11, 18]. Some works [11, 18, 13, 16, 8] put efforts able for different situations. Bucket counting technique on mining frequent closed itemsets to reduce output size. is proper and extremely efficient when the number of dis- An itemset is closed if all of its supersets have a lower sup- tinct frequent items is around 10. Tree structure is bene- port than it. The set of frequent closed itemsets is the min- ficial when conditional databases are dense. Array struc- imum informative set of frequent itemsets. In this section, ture is favorable when conditional databases are sparse. we describe how to extend Algorithm 1 to mine only fre- We use four parameters to control when to use these three quent closed itemsets. For more details, please refer to [8]. structures as follows: (1) frequent itemsets containing only top-bucket size frequent items are counted using buck- 4.1 Removing non-closed itemsets ets; (2) if the minimum support threshold is greater than Non-closed itemsets can be removed either in a postpro- tree min sup or average support of all frequent items is cessing phase, or during mining process. The second strat- no less than tree avg sup, then all the rest conditional egy can help avoid unnecessary mining cost. Non-closed databases are represented using AFOPT-tree; otherwise (3) frequent itemsets are removed based on the following two the conditional databases of the next tree alphabet size lemmas (see [8] for proof of these two lemmas). most frequent items are represented using AFOPT-tree, and Lemma 1 In Algorithm 1, an itemset p is closed if and only the rest conditional databases are represented using arrays. if two conditions hold: (1) no existing frequent itemsets is a Figure 2 shows a transactional database D and the ini- superset of p and is as frequent as p; (2) all the items in Dp tial conditional databases constructed with min sup=40%. have a lower support than p. There are 6 frequent items {c:3, d:3, p:4, f :5, m:5,a:6}. Figure 2(b) shows the projected database after remov- Lemma 2 In Algorithm 1, if a frequent itemset p is not ing infrequent items and sorting. The values of closed because condition (1) in Lemma 1 does not hold, the parameters for conditional database construction are then none of the itemsets mined from Dp can be closed. set as follows: bucket size=2, tree alphabet size=2, We check whether there exists q such that p ⊂ q and tree min sup=50%, tree avg sup=60%. The frequent sup(p)=sup(q) before mining Dp . If such q exists, then itemsets containing only m and a are counted using buck- there is no need to mine Dp based on Lemma 2 (line 10). 5 Datasets AFOPT H-Mine FP-growth size build pushright Size build pushright size build T10I4D100k (0.01%) 5116 kb 0.55s 0.37s 11838 kb 0.68s 0.19s 20403 kb 1.83s T40I10D100k (0.5%) 16535 kb 1.85s 1.91s 46089 kb 2.10s 1.42s 104272 kb 6.16s BMS-POS (0.05%) 17264 kb 2.11s 1.43s 38833 kb 2.58s 1.00s 47376 kb 6.64s BMS-WebView-1 (0.06%) 711 kb 0.12s 0.01s 1736 kb 0.17s 0.01s 1682 kb 0.27s chess (45%) 563 kb 0.04s 0.01s 1150 kb 0.05s 0.03s 1339 kb 0.12s connect-4 (75%) 35 kb 0.73s 0.01s 22064 kb 1.15s 0.55s 92 kb 2.08s mushroom (5%) 1067 kb 0.08s 0.04s 2120 kb 0.10s 0.03s 988 kb 0.17s pumsb (70%) 375 kb 0.82s 0.02s 17374 kb 1.15s 0.43s 1456 kb 2.26s Table 3. Comparison of Initial Structures Thus the identification of a non-closed itemsets not only re- CFP-tree to check whether an itemset is closed. An exam- duces output size, but also avoids unnecessary mining cost. ple of CFP-tree is shown in Figure 3(b) which stores all Based on pruning condition (2) in Lemma 1, we can check the frequent closed itemsets in Figure 3(a). They are mined whether an item i ∈ F appears in every transaction of Dp . from the database shown in Figure 2(a) with support 40%. If such i exists, then there is no need to consider the frequent Each CFP-tree node is a variable-length array, and all itemsets that do not contain i when mining Dp . In other the items in the same node are sorted in ascending order of words, we can directly perform mining on Dp S{i} instead their frequencies. A path in the tree starting from an entry of Dp (line 3-4). The efforts for mining Dp S{j} , j 6= i are in the root node represents a frequent itemset. The CFP- saved. The pseudo-code for mining frequent closed item- tree has two properties: the left containment property and sets is shown in Algorithm 2. the Apriori property. The Apriori Property is that the sup- port of any child of a CFP-tree entry cannot be greater than Algorithm 2 AFOPT-close Algorithm the support of that entry. The Left Containment Property is Input: that the item of an entry E can only appear in the subtrees p is a frequent itemset pointed by entries before E or in E itself. The superset of Dp is the conditional database of p an itemset p with support s can be efficiently searched in the min sup is the minimum support threshold; Description: CFP-tree based on these two properties. The apriori prop- 1: Scan Dp count frequent items, F ={i1 , i2 ,· · ·, in }; erty can be exploited to prune subtrees pointed by entries 2: Sort items in F in ascending order S of their frequencies; with support less than s. The left containment property can 3: I = {i|i ∈ F and support(p {i}) = support(p)}; S be utilized to prune subtrees that do not contain all items 4: F = F − I; p = p I; 5: for all transaction t ∈ Dp do in p. We also maintain a hash-bitmap in each entry to indi- 6: remove infrequent items from t, and sort remaining items according cate whether an item appears in the subtree pointed by that to their orders in F ; entry to further reduce searching cost. The superset search 7: let i be the first item of t, insert t into Dp S{i} . algorithm is shown in Algorithm 3. BinarySearch(cnode, 8: for all item S i ∈ F do s) returns the first entry in a CFP-tree node with support no 9: s = p {i}; less than s. Algorithm 3 do not require the whole CFP-tree 10: if s is closed then 11: Output s; to be in main memory because it is also very efficient on 12: AFOPT-close(s, Ds , min sup); disk. Moreover, the CFP-tree structure is a compact repre- 13: PushRight(Ds ); sentation of the frequent closed itemsets, so it has a higher chance to be held in memory than flat representation. Frequent c 4 Closed Itemsets c:3 d:3 p:4 f:5 m:5 a:6 1 Although searching in CFP-tree is very efficient, it is d d:3, p:4, f:5, a:6 p 4 1 still costly when CFP-tree is large. Inspired by the two- pf:3, fa:4, ma: 5 f:3 m:3 a:4 a:5 f 2 0 0 layer structure adopted by CLOSET+ algorithm[16] for fma: 3 m 4 0 0 subset checking, we use a two-layer hash map to check pma:3 a:3 cpma:3 a 4 0 0 0 whether an itemset is closed before searching in CFP- (a) (b) CFP-tree (c) tree. The two-layer hash map is shown in Figure 3(c). Figure 3. CFP-tree and Two-layer Hash Map We maintain a hash map for each item. The hash map of item i is denoted by i.hashmap. The length of the 4.2 Closed itemset checking hash map of an item i is set to min{sup(i)-min sup, During the mining process, we store all existing frequent max hashmap len}, where max hashmap len is a pa- closed itemsets in a tree structure, called Condensed Fre- rameter to control the maximal size of the hash maps and quent Pattern tree or CFP-tree for short [8]. We use the min sup=min{sup(i)|i is f requent}. Given an itemset 6 S Algorithm 3 Search Superset Algorithm Lemma 4 Given a frequent itemset p, if p f req exts(p) Input: is frequent but not maximal, then none of the frequent item- l is a frequent itemset sets mined from S Dp can be maximal because all of them are cnode the CFP+-tree node pointed by l subsets of p f req exts(p). s is the minimum support threshold I is a set of items to be contained in the superset Based on Lemma S 3, before mining Dp , we can first Description: check whether p cand exts(p) is frequent but not max- 1: if I = φ then 2: return true; imal. This can be done by two techniques. 3: Ē = the first entry of cnode such that Ē.item ∈ I; Superset Pruning Technique: It is to check whether 4: E 0 = BinarySearch(cnode, s); there exists some S maximal frequent itemset such that it is a 5: for all Sentry E ∈ cnode, E between E 0 and Ē do superset of p cand exts(p). Like frequent closed itemset 6: l0 =l {E.item}; 7: if E.child 6= NULL AND all items in I − {E.item} are in mining, subset checking can be challenging when the num- E.subtree then ber of maximal itemsets is large. We will discuss this issue 8: found = Search Superset(l0 , E.child, s, I − {E.item}); in next subsection. 9: if found then 10: return true; SLookahead Technique: It is to check whether 11: else if I − {E.item} = φ then p cand exts(p) is frequent when count frequent items in 12: return true; Dp . If Dp is represented by AFOPT-tree, the lookahead op- 13: return false; eration can be accomplished by simplyS looking at the left- most branch of AFOPT-tree. If p cand exts(p) is fre- quent, then the length of the left-most branch is equal to p = {i1 , i2 , · · · , il }, p is mapped to ij .hashmap[(sup(p) − |cand exts(p)|, and the support of the leaf node of the left- min sup)%max hashmap len], j = 1, 2, · · · , l. An en- most branch is no less than min sup. try in a hash map records the maximal length of the item- If the superset pruning technique and lookahead tech- sets mapped to it. For example, itemset {c, p, m, a} set the nique fail, then based on Lemma 4 weScan use superset first entry of c.hashmap, p.hashmap, m.hashmap and pruning technique to check whether p f req exts(p) is a.hashmap to 4. Figure 3(c) shows the status of the two- frequent but not maximal. Two other techniques are adopted layer hash map before mining Df . An itemset p must be in our algorithm. closed if any of the entry it mapped to contains a lower value Excluding items appearing in every transaction of Dp than its length. In such cases there is no need to search in from subsequent mining: Like frequent closed itemset CFP-tree. The hash map of an item i can be released af- mining, if an item i appears in every transaction of Dp , then ter all the frequent itemsets containing i are mined because a frequent itemset q mined from they will not be used in later mining. For example, when S Dp and not containing i cannot be maximal because q {i} is frequent. mining Df , the hash map of items c, d and p can be deleted. Single Path Trimming: If Dp is represented by AFOPT- tree and it has only one child i, then we can append i to p 5 Mining Maximal Frequent Itemsets and remove it from subsequent mining. The problem of mining maximal frequent itemsets can 5.2 Subset checking be viewed as given a minimum support threshold min sup, finding a border through the search space tree such that all When do superset prunning, to check against all fre- the nodes below the border are infrequent and all the nodes quent maximal itemsets can be costly when the number above the border are frequent. The goal of maximal fre- of maximal itemsets is large. Zaki et. al proposed a pro- quent itemsets mining is to find the border by counting sup- gressive focusing technique for subset checking [5]. The port for as less as possible itemsets. Existing maximal algo- observation behind the progressive focusing technique is rithms [19, 7, 1, 4, 5] adopted various pruning techniques to that only the maximal S frequent itemsets containing S p can reduce the search space to be explored. be a superset of p cand exts(p) or p f req exts(p). The set of maximal frequent itemsets containing p is called 5.1 Pruning techniques the local maximal frequent itemsets with respect to p, de- S The most effective techniques are based on the following notedSas LMFIp . When check whether p cand exts(p) two lemmas to prune a whole branch from search space tree. or p f req exts(p) is a subset of some existing maxi- S mal frequent itemsets, we only need to check them against Lemma 3 Given a frequent itemset p, if p cand exts(p) LMFIp . The frequent itemsets in LMFIp can either come is frequent but not maximal, then none of the frequent item- from p’s parent’s LMFI, or from p’s left-siblings’ LMFI. sets mined from Dp and from p’s right sibling’s conditional The construction of LMFIs is very similar to the construc- databases S can be maximal because all of them are subsets tion of conditional databases. The construction consists of of p cand exts(p). two steps: (1) projecting: after all frequent items F in Dp 7 Dataset: T10I4D100k (output threshold = 0.01%) Dataset: T40I10D100k (output threshold = 0.5%) Dataset: BMS-WebView-1 (output threshold = 0.006%) Dataset: BMS-POS (output threshold = 0.05%) 10000 10000 1000 1000 Apriori Apriori Apriori Apriori DCI DCI DCI DCI Eclat Eclat Eclat Eclat 1000 H-Mine H-Mine 100 H-Mine H-Mine FP-Growth 1000 FP-Growth FP-Growth FP-Growth AFOPT AFOPT AFOPT AFOPT Time(sec) Time(sec) Time(sec) Time(sec) 100 10 100 100 10 1 1 10 0.1 10 0 0.01 0.02 0.03 0.04 0.05 0.06 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.05 0.055 0.06 0.065 0.07 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 Minimum Support(%) Minimum Support(%) Minimum Support(%) Minimum Support(%) (a) T10I4D100k (b) T40I10D100k (c) BMS-WebView-1 (d) BMS-POS Dataset: chess (output threshold = 45%) Dataset: connect-4 (output threshold = 75%) Dataset: mushroom (output threshold = 5%) Dataset: pumsb (output threshold = 70%) 10000 10000 10000 10000 Apriori Apriori Apriori Apriori DCI DCI DCI DCI Eclat 1000 Eclat 1000 Eclat Eclat 1000 FP-Growth FP-Growth FP-Growth 1000 FP-Growth AFOPT AFOPT AFOPT AFOPT Time(sec) Time(sec) Time(sec) Time(sec) 100 100 100 100 10 10 10 10 1 1 1 0.1 0.1 1 20 30 40 50 60 70 20 30 40 50 60 70 80 0 2 4 6 8 10 45 50 55 60 65 70 75 80 85 Minimum Support(%) Minimum Support(%) Minimum Support(%) Minimum Support(%) (e) chess (f) connect-4 (g) mushroom (h) pumsb Figure 4. Performance Comparison of FI Mining Algorithms minimum support=0.1% minimum support=0.1% Data Sets Size #Trans #Items MaxTL AvgTL 100000 DCI AFOPT 100000 DCI AFOPT T10I4D100k (0.01%) 3.93M 100000 870 30 10.10 10000 10000 T40I10D100k (0.5%) 15.12M 100000 942 78 39.61 Time(sec) Time(sec) BMS-POS (0.05%) 11.62MB 515597 1657 165 6.53 1000 1000 BMS-WebView-1 (0.06%) 1.28M 59601 497 267 2.51 100 100 chess (45%) 0.34M 3196 75 37 37.00 connect-4 (75%) 9.11M 67557 129 43 43.00 10 10 200 400 600 800 1000 5 10 15 20 25 30 35 40 45 50 55 mushroom (5%) 0.56M 8124 119 23 23.00 #Transactions(x1000) Average Transaction Length pumsb (70%) 16.30M 49046 2113 74 74.00 (a) #Transactions (b) AvgTransLen Table 4. Datasets Figure 5. Scalability Study are counted, ∀s ∈LMFIp , s is put into LMFI S p {i} , where rithms. The Apriori and Eclat algorithms we used are im- i is the first item in F appears in s; (2) push-right: after plemented by Christian Borgelt. DCI was downloaded from all the maximal frequent itemsets containing p are mined, its web site. We obtained the source code of FP-growth ∀s ∈LMFIp , s is put into LMFIq if q is the first right sib- from its authors. H-Mine was implemented by ourselves. ling of p containing an item in s. In our implementation, We ran H-Mine only on several sparse datasets since it was we use pseudo projection technique to generate LMFIs, i.e. designed for sparse datasets and it changes to use FP-tree LMFIp is a collection of pointers pointing to those maximal on dense datasets. Figure 4 shows the running time of all itemsets containing p. algorithms over datasets shown in Table 4. When the min- imum support threshold is very low, an intolerable number 6 Experimental results of frequent itemsets can be generated. So when minimum In this section, we compare the performance of our al- support threshold reached some very low value, we turned gorithms with other FIM algorithms. All the experiments off the output. This minimum support value is called out- were conducted on a 1Ghz Pentium III with 256MB mem- put threshold, and they are shown on top of each figure. ory running Mandrake Linux. With high minimum support threshold, all algorithms Table 4 shows some statistical information about the showed comparable performance. When minimum sup- datasets used for performance study. All the datasets were port threshold was lowered, the gaps between algorithms downloaded from FIMI’03 workshop web site. The fifth increased. The two candidate generate-and-test approaches, and sixth columns are maximal and average transaction Apriori and DCI, showed satisfactory performance on sev- length. These statistics provide some rough description of eral sparse datasets, but took thousands of seconds to ter- the density of the datasets. minate on dense datasets due to high cost for generat- ing and testing a large number of candidate itemsets. H- 6.1 Mining all frequent itemsets Mine demonstrated similar performance with FP-growth on We compared the efficiency of AFOPT-all algorithm dataset T10I4D100k, but it was slower than FP-growth on with Apriori, DCI, FP-growth, H-Mine and Eclat algo- the other three sparse datasets. H-Mine uses pseudo con- 8 Dataset: T10I4D100k Dataset: T40I10D100k Dataset: BMS-WebView-1 Dataset: BMS-POS 1000 1000 10000 1000 Apriori-close Apriori-close Apriori-close Apriori-close MAFIA-close MAFIA-close MAFIA-close MAFIA-close AFOPT-close AFOPT-close 1000 AFOPT-close AFOPT-close 100 100 Time(sec) Time(sec) Time(sec) Time(sec) 100 100 10 10 10 1 1 10 0.1 1 0 0.02 0.04 0.06 0.08 0.1 0.12 0.8 0.9 1 1.1 1.2 1.3 0.04 0.06 0.08 0.1 0.12 0.14 0.1 0.2 0.3 0.4 0.5 0.6 Minimum Support(%) Minimum Support(%) Minimum Support(%) Minimum Support(%) (a) T10I4D100k (b) T40I10D100k (c) BMS-WebView-1 (d) BMS-POS Dataset: chess Dataset: connect-4 Dataset: mushroom Dataset: pumsb 1000 1000 10000 1000 Apriori-close MAFIA-close MAFIA-close MAFIA-close MAFIA-close AFOPT-close AFOPT-close AFOPT-close AFOPT-close 1000 100 100 100 Time(sec) Time(sec) Time(sec) Time(sec) 100 10 10 10 10 1 1 1 0.1 1 50 55 60 65 70 30 35 40 45 50 55 60 0 0.2 0.4 0.6 0.8 1 65 70 75 80 Minimum Support(%) Minimum Support(%) Minimum Support(%) Minimum Support(%) (e) chess (f) connect-4 (g) mushroom (h) pumsb Figure 6. Performance Comparison of FCI Mining Algorithms struction strategy, which cannot reduce traversal cost as ef- structure to store conditional databases. The tree structure fective as physical construction strategy. Eclat uses verti- has apparent advantages on dense datasets because many cal mining techniques. Support counting is performed effi- transactions share their prefixes. ciently by transaction id list join. But Eclat is not scale well with respect to the number of transactions in a database. 6.4 Mining maximal frequent itemsets The running time of AFOPT-all was rather stable over all We compared AFOPT-max with MAFIA and Apriori tested datasets, and it outperformed other algorithms. algorithms. The Apriori algorithm also has an option to produce only maximal frequent itemsets. It is denoted as 6.2 Scalability “Apriori-max” in figures. Again we only compare with We studied the scalability of our algorithm by perturb- it on sparse datasets. Apriori-max explores the search ing the IBM synthetic data generator along two dimensions: space in breadth-first order. It finds short frequent item- the number of transactions was varied from 200k to 1000k sets first. Maximal frequent itemsets are generated in a and the average transaction length was varied from 10 to post-processing phase. Therefore Apriori-max is infeasi- 50. The default values of these two parameters were set ble when the number of frequent itemsets is large even if to 1000k and 40 respectively. We compared our algorithm it adopts some pruning techniques during the mining pro- with algorithm DCI. Other algorithms took long time to fin- cess. AFOPT-max and MAFIA generate frequent itemsets ish on large datasets, so we exclude them from comparison. in depth-first order. Long frequent itemsets are mined first. Figure 5 shows the results when varying the two parameters. All the subsets of a long maximal frequent itemsets can be pruned from further consideration by using the super- 6.3 Mining frequent closed itemsets set pruning and lookahead technique. AFOPT-max uses We compared AFOPT-close with MAFIA [4] and Apri- tree structure to represent dense conditional databases. The ori algorithms. Both algorithms have an option to gen- AFOPT-tree introduces more pruning capability than tid list erate only closed itemsets. We denoted these two algo- or tid bitmap. For example, if a conditional database can rithms as Apriori-close and MAFIA-close respectively in be represented by a single branch in AFOPT-tree, then the figures. MAFIA was downloaded from its web site. We single branch will be the only one possible maximal item- compared with Apriori-close only on sparse datasets be- set in the conditional database. AFOPT-max also benefits cause Apriori-close requires a very long time to terminate from the progressive focusing technique for superset prun- on dense datasets. On several sparse datasets, AFOPT- ing. MAFIA was very efficient on small datasets, e.g chess close and Apriori-close showed comparable performance. and mushroom when the length of bitmap is short. Both of them were orders of magnitude faster than MAFIA- close. MAFIA-close uses vertical mining technique. It uses 7 Conclusions bitmaps to represent tid lists. AFOPT-close showed better In this paper, we revisited the frequent itemset mining performance on tested dense datasets due to its adaptive na- problem and focused on investigating the algorithmic per- ture and the efficient subset checking techniques described formance space of the pattern growth approach. We iden- in Section 4. On dense datasets, AFOPT-close uses tree tified four dimensions in which existing pattern growth al- 9 Dataset: T10I4D100k Dataset: T40I10D100k Dataset: BMS-WebView-1 Dataset: BMS-POS 10000 10000 10000 10000 Apriori-max Apriori-max Apriori-max Apriori-max MAFIA-max MAFIA-max MAFIA-max MAFIA-max AFOPT-max AFOPT-max 1000 AFOPT-max AFOPT-max 1000 1000 1000 Time(sec) Time(sec) Time(sec) Time(sec) 100 100 10 100 100 10 1 1 10 0.1 10 0 0.02 0.04 0.06 0.08 0.1 0.3 0.4 0.5 0.6 0.7 0.8 0 0.02 0.04 0.06 0.08 0.1 0.04 0.06 0.08 0.1 0.12 0.14 Minimum Support(%) Minimum Support(%) Minimum Support(%) Minimum Support(%) (a) T10I4D100k (b) T40I10D100k (c) BMS-WebView-1 (d) BMS-POS Dataset: chess Dataset: connect-4 Dataset: mushroom Dataset: pumsb 1000 1000 10000 1000 MAFIA-max MAFIA-max MAFIA-max MAFIA-max AFOPT-max AFOPT-max AFOPT-max AFOPT-max 1000 100 100 100 Time(sec) Time(sec) Time(sec) Time(sec) 100 10 10 10 10 1 1 1 0.1 1 20 25 30 35 40 10 15 20 25 30 35 40 0.04 0.06 0.08 0.1 0.12 0.14 50 55 60 65 70 Minimum Support(%) Minimum Support(%) Minimum Support(%) Minimum Support(%) (e) chess (f) connect-4 (g) mushroom (h) pumsb Figure 7. Performance Comparison of MFI Mining Algorithms gorithms differ: (1) item search order: static lexicograph- [4] D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maximal ical order or ascending frequency order; (2) conditional frequent itemset algorithm for transactional databases. In ICDE, 2001. database representation: tree-based structure or array-based [5] K. Gouda and M. J. Zaki. Genmax: Efficiently mining max- structure; (3) conditional database construction strategy: imal frequent itemsets. In ICDM, 2001. physical construction or pseudo construction; and (4) tree [6] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without traversal strategy: bottom-up or top-down. Existing algo- candidate generation. In SIGMOD, 2000. rithms adopted different strategies on these four dimensions [7] R.J. Bayardo. Jr. Efficiently mining long patterns from in order to reduce the total number of conditional databases databases. In SIGMOD, 1998. and the mining cost of each individual conditional database. [8] G. Liu, H. Lu, W. Lou, and J. X. Yu. On computing, storing and querying frequent patterns. In SIGKDD, 2003. we described an efficient pattern growth algorithm [9] G. Liu, H. Lu, Y. Xu, and J. X. Yu. Ascending frequency AFOPT in the paper. It adaptively uses three different struc- ordered prefix-tree: Efficient mining of frequent patterns. In tures: arrays, AFOPT-tree and buckets, to represent condi- DASFAA, 2003. tional databases according to the density of a conditional [10] J. Liu, Y. Pan, K. Wang, and J. Han. Mining frequent item database. Several parameters were introduced to control sets by opportunistic projection. In SIGKDD, 2002. which structure should be used for a specific conditional [11] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discover- database. We showed that the adaptive conditional database ing frequent closed itemsets for association rules. In ICDT, 1999. representation strategy requires less space than using array- [12] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang. H- based structure or tree-based structure solely. We also ex- mine: Hyper-structure mining of frequent patterns in large tended AFOPT algorithm to mine closed and maximal fre- databases. In ICDM, 2001. quent itemsets, and described how to incorporate pruning [13] J. Pei, J. Han, and R. Mao. Closet: An efficient algorithm for mining frequent closed itemsets. In DMKD, 2000. techniques into AFOPT framework. Efficient subset check- ing techniques for both closed and maximal frequent item- [14] R. Raymon. Search through systematic set enumeration. In Proc. of KR Conf., 1992. sets mining were presented. A set of experiments were con- [15] R.C.Agarwal, C.C.Aggarwal, and V.V.V.Prasad. A tree pro- ducted to show the efficiency of the proposed algorithms. jection algorithm for finding frequent itemsets. Journal on Parallel and Distributed Computing, 61(3):350–371, 2001. [16] J. Wang, J. Pei, and J. Han. Closet+: Searching for the best References strategies for mining frequent closed itemsets. In SIGKDD, 2003. [1] R. Agrawal, C. Aggarwal, and V. Prasad. Depth first genera- [17] Y. Xu, J. X. Yu, G. Liu, and H. Lu. From path tree to fre- tion of long patterns. In SIGKDD, 2000. quent patterns: A framework for mining frequent patterns. In ICDM, pages 514–521, 2002. [2] R. Agrawal, T. Imielinski, and A. N. Swami. Mining as- sociation rules between sets of items in large databases. In [18] M. J. Zaki and C. Hsiao. Charm: An efficient algorithm for SIGMOD, 1993. closed itemset mining. In SDM, 2002. [19] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New al- [3] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic gorithms for fast discovery of association rules. In SIGKDD, itemset counting and implication rules for market basket 1997. data. In SIGMOD, 1997. 10