CT-PRO: A Bottom-Up Non Recursive Frequent Itemset Mining Algorithm Using Compressed FP-Tree Data Structure Yudho Giri Sucahyo Raj P. Gopalan Department of Computing, Curtin University of Technology Kent St, Bentley Western Australia 6102 {sucahyoy, raj}@cs.curtin.edu.au Abstract combination of arrays and hyper-links. It was shown in [9] that H-struct works well for sparse datasets as H-Mine Frequent itemset mining (FIM) is an essential part of outperforms FP-Growth [10] on these datasets (note that association rules mining. Its application for other data both H-Mine and FP-Growth follows the same pattern mining tasks has also been recognized. It has been an growth method). However, the hyper-structure is not active research area and a large number of algorithms efficient on dense datasets and therefore H-Mine switches have been developed. In this paper, we propose another to FP-Growth for such datasets. pattern growth algorithm which uses a more compact FP-Growth [10] shows good performance on dense data structure named Compressed FP-Tree (CFP-Tree). datasets as it uses a compact data structure named FP- The number of nodes in a CFP-Tree can be up to half less Tree. FP-Tree is a prefix tree with links between nodes than in the corresponding FP-Tree. We also describe the containing the same item. A tree data structure is suitable implementation of CT-PRO which utilize the CFP-Tree for dense datasets since many transactions will share for FIM. CT-PRO traverses the CFP-Tree bottom-up and common prefixes so that the database could be compactly generates the frequent itemsets following the pattern represented. However, for sparse datasets the tree will be growth approach non-recursively. Experiments show that bigger and bushier, and therefore its construction cost and CT-PRO performs better than OpportuneProject, FP- traversal cost will be higher than array-based data Growth, and Apriori. A further experiment is conducted structures. to determine the feasible performance range of CT-PRO The strengths of H-Mine and FP-Growth were and the result shows that CT-PRO has a larger combined in the recent pattern growth FIM algorithm performance range compared to others. CT-PRO also named OpportuneProject (OP) [11]. OP is an adaptive performs better compared to LCM and kDCI that are algorithm that opportunistically chooses an array-based or known as the two best algorithms in FIMI Repository a tree-based data structure depending on the sub-database 2003. characteristics. In this paper, we describe our new data structure 1. Introduction named Compressed FP-Tree (CFP-Tree) and also the implementation of our new FIM algorithm named CT- Since its introduction in [1] the problem of efficiently PRO that was first introduced in [12]. Here we report the generating frequent itemsets has been an active research compactness of CFP-Tree with FP-Tree at several support area and a large number of algorithms have been levels on the various datasets generated using the developed for it; for surveys, see [2-4]. Frequent itemset synthetic data generator [13]. The performance of CT- mining (FIM) is an essential part of association rules PRO is compared with Apriori [14, 15], FP-Growth [10], mining (ARM). Since FIM is computationally expensive, and OP [11]. the general performance of ARM is determined by it. The Sample datasets such as real-world BMS datasets [3] or frequent itemset concept has also been extended for many UCI Machine Learning Repository datasets [16] do not other data mining tasks such as classification [5, 6], cover the full range of densities from sparse to dense. clustering [7], and sequential pattern discovery [8]. Some algorithms may work well for a certain dataset but The data structures used play an important role in the may not be feasible when the dimensions of the database performance of FIM algorithms. The various data change (i.e. number of transactions, number of items, structures used by FIM algorithms can be categorized as average number of items per transaction etc.). Therefore, a either array-based or tree-based. An example of a further study has been done in this paper, to show the successful array-based algorithm for FIM is H-Mine [9]. It feasible performance range of the algorithms. The more uses a data structure named H-struct, which is a extensive testing of the algorithms is carried out using a set of databases with varying number of both transactions and average number of items per transaction. For each occurrences of an itemset. If C = {C0, C1,… Ck} is a dataset, all the algorithms are tested on supports of 10% to set of counts in the count array attached to a node 90% in increments of 10%. The experimental results are and the index of the array starts from zero, then Ci is reported in detail. the count of a transaction with an itemset along the To show how well CT-PRO compares with algorithms path from the node at level i to the node where Ci is in FIMI Repository 2003 [17], two best algorithms from located.  the last workshop, LCM [18] and kDCI [19], are selected for comparison. The result shows that CT-PRO The following lemma provides the worst-case space outperforms these and therefore all others. complexity of a CFP-Tree. The structure of the rest of this paper is as follows: In Section 2, we introduce the CFP-Tree data structure and Lemma 1. Let n be the number of frequent items in the report the results of experiments in evaluating its database for a certain support threshold. The number of compactness. In Section 3, we describe the CT-PRO nodes of the CFP-Tree is bounded by 2n-1, which is half of algorithm with a running example. We discuss the the maximum for a full prefix tree. complexity of CT-PRO algorithm in Section 4. The Rationale. If IF = {iF1, … iFn} is a set of distinct items in a performance of the algorithm on various datasets is CFP-Tree where iF1, iF2,…iFn are lexicographically compared against other algorithms in Section 5. Section 6 ordered. The maximum number of nodes under subtrees contains conclusions of our study. iF1, iF2, … iFn is 2n-1, 2n-2…20 respectively. Since the CFP- Tree is actually the subtree iF1 then the maximum number 2. Compressed FP-Tree Data Structure of nodes of the CFP-Tree is 2n-1.  In this section, a new tree-based data structure, named Compared to FP-Tree, CFP-Tree has some important Compressed FP-Tree (CFP-Tree), is introduced. It is a differences, as follows: variant of CT-Tree data structure that we introduced in 1. FP-Tree stores the item id in the tree while, in CFP- [20] with the following major differences: items are sorted Tree, item ids are mapped to an ascending sequence in descending order of their frequency (instead of of integers that is actually the array index in ascending order, as in CT-Tree) and there is a link to the ItemTable. next node with the same item node (while links are not 2. The FP-Tree is compressed by removing identical present in CT-Tree). The CFP-Tree is defined as follows: subtrees of a complete FP-Tree and succinctly storing the information from them in the remaining nodes. Definition 1 (Compressed FP-Tree or CFP-Tree). A All subtrees of the root of the FP-Tree (except the Compressed FP-Tree is a prefix tree with the following leftmost branch) are collected together at the leftmost properties: branch to form the CFP-Tree.. 1. It consists of an ItemTable and a tree whose root 3. Each node in the FP-Tree (except the root) consists represents the index of the item with the highest of three fields: item-id, count and node-link. Count frequency and a set of subtrees as the children of the registers the number of transactions represented by root. the portion of the path reaching this node. Node-link 2. The ItemTable contains all frequent items sorted in links to the next node with the same item-id. Each descending order by their frequency. Each entry in node in the CFP-Tree consists of three fields: item-id, the ItemTable consists of four fields, (1) index, (2) count array and node-link. The count array contains item-id, (3) frequency of the item, and (4) a pointer counts for item subsets in the path from the root to pointing to the root of the subtree of each frequent that node. The index of the cells in the array item. corresponds to the level numbers of the nodes above. 3. If I = {i1, i2, … ik} is a set of frequent items in a 4. FP-Tree has a HeaderTable consisting of two fields: transaction, after being mapped to their index-id, item-id and a pointer to the first node in the FP-Tree then the transaction will be inserted into the carrying the nodes with the same item-id. CFP-Tree Compressed FP-Tree starting from the root of a has an ItemTable consisting of four fields: index, subtree to which i1 in the ItemTable points. item-id, count of the item and a pointer to the root of 4. The root of the Compressed FP-Tree is the level 0 of the subtree of each item. The root of each subtree has the tree. a link to the next node with the same-item-node. Both 5. Each node in the Compressed FP-Tree consists of HeaderTable and ItemTable store only frequent four fields: node-id, a pointer to the next sibling, a items. pointer to the next node with the same id, and a count array where each entry corresponds to the number of Figure 1 shows the FP-Tree and the CFP-Tree for a sample database. In this case, the FP-Tree is a complete tree for items 1-4. In this example, the number of nodes in 3. CT-PRO Algorithm the FP-Tree is twice that of the corresponding CFP-Tree. However, most datasets do not have such an extreme In this section, a new method that traverses the tree in a characteristic as in this example. bottom-up strategy, and implemented non-recursively, is Figure 2 shows the compactness of CFP-Tree presented. The CFP-Tree data structure is used to compared to FP-Tree on several synthetic datasets at compactly represent transactions in the memory. The various support levels (the characteristics of the datasets algorithm is called CT-PRO and it has three steps in it: are explained later in Section 5.2). CFP-Tree has a finding the frequent items, constructing the CFP-Tree, and smaller number of nodes compared to FP-Tree in all mining. Algorithm 1 shows the first two steps in CT-PRO. cases. Tid Items Tid Items Tid Items 1 1234 6 2 11 1 2 24 7 14 12 234 3 134 8 123 13 12 4 3 9 34 14 124 5 23 10 4 15 13 (a) Sample Database HeaderTable ItemTable 1 8 Root 1 1 1 8 2 1:8 2:4 3:2 4:1 2 2 8 2 44 3 2 4 1 3 3 3 8 4 4 4 8 Item Head of 2:4 3:2 4:1 3:2 4:1 4:1 I I C P 3 222 4 11 4 1 node-links n t o S d e u T 3:2 4:1 4:1 4:1 e m n x t 4 1111 4:1 (b) FP-Tree (c) CFP-Tree Figure 1: FP-Tree and CFP-Tree 20000 CFP-Tree FP-Tree 15000 # of Nodes 10000 5000 0 I100T50KA10 I100T50KA10 I100T50KA25 I100T50KA25 I100T50KA50 I100T100KA10 I100T100KA25 I100T100KA25 I100T100KA50 (s-20) (s-30) (s-50) (s-60) (s-80) (s-30) (s-50) (s-60) (s-80) Datasets and Support Figure 2: Compactness of CFP-Tree Compared to FP-Tree on Various Synthetic Datasets at Various Support Levels Algorithm 1 CT-PRO Algorithm: Step 1 and Step 2 pointer in GlobalItemTable also acts as the start of the input Database D, Support Threshold σ links to other nodes with the same item ids (indicated by output CFP-Tree the dashed lines in Figure 3d). For illustration, at each node we also show the index of the array, the transaction 1 begin represented at each index entry and its count. In the 2 // Step 1: Identify frequent items implementation of CFP-Tree, however, only the second 3 for each transaction t ∈ D column that represents the count is stored. 4 for each item i ∈ t 5 if i ∈ ItemTable Tid Items Tid Items Tid Items 6 Increment count of i 1 3 4 5 6 7 9 1 3 4 5 7 1 1 2 4 5 7 else 2 1 3 4 5 13 2 1 3 4 5 2 1 2 3 4 8 Insert i into GlobalItemTable with count = 1 3 1 2 4 5 7 11 3 1 4 5 7 3 1 3 4 5 9 end if 4 1 3 4 8 4 1 3 4 4 1 2 3 10 end for 5 1 3 4 10 5 1 3 4 5 1 2 3 11 end for (a) Sample Database (b) Frequent Items (c) Mapped 12 Sort GlobalItemTable in frequency descending order I C Level 0 n I o 13 Assign an index for each frequent item in the d t u P 1 05 1 n S GlobalItemTable e e t T Level 1 x m 14 // Step 2: Construct CFP-Tree 0 4 12 3 0 1 13 1 4 5 2 15 Construct the left most branch of the tree 10 2 2 3 4 16 for each transaction t ∈ D 3 1 4 Level 2 17 Initialize mappedTrans 4 5 3 3 0 3 123 4 0 1 124 4 0 1 134 18 for each frequent item i ∈ t 5 7 2 1 0 23 1 0 24 19 mappedTrans = mappedTrans ∪ GetIndex(i) 20 3 ItemTable Level 3 20 end for 0 1 1234 5 0 1 1245 5 0 1 134 4 21 Sort mappedTrans in ascending order of item ids 1 0 234 1 0 245 22 InsertToCFPTree(mappedTrans) 2 0 34 23 end for 30 4 24 end Level 4 5 25 Procedure InsertToCFPTree(mappedTrans) 0 0 12345 26 firstItem := mappedTrans[1] 1 0 2345 27 currNode := root of subtree pointed by 2 0 345 ItemTable[firstItem] 3 0 45 28 for each subsequent item i ∈ mappedTrans 4 0 5 29 if currNode has child represent i (d) Global CFP-Tree 30 Increment count[firstItem-1] of the child node 31 else Figure 3: CFP-Tree for the Sample Dataset 32 Create child node and set its count[firstItem-1] to 1 The mining process in CT-PRO is shown in Algorithm 33 Link the node to its respective node-link 2 and illustrated by the following example. 34 end if Example 1. Let the CFP-Tree, as shown in Figure 3d, be 35 end for the input for the mining step in CT-PRO and suppose the 36 end user wants to get all the frequent itemsets with minimum support of two transactions (or 40%). Suppose the user wants to mine all frequent itemsets Figure 4 shows the LocalCFP-Tree and from the transaction database shown in Figure 3a with a LocalFrequentPatternTree at each step during the mining support threshold of two transactions (or 40%). First, we process. CT-PRO starts from the least frequent item need to identify frequent items by reading the database (index: 5, item: 7) in the GlobalItemTable (line 2). Item once (lines 3-11). The frequent items are stored in 7 is frequent and it will be the root of the frequency descending order in the GlobalItemTable (line LocalFrequentPatternTree (line 3). Then CT-PRO creates 12). In a second pass over the database, only frequent a projection of all transactions ending with index 5. This items are selected from each transaction (see Figure 3b), projection is represented by a LocalCFP-Tree and only mapped to their index id in GlobalItemTable on-the-fly, contains locally frequent items. Traversing the node-link sorted in ascending order of their index id (see Figure 3c) of index 5 in the GlobalCFP-Tree identifies the local and inserted into the CFP-Tree (see Figure 3d). The frequent items that occur together with it. There are three nodes of index 5 and the path to the root for each node is Algorithm 2 CT-PRO Algorithm: Mining Part traversed counting the other indexes that occur together input CFP-Tree with index 5 (lines 13-23). In all, we have 1 (2), 2 (1), 3 output Frequent Itemsets FP (1) and 4 (2) for index 5. As indexes 1,4 (item id: 4,5) are locally frequent, they are registered in the LocalItemTable 1 Procedure Mining and assigned new index ids (see Figure 4a). They also 2 for each frequent item i ∈ GlobalItemTable become the child of the LocalFrequentPatternTree’s root from the least to the most frequent (lines 5-7). Together, the root and its children form 3 Initialize LocalFrequentPatternTree frequent itemsets with length two. with i as the root 4 ConstructLocalItemTable(x) Lev 0 GlobalItemTable 5 for each frequent item j ∈ LocalItemTable 1 4 2 1 02 1 5 72 6 Attach i as a child of x 2 5 2 Lev 1 LocalItemTable 7 end for Local 2 0 2 12 2 52 142 8 ConstructLocalCFPTree(x) ItemTable 10 2 Traversing 9 RecMine(x) 1 42 Local CFP-Tree 10 Traverse the LocalFrequentPatternTree (a) Local CFP-Tree of index 5 (b) Frequent itemsets in projection 5 to print the frequent itemsets Lev 0 11 end for 1 4 3 1 0 3 1 4 53 GlobalItemTable 12 end 2 3 2 Lev 1 13 Procedure ConstructLocalItemTable(i) 3 1 2 0 2 12 2 3 01 3 12 2 32 143 14 for each occurrence of node i in the CFP-Tree Local 10 2 13 LocalItemTable 15 for each item j in the path to the root ItemTable Lev 2 1 42 1 42 16 if j ∈ LocalItemTable 3 0 1 123 17 Increment count of j 1 0 23 Traversing 20 3 Local CFP-Tree 18 else 19 Insert j into LocalItemTable with count = 1 (c) Local CFP-Tree of index 4 (d) Frequent itemsets in projection 4 20 end if Lev 0 21 end for 3 14 GlobalItemTable 22 end for 1 4 4 1 04 1 2 3 3 Lev 1 23 end 2 33 1 44 LocalItemTable Local 0 3 12 24 Procedure ConstructLocalCFPTree(i) 2 25 for each occurrence of node i in the CFP-Tree ItemTable 10 2 1 43 Traversing Local CFP-Tree 26 Initialize mappedTrans 27 for each frequent item j ∈ LocalItemTable (e) Local CFP-Tree of index 3 (f) Frequent itemsets in projection 3 in the path to the root 28 mappedTrans = mappedTrans ∪ GetIndex(j) Lev 0 2 34 GlobalItemTable 29 end for 1 4 4 1 04 1 30 Sort mappedTrans in ascending order of item ids Local 1 44 LocalItemTable 31 InsertToCFPTree(mappedTrans) ItemTable 32 end for (g) Local CFP-Tree of index 2 (h) Frequent itemsets in projection 2 33 end 34 Procedure RecMine(x) Figure 4: Local CFP-Tree during Mining 35 for each child i of x Process 36 Set all counts in LocalItemTable to 0 37 for each occurrence of node i in After local frequent items for the projection have been the LocalCFPTree identified, the node-link in the GlobalCFP-Tree is re- 38 for each item j in the path to the root traversed and the path to the root from each node is 39 Increment count of j in LocalCFPTree revisited to get the local frequent items occurring together 40 end for with index 5 in the transactions. These local frequent 41 end for items are mapped to their index in the LocalItemTable on- 42 for each frequent item k ∈ LocalItemTable the-fly, sorted in ascending order of their index id and 43 Attach k as a child of i inserted into the LocalCFP-Tree (lines 24-33). The first 44 end for path of index 5 returns nothing. From the second path of 45 RecMine(i) index 5, a transaction 14 (1) is inserted into the 46 end for LocalCFP-Tree and another transaction 14 (1) from the 47 end third path of index 5 also is inserted. In total, there are two occurrences of transaction 14. Indexes 1 and 4 in the Lemma 3. In the worst-case, the cost of generating GlobalItemTable represent items 4 and 5 respectively. frequent itemsets is The indexes of items 4 and 5 in the LocalItemTable are 1 n and 2 respectively and so the transaction 14 is inserted as (v + n)+ (v + 2n-1) + ∑ ((2 k=2 (n-k) –1)(2(n-k)) + 2(k-2)) = O(22n). transaction 12 in the LocalCFP-Tree. As the item index in the GlobalItemTable and LocalItemTable are different, Proof. The worst-case happens when all n items are the item id is always maintained for output purposes. frequent and all combinations of them are present in m Longer frequent itemsets, with length greater than two, transactions. CT-PRO has three steps: finding frequent are extracted by calling the procedure RecMine (line 9). items, constructing the CFP-Tree, and mining. The cost of For simplicity, we have described this procedure (lines finding frequent items has been provided by Lemma 2. 34-47) using recursion but, in the program, it is The worst case for the GlobalCFP-Tree corresponds to a implemented as a non-recursive procedure. Starting from situation where all the possible paths exist. In the least frequent item in the LocalItemTable, (line 35), constructing the GlobalCFP-Tree, all the transactions in the node-link is traversed (lines 37-41). For each node, the the database are read (the cost is v) and inserted into the path to the root in the LocalCFP-Tree is traversed tree (the total number of nodes is 2n-1). For the mining counting the other items that are together with the current process, for each frequent item fk where 2 ≤ k ≤ n, 2(k-2) item. For example, in Figure 4a, traversing the node-link nodes in the GlobalCFP-Tree are visited to construct a of node 2 will return the index 1 (2) and, since it is LocalCFP-Tree. The LocalCFP-Tree has (2(n-k)–1) frequent, an entry is created and attached as the child of paths that correspond to, at most, 2(n-k) candidate index 2 in the LocalFrequentPatternTree (lines 42-44). itemsets. So the worst case mining cost is: All frequent itemsets containing item 7 can be extracted n by traversing the LocalFrequentPatternTree (line 10): 7 ∑ ((2 k=2 (n-k) –1)(2(n-k)) + 2(k-2)). (2), 75 (2), 754 (2), 74 (2). The process is continued to mine the next item in the Therefore, the total worst-case cost of CT-PRO is GlobalItemTable in the GlobalCFP-Tree with indexes 4, n 3, 2 and finally, when the mining process reaches the root (v+n)+(v+2n-1)+ ∑ ((2 k=2 (n-k) –1)(2(n-k)) + 2(k-2)) = O(22n) of the tree of Figure 3d, it outputs 4 (5). One major advantage of CT-PRO compared to FP- Growth is that CT-PRO avoids the cost of creating 5. Experimental Evaluation conditional FP-Trees. FP-Growth needs to create a conditional FP-Tree at each step of its recursive mining. This section contains three sub-sections. In Section 5.1, This overhead adversely affects its performance, as the we compare CT-PRO against other well-known algorithms number of conditional FP-Trees corresponds to the including with Apriori [14, 15], FP-Growth [10] and number of frequent itemsets. In CT-PRO, for each recently proposed OpportuneProject (OP) [11] on the frequent item (not frequent itemsets), only one LocalCFP- various datasets available at FIMI Repository 2003 [17]. Tree is created and traversed non-recursively to extract all In Section 5.2, we report the result of more frequent itemsets beginning with the frequent item. comprehensive testing to determine the feasible performance range of the algorithms. Finally, in Section 4. Time Complexity 5.3, we compare CT-PRO with the two best algorithms in the FIMI Repository 2003 [17], LCM [18] and kDCI [19]. In this section, the best-case and worst-case time complexity of CT-PRO algorithm is presented. Let I = {i1, 5.1. Comparison with Apriori, FP-Growth and i2, …., in} be the set of all n items, let transaction database OpportuneProject D be {t1, t2, …, tm}, and let v be the total number of items in all transactions. Six real datasets are used in this experiment including two dense datasets: Chess and Connect4; two less dense Lemma 2. In the best-case, the cost of generating frequent datasets: Mushroom and Pumsb*; and two sparse datasets: itemsets is O(v + n). BMS-WebView-1 and BMS-WebView-2. The first four Proof. The best-case for the CT-PRO algorithm occurs datasets are originally taken from UCI ML Repository when there is no frequent item. The algorithm has to read [16] and the last two datasets are donated by Blue Martini v items in all transactions and add the count of n items. Software [3]. All the datasets are also available at FIMI The count of all n items are stored in the ItemTable and Repository 2003 [17]. checked to determine whether there is any frequent item We used the implementation of Apriori created by or not. If there is no frequent item, the process stops.  Christian Borgelt [21] by enabling the use of the prefix tree data structure. As for FP-Growth, we used the 5.2. Determining the Feasible Performance executable code available from its authors [10]. However, Range for comparing the number of nodes of FP-Tree to our proposed data structure, we modified the source code of As mentioned earlier, sample datasets such as real- FP-Growth provided by Bart Goethals in [22]. For word BMS datasets [3] and the UCI Machine Learning OpportuneProject (OP), we used the executable code Repository [16], which also are available at the FIMI available from its author, Junqiang Liu [11]. Repository 2003 [17], have their own static characteristics All the algorithm were implemented and compiled and thus do not cover the full range of densities. An using MS Visual C++ 6.0. All the experiments (except algorithm that works well for one dataset may not have the comparisons with algorithms in the FIMI Repository 2003 same degree of performance on other datasets with website [17]) were performed on a Pentium III PC 866 different dimensions. Dimensions, here, could be the MHz with 512 MB RAM and 110 GB Hard Disk running number of transactions, number of items, average number on MS Windows 2000. All the reported runtime used in of items per transaction, denseness or sparseness, etc. In our charts is the total execution time, the period between this section, a more comprehensive evaluation of the input and output. It also includes the time of constructing performance of various algorithms is presented. all the data structures used in all programs. We generated ten datasets using the synthetic data Figure 5 shows the results of the experiment on various generator [13]. The first five datasets contained 100 items datasets. All the charts use a logarithmic scale for run time with 50,000 transactions, and an average number of items along the y-axis on the left of the chart. We did not plot per transaction of 10, 25, 50, 75, and 100. The second five the results in the chart if the runtime was more than datasets contained 100 items with 100,000 transactions, 10,000 seconds. For a comprehensive evaluation of the also with an average number of items per transaction of algorithm’s performance, rather than showing where our 10, 25, 50, 75, and 100. CT-PRO, Apriori, FP-Growth algorithm performed best at some of the support levels, all and OP were tested extensively on these datasets at a the algorithms were extensively tested on various datasets support level of 10% to 90%, in increments of 10%. with a support level of 10% to 90% for dense datasets Figure 6 shows the performance comparisons of the (e.g. Connect4, Chess, Pumsb*, Mushroom), a support algorithms on various datasets. The dataset name shows level of 0.1% to 1% for the sparse dataset BMS-WebView- its characteristics. For example, I100T100KA10 means 1, and a support level of 0.01% to 0.1% for the sparse there are 100 items, and 100,000 transactions with an dataset BMS-WebView-2. As the average number of items average of 10 items per transaction. The experimental increases and/or the support level decreases, at some results show that the performance characteristics on point, every algorithm ‘hits the wall’ (i.e. takes too long to databases of 50,000 to 100,000 transactions are quite complete). similar. However, the runtime increases with the number CT-PRO outperforms others at all support thresholds of transactions. on the Connect4, Chess, Mushroom and Pumsb* datasets. The Apriori algorithm is very feasible for sparse On the sparse dataset BMS-WebView-1, CT-PRO is a datasets (with an average number of items in each runner-up, after OP, with only small performance transaction of 10 and 25). Its performance is good, as it differences (0.4 seconds to 0.49 seconds at a support level consistently performs better than FP-Growth at all support of 0.1% and 5.18 seconds to 7.69 seconds at 0.06%). levels. Although Apriori is slower than CT-PRO and OP Below the support level of 0.06%, none of the algorithms using the two sparse datasets, its runtime is still acceptable could mine the BMS-WebView-1 dataset. On the sparse to the user. (It needs only 60 seconds to mine the dataset BMS-WebView-2, a remarkable result is obtained. I100T50KA25 dataset at the support level of 10%). Apriori, which is known as a traditional FIM algorithm, However, on the datasets with an average number of items outperforms FP-Growth at all support levels. CT-PRO is per transaction of 50, 75, and 100, Apriori performs worst the fastest from a support threshold of 1% down to 0.4% and it can only mine down to a support level of 30%, and becomes the runner-up, after OP, at a support level of 50%, and 70% respectively. These results confirm that, 0.3% down to 0.1% with small performance differences. for dense datasets, if the support levels used are low, From these results, we can claim that, on dense Apriori is infeasible. datasets, CT-PRO generally outperforms others. On sparse FP-Growth performs worst at all support levels on the datasets, the high cost of the tree construction reduces CT- datasets with a low average number of items per PRO to runner-up. However, as the gap is very small, we transaction (i.e. 10 and 25). The fact that FP-Growth does can say that CT-PRO also works well for sparse datasets. not outperform Apriori on these two datasets shows that Apriori is more feasible than FP-Growth for sparse datasets. However, FP-Growth performs significantly better than Apriori for the larger average number of items in transactions. 10000 1000 Runtime (s) logarithmic Runtime (s) logarithmic Connect4 Pumsb* 1000 100 100 10 10 1 1 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 Support (%) Support (%) 10000 100 Runtime (s) logarithmic Runtime (s) logarithmic Chess BMS-WebView-1 1000 100 10 10 1 1 0.1 0.01 0.1 10 20 30 40 50 60 70 80 90 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 Support (%) Support (%) 100 10 Runtime (s) logarithmic Runtime (s) logarithmic Mushroom BMS-WebView-2 10 1 1 0.1 0.1 10 20 30 40 50 60 70 80 90 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Support (%) Support (%) Figure 5: Performance Evaluation of CT-PRO Against Others on Various Datasets Both CT-PRO and OP have larger feasible 5.3. Comparison with Best Algorithms in the performance ranges compared to the other algorithms. OP FIMI Repository 2003 does not perform well on the sparse datasets I100T50KA10 and I100T100KA10. Its performance was For comparison with the best algorithms in the FIMI even worse than Apriori on this dataset. However, it Repository 2003 [17], we ported our algorithm CT-PRO performs better than Apriori and FP-Growth on other to a Linux operating system and compared it with their datasets. On the datasets with an average number of items two best algorithms: LCM [18] and kDCI [19]. We per transaction of 50, 75, and 100, FP-Growth, CT-PRO performed the experiments on a PC AMD Athlon XP and OP can mine down to a support level of 20%, 40%, 2000+ 1.6 GHz, 1 GB RAM, 2 GB Swap with 40GB Hard and 50% respectively. Disk running Fedora Core 1. All programs were compiled CT-PRO can be considered the best among all other using g++ compiler. algorithms as it generally performs the best at most Figure 7 shows the performance comparisons of support levels. However, as the support level gets lower, algorithms that were submitted to FIMI 2003 on Chess its performance is similar to OP. Only at a support level of and Connect4 datasets. The figures are taken from [17]. 10%, OP occasionally runs slightly faster than CT-PRO On Chess dataset, kDCI is the best at a support level of (e.g. at a support level of 10% on the I100T50KA25 90% to a support level of 70%. Below that, LCM dataset). outperforms others. On Connect4 dataset, at a support 10 Runtime (s) logarithmic 10 Runtime (s) logarithmic I100T50KA10 I100T100KA10 1 1 0.1 0.1 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 Support (%) Support (%) 100 100 Runtime (s) logarithmic Runtime (s) logarithmic I100T50KA25 I100T100KA25 10 10 1 1 0.1 0.1 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 Support (%) Support (%) 10000 10000 Runtime (s) logarithmic Runtime (s) logarithmic 1000 I100T50KA50 I100T100KA50 1000 100 100 10 1 10 0.1 1 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 Support (%) Support (%) 10000 10000 Runtime (s) logarithmic Runtime (s) logarithmic I100T50KA75 I100T100KA75 1000 1000 100 100 10 10 1 1 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 Support (%) Support (%) 10000 10000 I100T50KA100 I100T100KA100 Runtime (s) logarithmic Runtime (s) logarithmic 1000 1000 100 100 10 10 1 1 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 Support (%) Support (%) Figure 6: Performance Evaluation on Various Synthetic Datasets level of 95% to a support level of 70%, kDCI is the best. Figure 8 shows the performance comparisons on Chess Below that, LCM outperforms others. We can conclude and Connect4 datasets. From these results, CT-PRO that for higher support levels, kDCI is the best, but for always outperforms others at high support levels. For lower support levels, LCM is the best. These best two lower support levels, the performances of these three algorithms are compared with CT-PRO. algorithms are similar. Since LCM and kDCI are the best The kDCI algorithm [19] is a multiple heuristics algorithms in FIMI Repository 2003 on Chess and hybrid algorithm that able to adapt its behaviour during Connect4 datasets, we can conclude that CT-PRO the execution. It is an extension of the DCI (Direct Count outperforms all other algorithms available in FIMI and Intersect) algorithm [23] by adding its adaptability to Repository 2003 [17] on Chess and Connect4 datasets. the dataset specific features. kDCI is also a resource aware algorithm which can decides mining strategy based on the 1000 Runtime (s) logarithmic hardware characteristics of the computing platform used. Chess Moreover, kDCI also used counting inference strategy 100 which originally proposed in [24]. 10 The LCM (Linear time Closed itemset Miner) algorithm [18] uses the parent-child relationship defined 1 on frequent itemsets. The search tree technique is adapted from the algorithms for generating maximal bipartite 0.1 cliques [25, 26] based on reverse search [27, 28]. In 0.01 enumerating all frequent itemsets, LCM uses hybrid 30 40 50 60 70 80 90 techniques involving occurrence deliver or diffsets [29] Support (%) according to the density of the database. 1000 Runtime (s) logarithmic Connect4 100 10 1 55 60 65 70 75 80 85 90 95 Support (%) Figure 8: Performance Comparisons of CT-PRO, LCM and kDCI on Chess and Connect4 Datasets 6. Conclusions In this paper, we have described a new tree-based data structure named CFP-Tree that is more compact than FP- Tree used in FP-Growth algorithm. Depending on the database characteristics, the number of nodes in an FP- Tree could be up to twice as many as in the corresponding CFP-Tree for a given database. CFP-Tree is used in our new algorithm named CT-PRO for mining all frequent itemsets. CT-PRO divides the CFP-Tree into several projections represented also by CFP-Trees. Then CT-PRO conquers the CFP-Tree for mining all frequent itemsets in each projection. Figure 7: Performance Comparisons of CT-PRO was explained in detail using a running Algorithms available in the FIMI 2003 Repository example and the best-case and worst-case time complexity on Chess and Connect4 Datasets [17] of the algorithm also was presented. Performance comparisons of CT-PRO against other well-known [8] R. Agrawal and R. Srikant, "Mining Sequential algorithms, including Apriori [14, 15], FP-Growth [10] Patterns", Proceedings of the 11th International and OpportuneProject (OP) [11] also were reported. The Conference on Data Engineering (ICDE), Taipei, Taiwan, 1995. results show that CT-PRO outperforms other algorithms at [9] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang, all support levels on dense datasets and also works well on "H-Mine: Hyper-Structure Mining of Frequent Patterns sparse datasets. in Large Databases", Proceedings of the IEEE Extensive experiments to measure the feasible International Conference on Data Mining (ICDM), San performance range of the algorithms are also presented in Jose, California, 2001. this paper. A synthetic data generator is used to generate [10] J. Han, J. Pei, and Y. Yin, "Mining Frequent Patterns several datasets with varying number of both transactions without Candidate Generation", Proceedings of the ACM and average number of items per transaction. Then the SIGMOD International Conference on Management of best available algorithms including CT-PRO, Apriori, FP- Data, Dallas, TX, 2000. [11] J. Liu, Y. Pan, K. Wang, and J. Han, "Mining Frequent Growth and OP are tested on those datasets. The result Item Sets by Opportunistic Projection", Proceedings of shows that CT-PRO generally outperforms others. ACM SIGKDD, Edmonton, Alberta, Canada, 2002. In addition, to relate our research to the last workshop [12] R. P. Gopalan and Y. G. Sucahyo, "High Performance on frequent itemset mining implementations [17], we Frequent Pattern Extraction using Compressed FP- selected two best algorithms (LCM and kDCI) from FIMI Trees", Proceedings of SIAM International Workshop on Repository 2003 and compared their performance with High Performance and Distributed Mining (HPDM), CT-PRO. It was shown that CT-PRO performed better Orlando, USA, 2004. than the others. [13] IBM, "Synthetic Data Generation Code for Associations and Sequential Patterns", Intelligent Information Systems, IBM Almaden Research Center, 2002, 7. Acknowledgement http://www.almaden.ibm.com/software/quest/Resources/i ndex.shtml. We are very grateful to Jian Pei for providing the [14] R. Agrawal and R. Srikant, "Fast Algorithms for Mining executable code of FP-Growth, Bart Goethals for Association Rules", Proceedings of the 20th providing his FP-Growth program, Christian Borgelt for International Conference on Very Large Data Bases, the Apriori program, and Junqiang Liu for the Santiago, Chile, 1994. OpportuneProject program. [15] C. Borgelt, "Efficient Implementations of Apriori and Eclat", Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI), 8. References Melbourne, Florida, 2003. [16] C. L. Blake and C. J. Merz, "UCI repository of machine [1] R. Agrawal, T. Imielinski, and A. Swami, "Mining learning databases", Irvine, CA: University of California, Association Rules between Sets of Items in Large Department of Information and Computer Science, 1998. Databases", Proceedings of ACM SIGMOD [17] FIMI, "FIMI Repository", 2003, http://fimi.cs.helsinki.fi. International Conference on Management of Data, [18] T. Uno, T. Asai, Y. Uchida, and H. Arimura, "LCM: An Washington DC, 1993, pp. 207-216. Efficient Algorithm for Enumerating Frequent Closed [2] J. Hipp, U. Guntzer, and G. Nakhaeizadeh, "Algorithms Item Sets", Proceedings of the IEEE ICDM Workshop of for Association Rule Mining - A General Survey and Frequent Itemset Mining Implementations (FIMI), Comparison", SIGKDD Explorations, vol. 2, pp. 58-64, Melbourne, Florida, 2003. July 2000. [19] C. Lucchese, S. Orlando, P. Palmerini, R. Perego, and F. [3] Z. Zheng, R. Kohavi, and L. Mason, "Real World Silvestri, "kDCI: a Multi-Strategy Algorithm for Mining Performance of Association Rule Algorithms", Frequent Sets", Proceedings of the IEEE ICDM Proceedings of the 7th International Conference on Workshop of Frequent Itemset Mining Implementations Knowledge Discovery and Data Mining (KDD), New (FIMI), Melbourne, Florida, 2003. York, 2001. [20] Y. G. Sucahyo and R. P. Gopalan, "CT-ITL: Efficient [4] B. Goethals, "Efficient Frequent Pattern Mining", PhD Frequent Item Set Mining Using a Compressed Prefix Thesis, University of Limburg, Belgium, 2002. Tree with Pattern Growth", Proceedings of the 14th [5] B. Liu, W. Hsu, and Y. Ma, "Integrating Classification Australasian Database Conference, Adelaide, Australia, and Association Rule Mining", Proceedings of ACM 2003. SIGKDD, New York, NY, 1998. [21] C. Borgelt and R. Kruse, "Induction of Association [6] Y. G. Sucahyo and R. P. Gopalan, "Building a More Rules: Apriori Implementation", Proceedings of the 15th Accurate Classifier Based on Strong Frequent Patterns", Conference on Computational Statistics, Berlin, Proceedings of the 17th Australian Joint Conference on Germany, 2002. Artificial Intelligence, Cairns, Australia, 2004. [22] B. Goethals, "Home page of Bart Goethals", 2003, [7] K. Wang, X. Chu, and B. Liu, "Clustering Transactions http://www.cs.helsinki.fi/u/goethals. Using Large Items", Proceedings of ACM CIKM, USA, [23] S. Orlando, P. Palmerini, R. Perego, and F. Silvestri, 1999. "Adaptive and Resource-Aware Mining of Frequent Sets", Proceedings of the IEEE International Conference [26] T. Uno, "Fast Algorithms for Enumerating Cliques in on Data Mining (ICDM), Maebashi City, Japan, 2002. Huge Graphs", Research Group of Computation, IEICE, [24] Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, and L. Kyoto University 2003. Lakhal, "Mining Frequent Patterns with Counting [27] D. Avis and K. Fukuda, "Reverse Search for Inference", ACM SIGKDD Explorations, vol. 2, pp. 66- Enumeration", Discrete Applied Mathematics, vol. 65, 75, December 2000. pp. 21-46, 1996. [25] T. Uno, "A Practical Fast Algorithm for Enumerating [28] T. Uno, "A New Approach for Speeding Up Cliques in Huge Bipartite Graphs and Its Enumeration Algorithms", Proceedings of ISAAC’98, Implementation", Proceedings of the 89th Special 1998. Interest Group of Algorithms, Information Processing [29] M. J. Zaki and C. Hsiao, "CHARM: An Efficient Society, Japan, 2003. Algorithm for Closed Itemset Mining", Proceedings of SIAM International Conference on Data Mining (SDM), Arlington, VA, 2002.