COFI-tree Mining: A New Approach to Pattern Growth with Reduced Candidacy Generation Mohammad El-Hajj Osmar R. Zaı̈ane Department of Computing Science Department of Computing Science University of Alberta Edmonton, AB, Canada University of Alberta Edmonton, AB, Canada mohammad@cs.ualberta.ca zaiane@cs.ualberta.ca Abstract posed method uses a pruning technique that dramatically saves the memory space. These relatively small trees are Existing association rule mining algorithms suffer constructed based on a memory-based structure called FP- from many problems when mining massive transactional Trees [11]. This data structure is studied in detail in the datasets. Some of these major problems are: (1) the repeti- following sections. In short, we introduced in [8] the COFI- tive I/O disk scans, (2) the huge computation involved dur- tree stucture and an algorithm to mine it. In [7] we pre- ing the candidacy generation, and (3) the high memory de- sented a disk based data structure, inverted matrix, that re- pendency. This paper presents the implementation of our places the memory-based FP-tree and scales the interactive frequent itemset mining algorithm, COFI, which achieves frequent pattern mining significantly. Our contributions in its efficiency by applying four new ideas. First, it can mine this paper are the introduction of a clever pruning technique using a compact memory based data structures. Second, based on an interesting property drawn from our top-down for each frequent item assigned, a relatively small indepen- approach, and some implementation tricks and issues. We dent tree is built summarizing co-occurrences. Third, clever included the pruning in the algorithm of building the tree so pruning reduces the search space drastically. Finally, a sim- that the pruning is done on the fly. ple and non-recursive mining process reduces the memory requirements as minimum candidacy generation and count- 1.1 Problem Statement ing is needed to generate all relevant frequent patterns. The problem of mining association rules over market basket analysis was introduced in [2]. The problem consists of finding associations between items or itemsets in trans- 1 Introduction actional data. The data could be retail sales in the form of customer transactions or even medical images [16]. Asso- Frequent pattern discovery has become a common topic ciation rules have been shown to be useful for other appli- of investigation in the data mining research area. Its main cations such as recommender systems, diagnosis, decision theme is to discover the sets of items that occur together support, telecommunication, and even supervised classifi- more than a given threshold defined by the decision maker. cation [5]. Formally, as defined in [3], the problem is stated A well-known application domain that counts on the fre- as follows: Let I = {i1 , i2 , ...im } be a set of literals, called quent pattern discovery is the market basket analysis. In items and m is considered the dimensionality of the prob- most cases when the support threshold is low and the num- lem. Let D be a set of transactions, where each transaction ber of frequent patterns “explodes”, the discovery of these T is a set of items such that T ⊆ I. A unique identifier patterns becomes problematic for reasons such as: high TID is given to each transaction. A transaction T is said memory dependencies, huge search space, and massive I/O to contain X, a set of items in I, if X ⊆ T . An associ- required. However, recently new studies have been pro- ation rule is an implication of the form “X ⇒ Y ”, where posed to reduce the memory requirements [8], to decrease X ⊆ I, Y ⊆ I, and X ∩ Y = ∅. An itemset X is said to be the I/O dependencies [7], still more promising issues need large or frequent if its support s is greater or equal than a to be investigated such as pruning techniques to reduce the given minimum support threshold σ. An itemset X satisfies search space. In this paper we introduce a new method a constraint C if and only if C(X) is true. The rule X ⇒ Y for frequent pattern discovery that is based on the Co- has a support s in the transaction set D if s% of the transac- Occurrence Frequent Item tree concept [8, 9]. The new pro- tions in D contain X ∪ Y . In other words, the support of the rule is the probability that X and Y hold together among all frequent patterns. This massive creation of conditional trees the possible presented cases. It is said that the rule X ⇒ Y makes this algorithm not scalable to mine large datasets be- holds in the transaction set D with confidence c if c% of yond few millions. In [14] the same authors propose a new transactions in D that contain X also contain Y . In other algorithm, H-mine, that invokes FP-Tree to mine condensed words, the confidence of the rule is the conditional proba- data. This algorithm is still not scalable as reported by its bility that the consequent Y is true under the condition of authors in [13]. the antecedent X. The problem of discovering all associa- tion rules from a set of transactions D consists of generating 1.3 Preliminaries, Motivations and Contributions the rules that have a support and confidence greater than a given threshold. These rules are called strong rules. This The Co-Occurrence Frequent Item tree (or COFI-tree for association-mining task can be broken into two steps: short) and the COFI algorithm presented in this paper are 1. A step for finding all frequent k-itemsets known for its based on our previous work in [7, 8]. The main motivation extreme I/O scan expense, and the massive computational of our current research is the pruning technique that reduces costs; the memory space needed by the COFI-trees. The presented 2. A straightforward step for generating strong rules. algorithm is done in two phases in which phase 1 requires In this paper and our attached code, we focus exclusively two full I/O scans of the transactional database to build the on the first step: generating frequent itemsets. FP-Tree structure[11]. The second phase starts by building small Co-Occurrence Frequent trees for each frequent item. 1.2 Related Work These trees are pruned first to eliminate any non-frequent items with respect to the COFI-tree based frequent item. Finally the mining process is executed. Several algorithms have been proposed in the literature The remainder of this paper is organized as follows: Sec- to address the problem of mining association rules [12, 10]. tion 2 describes the Frequent Pattern tree, design and con- One of the key algorithms, which seems to be the most pop- struction. Section 3 illustrates the design, constructions, ular in many applications for enumerating frequent item- pruning, and mining of the Co-Occurrence Frequent Item sets, is the apriori algorithm [3]. This apriori algorithm trees. Section 4 presents the implementation procedure of also forms the foundation of most known algorithms. It this algorithm. Experimental results are given in Section 5. uses an anti-monotone property stating that for a k-itemset Finally, Section 6 concludes by discussing some issues and to be frequent, all its (k-1)-itemsets have to be frequent. The highlighting our future work. use of this fundamental property reduces the computational cost of candidate frequent itemset generation. However, in the cases of extremely large input sets with big frequent 1- 2 Frequent Pattern Tree: Design and Con- items set, the Apriori algorithm still suffers from two main struction problems of repeated I/O scanning and high computational cost. One major hurdle observed with most real datasets The COFI-tree approach we propose consists of two is the sheer size of the candidate frequent 2-itemsets and main stages. Stage one is the construction of a modified 3-itemsets. Frequent Pattern tree. Stage two is the repetitive building of TreeProjection is an efficient algorithm presented in [1]. small data structures, the actual mining for these data struc- This algorithm builds a lexicographic tree in which each tures, and their release. node of this tree presents a frequent pattern. The authors report that their algorithm is one order of magnitude faster 2.1 Construction of the Frequent Pattern Tree than the existing techniques in the literature. Another inno- vative approach of discovering frequent patterns in transac- The goal of this stage is to build the compact data struc- tional databases, FP-Growth, was proposed by Han et al. ture called Frequent Pattern Tree [11]. This construction is in [11]. This algorithm creates a compact tree-structure, done in two phases, where each phase requires a full I/O FP-Tree, representing frequent patterns, that alleviates the scan of the dataset. A first initial scan of the database iden- multi-scan problem and improves the candidate itemset tifies the frequent 1-itemsets. The goal is to generate an or- generation. The algorithm requires only two full I/O scans dered list of frequent items that would be used when build- of the dataset to build the prefix tree in main memory and ing the tree in the second phase. then mines directly this structure. The authors of this al- This phase starts by enumerating the items appearing in gorithm report that their algorithm is faster than the Apri- the transactions. After enumeration these items (i.e. after ori and the TreeProjection algorithms. Mining the FP-tree reading the whole dataset), infrequent items with a support structure is done recursively by building conditional trees less than the support threshold are weeded out and the re- that are of the same order of magnitude in number as the maining frequent items are sorted by their frequency. This tween this item-node in the tree and its entry in the header Table 1. Transactional database table. The header table holds as one pointer per item that T.No. Items points to the first occurrences of this item in the FP-Tree T1 A G D C B structure. T2 B C H E D T3 B D E A M T4 C E F A N 2.2 Illustrative Example T5 A B N O P T6 A C Q R G For illustration, we use an example with the transactions T7 A C H I G shown in Table 1. Let the minimum support threshold be set T8 L E F K B to 4. Phase 1 starts by accumulating the support for all items T9 A F M N O that occur in the transactions. Step 2 of phase 1 removes all T10 C F P G R non-frequent items, in our example (G, H, I, J, K, L,M, N, T11 A D B H I O, P, Q and R), leaving only the frequent items (A, B, C, D, T12 D E B K L E, and F). Finally all frequent items are sorted according to T13 M D C G O their support to generate the sorted frequent 1-itemset. This T14 C F P Q J last step ends phase 1 in Figure 1 of the COFI-tree algorithm T15 B D E F I and starts the second phase. In phase 2, the first transaction T16 J E B A D (A, G, D, C, B) is filtered to consider only the frequent items T17 A K E F C that occur in the header table (i.e. A, D, C and B). This fre- T18 C D L B A quent list is sorted according to the items’ supports (A, B, C and D). This ordered transaction generates the first path of the FP-Tree with all item-node support initially equal to list is organized in a table, called header table, where the 1. A link is established between each item-node in the tree items and their respective support are stored along with and its corresponding item entry in the header table. The pointers to the first occurrence of the item in the frequent same procedure is executed for the second transaction (B, pattern tree. Phase 2 would construct a frequent pattern tree. C, H, E, and D), which yields a sorted frequent item list (B, C, D, E) that forms the second path of the FP-Tree. Trans- Item Counter Item Counter Item Counter Item Counter action 3 (B, D, E, A, and M) yields the sorted frequent item A 11 N 3 A 11 F 7 B 10 O 3 B 10 E 8 list (A, B, D, E) that shares the same prefix (A, B) with an C 10 P 3 C 10 D 9 existing path on the tree. Item-nodes (A and B) support is D 9 Q 2 D 9 C 10 incremented by 1 making the support of (A) and (B) equal G 4 R 2 E 8 B 10 E 8 I 3 F 7 A 11 to 2 and a new sub-path is created with the remaining items H 3 K 3 on the list (D, E) all with support equal to 1. The same pro- F 7 L 3 M 3 J 3 cess occurs for all transactions until we build the FP-Tree Step 1 Step 2 Step 3 for the transactions given in Table 1. Figure 2 shows the result of the tree building process. Notice that in our tree Figure 1. Steps of phase 1 structure, contrary to the original FP-tree [11], our links are bi-directional. This, and other differences presented later, Phase 2 of constructing the Frequent Pattern tree struc- are used by our mining algorithm. ture is the actual building of this compact tree. This phase requires a second complete I/O scan from the dataset. For 3 Co-Occurrence Frequent-Item-trees: Con- each transaction read, only the set of frequent items present in the header table is collected and sorted in descending or- struction, Pruning and Mining der according to their frequency. These sorted transaction items are used in constructing the FP-Trees as follows: for Our approach for computing frequencies relies first on the first item on the sorted transactional dataset, check if it building independent, relatively small trees for each fre- exists as one of the children of the root. If it exists then quent item in the header table of the FP-Tree called COFI- increment the support for this node. Otherwise, add a new trees. A pruning technique is applied to remove all non- node for this item as a child for the root node with 1 as frequent items with respect to the main frequent item of support. Then, consider the current item node as the new the tested COFI-tree. Then we mine separately each one temporary root and repeat the same procedure with the next of the trees as soon as they are built, minimizing the candi- item on the sorted transaction. During the process of adding dacy generation and without building conditional sub-trees any new item-node to the FP-Tree, a link is maintained be- recursively. The trees are discarded as soon as mined. At Root A 11 B 4 C 3 F 7 E 8 F 1 C 4 B 6 C 1 E 1 D2 F 2 D 1 D 9 C 10 E 2 C 2 D 3 D 1 F 1 E 2 B 10 A 11 F 2 D 2 E 2 E 1 F 1 Figure 2. Frequent Pattern Tree. any given time, only one COFI-tree is present in main mem- cally not frequent with respect to item F as the support for ory. In our following examples we always assume that we all these items are not greater than the support threshold σ are building the COFI-trees based on the modified FP-Tree which is equal to 4, Figure 3. From knowing this, there data-structure presented above. will be no need to mine the F-COFI-tree, we already know that no frequent patterns other than the item F will be gen- 3.1 Pruning the COFI-trees erated. We can extend our knowledge at this stage to know that item F will not appear in any of the frequent patterns. Pruning can be done after building a tree or, even better, The COFI-tree for item E indicates that only items D, and while building it. We opted for pruning on the fly since the B are frequent with respect to item E, which means that overhead is minimal but the consequences are drastic reduc- there will be no need to test patterns as EC, and EA. The tion in memory requirements. We will discuss the pruning COFI-tree for item D indicates that item C will be elimi- idea, then present the building algorithm that considers the nated, as it is not frequent with respect to item D. C-COFI- pruning on the fly. tree ignores item B for the same reason. To sum up the In this section we are introducing a new anti-monotone Apriori property states in our example of 6 1-frequent item- property called global frequent/local non-frequent property. set that we need to generate 15 2-Candidate itemset which This property is similar to the Apriori one in the sense that are (A,B), (A,C), (A,D), (A,E), (A,F), (B,C), (B,D), (B,E), it eliminates at the ith level all non-frequent items that will (B,F), (C,D), (C,E), (C,F), (D,E), (D,F), (E,F), using our not participate in the (i+1) level of candidate itemsets gen- property we have eliminated (not generated or counted) 9 eration. The difference between the two properties is that patterns which are (A,E), (A,F), (B,C), (B,F), (C,D), (C,E), we extended our property to eliminate also frequent items (C,F), (D,F), (E,F) leaving only 6 patterns to test which are which are among the i-itemset and we are sure that they (A,B), (A,C), (A,D), (B,D), (B,E), (D,E). will not participate in the (i+1) candidate set. The Apriori property states that all nonempty subsets of a frequent item- 3.2 Construction of the Co-Occurrence Frequent- set must also be frequent. An example is given later in this Item-trees section to illustrate both properties. In our approach, we are trying to find all frequent patterns with respect to one The small COFI-trees we build are similar to the condi- frequent item, which is the base item of the tested COFI- tional FP-Trees [11] in general in the sense that they have tree. We already know that all items that participate in the a header with ordered frequent items and horizontal point- creation of the COFI-tree are frequent with respect to the ers pointing to a succession of nodes containing the same global transaction database, but that does not mean that they frequent item, and the prefix tree per se with paths repre- are also locally frequent with respect to the based item in the senting sub-transactions. However, the COFI-trees have bi- COFI-tree. The global frequent/local non-frequent property directional links in the tree allowing bottom-up scanning as states that all nonempty subsets of a frequent itemset with well, and the nodes contain not only the item label and a respect to the item A of the A-COFI-tree , must also be frequency counter, but also a participation counter as ex- frequent with respect to item A. For each frequent item plained later in this section. The COFI-tree for a given fre- A we traverse the FP-Tree to find all frequent items that quent item x contains only nodes labeled with items that are occur with A in at least one transaction (or branch in the more frequent or as frequent as x. FP-Tree) with their number of occurrences. All items that To illustrate the idea of the COFI-trees, we will explain are locally frequent with item A will participate in build- step by step the process of creating COFI-trees for the FP- ing the A-COFI-tree, other global frequent items, locally Tree of Figure 2. With our example, the first Co-Occurrence non-frequent items will not participate in the creation of the Frequent Item tree is built for item F as it is the least fre- A-COFI-tree. In our example we can find that all items quent item in the header table. In this tree for F, all frequent that participate in the creation of the F-COFI-tree are lo- items, which are more frequent than F, and share transac- tions with F, participate in building the tree. This can be F COFI-tree E 4 F (7 0) found by following the chain of item F in the FP-Tree struc- D 2 ture. The F-COFI-tree starts with the root node containing C 4 the item in question, then a scan of part of the FP-Tree is ap- B 2 A 3 plied following he chain of the F item in the FP-Tree. The first branch FA has frequency of 1, as the frequency of the branch is the frequency of the test item, which is F. The goal E COFI-tree E (8 0) of this traversal is to count the frequency of each frequent D 5 item with respect to item F. By doing so we can find that C 3 B 6 D 5 D (5 0) B (1 0) item E occurs 4 times, D occurs 2 times, C occurs 4 times, A 4 B 6 B 2 times, and A 3 times, by applying the anti-monotone constraint property we can predict that item F will never B (5 0) appear in any frequent pattern except itself. Consequently there will be no need to continue building the F-COFI-tree. D COFI-tree D (9 0) The next frequent item to test is E. The same process is done to compute the frequency of each frequent items C 4 B 8 B 8 A 5 B (8 0) with respect to item E. From this we can find that only two A 5 globally frequent items are also locally frequent which are (D:5 and B:6). For each sub-transaction or branch in the A (5 0) FP-Tree containing item E with other locally frequent items that are more frequent than E which are parent nodes of E, C COFI-tree a branch is formed starting from the root node E. the sup- C ( 10 0 ) B 3 port of this branch is equal to the support of the E node A 6 A 6 in its corresponding branch in FP-Tree. If multiple fre- A (6 0) quent items share the same prefix, they are merged into one branch and a counter for each node of the tree is adjusted B COFI-tree B ( 10 0 ) accordingly. Figure 3 illustrates all COFI-trees for frequent A 6 A 6 items of Figure 2. In Figure 3, the rectangle nodes are nodes A (6 0) from the tree with an item label and two counters. The first counter is a support-count for that node while the second counter, called participation-count, is initialized to 0 and is Figure 3. COFI-trees used by the mining algorithm discussed later, a horizontal link which points to the next node that has the same item- name in the tree, and a bi-directional vertical link that links a child node with its parent and a parent with its child. The bi-directional pointers facilitate the mining process by mak- ing the traversal of the tree easier. The squares are actually cells from the header table as with the FP-Tree. This is a support for this branch (following the upper links for this list made of all frequent items that participate in building item). Two nodes are created, for items D and B with sup- the tree structure sorted in ascending order of their global port equals to 2, D is a child node of B, and B is a child node support. Each entry in this list contains the item-name, item- of E. The third location of E indicate having EDB:1, which counter, and a pointer to the first node in the tree that has shares an existing branch in the E-COFI-tree, all counters the same item-name. are adjusted accordingly. A new branch of EB: 1 is created To explain the COFI-tree building process, we will high- as the support of E=1 for the fourth occurrences of E. The light the building steps for the E-COFI-tree in Figure 3. Fre- final occurrence EDB: 2 uses an existing branch and only quent item E is read from the header table and its first loca- counters are adjusted. Like with FP-Trees, the header con- tion in the FP-Tree is located using the pointer in the header stitutes a list of all frequent items to maintain the location table. The first location of item E indicate that it shares a of first entry for each item in the COFI-tree. A link is also branch with items CA, with support = 2, since none of these made for each node in the tree that points to the next lo- items are locally frequent then only the support of the E root cation of the same item in the tree if it exists. The mining node is incremented by 2. the second node of item E indi- process is the last step done on the E-COFI-tree before re- cates that it shares items DBA with support equals to 2 for moving it and creating the next COFI-tree for the next item this branch as the support of the E-item is considered the in the header table. E COFI-tree STEP1 Pattern D COFI-tree STEP1 Pattern E (8 0) E (8 5) E D B 5 D (9 0) D (9 5) D B A 5 E D 5 B 8 D B A 5 D 5 D (5 0) B (1 0) D (5 1) E B 5 A 5 B (8 0) B (8 5) D B 5 B 6 E D B 5 D A 5 B (5 0) B (5 5) A (5 0) A (5 5) E COFI-tree STEP2 E (8 5) Pattern D COFI-tree STEP2 Pattern E (8 6) E B 1 D (9 5) D (9 5) D B 3 D 5 D (5 5) B (1 0) E D 5 B 8 D B A 5 B 6 B (1 1) E B 6 A 5 B (8 5) B (8 5) D B 8 E D B 5 B (5 5) D A 5 A (5 5) Frequent Patterns are: E COFI-tree STEP3 DBA:5, DB: 8, DA: 5 E (8 6) Pattern E (8 7) E D 0 D 5 D (5 5) B (1 1) Figure 5. Steps needed to generate frequent B 6 D (6 6) patterns related to item D Frequent Patterns are: B (5 5) ED:5, EB: 6, EDB: 5 Figure 4. Steps needed to generate frequent ates the pattern EB: 1. EB already exists and its counter patterns related to item E is adjusted to become 6. The COFI-tree of Item E can be removed at this time and another tree can be generated and tested to produce all the frequent patterns related to the root node. The same process is executed to generate the fre- 3.3 Mining the COFI-trees quent patterns. The D-COFI-tree (Figure 5) is created after the E-COFI-tree. Mining this tree generates the following The COFI-trees of all frequent items are not constructed frequent patterns: DBA: 5, DA: 5, and DB:8. The same pro- together. Each tree is built, mined, then discarded before the cess occurs for the remaining trees that would produce AC: next COFI-tree is built. The mining process is done for each 6 for the C-COFI-tree and BA:6 for the B-COFI-tree. tree independently with the purpose of finding all frequent The following is our algorithm for building and mining k-itemset patterns in which the item on the root of the tree the COFI-trees with pruning. participates. Algorithm COFI: Creating with pruning and Mining Steps to produce frequent patterns related to the E item COFI-trees for example, as the F-COFI-tree will not be mined based Input: modified FP-Tree, a minimum support threshold σ on the pruning results we found on the previous step, are Output: Full set of frequent patterns illustrated in Figure 4. From each branch of the tree, us- Method: ing the support-count and the participation-count, candi- 1. A = the least frequent item on the header table of date frequent patterns are identified and stored temporarily FP-Tree in a list. The non-frequent ones are discarded at the end 2. While (There are still frequent items) do when all branches are processed. The mining process for 2.1 count the frequency of all items that share item (A) the E-COFI-tree starts from the most locally frequent item a path. Frequency of all items that share the same path in the header table of the tree, which is item B. Item B ex- are the same as of the frequency of the (A) items ists in two branches in the E-COFI-tree which are (B:5, D:5 2.2 Remove all non-locally frequent items for and E:8), and (B:1, and E:8). The frequency of each branch the frequent list of item (A) is the frequency of the first item in the branch minus the 2.3 Create a root node for the (A)-COFI-tree with both participation value of the same node. Item B in the first frequency-count and participation-count = 0 branch has a frequency value of 5 and participation value 2.3.1 C is the path of locally frequent items in the path of 0 which makes the first pattern EDB frequency equals of item A to the root to 5. The participation values for all nodes in this branch 2.3.2 Items on C form a prefix of the (A)-COFI-tree. are incremented by 5, which is the frequency of this pat- 2.3.3 If the prefix is new then Set frequency-count= tern. In the first pattern EDB: 5. We need to generate all frequency of (A) node and participation- sub-patterns that item E participates in, which are ED: 5, count= 0 for all nodes in the path EB: 5, and EDB: 5. The second branch that has B gener- Else 2.3.4 Adjust the frequency-count of the already COFI F P -G row th exist part of the path. 35 0 2.3.5 Adjust the pointers of the Header list 30 0 if needed T im e in S ec on ds 25 0 2.3.6 find the next node for item A in the FP-tree and 20 0 15 0 go to 2.3.1 10 0 2.4 MineCOFI-tree (A) 50 2.5 Release (A) COFI-tree 0 0.1 0.05 0.02 0.01 0.00 5 0.02 0.00 1 0.00 04 2.6 A = next frequent item from the header table S up p ort % 3. Goto 2 (A) Runtime Function: MineCOFI-tree (A) COFI F P -G row th 1. nodeA = select next node //Selection of nodes starts with 16 00 14 00 M e m or y U sag e in K .By te the node of most locally frequent item and following its 12 00 chain, then the next less frequent item with its chain, un- 10 00 80 0 til we reach the least frequent item in the Header list of the 60 0 (A)-COFI-tree 40 0 2. while there are still nodes do 20 0 0 2.1 D = set of nodes from nodeA to the root S up p ort % 0.1 0.05 0.02 0.01 0.00 5 0.02 0.00 1 0.00 04 2.2 F = nodeA.frequency-count-nodeA. participation-count (B) Total Memory requirement 2.3 Generate all Candidate patterns X from items in D. COFI F P -G row th Patterns that do not have A will be discarded. 16 00 2.4 Patterns in X that do not exist in the A-Candidate 14 00 M e m or y U sag e in K .By te List will be added to it with frequency = F otherwise 12 00 10 00 just increment their frequency with F 80 0 2.5 Increment the value of participation-count 60 0 40 0 by F for all items in D 20 0 2.6 nodeA = select next node 0 0.1 0.05 0.02 0.01 0.00 5 0.02 0.00 1 0.00 04 3. Goto 2 S up p ort % 4. Based on support threshold σ remove non-frequent pat- (C) Memory requirement without FP-tree terns from A Candidate List. No.of transactions = 500K, Dimension= 10K, 4 Experimental Studies Average no. of items / transaction = 12 To study the COFI-tree mining strategies we have con- Figure 6. Mining dataset of 500K transactions ducted several experiments on a variety of data sizes com- paring our approach with the well-known FP-Growth [11] algorithm written by its original authors. The experiments were conducted on 2.6 GHz CPU machine with 2 Gbytes proach is in the memory space saved. Our algorithm outper- of memory using Win2000 operating system. Transactions forms the FP-Growth by one order of magnitude in terms of were generated using IBM synthetic data generator [4]. We memory space requirements. We have also tested the mem- have conducted several types of experiments to test the ef- ory space used during the mining process only, (i.e, isolat- fect of changing the support, transaction size, dimension, ing the memory space used to create the FP-Tree by both and transaction length. The first set of experiments were FP-growth and COFI-tree FP-Tree based algorithms). We tested on a transaction database of 500K transactions, 10K have found also that the COFI-tree approach outperforms the dimension, and the average transaction length was 12. the FP-tree by one order of magnitude in terms of mem- We have varied the support from absolute value of 500 to ory space used by the COFI-tree compared with the condi- 2 in which frequent patterns generated varied from 15K to tional trees used by FP-Growth during the mining process. 3400K patterns. FP-Growth could not mine the last experi- Figure 6A presents the time needed to mine 500K transac- ment in this set as it used all available memory space. In all tions using different support levels. Figure 6B depicts the experiments the COFI-tree approach outperforms the FP- memory needed during the mining process of the previous Growth approach. The major accomplishment of our ap- experiments. Figure 6C illustrates the memory needed by An optional file name for the out patterns. This code gen- Table 2. Time and Memory Scalability with re- erates ALL frequent patterns from the provided input file. spect to support on the T10I4D100K dataset The code scans the database twice. The goal of the first database scan is to find the frequency of each item in this Time in Seconds Memory in KB transactional database. These frequencies are stored in a Support % COFI FP-Growth COFI FP-Growth data structure called Candidate-Items. Each entry of this 0.50 1.5 3.0 18 173 candidate items is a structure called ItemsStructure that is 0.25 1.7 5.2 19 285 made of two long integers representing the item and its fre- 0.10 2.7 12.3 26 289 quency. All frequent items are then stored in a special data 0.05 14.0 20.9 19 403 structure called F1-Items. This data structure is sorted in descending order based on the frequency of each item. To access the location of each item we map it with a specific the COFI-trees and Conditional trees during the mining pro- location using a new data structure called FindInHashTable. cess. Other experiments were conducted to test the effect of In brief, since we do not know the number of unique items changing the dimension, transaction size, transaction length at runtime, and thus can’t create an array for counting the using the same support which is 0.05%. Some of these ex- items, rather than having a linked list of items, we create periments are represented in Figure 7. Figures 7A and 7B blocks of p items. The number p could arbitrarily be 100 or represent the time needed during the mining process. Fig- 1000. Indeed, following links in a linked list each time to ures 7C and 7D represent the memory space needed during find and increment a counter could be expensive. Instead, the whole mining process. Figures 7E and 7F represent blocs of items are easily indexed. In the worst case, we the memory space needed by the COFI-trees or conditional could lose the space of p − 1 unused items. trees during the mining process. In these experiments we The second scan starts by eliminating all non frequent have varied the dimension, which is the number of distinct items from each transaction read and then sort this trans- items from 5K to 10K, the average transaction length from action based on the frequency of each frequent item. This 12 to 24 items in one transaction, and the number of trans- process occurred in the Sort-Transaction method. The FP- actions from 10K to 500K. All these experiments depicted tree is built based on the sub-transaction made of the fre- the fact that our approach is one order of magnitude better quent items. The FP-tree data structure is a tree of n chil- than the FP-Growth approach in terms of memory usage. dren. The structure struct FPTTree { long Element; long We also run experiments using the public UCI datasets counter; FPTTree* child; FPTTree* brother; FPTTree* fa- provided on the FIMI workshop website, which are ther; FPTTree* next; } has been used to create each node Mushroom, Chess, Connect, Pumsb, T40I10D100K, and of this tree, where a link is created between each node and T10I4D100K. The COFI algorithm scales relatively well its first child, and the brother link is maintained to create a vis-à-vis the support threshold with these datasets. Re- linked list of all children of the same node. This linked list sults are not reported here for lack of space. Our ap- is built ordered based on the frequency of each item. The proach revealed good results with high support value on all header list is maintained using the structure FrequentStruc datasets. However, like with other approaches, in cases of { long Item; long Frequency; long COFIFrequency; long low support value, where the number of frequent patterns COFIFrequency1; FPTTree* first; COFITree* firstCOFI; }; increases significantly, our approach faces some difficulties. After building the FP-tree we start building the first COFI- For such cases it is recommended to consider discovering tree by selecting the item with least frequency from the fre- closed itemsets or maximal patterns instead of just frequent quent list. A scan is made of the FP-tree starting from the itemsets. The sheer number of frequent itemsets becomes linked list of this item to find the frequency of other items overwhelming, and some argue even useless. Closed item- with respect to this item. After that, the COFI-tree is created sets and maximal itemsets represent all frequent patterns by based on only the locally frequent items. Finally frequent eliminating the redundant ones. For illustration, Table 2 patterns are generated and stored in the FrequentTree data compares the CPU time and memory requirement for COFI structure. All nodes that have support greater or equal than and FP-Growth on the T10I4D100K dataset. the given support present a frequent pattern. The COFI-tree and the FrequentTree are removed from memory and the 5 Implementations next COFI-tree is created until we mine all frequent trees. One interesting implementation improvement is the fact The COFI-tree program submitted with this paper is a that the participation counter was also added to the header C++ code. The executable of this code runs with 3 param- table of the COFI-tree this counter cumulates the partici- eters, which are: (1) the path to the input file name. (2) pation of the item in all paterns already discovered in the a positive integer that presents the absolute support. (3) current COFI-tree. The difference between the participa- tion in the node and the participation in the header is that the [2] R. Agrawal, T. Imielinski, and A. Swami. Mining associa- counter in the node counts the participation of the node item tion rules between sets of items in large databases. In Proc. in all paths where the node appears, while the new counter 1993 ACM-SIGMOD Int. Conf. Management of Data, pages in the COFI-tree header counts the participation of the item 207–216, Washington, D.C., May 1993. [3] R. Agrawal and R. Srikant. Fast algorithms for mining as- globally in the tree. This trick does not compromise the sociation rules. In Proc. 1994 Int. Conf. Very Large Data effectiveness and usefulness of the participation counting. Bases, pages 487–499, Santiago, Chile, September 1994. One main advantage of this counter is that it looks ahead [4] I. Almaden. Quest synthetic data generation code. to see if all nodes of a specific item have already been tra- http://www.almaden.ibm.com/cs/quest/syndata.html. versed or not to reduce the unneeded scans of the COFI-tree. [5] M.-L. Antonie and O. R. Zaı̈ane. Text document categoriza- tion by term association. In IEEE International Conference on Data Mining, pages 19–26, December 2002. 6 Conclusion and future work [6] C. Burdick, M. Calimlim, and J. Gehrke. MAFIA: A max- imal frequent itemset algorithm for transactional databases. The COFI algorithm, based on our COFI-tree structure, In IEEE International Conference on Data Mining (ICDM we propose in this paper is one order of magnitude better 01), April 2001. [7] M. El-Hajj and O. R. Zaı̈ane. Inverted matrix: Efficient dis- than the FP-Growth algorithm in terms of memory usage, covery of frequent items in large datasets in the context of and sometimes in terms of speed. This Algorithm achieves interactive mining. In In Proc. 2003 Int’l Conf. on Data this results thanks to: (1) the non recursive technique used Mining and Knowledge Discovery (ACM SIGKDD), August during the mining process, in which with a simple traver- 2003. sal of the COFI-tree a full set of frequent patterns can be [8] M. El-Hajj and O. R. Zaı̈ane. Non recursive generation of generated. (2) The pruning method that is used to remove frequent k-itemsets from frequent pattern tree representa- all locally non frequent patterns, leaving the COFI-tree with tions. In In Proc. of 5th International Conference on Data only locally frequent items. Warehousing and Knowledge Discovery (DaWak’2003), The major advantage of our algorithm COFI over FP- September 2003. [9] M. El-Hajj and O. R. Zaı̈ane. Parallel association rule Growth is that it needs a significantly smaller memory foot- mining with minimum inter-processor communication. In print, and thus can mine larger transactional databases with Fifth International Workshop on Parallel and Distributed smaller main memory available. The fundamental differ- Databases (PaDD’2003) in conjunction with the 14th ence, is that COFI tries to find a compromise between a Int’ Conf. on Database and Expert Systems Applications fully pattern growth approach, that FP-Growth adopts, and DEXA2003, September 2003. a total candidacy generation approach that apriori is known [10] J. Han and M. Kamber. Data Mining: Concepts and Tech- for. COFI grows targeted patterns but performs a reduced niques. Morgan Kaufman, San Francisco, CA, 2001. and focused generation of candidates during the mining. [11] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without This is to avoid the recursion that FP-growth uses, and no- candidate generation. In ACM-SIGMOD, Dallas, 2000. [12] J. Hipp, U. Guntzer, and G. Nakaeizadeh. Algorithms for torious to blow the stack with large datasets. association rule mining - a general survey and comparison. We have developed algorithms for closed itemset min- ACM SIGKDD Explorations, 2(1):58–64, June 2000. ing and maximal itemset mining based on our COFI-tree [13] J. Liu, Y. Pan, K. Wang, and J. Han. Mining frequent item approach. However, their efficient implementations were sets by oppotunistic projection. In Eight ACM SIGKDD In- not ready by the deadline of this workshop. These effi- ternationa Conf. on Knowledge Discovery and Data Mining, cient algorithms and experimental results will be compared pages 229–238, Edmonton, Alberta, August 2002. to existing algorithms such as CHARM[17], MAFIA[6] and [14] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang. H- CLOSET+[15], and will be reported in the future. mine: Hyper-structure mining of frequent patterns in large databases. In ICDM, pages 441–448, 2001. [15] J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the 7 Acknowledgments best strategies for mining frequent closed itemsets. In 9th ACM SIGKDD International Conf. on Knowledge Discovery and Data Mining, July 2003. This research is partially supported by a Research Grant [16] O. R. Zaı̈ane, J. Han, and H. Zhu. Mining recurrent items in from NSERC, Canada. multimedia with progressive resolution refinement. In Int. Conf. on Data Engineering (ICDE’2000), pages 461–470, San Diego, CA, February 2000. References [17] M. Zaki and C.-J. Hsiao. ChARM: An efficient algorithm for closed itemset mining. In 2nd SIAM International Con- [1] R. Agarwal, C.Aggarwal, and V. Prasad. A tree projection ference on Data MIning, April 2002. algorithm for generation of frequent itemsets. Parallel and distributed Computing, 2000. D = Dimension, L = Average number of items in one transaction Support = 0.05% COFI FP-Growth COFI FP-Growth 400 25 350 20 300 Time in Seconds Time in Seconds 250 15 200 150 10 100 5 50 0 0 10 50 100 10 50 100 500 Number of Transactions in K Number of Transactions in K (A) D=5K, L=12 (B) D=10K, L=24 COFI FP-Growth COFI FP-Growth 400 800 350 700 Memory Usage in K.Byte Memory Usage in K.Byte 300 600 250 500 200 400 150 300 100 200 50 100 0 0 10 50 100 10 50 100 500 Number of Transactions in K Number of Transactions in K (C) D=5K, L=12 (D) D=10K, L=24 COFI FP-Growth COFI FP-Growth 350 600 300 Memory Usage in K.Byte Memory Usage in K.Byte 500 250 400 200 300 150 200 100 50 100 0 0 10 50 100 10 50 100 500 Number of Transactions in K Number of Transactions in K (E) D=5K, L=12 (F) D=10K, L=24 Figure 7. Mining dataset of different sizes