CT-PRO: A Bottom-Up Non Recursive Frequent Itemset Mining Algorithm
                 Using Compressed FP-Tree Data Structure

                               Yudho Giri Sucahyo          Raj P. Gopalan
                         Department of Computing, Curtin University of Technology
                                            Kent St, Bentley
                                         Western Australia 6102
                                    {sucahyoy, raj}@cs.curtin.edu.au


                       Abstract                               combination of arrays and hyper-links. It was shown in
                                                              [9] that H-struct works well for sparse datasets as H-Mine
   Frequent itemset mining (FIM) is an essential part of      outperforms FP-Growth [10] on these datasets (note that
association rules mining. Its application for other data      both H-Mine and FP-Growth follows the same pattern
mining tasks has also been recognized. It has been an         growth method). However, the hyper-structure is not
active research area and a large number of algorithms         efficient on dense datasets and therefore H-Mine switches
have been developed. In this paper, we propose another        to FP-Growth for such datasets.
pattern growth algorithm which uses a more compact                FP-Growth [10] shows good performance on dense
data structure named Compressed FP-Tree (CFP-Tree).           datasets as it uses a compact data structure named FP-
The number of nodes in a CFP-Tree can be up to half less      Tree. FP-Tree is a prefix tree with links between nodes
than in the corresponding FP-Tree. We also describe the       containing the same item. A tree data structure is suitable
implementation of CT-PRO which utilize the CFP-Tree           for dense datasets since many transactions will share
for FIM. CT-PRO traverses the CFP-Tree bottom-up and          common prefixes so that the database could be compactly
generates the frequent itemsets following the pattern         represented. However, for sparse datasets the tree will be
growth approach non-recursively. Experiments show that        bigger and bushier, and therefore its construction cost and
CT-PRO performs better than OpportuneProject, FP-             traversal cost will be higher than array-based data
Growth, and Apriori. A further experiment is conducted        structures.
to determine the feasible performance range of CT-PRO             The strengths of H-Mine and FP-Growth were
and the result shows that CT-PRO has a larger                 combined in the recent pattern growth FIM algorithm
performance range compared to others. CT-PRO also             named OpportuneProject (OP) [11]. OP is an adaptive
performs better compared to LCM and kDCI that are             algorithm that opportunistically chooses an array-based or
known as the two best algorithms in FIMI Repository           a tree-based data structure depending on the sub-database
2003.                                                         characteristics.
                                                                  In this paper, we describe our new data structure
1. Introduction                                               named Compressed FP-Tree (CFP-Tree) and also the
                                                              implementation of our new FIM algorithm named CT-
   Since its introduction in [1] the problem of efficiently   PRO that was first introduced in [12]. Here we report the
generating frequent itemsets has been an active research      compactness of CFP-Tree with FP-Tree at several support
area and a large number of algorithms have been               levels on the various datasets generated using the
developed for it; for surveys, see [2-4]. Frequent itemset    synthetic data generator [13]. The performance of CT-
mining (FIM) is an essential part of association rules        PRO is compared with Apriori [14, 15], FP-Growth [10],
mining (ARM). Since FIM is computationally expensive,         and OP [11].
the general performance of ARM is determined by it. The           Sample datasets such as real-world BMS datasets [3] or
frequent itemset concept has also been extended for many      UCI Machine Learning Repository datasets [16] do not
other data mining tasks such as classification [5, 6],        cover the full range of densities from sparse to dense.
clustering [7], and sequential pattern discovery [8].         Some algorithms may work well for a certain dataset but
   The data structures used play an important role in the     may not be feasible when the dimensions of the database
performance of FIM algorithms. The various data               change (i.e. number of transactions, number of items,
structures used by FIM algorithms can be categorized as       average number of items per transaction etc.). Therefore, a
either array-based or tree-based. An example of a             further study has been done in this paper, to show the
successful array-based algorithm for FIM is H-Mine [9]. It    feasible performance range of the algorithms. The more
uses a data structure named H-struct, which is a              extensive testing of the algorithms is carried out using a
                                                              set of databases with varying number of both transactions
and average number of items per transaction. For each              occurrences of an itemset. If C = {C0, C1,… Ck} is a
dataset, all the algorithms are tested on supports of 10% to       set of counts in the count array attached to a node
90% in increments of 10%. The experimental results are             and the index of the array starts from zero, then Ci is
reported in detail.                                                the count of a transaction with an itemset along the
   To show how well CT-PRO compares with algorithms                path from the node at level i to the node where Ci is
in FIMI Repository 2003 [17], two best algorithms from             located.                                              
the last workshop, LCM [18] and kDCI [19], are selected
for comparison. The result shows that CT-PRO                     The following lemma provides the worst-case space
outperforms these and therefore all others.                    complexity of a CFP-Tree.
   The structure of the rest of this paper is as follows: In
Section 2, we introduce the CFP-Tree data structure and        Lemma 1. Let n be the number of frequent items in the
report the results of experiments in evaluating its            database for a certain support threshold. The number of
compactness. In Section 3, we describe the CT-PRO              nodes of the CFP-Tree is bounded by 2n-1, which is half of
algorithm with a running example. We discuss the               the maximum for a full prefix tree.
complexity of CT-PRO algorithm in Section 4. The               Rationale. If IF = {iF1, … iFn} is a set of distinct items in a
performance of the algorithm on various datasets is            CFP-Tree where iF1, iF2,…iFn are lexicographically
compared against other algorithms in Section 5. Section 6      ordered. The maximum number of nodes under subtrees
contains conclusions of our study.                             iF1, iF2, … iFn is 2n-1, 2n-2…20 respectively. Since the CFP-
                                                               Tree is actually the subtree iF1 then the maximum number
2. Compressed FP-Tree Data Structure                           of nodes of the CFP-Tree is 2n-1.                             

   In this section, a new tree-based data structure, named        Compared to FP-Tree, CFP-Tree has some important
Compressed FP-Tree (CFP-Tree), is introduced. It is a          differences, as follows:
variant of CT-Tree data structure that we introduced in        1. FP-Tree stores the item id in the tree while, in CFP-
[20] with the following major differences: items are sorted         Tree, item ids are mapped to an ascending sequence
in descending order of their frequency (instead of                  of integers that is actually the array index in
ascending order, as in CT-Tree) and there is a link to the          ItemTable.
next node with the same item node (while links are not         2. The FP-Tree is compressed by removing identical
present in CT-Tree). The CFP-Tree is defined as follows:            subtrees of a complete FP-Tree and succinctly storing
                                                                    the information from them in the remaining nodes.
Definition 1 (Compressed FP-Tree or CFP-Tree). A                    All subtrees of the root of the FP-Tree (except the
Compressed FP-Tree is a prefix tree with the following              leftmost branch) are collected together at the leftmost
properties:                                                         branch to form the CFP-Tree..
1. It consists of an ItemTable and a tree whose root           3. Each node in the FP-Tree (except the root) consists
    represents the index of the item with the highest               of three fields: item-id, count and node-link. Count
    frequency and a set of subtrees as the children of the          registers the number of transactions represented by
    root.                                                           the portion of the path reaching this node. Node-link
2. The ItemTable contains all frequent items sorted in              links to the next node with the same item-id. Each
    descending order by their frequency. Each entry in              node in the CFP-Tree consists of three fields: item-id,
    the ItemTable consists of four fields, (1) index, (2)           count array and node-link. The count array contains
    item-id, (3) frequency of the item, and (4) a pointer           counts for item subsets in the path from the root to
    pointing to the root of the subtree of each frequent            that node. The index of the cells in the array
    item.                                                           corresponds to the level numbers of the nodes above.
3. If I = {i1, i2, … ik} is a set of frequent items in a       4. FP-Tree has a HeaderTable consisting of two fields:
    transaction, after being mapped to their index-id,              item-id and a pointer to the first node in the FP-Tree
    then the transaction will be inserted into the                  carrying the nodes with the same item-id. CFP-Tree
    Compressed FP-Tree starting from the root of a                  has an ItemTable consisting of four fields: index,
    subtree to which i1 in the ItemTable points.                    item-id, count of the item and a pointer to the root of
4. The root of the Compressed FP-Tree is the level 0 of             the subtree of each item. The root of each subtree has
    the tree.                                                       a link to the next node with the same-item-node. Both
5. Each node in the Compressed FP-Tree consists of                  HeaderTable and ItemTable store only frequent
    four fields: node-id, a pointer to the next sibling, a          items.
    pointer to the next node with the same id, and a count
    array where each entry corresponds to the number of           Figure 1 shows the FP-Tree and the CFP-Tree for a
                                                               sample database. In this case, the FP-Tree is a complete
tree for items 1-4. In this example, the number of nodes in                                                                    3. CT-PRO Algorithm
the FP-Tree is twice that of the corresponding CFP-Tree.
However, most datasets do not have such an extreme                                                                                In this section, a new method that traverses the tree in a
characteristic as in this example.                                                                                             bottom-up strategy, and implemented non-recursively, is
   Figure 2 shows the compactness of CFP-Tree                                                                                  presented. The CFP-Tree data structure is used to
compared to FP-Tree on several synthetic datasets at                                                                           compactly represent transactions in the memory. The
various support levels (the characteristics of the datasets                                                                    algorithm is called CT-PRO and it has three steps in it:
are explained later in Section 5.2). CFP-Tree has a                                                                            finding the frequent items, constructing the CFP-Tree, and
smaller number of nodes compared to FP-Tree in all                                                                             mining. Algorithm 1 shows the first two steps in CT-PRO.
cases.


                                                                      Tid            Items               Tid                 Items               Tid            Items
                                                                      1              1234                6                   2                   11             1
                                                                      2              24                  7                   14                  12             234
                                                                      3              134                 8                   123                 13             12
                                                                      4              3                   9                   34                  14             124
                                                                      5              23                  10                  4                   15             13
                                                                                                     (a) Sample Database

     HeaderTable                                                                                                                             ItemTable                                                            1 8
                                                                            Root
      1                                                                                                                                      1 1 8
      2                               1:8                                   2:4            3:2           4:1                                 2 2 8                                                2 44           3 2 4 1
      3                                                                                                                                      3 3 8
      4                                                                                                                                      4 4 8
     Item Head of
                        2:4                 3:2       4:1             3:2          4:1     4:1
                                                                                                                                             I    I C P                        3 222              4 11           4 1
          node-links                                                                                                                         n    t o S
                                                                                                                                             d    e u T
                  3:2   4:1                 4:1                       4:1
                                                                                                                                             e    m n
                                                                                                                                             x      t
                                                                                                                                                                               4 1111

                  4:1
                                (b) FP-Tree                                                                                                                                    (c) CFP-Tree
                                                                                Figure 1: FP-Tree and CFP-Tree


                                       20000
                                                                                                                                                                   CFP-Tree                     FP-Tree

                                       15000
                         # of Nodes


                                       10000


                                        5000


                                                  0
                                                        I100T50KA10


                                                                             I100T50KA10


                                                                                           I100T50KA25


                                                                                                               I100T50KA25


                                                                                                                               I100T50KA50


                                                                                                                                                 I100T100KA10


                                                                                                                                                                I100T100KA25


                                                                                                                                                                                 I100T100KA25


                                                                                                                                                                                                  I100T100KA50
                                                            (s-20)


                                                                                 (s-30)


                                                                                               (s-50)


                                                                                                                   (s-60)


                                                                                                                                   (s-80)


                                                                                                                                                     (s-30)


                                                                                                                                                                    (s-50)


                                                                                                                                                                                     (s-60)


                                                                                                                                                                                                      (s-80)


                                                                                                         Datasets and Support


 Figure 2: Compactness of CFP-Tree Compared to FP-Tree on Various Synthetic Datasets at Various
                                       Support Levels
Algorithm 1 CT-PRO Algorithm: Step 1 and Step 2               pointer in GlobalItemTable also acts as the start of the
input  Database D, Support Threshold σ                        links to other nodes with the same item ids (indicated by
output CFP-Tree                                               the dashed lines in Figure 3d). For illustration, at each
                                                              node we also show the index of the array, the transaction
1 begin                                                       represented at each index entry and its count. In the
2    // Step 1: Identify frequent items                       implementation of CFP-Tree, however, only the second
3    for each transaction t ∈ D                               column that represents the count is stored.
4       for each item i ∈ t
5          if i ∈ ItemTable                                   Tid   Items                     Tid Items         Tid Items
6             Increment count of i                            1 3 4 5 6 7 9                   1 3 4 5 7         1 1 2 4 5
7          else                                               2 1 3 4 5 13                    2 1 3 4 5         2 1 2 3 4
8             Insert i into GlobalItemTable with count = 1    3 1 2 4 5 7 11                  3 1 4 5 7         3 1 3 4 5
9          end if                                             4 1 3 4 8                       4 1 3 4           4 1 2 3
10      end for                                               5 1 3 4 10                      5 1 3 4           5 1 2 3
11   end for                                                  (a) Sample Database      (b) Frequent Items       (c) Mapped
12   Sort GlobalItemTable in
    frequency descending order                                    I       C
                                                                                    Level 0
                                                                  n   I   o
13   Assign an index for each frequent item in the                d   t   u P   1    05 1
                                                                          n S
     GlobalItemTable                                              e   e
                                                                          t T       Level 1
                                                                  x   m
14   // Step 2: Construct CFP-Tree                                                   0 4 12                     3 0 1 13
                                                                  1 4 5         2
15   Construct the left most branch of the tree                                      10 2
                                                                  2 3 4
16   for each transaction t ∈ D                                   3 1 4             Level 2
17      Initialize mappedTrans                                    4 5 3         3    0 3 123      4   0 1 124   4 0 1 134
18      for each frequent item i ∈ t                              5 7 2              1 0 23           1 0 24
19         mappedTrans = mappedTrans ∪ GetIndex(i)                                   20 3
                                                                ItemTable           Level 3
20      end for                                                                      0 1 1234 5 0 1 1245 5 0 1 134
                                                                                4
21      Sort mappedTrans in ascending order of item ids                              1 0 234    1 0 245
22      InsertToCFPTree(mappedTrans)                                                 2 0 34
23   end for                                                                         30 4
24 end                                                                              Level 4
                                                                                5
25 Procedure InsertToCFPTree(mappedTrans)                                           0 0 12345
26   firstItem := mappedTrans[1]                                                    1 0 2345
27   currNode := root of subtree pointed by                                         2 0 345
    ItemTable[firstItem]                                                            3 0 45
28   for each subsequent item i ∈ mappedTrans                                       4 0 5
29      if currNode has child represent i                                           (d) Global CFP-Tree
30         Increment count[firstItem-1] of the child node
31      else                                                      Figure 3: CFP-Tree for the Sample Dataset
32         Create child node and set its
           count[firstItem-1] to 1                               The mining process in CT-PRO is shown in Algorithm
33         Link the node to its respective node-link          2 and illustrated by the following example.
34      end if                                                Example 1. Let the CFP-Tree, as shown in Figure 3d, be
35   end for                                                  the input for the mining step in CT-PRO and suppose the
36 end                                                        user wants to get all the frequent itemsets with minimum
                                                              support of two transactions (or 40%).
   Suppose the user wants to mine all frequent itemsets          Figure      4   shows      the   LocalCFP-Tree       and
from the transaction database shown in Figure 3a with a       LocalFrequentPatternTree at each step during the mining
support threshold of two transactions (or 40%). First, we     process. CT-PRO starts from the least frequent item
need to identify frequent items by reading the database       (index: 5, item: 7) in the GlobalItemTable (line 2). Item
once (lines 3-11). The frequent items are stored in           7 is frequent and it will be the root of the
frequency descending order in the GlobalItemTable (line       LocalFrequentPatternTree (line 3). Then CT-PRO creates
12). In a second pass over the database, only frequent        a projection of all transactions ending with index 5. This
items are selected from each transaction (see Figure 3b),     projection is represented by a LocalCFP-Tree and only
mapped to their index id in GlobalItemTable on-the-fly,       contains locally frequent items. Traversing the node-link
sorted in ascending order of their index id (see Figure 3c)   of index 5 in the GlobalCFP-Tree identifies the local
and inserted into the CFP-Tree (see Figure 3d). The           frequent items that occur together with it. There are three
nodes of index 5 and the path to the root for each node is                  Algorithm 2 CT-PRO Algorithm: Mining Part
traversed counting the other indexes that occur together                    input  CFP-Tree
with index 5 (lines 13-23). In all, we have 1 (2), 2 (1), 3                 output Frequent Itemsets FP
(1) and 4 (2) for index 5. As indexes 1,4 (item id: 4,5) are
locally frequent, they are registered in the LocalItemTable                 1    Procedure Mining
and assigned new index ids (see Figure 4a). They also                       2      for each frequent item i ∈ GlobalItemTable
become the child of the LocalFrequentPatternTree’s root                            from the least to the most frequent
(lines 5-7). Together, the root and its children form                       3         Initialize LocalFrequentPatternTree
frequent itemsets with length two.                                                    with i as the root
                                                                            4         ConstructLocalItemTable(x)
                  Lev 0
                                                GlobalItemTable             5         for each frequent item j ∈ LocalItemTable
 1 4 2        1 02 1               5 72
                                                                            6            Attach i as a child of x
 2 5 2            Lev 1
                                                           LocalItemTable   7         end for
Local         2 0 2 12             2 52 142
                                                                            8         ConstructLocalCFPTree(x)
ItemTable
                      10 2
                                                          Traversing        9         RecMine(x)
                                   1 42                   Local CFP-Tree
                                                                            10        Traverse the LocalFrequentPatternTree
(a) Local CFP-Tree of index 5      (b) Frequent itemsets in projection 5
                                                                                       to print the frequent itemsets
                  Lev 0                                                     11     end for
 1 4 3        1       0 3 1                4 53           GlobalItemTable
                                                                            12   end
 2 3 2            Lev 1                                                     13   Procedure ConstructLocalItemTable(i)
 3 1 2                0 2 12
              2                  3 01     3 12 2 32 143                     14     for each occurrence of node i in the CFP-Tree
Local                 10 2        13                       LocalItemTable   15        for each item j in the path to the root
ItemTable         Lev 2
                                          1 42     1 42                     16              if j ∈ LocalItemTable
              3 0 1 123                                                     17                 Increment count of j
                      1 0 23                            Traversing
                      20 3                              Local CFP-Tree
                                                                            18              else
                                                                            19                 Insert j into LocalItemTable with count = 1
 (c) Local CFP-Tree of index 4     (d) Frequent itemsets in projection 4    20              end if
                       Lev 0                                                21        end for
                                        3 14        GlobalItemTable
                                                                            22     end for
  1 4 4           1     04 1
  2 3 3                Lev 1                                                23   end
                                        2 33 1 44          LocalItemTable
 Local                  0 3 12                                              24   Procedure ConstructLocalCFPTree(i)
                  2                                                         25     for each occurrence of node i in the CFP-Tree
 ItemTable              10 2
                                        1 43           Traversing
                                                       Local CFP-Tree       26        Initialize mappedTrans
                                                                            27        for each frequent item j ∈ LocalItemTable
 (e) Local CFP-Tree of index 3      (f) Frequent itemsets in projection 3             in the path to the root
                                                                            28           mappedTrans = mappedTrans ∪ GetIndex(j)
                        Lev 0           2 34         GlobalItemTable        29        end for
  1 4 4           1       04 1                                              30        Sort mappedTrans in ascending order of item ids
  Local                                 1 44          LocalItemTable        31        InsertToCFPTree(mappedTrans)
  ItemTable                                                                 32     end for
  (g) Local CFP-Tree of index 2 (h) Frequent itemsets in projection 2       33   end
                                                                            34   Procedure RecMine(x)
         Figure 4: Local CFP-Tree during Mining                             35     for each child i of x
                       Process                                              36        Set all counts in LocalItemTable to 0
                                                                            37        for each occurrence of node i in
    After local frequent items for the projection have been                           the LocalCFPTree
identified, the node-link in the GlobalCFP-Tree is re-                      38           for each item j in the path to the root
traversed and the path to the root from each node is                        39                 Increment count of j in LocalCFPTree
revisited to get the local frequent items occurring together                40           end for
with index 5 in the transactions. These local frequent                      41        end for
items are mapped to their index in the LocalItemTable on-                   42        for each frequent item k ∈ LocalItemTable
the-fly, sorted in ascending order of their index id and                    43           Attach k as a child of i
inserted into the LocalCFP-Tree (lines 24-33). The first                    44        end for
path of index 5 returns nothing. From the second path of                    45        RecMine(i)
index 5, a transaction 14 (1) is inserted into the                          46     end for
LocalCFP-Tree and another transaction 14 (1) from the                       47   end
third path of index 5 also is inserted. In total, there are two
occurrences of transaction 14. Indexes 1 and 4 in the             Lemma 3. In the worst-case, the cost of generating
GlobalItemTable represent items 4 and 5 respectively.             frequent itemsets is
The indexes of items 4 and 5 in the LocalItemTable are 1                                        n
and 2 respectively and so the transaction 14 is inserted as       (v + n)+ (v + 2n-1) +        ∑ ((2
                                                                                               k=2
                                                                                                              (n-k)
                                                                                                                      –1)(2(n-k)) + 2(k-2)) = O(22n).
transaction 12 in the LocalCFP-Tree. As the item index in
the GlobalItemTable and LocalItemTable are different,             Proof. The worst-case happens when all n items are
the item id is always maintained for output purposes.             frequent and all combinations of them are present in m
    Longer frequent itemsets, with length greater than two,       transactions. CT-PRO has three steps: finding frequent
are extracted by calling the procedure RecMine (line 9).          items, constructing the CFP-Tree, and mining. The cost of
For simplicity, we have described this procedure (lines           finding frequent items has been provided by Lemma 2.
34-47) using recursion but, in the program, it is                 The worst case for the GlobalCFP-Tree corresponds to a
implemented as a non-recursive procedure. Starting from           situation where all the possible paths exist. In
the least frequent item in the LocalItemTable, (line 35),         constructing the GlobalCFP-Tree, all the transactions in
the node-link is traversed (lines 37-41). For each node, the      the database are read (the cost is v) and inserted into the
path to the root in the LocalCFP-Tree is traversed                tree (the total number of nodes is 2n-1). For the mining
counting the other items that are together with the current       process, for each frequent item fk where 2 ≤ k ≤ n, 2(k-2)
item. For example, in Figure 4a, traversing the node-link         nodes in the GlobalCFP-Tree are visited to construct a
of node 2 will return the index 1 (2) and, since it is            LocalCFP-Tree. The LocalCFP-Tree has (2(n-k)–1)
frequent, an entry is created and attached as the child of        paths that correspond to, at most, 2(n-k) candidate
index 2 in the LocalFrequentPatternTree (lines 42-44).            itemsets. So the worst case mining cost is:
All frequent itemsets containing item 7 can be extracted           n
by traversing the LocalFrequentPatternTree (line 10): 7            ∑ ((2
                                                                   k=2
                                                                           (n-k)
                                                                                   –1)(2(n-k)) + 2(k-2)).
(2), 75 (2), 754 (2), 74 (2).
    The process is continued to mine the next item in the         Therefore, the total worst-case cost of CT-PRO is
GlobalItemTable in the GlobalCFP-Tree with indexes 4,                                     n

3, 2 and finally, when the mining process reaches the root        (v+n)+(v+2n-1)+        ∑ ((2
                                                                                         k=2
                                                                                                     (n-k)
                                                                                                             –1)(2(n-k)) + 2(k-2)) = O(22n)
of the tree of Figure 3d, it outputs 4 (5).
    One major advantage of CT-PRO compared to FP-
Growth is that CT-PRO avoids the cost of creating                 5. Experimental Evaluation
conditional FP-Trees. FP-Growth needs to create a
conditional FP-Tree at each step of its recursive mining.            This section contains three sub-sections. In Section 5.1,
This overhead adversely affects its performance, as the           we compare CT-PRO against other well-known algorithms
number of conditional FP-Trees corresponds to the                 including with Apriori [14, 15], FP-Growth [10] and
number of frequent itemsets. In CT-PRO, for each                  recently proposed OpportuneProject (OP) [11] on the
frequent item (not frequent itemsets), only one LocalCFP-         various datasets available at FIMI Repository 2003 [17].
Tree is created and traversed non-recursively to extract all      In Section 5.2, we report the result of more
frequent itemsets beginning with the frequent item.               comprehensive testing to determine the feasible
                                                                  performance range of the algorithms. Finally, in Section
4. Time Complexity                                                5.3, we compare CT-PRO with the two best algorithms in
                                                                  the FIMI Repository 2003 [17], LCM [18] and kDCI [19].
    In this section, the best-case and worst-case time
complexity of CT-PRO algorithm is presented. Let I = {i1,         5.1. Comparison with Apriori, FP-Growth and
i2, …., in} be the set of all n items, let transaction database        OpportuneProject
D be {t1, t2, …, tm}, and let v be the total number of items
in all transactions.                                                 Six real datasets are used in this experiment including
                                                                  two dense datasets: Chess and Connect4; two less dense
Lemma 2. In the best-case, the cost of generating frequent        datasets: Mushroom and Pumsb*; and two sparse datasets:
itemsets is O(v + n).                                             BMS-WebView-1 and BMS-WebView-2. The first four
Proof. The best-case for the CT-PRO algorithm occurs              datasets are originally taken from UCI ML Repository
when there is no frequent item. The algorithm has to read         [16] and the last two datasets are donated by Blue Martini
v items in all transactions and add the count of n items.         Software [3]. All the datasets are also available at FIMI
The count of all n items are stored in the ItemTable and          Repository 2003 [17].
checked to determine whether there is any frequent item              We used the implementation of Apriori created by
or not. If there is no frequent item, the process stops.       Christian Borgelt [21] by enabling the use of the prefix
tree data structure. As for FP-Growth, we used the               5.2. Determining the Feasible Performance
executable code available from its authors [10]. However,        Range
for comparing the number of nodes of FP-Tree to our
proposed data structure, we modified the source code of              As mentioned earlier, sample datasets such as real-
FP-Growth provided by Bart Goethals in [22]. For                 word BMS datasets [3] and the UCI Machine Learning
OpportuneProject (OP), we used the executable code               Repository [16], which also are available at the FIMI
available from its author, Junqiang Liu [11].                    Repository 2003 [17], have their own static characteristics
    All the algorithm were implemented and compiled              and thus do not cover the full range of densities. An
using MS Visual C++ 6.0. All the experiments (except             algorithm that works well for one dataset may not have the
comparisons with algorithms in the FIMI Repository 2003          same degree of performance on other datasets with
website [17]) were performed on a Pentium III PC 866             different dimensions. Dimensions, here, could be the
MHz with 512 MB RAM and 110 GB Hard Disk running                 number of transactions, number of items, average number
on MS Windows 2000. All the reported runtime used in             of items per transaction, denseness or sparseness, etc. In
our charts is the total execution time, the period between       this section, a more comprehensive evaluation of the
input and output. It also includes the time of constructing      performance of various algorithms is presented.
all the data structures used in all programs.                        We generated ten datasets using the synthetic data
    Figure 5 shows the results of the experiment on various      generator [13]. The first five datasets contained 100 items
datasets. All the charts use a logarithmic scale for run time    with 50,000 transactions, and an average number of items
along the y-axis on the left of the chart. We did not plot       per transaction of 10, 25, 50, 75, and 100. The second five
the results in the chart if the runtime was more than            datasets contained 100 items with 100,000 transactions,
10,000 seconds. For a comprehensive evaluation of the            also with an average number of items per transaction of
algorithm’s performance, rather than showing where our           10, 25, 50, 75, and 100. CT-PRO, Apriori, FP-Growth
algorithm performed best at some of the support levels, all      and OP were tested extensively on these datasets at a
the algorithms were extensively tested on various datasets       support level of 10% to 90%, in increments of 10%.
with a support level of 10% to 90% for dense datasets                Figure 6 shows the performance comparisons of the
(e.g. Connect4, Chess, Pumsb*, Mushroom), a support              algorithms on various datasets. The dataset name shows
level of 0.1% to 1% for the sparse dataset BMS-WebView-          its characteristics. For example, I100T100KA10 means
1, and a support level of 0.01% to 0.1% for the sparse           there are 100 items, and 100,000 transactions with an
dataset BMS-WebView-2. As the average number of items            average of 10 items per transaction. The experimental
increases and/or the support level decreases, at some            results show that the performance characteristics on
point, every algorithm ‘hits the wall’ (i.e. takes too long to   databases of 50,000 to 100,000 transactions are quite
complete).                                                       similar. However, the runtime increases with the number
    CT-PRO outperforms others at all support thresholds          of transactions.
on the Connect4, Chess, Mushroom and Pumsb* datasets.                The Apriori algorithm is very feasible for sparse
On the sparse dataset BMS-WebView-1, CT-PRO is a                 datasets (with an average number of items in each
runner-up, after OP, with only small performance                 transaction of 10 and 25). Its performance is good, as it
differences (0.4 seconds to 0.49 seconds at a support level      consistently performs better than FP-Growth at all support
of 0.1% and 5.18 seconds to 7.69 seconds at 0.06%).              levels. Although Apriori is slower than CT-PRO and OP
Below the support level of 0.06%, none of the algorithms         using the two sparse datasets, its runtime is still acceptable
could mine the BMS-WebView-1 dataset. On the sparse              to the user. (It needs only 60 seconds to mine the
dataset BMS-WebView-2, a remarkable result is obtained.          I100T50KA25 dataset at the support level of 10%).
Apriori, which is known as a traditional FIM algorithm,          However, on the datasets with an average number of items
outperforms FP-Growth at all support levels. CT-PRO is           per transaction of 50, 75, and 100, Apriori performs worst
the fastest from a support threshold of 1% down to 0.4%          and it can only mine down to a support level of 30%,
and becomes the runner-up, after OP, at a support level of       50%, and 70% respectively. These results confirm that,
0.3% down to 0.1% with small performance differences.            for dense datasets, if the support levels used are low,
    From these results, we can claim that, on dense              Apriori is infeasible.
datasets, CT-PRO generally outperforms others. On sparse             FP-Growth performs worst at all support levels on the
datasets, the high cost of the tree construction reduces CT-     datasets with a low average number of items per
PRO to runner-up. However, as the gap is very small, we          transaction (i.e. 10 and 25). The fact that FP-Growth does
can say that CT-PRO also works well for sparse datasets.         not outperform Apriori on these two datasets shows that
                                                                 Apriori is more feasible than FP-Growth for sparse
                                                                 datasets. However, FP-Growth performs significantly
                                                                 better than Apriori for the larger average number of items
                                                                 in transactions.
                            10000                                                                                       1000
  Runtime (s) logarithmic


                                                                                           Runtime (s) logarithmic
                                                                        Connect4                                                                                                            Pumsb*
                             1000
                                                                                                                                   100
                              100

                                                                                                                                    10
                               10

                                  1                                                                                                     1
                                      10    20    30   40   50   60    70    80     90                                                      10         20     30    40      50              60     70
                                                       Support (%)                                                                                              Support (%)

                            10000                                                                                    100


                                                                                           Runtime (s) logarithmic
  Runtime (s) logarithmic


                                                                            Chess                                                                                               BMS-WebView-1
                            1000
                             100                                                                                                   10
                              10
                                  1                                                                                                 1
                              0.1
                             0.01                                                                                           0.1
                                      10    20    30    40  50    60   70    80     90                                         0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
                                                        Support (%)                                                                                             Support (%)

                            100                                                                                                    10
                                                                                                         Runtime (s) logarithmic
  Runtime (s) logarithmic


                                                                        Mushroom                                                                                                BMS-WebView-2

                             10
                                                                                                                                    1

                              1


                            0.1                                                                                                    0.1
                                  10       20    30    40  50    60    70    80     90                                                   0.1     0.2    0.3   0.4   0.5   0.6   0.7   0.8    0.9     1
                                                       Support (%)                                                                                              Support (%)


                                           Figure 5: Performance Evaluation of CT-PRO Against Others on Various Datasets

    Both CT-PRO and OP have larger feasible                                              5.3. Comparison with Best Algorithms in the
performance ranges compared to the other algorithms. OP                                  FIMI Repository 2003
does not perform well on the sparse datasets
I100T50KA10 and I100T100KA10. Its performance was                                           For comparison with the best algorithms in the FIMI
even worse than Apriori on this dataset. However, it                                     Repository 2003 [17], we ported our algorithm CT-PRO
performs better than Apriori and FP-Growth on other                                      to a Linux operating system and compared it with their
datasets. On the datasets with an average number of items                                two best algorithms: LCM [18] and kDCI [19]. We
per transaction of 50, 75, and 100, FP-Growth, CT-PRO                                    performed the experiments on a PC AMD Athlon XP
and OP can mine down to a support level of 20%, 40%,                                     2000+ 1.6 GHz, 1 GB RAM, 2 GB Swap with 40GB Hard
and 50% respectively.                                                                    Disk running Fedora Core 1. All programs were compiled
    CT-PRO can be considered the best among all other                                    using g++ compiler.
algorithms as it generally performs the best at most                                        Figure 7 shows the performance comparisons of
support levels. However, as the support level gets lower,                                algorithms that were submitted to FIMI 2003 on Chess
its performance is similar to OP. Only at a support level of                             and Connect4 datasets. The figures are taken from [17].
10%, OP occasionally runs slightly faster than CT-PRO                                    On Chess dataset, kDCI is the best at a support level of
(e.g. at a support level of 10% on the I100T50KA25                                       90% to a support level of 70%. Below that, LCM
dataset).                                                                                outperforms others. On Connect4 dataset, at a support
                                                     10


                                                                                                                                              Runtime (s) logarithmic
                                                                                                                                                                         10
Runtime (s) logarithmic


                                                                                    I100T50KA10                                                                                                         I100T100KA10


                                                      1                                                                                                                   1


                                                    0.1                                                                                                                 0.1
                                                          10    20    30    40 50 60     70   80   90                                                                         10 20 30 40 50 60 70 80 90
                                                                           Support (%)                                                                                                          Support (%)

                                                    100                                                                                                          100
Runtime (s) logarithmic


                                                                                                                    Runtime (s) logarithmic
                                                                                    I100T50KA25                                                                                                         I100T100KA25
                                                     10                                                                                                                 10

                                                      1                                                                                                                  1

                                                    0.1                                                                                                                 0.1
                                                          10    20    30   40 50 60      70   80   90                                                                         10      20   30    40 50 60     70   80   90
                                                                           Support (%)                                                                                                          Support (%)

                                                    10000                                                                                                               10000
                                                                                                                                    Runtime (s) logarithmic
Runtime (s) logarithmic


                                                     1000                           I100T50KA50                                                                                                         I100T100KA50
                                                                                                                                                                         1000
                                                      100
                                                                                                                                                                          100
                                                       10
                                                          1                                                                                                                   10

                                                      0.1                                                                                                                         1
                                                              10 20 30 40 50 60 70 80 90                                                                                              10 20 30 40 50 60 70 80 90
                                                                      Support (%)                                                                                                              Support (%)


                                                    10000                                                                                                           10000
                                                                                                        Runtime (s) logarithmic
                          Runtime (s) logarithmic


                                                                                    I100T50KA75                                                                                                         I100T100KA75
                                                     1000                                                                                                               1000

                                                      100                                                                                                                 100

                                                          10                                                                                                                  10

                                                           1                                                                                                                  1
                                                               10 20 30 40 50 60 70 80 90                                                                                          10 20 30 40 50 60 70 80 90
                                                                        Support (%)                                                                                                         Support (%)

                                                    10000                                                                                                10000
                                                                                    I100T50KA100                                                                                                       I100T100KA100
                                                                                                        Runtime (s) logarithmic
Runtime (s) logarithmic


                                                    1000                                                                                                                1000

                                                     100                                                                                                                 100

                                                      10                                                                                                                  10

                                                          1                                                                                                                   1
                                                              10 20 30 40 50 60 70 80 90                                                                                          10 20 30 40 50 60 70 80 90
                                                                      Support (%)                                                                                                         Support (%)


                                                                     Figure 6: Performance Evaluation on Various Synthetic Datasets
level of 95% to a support level of 70%, kDCI is the best.         Figure 8 shows the performance comparisons on Chess
Below that, LCM outperforms others. We can conclude            and Connect4 datasets. From these results, CT-PRO
that for higher support levels, kDCI is the best, but for      always outperforms others at high support levels. For
lower support levels, LCM is the best. These best two          lower support levels, the performances of these three
algorithms are compared with CT-PRO.                           algorithms are similar. Since LCM and kDCI are the best
    The kDCI algorithm [19] is a multiple heuristics           algorithms in FIMI Repository 2003 on Chess and
hybrid algorithm that able to adapt its behaviour during       Connect4 datasets, we can conclude that CT-PRO
the execution. It is an extension of the DCI (Direct Count     outperforms all other algorithms available in FIMI
and Intersect) algorithm [23] by adding its adaptability to    Repository 2003 [17] on Chess and Connect4 datasets.
the dataset specific features. kDCI is also a resource aware
algorithm which can decides mining strategy based on the                                                  1000


                                                                                Runtime (s) logarithmic
hardware characteristics of the computing platform used.                                                                                                          Chess
Moreover, kDCI also used counting inference strategy                                                       100
which originally proposed in [24].
                                                                                                            10
   The LCM (Linear time Closed itemset Miner)
algorithm [18] uses the parent-child relationship defined                                                       1
on frequent itemsets. The search tree technique is adapted
from the algorithms for generating maximal bipartite                                                       0.1
cliques [25, 26] based on reverse search [27, 28]. In                                                     0.01
enumerating all frequent itemsets, LCM uses hybrid                                                                  30        40        50      60      70        80      90
techniques involving occurrence deliver or diffsets [29]                                                                                     Support (%)
according to the density of the database.
                                                                                                      1000
                                                                  Runtime (s) logarithmic


                                                                                                                                                              Connect4

                                                                                                          100


                                                                                                          10


                                                                                                           1
                                                                                                                55       60        65   70 75 80             85    90     95
                                                                                                                                        Support (%)


                                                               Figure 8: Performance Comparisons of CT-PRO,
                                                               LCM and kDCI on Chess and Connect4 Datasets

                                                               6. Conclusions
                                                                  In this paper, we have described a new tree-based data
                                                               structure named CFP-Tree that is more compact than FP-
                                                               Tree used in FP-Growth algorithm. Depending on the
                                                               database characteristics, the number of nodes in an FP-
                                                               Tree could be up to twice as many as in the corresponding
                                                               CFP-Tree for a given database. CFP-Tree is used in our
                                                               new algorithm named CT-PRO for mining all frequent
                                                               itemsets. CT-PRO divides the CFP-Tree into several
                                                               projections represented also by CFP-Trees. Then CT-PRO
                                                               conquers the CFP-Tree for mining all frequent itemsets in
                                                               each projection.
    Figure 7: Performance Comparisons of
                                                                  CT-PRO was explained in detail using a running
Algorithms available in the FIMI 2003 Repository
                                                               example and the best-case and worst-case time complexity
     on Chess and Connect4 Datasets [17]
                                                               of the algorithm also was presented. Performance
comparisons of CT-PRO against other well-known                   [8]    R. Agrawal and R. Srikant, "Mining Sequential
algorithms, including Apriori [14, 15], FP-Growth [10]                  Patterns", Proceedings of the 11th International
and OpportuneProject (OP) [11] also were reported. The                  Conference on Data Engineering (ICDE), Taipei,
                                                                        Taiwan, 1995.
results show that CT-PRO outperforms other algorithms at
                                                                 [9]    J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang,
all support levels on dense datasets and also works well on             "H-Mine: Hyper-Structure Mining of Frequent Patterns
sparse datasets.                                                        in Large Databases", Proceedings of the IEEE
    Extensive experiments to measure the feasible                       International Conference on Data Mining (ICDM), San
performance range of the algorithms are also presented in               Jose, California, 2001.
this paper. A synthetic data generator is used to generate       [10]   J. Han, J. Pei, and Y. Yin, "Mining Frequent Patterns
several datasets with varying number of both transactions               without Candidate Generation", Proceedings of the ACM
and average number of items per transaction. Then the                   SIGMOD International Conference on Management of
best available algorithms including CT-PRO, Apriori, FP-                Data, Dallas, TX, 2000.
                                                                 [11]   J. Liu, Y. Pan, K. Wang, and J. Han, "Mining Frequent
Growth and OP are tested on those datasets. The result
                                                                        Item Sets by Opportunistic Projection", Proceedings of
shows that CT-PRO generally outperforms others.                         ACM SIGKDD, Edmonton, Alberta, Canada, 2002.
    In addition, to relate our research to the last workshop     [12]   R. P. Gopalan and Y. G. Sucahyo, "High Performance
on frequent itemset mining implementations [17], we                     Frequent Pattern Extraction using Compressed FP-
selected two best algorithms (LCM and kDCI) from FIMI                   Trees", Proceedings of SIAM International Workshop on
Repository 2003 and compared their performance with                     High Performance and Distributed Mining (HPDM),
CT-PRO. It was shown that CT-PRO performed better                       Orlando, USA, 2004.
than the others.                                                 [13]   IBM, "Synthetic Data Generation Code for Associations
                                                                        and Sequential Patterns", Intelligent Information
                                                                        Systems, IBM Almaden Research Center, 2002,
7. Acknowledgement                                                      http://www.almaden.ibm.com/software/quest/Resources/i
                                                                        ndex.shtml.
   We are very grateful to Jian Pei for providing the            [14]   R. Agrawal and R. Srikant, "Fast Algorithms for Mining
executable code of FP-Growth, Bart Goethals for                         Association Rules", Proceedings of the 20th
providing his FP-Growth program, Christian Borgelt for                  International Conference on Very Large Data Bases,
the Apriori program, and Junqiang Liu for the                           Santiago, Chile, 1994.
OpportuneProject program.                                        [15]   C. Borgelt, "Efficient Implementations of Apriori and
                                                                        Eclat", Proceedings of the IEEE ICDM Workshop on
                                                                        Frequent Itemset Mining Implementations (FIMI),
8. References                                                           Melbourne, Florida, 2003.
                                                                 [16]   C. L. Blake and C. J. Merz, "UCI repository of machine
[1]   R. Agrawal, T. Imielinski, and A. Swami, "Mining                  learning databases", Irvine, CA: University of California,
      Association Rules between Sets of Items in Large                  Department of Information and Computer Science, 1998.
      Databases",       Proceedings     of    ACM       SIGMOD   [17]   FIMI, "FIMI Repository", 2003, http://fimi.cs.helsinki.fi.
      International Conference on Management of Data,            [18]   T. Uno, T. Asai, Y. Uchida, and H. Arimura, "LCM: An
      Washington DC, 1993, pp. 207-216.                                 Efficient Algorithm for Enumerating Frequent Closed
[2]   J. Hipp, U. Guntzer, and G. Nakhaeizadeh, "Algorithms             Item Sets", Proceedings of the IEEE ICDM Workshop of
      for Association Rule Mining - A General Survey and                Frequent Itemset Mining Implementations (FIMI),
      Comparison", SIGKDD Explorations, vol. 2, pp. 58-64,              Melbourne, Florida, 2003.
      July 2000.                                                 [19]   C. Lucchese, S. Orlando, P. Palmerini, R. Perego, and F.
[3]   Z. Zheng, R. Kohavi, and L. Mason, "Real World                    Silvestri, "kDCI: a Multi-Strategy Algorithm for Mining
      Performance of Association Rule Algorithms",                      Frequent Sets", Proceedings of the IEEE ICDM
      Proceedings of the 7th International Conference on                Workshop of Frequent Itemset Mining Implementations
      Knowledge Discovery and Data Mining (KDD), New                    (FIMI), Melbourne, Florida, 2003.
      York, 2001.                                                [20]   Y. G. Sucahyo and R. P. Gopalan, "CT-ITL: Efficient
[4]   B. Goethals, "Efficient Frequent Pattern Mining", PhD             Frequent Item Set Mining Using a Compressed Prefix
      Thesis, University of Limburg, Belgium, 2002.                     Tree with Pattern Growth", Proceedings of the 14th
[5]   B. Liu, W. Hsu, and Y. Ma, "Integrating Classification            Australasian Database Conference, Adelaide, Australia,
      and Association Rule Mining", Proceedings of ACM                  2003.
      SIGKDD, New York, NY, 1998.                                [21]   C. Borgelt and R. Kruse, "Induction of Association
[6]   Y. G. Sucahyo and R. P. Gopalan, "Building a More                 Rules: Apriori Implementation", Proceedings of the 15th
      Accurate Classifier Based on Strong Frequent Patterns",           Conference on Computational Statistics, Berlin,
      Proceedings of the 17th Australian Joint Conference on            Germany, 2002.
      Artificial Intelligence, Cairns, Australia, 2004.          [22]   B. Goethals, "Home page of Bart Goethals", 2003,
[7]   K. Wang, X. Chu, and B. Liu, "Clustering Transactions             http://www.cs.helsinki.fi/u/goethals.
      Using Large Items", Proceedings of ACM CIKM, USA,          [23]   S. Orlando, P. Palmerini, R. Perego, and F. Silvestri,
      1999.                                                             "Adaptive and Resource-Aware Mining of Frequent
       Sets", Proceedings of the IEEE International Conference   [26]   T. Uno, "Fast Algorithms for Enumerating Cliques in
       on Data Mining (ICDM), Maebashi City, Japan, 2002.               Huge Graphs", Research Group of Computation, IEICE,
[24]   Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, and L.            Kyoto University 2003.
       Lakhal, "Mining Frequent Patterns with Counting           [27]   D. Avis and K. Fukuda, "Reverse Search for
       Inference", ACM SIGKDD Explorations, vol. 2, pp. 66-             Enumeration", Discrete Applied Mathematics, vol. 65,
       75, December 2000.                                               pp. 21-46, 1996.
[25]   T. Uno, "A Practical Fast Algorithm for Enumerating       [28]   T. Uno, "A New Approach for Speeding Up
       Cliques in Huge Bipartite Graphs and Its                         Enumeration Algorithms", Proceedings of ISAAC’98,
       Implementation", Proceedings of the 89th Special                 1998.
       Interest Group of Algorithms, Information Processing      [29]   M. J. Zaki and C. Hsiao, "CHARM: An Efficient
       Society, Japan, 2003.                                            Algorithm for Closed Itemset Mining", Proceedings of
                                                                        SIAM International Conference on Data Mining (SDM),
                                                                        Arlington, VA, 2002.