<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CT-PRO: A Bottom-Up Non Recursive Frequent Itemset Mining Algorithm Using Compressed FP-Tree Data Structure</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yudho Giri Sucahyo Raj P. Gopalan</string-name>
          <email>raj@cs.curtin.edu.au</email>
          <email>sucahyoy@cs.curtin.edu.au</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computing, Curtin University of Technology Kent St, Bentley Western Australia 6102</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>R. Agrawal, T. Imielinski, and A. Swami, "Mining Association Rules between Sets of Items in Large Databases", Proceedings of ACM SIGMOD International Conference on Management of Data</institution>
          ,
          <addr-line>Washington DC, 1993, pp. 207-216. J. Hipp, U. Guntzer, and G. Nakhaeizadeh</addr-line>
          ,
          <institution>"Algorithms for Association Rule Mining - A General Survey and Comparison"</institution>
          ,
          <addr-line>SIGKDD Explorations, vol. 2, pp. 58-64</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Frequent itemset mining (FIM) is an essential part of association rules mining. Its application for other data mining tasks has also been recognized. It has been an active research area and a large number of algorithms have been developed. In this paper, we propose another pattern growth algorithm which uses a more compact data structure named Compressed FP-Tree (CFP-Tree). The number of nodes in a CFP-Tree can be up to half less than in the corresponding FP-Tree. We also describe the implementation of CT-PRO which utilize the CFP-Tree for FIM. CT-PRO traverses the CFP-Tree bottom-up and generates the frequent itemsets following the pattern growth approach non-recursively. Experiments show that CT-PRO performs better than OpportuneProject, FPGrowth, and Apriori. A further experiment is conducted to determine the feasible performance range of CT-PRO and the result shows that CT-PRO has a larger performance range compared to others. CT-PRO also performs better compared to LCM and kDCI that are known as the two best algorithms in FIMI Repository 2003.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Since its introduction in [1] the problem of efficiently
generating frequent itemsets has been an active research
area and a large number of algorithms have been
developed for it; for surveys, see [2-4]. Frequent itemset
mining (FIM) is an essential part of association rules
mining (ARM). Since FIM is computationally expensive,
the general performance of ARM is determined by it. The
frequent itemset concept has also been extended for many
other data mining tasks such as classification [5, 6],
clustering [7], and sequential pattern discovery [8].</p>
      <p>The data structures used play an important role in the
performance of FIM algorithms. The various data
structures used by FIM algorithms can be categorized as
either array-based or tree-based. An example of a
successful array-based algorithm for FIM is H-Mine [9]. It
uses a data structure named H-struct, which is a
combination of arrays and hyper-links. It was shown in
[9] that H-struct works well for sparse datasets as H-Mine
outperforms FP-Growth [10] on these datasets (note that
both H-Mine and FP-Growth follows the same pattern
growth method). However, the hyper-structure is not
efficient on dense datasets and therefore H-Mine switches
to FP-Growth for such datasets.</p>
      <p>FP-Growth [10] shows good performance on dense
datasets as it uses a compact data structure named
FPTree. FP-Tree is a prefix tree with links between nodes
containing the same item. A tree data structure is suitable
for dense datasets since many transactions will share
common prefixes so that the database could be compactly
represented. However, for sparse datasets the tree will be
bigger and bushier, and therefore its construction cost and
traversal cost will be higher than array-based data
structures.</p>
      <p>The strengths of H-Mine and FP-Growth were
combined in the recent pattern growth FIM algorithm
named OpportuneProject (OP) [11]. OP is an adaptive
algorithm that opportunistically chooses an array-based or
a tree-based data structure depending on the sub-database
characteristics.</p>
      <p>In this paper, we describe our new data structure
named Compressed FP-Tree (CFP-Tree) and also the
implementation of our new FIM algorithm named
CTPRO that was first introduced in [12]. Here we report the
compactness of CFP-Tree with FP-Tree at several support
levels on the various datasets generated using the
synthetic data generator [13]. The performance of
CTPRO is compared with Apriori [14, 15], FP-Growth [10],
and OP [11].</p>
      <p>Sample datasets such as real-world BMS datasets [3] or
UCI Machine Learning Repository datasets [16] do not
cover the full range of densities from sparse to dense.
Some algorithms may work well for a certain dataset but
may not be feasible when the dimensions of the database
change (i.e. number of transactions, number of items,
average number of items per transaction etc.). Therefore, a
further study has been done in this paper, to show the
feasible performance range of the algorithms. The more
extensive testing of the algorithms is carried out using a
set of databases with varying number of both transactions
and average number of items per transaction. For each
dataset, all the algorithms are tested on supports of 10% to
90% in increments of 10%. The experimental results are
reported in detail.</p>
      <p>
        To show how well CT-PRO compares with algori
        <xref ref-type="bibr" rid="ref3">thms
in FIMI Repository 2003</xref>
        [17], two best algorithms from
the last workshop, LCM [18] and kDCI [19], are selected
for comparison. The result shows that CT-PRO
outperforms these and therefore all others.
      </p>
      <p>The structure of the rest of this paper is as follows: In
Section 2, we introduce the CFP-Tree data structure and
report the results of experiments in evaluating its
compactness. In Section 3, we describe the CT-PRO
algorithm with a running example. We discuss the
complexity of CT-PRO algorithm in Section 4. The
performance of the algorithm on various datasets is
compared against other algorithms in Section 5. Section 6
contains conclusions of our study.
2. Compressed FP-Tree Data Structure</p>
      <p>In this section, a new tree-based data structure, named
Compressed FP-Tree (CFP-Tree), is introduced. It is a
variant of CT-Tree data structure that we introduced in
[20] with the following major differences: items are sorted
in descending order of their frequency (instead of
ascending order, as in CT-Tree) and there is a link to the
next node with the same item node (while links are not
present in CT-Tree). The CFP-Tree is defined as follows:
Definition 1 (Compressed FP-Tree or CFP-Tree). A
Compressed FP-Tree is a prefix tree with the following
properties:
1. It consists of an ItemTable and a tree whose root
represents the index of the item with the highest
frequency and a set of subtrees as the children of the
root.
2. The ItemTable contains all frequent items sorted in
descending order by their frequency. Each entry in
the ItemTable consists of four fields, (1) index, (2)
item-id, (3) frequency of the item, and (4) a pointer
pointing to the root of the subtree of each frequent
item.
3. If I = {i1, i2, … ik} is a set of frequent items in a
transaction, after being mapped to their index-id,
then the transaction will be inserted into the
Compressed FP-Tree starting from the root of a
subtree to which i1 in the ItemTable points.
4. The root of the Compressed FP-Tree is the level 0 of
the tree.
5. Each node in the Compressed FP-Tree consists of
four fields: node-id, a pointer to the next sibling, a
pointer to the next node with the same id, and a count
array where each entry corresponds to the number of
occurrences of an itemset. If C = {C0, C1,… Ck} is a
set of counts in the count array attached to a node
and the index of the array starts from zero, then Ci is
the count of a transaction with an itemset along the
path from the node at level i to the node where Ci is
located.</p>
      <p>The following lemma provides the worst-case space
complexity of a CFP-Tree.</p>
      <p>Lemma 1. Let n be the number of frequent items in the
database for a certain support threshold. The number of
nodes of the CFP-Tree is bounded by 2n-1, which is half of
the maximum for a full prefix tree.</p>
      <p>Rationale. If IF = {iF1, … iFn} is a set of distinct items in a
CFP-Tree where iF1, iF2,…iFn are lexicographically
ordered. The maximum number of nodes under subtrees
iF1, iF2, … iFn is 2n-1, 2n-2…20 respectively. Since the
CFPTree is actually the subtree iF1 then the maximum number
of nodes of the CFP-Tree is 2n-1.</p>
      <p>Compared to FP-Tree, CFP-Tree has some important
differences, as follows:
1. FP-Tree stores the item id in the tree while, in
CFPTree, item ids are mapped to an ascending sequence
of integers that is actually the array index in
ItemTable.
2. The FP-Tree is compressed by removing identical
subtrees of a complete FP-Tree and succinctly storing
the information from them in the remaining nodes.
All subtrees of the root of the FP-Tree (except the
leftmost branch) are collected together at the leftmost
branch to form the CFP-Tree..
3. Each node in the FP-Tree (except the root) consists
of three fields: item-id, count and node-link. Count
registers the number of transactions represented by
the portion of the path reaching this node. Node-link
links to the next node with the same item-id. Each
node in the CFP-Tree consists of three fields: item-id,
count array and node-link. The count array contains
counts for item subsets in the path from the root to
that node. The index of the cells in the array
corresponds to the level numbers of the nodes above.
4. FP-Tree has a HeaderTable consisting of two fields:
item-id and a pointer to the first node in the FP-Tree
carrying the nodes with the same item-id. CFP-Tree
has an ItemTable consisting of four fields: index,
item-id, count of the item and a pointer to the root of
the subtree of each item. The root of each subtree has
a link to the next node with the same-item-node. Both
HeaderTable and ItemTable store only frequent
items.</p>
      <p>Figure 1 shows the FP-Tree and the CFP-Tree for a
sample database. In this case, the FP-Tree is a complete
tree for items 1-4. In this example, the number of nodes in
the FP-Tree is twice that of the corresponding CFP-Tree.
However, most datasets do not have such an extreme
characteristic as in this example.</p>
      <p>Figure 2 shows the compactness of CFP-Tree
compared to FP-Tree on several synthetic datasets at
various support levels (the characteristics of the datasets
are explained later in Section 5.2). CFP-Tree has a
smaller number of nodes compared to FP-Tree in all
cases.</p>
    </sec>
    <sec id="sec-2">
      <title>3. CT-PRO Algorithm</title>
      <p>In this section, a new method that traverses the tree in a
bottom-up strategy, and implemented non-recursively, is
presented. The CFP-Tree data structure is used to
compactly represent transactions in the memory. The
algorithm is called CT-PRO and it has three steps in it:
finding the frequent items, constructing the CFP-Tree, and
mining. Algorithm 1 shows the first two steps in CT-PRO.
HeaderTable
1
2
3
4
Item Head of
node-links
3:2
4:1
s
e
d
oN10000
f
o
#
5000
0</p>
      <p>Suppose the user wants to mine all frequent itemsets
from the transaction database shown in Figure 3a with a
support threshold of two transactions (or 40%). First, we
need to identify frequent items by reading the database
once (lines 3-11). The frequent items are stored in
frequency descending order in the GlobalItemTable (line
12). In a second pass over the database, only frequent
items are selected from each transaction (see Figure 3b),
mapped to their index id in GlobalItemTable on-the-fly,
sorted in ascending order of their index id (see Figure 3c)
and inserted into the CFP-Tree (see Figure 3d). The
pointer in GlobalItemTable also acts as the start of the
links to other nodes with the same item ids (indicated by
the dashed lines in Figure 3d). For illustration, at each
node we also show the index of the array, the transaction
represented at each index entry and its count. In the
implementation of CFP-Tree, however, only the second
column that represents the count is stored.
nodes of index 5 and the path to the root for each node is
traversed counting the other indexes that occur together
with index 5 (lines 13-23). In all, we have 1 (2), 2 (1), 3
(1) and 4 (2) for index 5. As indexes 1,4 (item id: 4,5) are
locally frequent, they are registered in the LocalItemTable
and assigned new index ids (see Figure 4a). They also
become the child of the LocalFrequentPatternTree’s root
(lines 5-7). Together, the root and its children form
frequent itemsets with length two.</p>
      <p>After local frequent items for the projection have been
identified, the node-link in the GlobalCFP-Tree is
retraversed and the path to the root from each node is
revisited to get the local frequent items occurring together
with index 5 in the transactions. These local frequent
items are mapped to their index in the LocalItemTable
onthe-fly, sorted in ascending order of their index id and
inserted into the LocalCFP-Tree (lines 24-33). The first
path of index 5 returns nothing. From the second path of
index 5, a transaction 14 (1) is inserted into the
LocalCFP-Tree and another transaction 14 (1) from the
third path of index 5 also is inserted. In total, there are two
Algorithm 2 CT-PRO Algorithm: Mining Part
input CFP-Tree
output Frequent Itemsets FP
occurrences of transaction 14. Indexes 1 and 4 in the
GlobalItemTable represent items 4 and 5 respectively.
The indexes of items 4 and 5 in the LocalItemTable are 1
and 2 respectively and so the transaction 14 is inserted as
transaction 12 in the LocalCFP-Tree. As the item index in
the GlobalItemTable and LocalItemTable are different,
the item id is always maintained for output purposes.</p>
      <p>Longer frequent itemsets, with length greater than two,
are extracted by calling the procedure RecMine (line 9).
For simplicity, we have described this procedure (lines
34-47) using recursion but, in the program, it is
implemented as a non-recursive procedure. Starting from
the least frequent item in the LocalItemTable, (line 35),
the node-link is traversed (lines 37-41). For each node, the
path to the root in the LocalCFP-Tree is traversed
counting the other items that are together with the current
item. For example, in Figure 4a, traversing the node-link
of node 2 will return the index 1 (2) and, since it is
frequent, an entry is created and attached as the child of
index 2 in the LocalFrequentPatternTree (lines 42-44).
All frequent itemsets containing item 7 can be extracted
by traversing the LocalFrequentPatternTree (line 10): 7
(2), 75 (2), 754 (2), 74 (2).</p>
      <p>The process is continued to mine the next item in the
GlobalItemTable in the GlobalCFP-Tree with indexes 4,
3, 2 and finally, when the mining process reaches the root
of the tree of Figure 3d, it outputs 4 (5).</p>
      <p>One major advantage of CT-PRO compared to
FPGrowth is that CT-PRO avoids the cost of creating
conditional FP-Trees. FP-Growth needs to create a
conditional FP-Tree at each step of its recursive mining.
This overhead adversely affects its performance, as the
number of conditional FP-Trees corresponds to the
number of frequent itemsets. In CT-PRO, for each
frequent item (not frequent itemsets), only one
LocalCFPTree is created and traversed non-recursively to extract all
frequent itemsets beginning with the frequent item.</p>
    </sec>
    <sec id="sec-3">
      <title>4. Time Complexity</title>
      <p>In this section, the best-case and worst-case time
complexity of CT-PRO algorithm is presented. Let I = {i1,
i2, …., in} be the set of all n items, let transaction database
D be {t1, t2, …, tm}, and let v be the total number of items
in all transactions.</p>
      <p>Lemma 2. In the best-case, the cost of generating frequent
itemsets is O(v + n).</p>
      <p>Proof. The best-case for the CT-PRO algorithm occurs
when there is no frequent item. The algorithm has to read
v items in all transactions and add the count of n items.
The count of all n items are stored in the ItemTable and
checked to determine whether there is any frequent item
or not. If there is no frequent item, the process stops.
Lemma 3. In the worst-case, the cost of generating
frequent itemsets is</p>
      <p>n
(v + n)+ (v + 2n-1) + ∑ ((2(n-k)–1)(2(n-k)) + 2(k-2)) = O(22n).</p>
      <p>k=2
Proof. The worst-case happens when all n items are
frequent and all combinations of them are present in m
transactions. CT-PRO has three steps: finding frequent
items, constructing the CFP-Tree, and mining. The cost of
finding frequent items has been provided by Lemma 2.
The worst case for the GlobalCFP-Tree corresponds to a
situation where all the possible paths exist. In
constructing the GlobalCFP-Tree, all the transactions in
the database are read (the cost is v) and inserted into the
tree (the total number of nodes is 2n-1). For the mining
process, for each frequent item fk where 2 ≤ k ≤ n, 2(k-2)
nodes in the GlobalCFP-Tree are visited to construct a
LocalCFP-Tree. The LocalCFP-Tree has (2(n-k)–1)
paths that correspond to, at most, 2(n-k) candidate
itemsets. So the worst case mining cost is:
n
∑ ((2(n-k)–1)(2(n-k)) + 2(k-2)).
k=2
Therefore, the total worst-case cost of CT-PRO is
n
(v+n)+(v+2n-1)+ ∑ ((2(n-k)–1)(2(n-k)) + 2(k-2)) = O(22n)
k=2</p>
    </sec>
    <sec id="sec-4">
      <title>5. Experimental Evaluation</title>
      <p>
        This section contains three sub-sections. In Section 5.1,
we compare CT-PRO against other well-known algorithms
including with Apriori [14, 15], FP-Growth [10] and
recently proposed OpportuneProject (OP) [11] on the
various datasets available a
        <xref ref-type="bibr" rid="ref3">t FIMI Repository 2003</xref>
        [17].
In Section 5.2, we report the result of more
comprehensive testing to determine the feasible
performance range of the algorithms. Finally, in Section
5.3, we compare CT-PRO with the two best algori
        <xref ref-type="bibr" rid="ref3">thms in
the FIMI Repository 2003</xref>
        [17], LCM [18] and kDCI [19].
5.1. Comparison with Apriori, FP-Growth and
      </p>
      <p>OpportuneProject</p>
      <p>
        Six real datasets are used in this experiment including
two dense datasets: Chess and Connect4; two less dense
datasets: Mushroom and Pumsb*; and two sparse datasets:
BMS-WebView-1 and BMS-WebView-2. The first four
datasets are originally taken from UCI ML Repository
[16] and the last two datasets are donated by Blue Martini
Software [3]. All the datasets are also available a
        <xref ref-type="bibr" rid="ref3">t FIMI
Repository 2003</xref>
        [17].
      </p>
      <p>We used the implementation of Apriori created by
Christian Borgelt [21] by enabling the use of the prefix
tree data structure. As for FP-Growth, we used the
executable code available from its authors [10]. However,
for comparing the number of nodes of FP-Tree to our
proposed data structure, we modified the source code of
FP-Growth provided by Bart Goethals in [22]. For
OpportuneProject (OP), we used the executable code
available from its author, Junqiang Liu [11].</p>
      <p>
        All the algorithm were implemented and compiled
using MS Visual C++ 6.0. All the experiments
        <xref ref-type="bibr" rid="ref1 ref4">(except
comparisons with algorithms in the FIMI Repository 2003
website [17])</xref>
        were performed on a Pentium III PC 866
MHz with 512 MB RAM and 110 GB Hard Disk running
on MS Windows 2000. All the reported runtime used in
our charts is the total execution time, the period between
input and output. It also includes the time of constructing
all the data structures used in all programs.
      </p>
      <p>Figure 5 shows the results of the experiment on various
datasets. All the charts use a logarithmic scale for run time
along the y-axis on the left of the chart. We did not plot
the results in the chart if the runtime was more than
10,000 seconds. For a comprehensive evaluation of the
algorithm’s performance, rather than showing where our
algorithm performed best at some of the support levels, all
the algorithms were extensively tested on various datasets
with a support level of 10% to 90% for dense datasets
(e.g. Connect4, Chess, Pumsb*, Mushroom), a support
level of 0.1% to 1% for the sparse dataset
BMS-WebView1, and a support level of 0.01% to 0.1% for the sparse
dataset BMS-WebView-2. As the average number of items
increases and/or the support level decreases, at some
point, every algorithm ‘hits the wall’ (i.e. takes too long to
complete).</p>
      <p>CT-PRO outperforms others at all support thresholds
on the Connect4, Chess, Mushroom and Pumsb* datasets.
On the sparse dataset BMS-WebView-1, CT-PRO is a
runner-up, after OP, with only small performance
differences (0.4 seconds to 0.49 seconds at a support level
of 0.1% and 5.18 seconds to 7.69 seconds at 0.06%).
Below the support level of 0.06%, none of the algorithms
could mine the BMS-WebView-1 dataset. On the sparse
dataset BMS-WebView-2, a remarkable result is obtained.
Apriori, which is known as a traditional FIM algorithm,
outperforms FP-Growth at all support levels. CT-PRO is
the fastest from a support threshold of 1% down to 0.4%
and becomes the runner-up, after OP, at a support level of
0.3% down to 0.1% with small performance differences.</p>
      <p>From these results, we can claim that, on dense
datasets, CT-PRO generally outperforms others. On sparse
datasets, the high cost of the tree construction reduces
CTPRO to runner-up. However, as the gap is very small, we
can say that CT-PRO also works well for sparse datasets.</p>
    </sec>
    <sec id="sec-5">
      <title>Determining the</title>
    </sec>
    <sec id="sec-6">
      <title>Feasible</title>
    </sec>
    <sec id="sec-7">
      <title>Performance</title>
      <p>
        As mentioned earlier, sample datasets such as
realword BMS datasets [3] and the UCI Machine Learning
Repository [16], which also are available a
        <xref ref-type="bibr" rid="ref3">t the FIMI
Repository 2003</xref>
        [17], have their own static characteristics
and thus do not cover the full range of densities. An
algorithm that works well for one dataset may not have the
same degree of performance on other datasets with
different dimensions. Dimensions, here, could be the
number of transactions, number of items, average number
of items per transaction, denseness or sparseness, etc. In
this section, a more comprehensive evaluation of the
performance of various algorithms is presented.
      </p>
      <p>We generated ten datasets using the synthetic data
generator [13]. The first five datasets contained 100 items
with 50,000 transactions, and an average number of items
per transaction of 10, 25, 50, 75, and 100. The second five
datasets contained 100 items with 100,000 transactions,
also with an average number of items per transaction of
10, 25, 50, 75, and 100. CT-PRO, Apriori, FP-Growth
and OP were tested extensively on these datasets at a
support level of 10% to 90%, in increments of 10%.</p>
      <p>Figure 6 shows the performance comparisons of the
algorithms on various datasets. The dataset name shows
its characteristics. For example, I100T100KA10 means
there are 100 items, and 100,000 transactions with an
average of 10 items per transaction. The experimental
results show that the performance characteristics on
databases of 50,000 to 100,000 transactions are quite
similar. However, the runtime increases with the number
of transactions.</p>
      <p>The Apriori algorithm is very feasible for sparse
datasets (with an average number of items in each
transaction of 10 and 25). Its performance is good, as it
consistently performs better than FP-Growth at all support
levels. Although Apriori is slower than CT-PRO and OP
using the two sparse datasets, its runtime is still acceptable
to the user. (It needs only 60 seconds to mine the
I100T50KA25 dataset at the support level of 10%).
However, on the datasets with an average number of items
per transaction of 50, 75, and 100, Apriori performs worst
and it can only mine down to a support level of 30%,
50%, and 70% respectively. These results confirm that,
for dense datasets, if the support levels used are low,
Apriori is infeasible.</p>
      <p>FP-Growth performs worst at all support levels on the
datasets with a low average number of items per
transaction (i.e. 10 and 25). The fact that FP-Growth does
not outperform Apriori on these two datasets shows that
Apriori is more feasible than FP-Growth for sparse
datasets. However, FP-Growth performs significantly
better than Apriori for the larger average number of items
in transactions.
40 50
Support (%)
60
30</p>
      <p>40</p>
      <p>Support (%)
10
20
50
60</p>
      <p>70</p>
      <sec id="sec-7-1">
        <title>BMS-WebView-1</title>
        <p>10
20
30
40 50 60
Support (%)</p>
      </sec>
      <sec id="sec-7-2">
        <title>Pumsb*</title>
      </sec>
      <sec id="sec-7-3">
        <title>Chess</title>
        <p>70
80</p>
        <p>90
Mushroom
Both CT-PRO and OP have larger feasible
performance ranges compared to the other algorithms. OP
does not perform well on the sparse datasets
I100T50KA10 and I100T100KA10. Its performance was
even worse than Apriori on this dataset. However, it
performs better than Apriori and FP-Growth on other
datasets. On the datasets with an average number of items
per transaction of 50, 75, and 100, FP-Growth, CT-PRO
and OP can mine down to a support level of 20%, 40%,
and 50% respectively.</p>
        <p>CT-PRO can be considered the best among all other
algorithms as it generally performs the best at most
support levels. However, as the support level gets lower,
its performance is similar to OP. Only at a support level of
10%, OP occasionally runs slightly faster than CT-PRO
(e.g. at a support level of 10% on the I100T50KA25
dataset).</p>
        <p>
          For comparison with the best algori
          <xref ref-type="bibr" rid="ref3">thms in the FIMI
Repository 2003</xref>
          [17], we ported our algorithm CT-PRO
to a Linux operating system and compared it with their
two best algorithms: LCM [18] and kDCI [19]. We
performed the experiments on a PC AMD Athlon XP
2000+ 1.6 GHz, 1 GB RAM, 2 GB Swap with 40GB Hard
Disk running Fedora Core 1. All programs were compiled
using g++ compiler.
        </p>
        <p>
          Figure 7 shows the performance comparisons of
algorithms
          <xref ref-type="bibr" rid="ref3">that were submitted to FIMI 2003</xref>
          on Chess
and Connect4 datasets. The figures are taken from [17].
On Chess dataset, kDCI is the best at a support level of
90% to a support level of 70%. Below that, LCM
outperforms others. On Connect4 dataset, at a support
10 20 30 40 50 60 70 80 90
        </p>
        <p>Support (%)
10 20 30 40 50 60 70 80 90</p>
        <p>Support (%)
I100T50KA25
I100T100KA25
10 20 30 40 50 60 70 80 90</p>
        <p>Support (%)
10 20 30 40 50 60 70 80 90</p>
        <p>Support (%)
10 20 30 40 50 60 70 80 90</p>
        <p>Support (%)</p>
        <p>I100T100KA10
ic 10
m
h
it
r
a
g
l)o 1
s
(
e
m
it
n
uR0.1
ic 100
m
h
it
ra 10
g
o
l)
(se 1
m
it
n
uR 0.1
ic10000
m
h
itr 1000
a
g
o
l) 100
s
(
item 10
n
u
R</p>
        <p>1
ic 10000
m
ith 1000
r
a
g
l)o 100
s
(
e 10
m
it
nu 1
R</p>
        <p>10000
c
i
itrhm 1000
a
l)og 100
s
(
em 10
it
n
u
R
1
10 20 30 40 50 60 70 80 90</p>
        <p>Support (%)
10 20 30 40 50 60 70 80 90</p>
        <p>Support (%)
I100T50KA75
I100T100KA75
10 20 30 40 50 60 70 80 90</p>
        <p>Support (%)
10 20 30 40 50 60 70 80 90</p>
        <p>Support (%)
I100T50KA100
I100T100KA100
level of 95% to a support level of 70%, kDCI is the best.
Below that, LCM outperforms others. We can conclude
that for higher support levels, kDCI is the best, but for
lower support levels, LCM is the best. These best two
algorithms are compared with CT-PRO.</p>
        <p>The kDCI algorithm [19] is a multiple heuristics
hybrid algorithm that able to adapt its behaviour during
the execution. It is an extension of the DCI (Direct Count
and Intersect) algorithm [23] by adding its adaptability to
the dataset specific features. kDCI is also a resource aware
algorithm which can decides mining strategy based on the
hardware characteristics of the computing platform used.
Moreover, kDCI also used counting inference strategy
which originally proposed in [24].</p>
        <p>The LCM (Linear time Closed itemset Miner)
algorithm [18] uses the parent-child relationship defined
on frequent itemsets. The search tree technique is adapted
from the algorithms for generating maximal bipartite
cliques [25, 26] based on reverse search [27, 28]. In
enumerating all frequent itemsets, LCM uses hybrid
techniques involving occurrence deliver or diffsets [29]
according to the density of the database.</p>
        <p>
          Figure 8 shows the performance comparisons on Chess
and Connect4 datasets. From these results, CT-PRO
always outperforms others at high support levels. For
lower support levels, the performances of these three
algorithms are similar. Since LCM and kDCI are the best
algori
          <xref ref-type="bibr" rid="ref3">thms in FIMI Repository 2003</xref>
          on Chess and
Connect4 datasets, we can conclude that CT-PRO
outperforms all other algorithms available in FIMI
Reposi
          <xref ref-type="bibr" rid="ref3">tory 2003</xref>
          [17] on Chess and Connect4 datasets.
        </p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>6. Conclusions</title>
      <p>In this paper, we have described a new tree-based data
structure named CFP-Tree that is more compact than
FPTree used in FP-Growth algorithm. Depending on the
database characteristics, the number of nodes in an
FPTree could be up to twice as many as in the corresponding
CFP-Tree for a given database. CFP-Tree is used in our
new algorithm named CT-PRO for mining all frequent
itemsets. CT-PRO divides the CFP-Tree into several
projections represented also by CFP-Trees. Then CT-PRO
conquers the CFP-Tree for mining all frequent itemsets in
each projection.</p>
      <p>CT-PRO was explained in detail using a running
example and the best-case and worst-case time complexity
of the algorithm also was presented. Performance
comparisons of CT-PRO against other well-known
algorithms, including Apriori [14, 15], FP-Growth [10]
and OpportuneProject (OP) [11] also were reported. The
results show that CT-PRO outperforms other algorithms at
all support levels on dense datasets and also works well on
sparse datasets.</p>
      <p>Extensive experiments to measure the feasible
performance range of the algorithms are also presented in
this paper. A synthetic data generator is used to generate
several datasets with varying number of both transactions
and average number of items per transaction. Then the
best available algorithms including CT-PRO, Apriori,
FPGrowth and OP are tested on those datasets. The result
shows that CT-PRO generally outperforms others.</p>
      <p>
        In addition, to relate our research to the last workshop
on frequent itemset mining implementations [17], we
selected two best algorithms (LCM and kDCI) from FIMI
Reposi
        <xref ref-type="bibr" rid="ref3">tory 2003</xref>
        and compared their performance with
CT-PRO. It was shown that CT-PRO performed better
than the others.
      </p>
    </sec>
    <sec id="sec-9">
      <title>7. Acknowledgement</title>
      <p>We are very grateful to Jian Pei for providing the
executable code of FP-Growth, Bart Goethals for
providing his FP-Growth program, Christian Borgelt for
the Apriori program, and Junqiang Liu for the
OpportuneProject program.
8. References
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>5.3. Comparison with Best Algorithms in the FIMI Repository</article-title>
          <year>2003</year>
          [
          <volume>24</volume>
          ] [25]
          <string-name>
            <surname>Sets</surname>
            <given-names>"</given-names>
          </string-name>
          ,
          <source>Proceedings of the IEEE International Conference on Data Mining (ICDM)</source>
          ,
          <string-name>
            <surname>Maebashi</surname>
            <given-names>City</given-names>
          </string-name>
          , Japan,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Lakhal</surname>
          </string-name>
          ,
          <article-title>"Mining Frequent Patterns with Counting Inference"</article-title>
          ,
          <source>ACM SIGKDD Explorations</source>
          , vol.
          <volume>2</volume>
          , pp.
          <fpage>66</fpage>
          -
          <lpage>75</lpage>
          ,
          <year>December 2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>T.</given-names>
            <surname>Uno</surname>
          </string-name>
          ,
          <article-title>"A Practical Fast Algorithm for Enumerating Cliques in Huge Bipartite Graphs and Its Implementation"</article-title>
          ,
          <source>Proceedings of the 89th Special Interest Group of Algorithms, Information Processing Society</source>
          , Japan,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>T.</given-names>
            <surname>Uno</surname>
          </string-name>
          ,
          <article-title>"Fast Algorithms for Enumerating Cliques in Huge Graphs"</article-title>
          , Research Group of Computation, IEICE, Kyoto University 2003.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>D.</given-names>
            <surname>Avis</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Fukuda</surname>
          </string-name>
          ,
          <article-title>"Reverse Search for Enumeration"</article-title>
          ,
          <source>Discrete Applied Mathematics</source>
          , vol.
          <volume>65</volume>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>46</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>T.</given-names>
            <surname>Uno</surname>
          </string-name>
          ,
          <article-title>"A New Approach for Speeding Up Enumeration Algorithms"</article-title>
          ,
          <source>Proceedings of ISAAC'98</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>M. J. Zaki</surname>
            and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hsiao</surname>
          </string-name>
          ,
          <article-title>"CHARM: An Efficient Algorithm for Closed Itemset Mining"</article-title>
          ,
          <source>Proceedings of SIAM International Conference on Data Mining (SDM)</source>
          , Arlington,
          <string-name>
            <surname>VA</surname>
          </string-name>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>