=Paper=
{{Paper
|id=Vol-90/paper-15
|storemode=property
|title=COFI-tree Mining: A New Approach to Pattern Growth with Reduced Candidacy Generation
|pdfUrl=https://ceur-ws.org/Vol-90/elhajj.pdf
|volume=Vol-90
|dblpUrl=https://dblp.org/rec/conf/fimi/ZaianeE03
}}
==COFI-tree Mining: A New Approach to Pattern Growth with Reduced Candidacy Generation==
COFI-tree Mining: A New Approach to Pattern Growth with Reduced
Candidacy Generation
Mohammad El-Hajj Osmar R. Zaı̈ane
Department of Computing Science Department of Computing Science
University of Alberta Edmonton, AB, Canada University of Alberta Edmonton, AB, Canada
mohammad@cs.ualberta.ca zaiane@cs.ualberta.ca
Abstract posed method uses a pruning technique that dramatically
saves the memory space. These relatively small trees are
Existing association rule mining algorithms suffer constructed based on a memory-based structure called FP-
from many problems when mining massive transactional Trees [11]. This data structure is studied in detail in the
datasets. Some of these major problems are: (1) the repeti- following sections. In short, we introduced in [8] the COFI-
tive I/O disk scans, (2) the huge computation involved dur- tree stucture and an algorithm to mine it. In [7] we pre-
ing the candidacy generation, and (3) the high memory de- sented a disk based data structure, inverted matrix, that re-
pendency. This paper presents the implementation of our places the memory-based FP-tree and scales the interactive
frequent itemset mining algorithm, COFI, which achieves frequent pattern mining significantly. Our contributions in
its efficiency by applying four new ideas. First, it can mine this paper are the introduction of a clever pruning technique
using a compact memory based data structures. Second, based on an interesting property drawn from our top-down
for each frequent item assigned, a relatively small indepen- approach, and some implementation tricks and issues. We
dent tree is built summarizing co-occurrences. Third, clever included the pruning in the algorithm of building the tree so
pruning reduces the search space drastically. Finally, a sim- that the pruning is done on the fly.
ple and non-recursive mining process reduces the memory
requirements as minimum candidacy generation and count- 1.1 Problem Statement
ing is needed to generate all relevant frequent patterns.
The problem of mining association rules over market
basket analysis was introduced in [2]. The problem consists
of finding associations between items or itemsets in trans-
1 Introduction actional data. The data could be retail sales in the form of
customer transactions or even medical images [16]. Asso-
Frequent pattern discovery has become a common topic ciation rules have been shown to be useful for other appli-
of investigation in the data mining research area. Its main cations such as recommender systems, diagnosis, decision
theme is to discover the sets of items that occur together support, telecommunication, and even supervised classifi-
more than a given threshold defined by the decision maker. cation [5]. Formally, as defined in [3], the problem is stated
A well-known application domain that counts on the fre- as follows: Let I = {i1 , i2 , ...im } be a set of literals, called
quent pattern discovery is the market basket analysis. In items and m is considered the dimensionality of the prob-
most cases when the support threshold is low and the num- lem. Let D be a set of transactions, where each transaction
ber of frequent patterns “explodes”, the discovery of these T is a set of items such that T ⊆ I. A unique identifier
patterns becomes problematic for reasons such as: high TID is given to each transaction. A transaction T is said
memory dependencies, huge search space, and massive I/O to contain X, a set of items in I, if X ⊆ T . An associ-
required. However, recently new studies have been pro- ation rule is an implication of the form “X ⇒ Y ”, where
posed to reduce the memory requirements [8], to decrease X ⊆ I, Y ⊆ I, and X ∩ Y = ∅. An itemset X is said to be
the I/O dependencies [7], still more promising issues need large or frequent if its support s is greater or equal than a
to be investigated such as pruning techniques to reduce the given minimum support threshold σ. An itemset X satisfies
search space. In this paper we introduce a new method a constraint C if and only if C(X) is true. The rule X ⇒ Y
for frequent pattern discovery that is based on the Co- has a support s in the transaction set D if s% of the transac-
Occurrence Frequent Item tree concept [8, 9]. The new pro- tions in D contain X ∪ Y . In other words, the support of the
rule is the probability that X and Y hold together among all frequent patterns. This massive creation of conditional trees
the possible presented cases. It is said that the rule X ⇒ Y makes this algorithm not scalable to mine large datasets be-
holds in the transaction set D with confidence c if c% of yond few millions. In [14] the same authors propose a new
transactions in D that contain X also contain Y . In other algorithm, H-mine, that invokes FP-Tree to mine condensed
words, the confidence of the rule is the conditional proba- data. This algorithm is still not scalable as reported by its
bility that the consequent Y is true under the condition of authors in [13].
the antecedent X. The problem of discovering all associa-
tion rules from a set of transactions D consists of generating 1.3 Preliminaries, Motivations and Contributions
the rules that have a support and confidence greater than a
given threshold. These rules are called strong rules. This The Co-Occurrence Frequent Item tree (or COFI-tree for
association-mining task can be broken into two steps: short) and the COFI algorithm presented in this paper are
1. A step for finding all frequent k-itemsets known for its based on our previous work in [7, 8]. The main motivation
extreme I/O scan expense, and the massive computational of our current research is the pruning technique that reduces
costs; the memory space needed by the COFI-trees. The presented
2. A straightforward step for generating strong rules. algorithm is done in two phases in which phase 1 requires
In this paper and our attached code, we focus exclusively two full I/O scans of the transactional database to build the
on the first step: generating frequent itemsets. FP-Tree structure[11]. The second phase starts by building
small Co-Occurrence Frequent trees for each frequent item.
1.2 Related Work These trees are pruned first to eliminate any non-frequent
items with respect to the COFI-tree based frequent item.
Finally the mining process is executed.
Several algorithms have been proposed in the literature
The remainder of this paper is organized as follows: Sec-
to address the problem of mining association rules [12, 10].
tion 2 describes the Frequent Pattern tree, design and con-
One of the key algorithms, which seems to be the most pop-
struction. Section 3 illustrates the design, constructions,
ular in many applications for enumerating frequent item-
pruning, and mining of the Co-Occurrence Frequent Item
sets, is the apriori algorithm [3]. This apriori algorithm
trees. Section 4 presents the implementation procedure of
also forms the foundation of most known algorithms. It
this algorithm. Experimental results are given in Section 5.
uses an anti-monotone property stating that for a k-itemset
Finally, Section 6 concludes by discussing some issues and
to be frequent, all its (k-1)-itemsets have to be frequent. The
highlighting our future work.
use of this fundamental property reduces the computational
cost of candidate frequent itemset generation. However, in
the cases of extremely large input sets with big frequent 1- 2 Frequent Pattern Tree: Design and Con-
items set, the Apriori algorithm still suffers from two main struction
problems of repeated I/O scanning and high computational
cost. One major hurdle observed with most real datasets The COFI-tree approach we propose consists of two
is the sheer size of the candidate frequent 2-itemsets and main stages. Stage one is the construction of a modified
3-itemsets. Frequent Pattern tree. Stage two is the repetitive building of
TreeProjection is an efficient algorithm presented in [1]. small data structures, the actual mining for these data struc-
This algorithm builds a lexicographic tree in which each tures, and their release.
node of this tree presents a frequent pattern. The authors
report that their algorithm is one order of magnitude faster 2.1 Construction of the Frequent Pattern Tree
than the existing techniques in the literature. Another inno-
vative approach of discovering frequent patterns in transac- The goal of this stage is to build the compact data struc-
tional databases, FP-Growth, was proposed by Han et al. ture called Frequent Pattern Tree [11]. This construction is
in [11]. This algorithm creates a compact tree-structure, done in two phases, where each phase requires a full I/O
FP-Tree, representing frequent patterns, that alleviates the scan of the dataset. A first initial scan of the database iden-
multi-scan problem and improves the candidate itemset tifies the frequent 1-itemsets. The goal is to generate an or-
generation. The algorithm requires only two full I/O scans dered list of frequent items that would be used when build-
of the dataset to build the prefix tree in main memory and ing the tree in the second phase.
then mines directly this structure. The authors of this al- This phase starts by enumerating the items appearing in
gorithm report that their algorithm is faster than the Apri- the transactions. After enumeration these items (i.e. after
ori and the TreeProjection algorithms. Mining the FP-tree reading the whole dataset), infrequent items with a support
structure is done recursively by building conditional trees less than the support threshold are weeded out and the re-
that are of the same order of magnitude in number as the maining frequent items are sorted by their frequency. This
tween this item-node in the tree and its entry in the header
Table 1. Transactional database table. The header table holds as one pointer per item that
T.No. Items
points to the first occurrences of this item in the FP-Tree
T1 A G D C B structure.
T2 B C H E D
T3 B D E A M
T4 C E F A N 2.2 Illustrative Example
T5 A B N O P
T6 A C Q R G For illustration, we use an example with the transactions
T7 A C H I G shown in Table 1. Let the minimum support threshold be set
T8 L E F K B to 4. Phase 1 starts by accumulating the support for all items
T9 A F M N O that occur in the transactions. Step 2 of phase 1 removes all
T10 C F P G R non-frequent items, in our example (G, H, I, J, K, L,M, N,
T11 A D B H I O, P, Q and R), leaving only the frequent items (A, B, C, D,
T12 D E B K L E, and F). Finally all frequent items are sorted according to
T13 M D C G O their support to generate the sorted frequent 1-itemset. This
T14 C F P Q J last step ends phase 1 in Figure 1 of the COFI-tree algorithm
T15 B D E F I and starts the second phase. In phase 2, the first transaction
T16 J E B A D (A, G, D, C, B) is filtered to consider only the frequent items
T17 A K E F C that occur in the header table (i.e. A, D, C and B). This fre-
T18 C D L B A quent list is sorted according to the items’ supports (A, B,
C and D). This ordered transaction generates the first path
of the FP-Tree with all item-node support initially equal to
list is organized in a table, called header table, where the 1. A link is established between each item-node in the tree
items and their respective support are stored along with and its corresponding item entry in the header table. The
pointers to the first occurrence of the item in the frequent same procedure is executed for the second transaction (B,
pattern tree. Phase 2 would construct a frequent pattern tree. C, H, E, and D), which yields a sorted frequent item list (B,
C, D, E) that forms the second path of the FP-Tree. Trans-
Item Counter Item Counter Item Counter Item Counter action 3 (B, D, E, A, and M) yields the sorted frequent item
A 11 N 3 A 11 F 7
B 10 O 3 B 10 E 8
list (A, B, D, E) that shares the same prefix (A, B) with an
C 10 P 3 C 10 D 9 existing path on the tree. Item-nodes (A and B) support is
D 9 Q 2 D 9 C 10 incremented by 1 making the support of (A) and (B) equal
G 4 R 2 E 8 B 10
E 8 I 3 F 7 A 11 to 2 and a new sub-path is created with the remaining items
H 3 K 3 on the list (D, E) all with support equal to 1. The same pro-
F 7 L 3
M 3 J 3 cess occurs for all transactions until we build the FP-Tree
Step 1 Step 2 Step 3 for the transactions given in Table 1. Figure 2 shows the
result of the tree building process. Notice that in our tree
Figure 1. Steps of phase 1 structure, contrary to the original FP-tree [11], our links are
bi-directional. This, and other differences presented later,
Phase 2 of constructing the Frequent Pattern tree struc- are used by our mining algorithm.
ture is the actual building of this compact tree. This phase
requires a second complete I/O scan from the dataset. For
3 Co-Occurrence Frequent-Item-trees: Con-
each transaction read, only the set of frequent items present
in the header table is collected and sorted in descending or- struction, Pruning and Mining
der according to their frequency. These sorted transaction
items are used in constructing the FP-Trees as follows: for Our approach for computing frequencies relies first on
the first item on the sorted transactional dataset, check if it building independent, relatively small trees for each fre-
exists as one of the children of the root. If it exists then quent item in the header table of the FP-Tree called COFI-
increment the support for this node. Otherwise, add a new trees. A pruning technique is applied to remove all non-
node for this item as a child for the root node with 1 as frequent items with respect to the main frequent item of
support. Then, consider the current item node as the new the tested COFI-tree. Then we mine separately each one
temporary root and repeat the same procedure with the next of the trees as soon as they are built, minimizing the candi-
item on the sorted transaction. During the process of adding dacy generation and without building conditional sub-trees
any new item-node to the FP-Tree, a link is maintained be- recursively. The trees are discarded as soon as mined. At
Root
A 11 B 4 C 3
F 7
E 8 F 1 C 4 B 6 C 1 E 1 D2 F 2 D 1
D 9
C 10 E 2 C 2 D 3 D 1 F 1 E 2
B 10
A 11 F 2 D 2 E 2 E 1 F 1
Figure 2. Frequent Pattern Tree.
any given time, only one COFI-tree is present in main mem- cally not frequent with respect to item F as the support for
ory. In our following examples we always assume that we all these items are not greater than the support threshold σ
are building the COFI-trees based on the modified FP-Tree which is equal to 4, Figure 3. From knowing this, there
data-structure presented above. will be no need to mine the F-COFI-tree, we already know
that no frequent patterns other than the item F will be gen-
3.1 Pruning the COFI-trees erated. We can extend our knowledge at this stage to know
that item F will not appear in any of the frequent patterns.
Pruning can be done after building a tree or, even better, The COFI-tree for item E indicates that only items D, and
while building it. We opted for pruning on the fly since the B are frequent with respect to item E, which means that
overhead is minimal but the consequences are drastic reduc- there will be no need to test patterns as EC, and EA. The
tion in memory requirements. We will discuss the pruning COFI-tree for item D indicates that item C will be elimi-
idea, then present the building algorithm that considers the nated, as it is not frequent with respect to item D. C-COFI-
pruning on the fly. tree ignores item B for the same reason. To sum up the
In this section we are introducing a new anti-monotone Apriori property states in our example of 6 1-frequent item-
property called global frequent/local non-frequent property. set that we need to generate 15 2-Candidate itemset which
This property is similar to the Apriori one in the sense that are (A,B), (A,C), (A,D), (A,E), (A,F), (B,C), (B,D), (B,E),
it eliminates at the ith level all non-frequent items that will (B,F), (C,D), (C,E), (C,F), (D,E), (D,F), (E,F), using our
not participate in the (i+1) level of candidate itemsets gen- property we have eliminated (not generated or counted) 9
eration. The difference between the two properties is that patterns which are (A,E), (A,F), (B,C), (B,F), (C,D), (C,E),
we extended our property to eliminate also frequent items (C,F), (D,F), (E,F) leaving only 6 patterns to test which are
which are among the i-itemset and we are sure that they (A,B), (A,C), (A,D), (B,D), (B,E), (D,E).
will not participate in the (i+1) candidate set. The Apriori
property states that all nonempty subsets of a frequent item- 3.2 Construction of the Co-Occurrence Frequent-
set must also be frequent. An example is given later in this Item-trees
section to illustrate both properties. In our approach, we
are trying to find all frequent patterns with respect to one The small COFI-trees we build are similar to the condi-
frequent item, which is the base item of the tested COFI- tional FP-Trees [11] in general in the sense that they have
tree. We already know that all items that participate in the a header with ordered frequent items and horizontal point-
creation of the COFI-tree are frequent with respect to the ers pointing to a succession of nodes containing the same
global transaction database, but that does not mean that they frequent item, and the prefix tree per se with paths repre-
are also locally frequent with respect to the based item in the senting sub-transactions. However, the COFI-trees have bi-
COFI-tree. The global frequent/local non-frequent property directional links in the tree allowing bottom-up scanning as
states that all nonempty subsets of a frequent itemset with well, and the nodes contain not only the item label and a
respect to the item A of the A-COFI-tree , must also be frequency counter, but also a participation counter as ex-
frequent with respect to item A. For each frequent item plained later in this section. The COFI-tree for a given fre-
A we traverse the FP-Tree to find all frequent items that quent item x contains only nodes labeled with items that are
occur with A in at least one transaction (or branch in the more frequent or as frequent as x.
FP-Tree) with their number of occurrences. All items that To illustrate the idea of the COFI-trees, we will explain
are locally frequent with item A will participate in build- step by step the process of creating COFI-trees for the FP-
ing the A-COFI-tree, other global frequent items, locally Tree of Figure 2. With our example, the first Co-Occurrence
non-frequent items will not participate in the creation of the Frequent Item tree is built for item F as it is the least fre-
A-COFI-tree. In our example we can find that all items quent item in the header table. In this tree for F, all frequent
that participate in the creation of the F-COFI-tree are lo- items, which are more frequent than F, and share transac-
tions with F, participate in building the tree. This can be F COFI-tree
E 4 F (7 0)
found by following the chain of item F in the FP-Tree struc- D 2
ture. The F-COFI-tree starts with the root node containing C 4
the item in question, then a scan of part of the FP-Tree is ap- B 2
A 3
plied following he chain of the F item in the FP-Tree. The
first branch FA has frequency of 1, as the frequency of the
branch is the frequency of the test item, which is F. The goal E COFI-tree
E (8 0)
of this traversal is to count the frequency of each frequent D 5
item with respect to item F. By doing so we can find that C 3
B 6 D 5 D (5 0) B (1 0)
item E occurs 4 times, D occurs 2 times, C occurs 4 times,
A 4 B 6
B 2 times, and A 3 times, by applying the anti-monotone
constraint property we can predict that item F will never B (5 0)
appear in any frequent pattern except itself. Consequently
there will be no need to continue building the F-COFI-tree. D COFI-tree
D (9 0)
The next frequent item to test is E. The same process
is done to compute the frequency of each frequent items C 4 B 8
B 8 A 5 B (8 0)
with respect to item E. From this we can find that only two
A 5
globally frequent items are also locally frequent which are
(D:5 and B:6). For each sub-transaction or branch in the A (5 0)
FP-Tree containing item E with other locally frequent items
that are more frequent than E which are parent nodes of E, C COFI-tree
a branch is formed starting from the root node E. the sup- C ( 10 0 )
B 3
port of this branch is equal to the support of the E node A 6 A 6
in its corresponding branch in FP-Tree. If multiple fre- A (6 0)
quent items share the same prefix, they are merged into one
branch and a counter for each node of the tree is adjusted B COFI-tree
B ( 10 0 )
accordingly. Figure 3 illustrates all COFI-trees for frequent A 6 A 6
items of Figure 2. In Figure 3, the rectangle nodes are nodes
A (6 0)
from the tree with an item label and two counters. The first
counter is a support-count for that node while the second
counter, called participation-count, is initialized to 0 and is
Figure 3. COFI-trees
used by the mining algorithm discussed later, a horizontal
link which points to the next node that has the same item-
name in the tree, and a bi-directional vertical link that links
a child node with its parent and a parent with its child. The
bi-directional pointers facilitate the mining process by mak-
ing the traversal of the tree easier. The squares are actually
cells from the header table as with the FP-Tree. This is a support for this branch (following the upper links for this
list made of all frequent items that participate in building item). Two nodes are created, for items D and B with sup-
the tree structure sorted in ascending order of their global port equals to 2, D is a child node of B, and B is a child node
support. Each entry in this list contains the item-name, item- of E. The third location of E indicate having EDB:1, which
counter, and a pointer to the first node in the tree that has shares an existing branch in the E-COFI-tree, all counters
the same item-name. are adjusted accordingly. A new branch of EB: 1 is created
To explain the COFI-tree building process, we will high- as the support of E=1 for the fourth occurrences of E. The
light the building steps for the E-COFI-tree in Figure 3. Fre- final occurrence EDB: 2 uses an existing branch and only
quent item E is read from the header table and its first loca- counters are adjusted. Like with FP-Trees, the header con-
tion in the FP-Tree is located using the pointer in the header stitutes a list of all frequent items to maintain the location
table. The first location of item E indicate that it shares a of first entry for each item in the COFI-tree. A link is also
branch with items CA, with support = 2, since none of these made for each node in the tree that points to the next lo-
items are locally frequent then only the support of the E root cation of the same item in the tree if it exists. The mining
node is incremented by 2. the second node of item E indi- process is the last step done on the E-COFI-tree before re-
cates that it shares items DBA with support equals to 2 for moving it and creating the next COFI-tree for the next item
this branch as the support of the E-item is considered the in the header table.
E COFI-tree STEP1 Pattern D COFI-tree STEP1 Pattern
E (8 0) E (8 5) E D B 5 D (9 0) D (9 5) D B A 5
E D 5 B 8 D B A 5
D 5 D (5 0) B (1 0) D (5 1) E B 5 A 5 B (8 0) B (8 5) D B 5
B 6 E D B 5 D A 5
B (5 0) B (5 5) A (5 0) A (5 5)
E COFI-tree STEP2
E (8 5) Pattern
D COFI-tree STEP2 Pattern
E (8 6) E B 1
D (9 5) D (9 5) D B 3
D 5 D (5 5) B (1 0) E D 5
B 8 D B A 5
B 6 B (1 1) E B 6
A 5 B (8 5) B (8 5) D B 8
E D B 5
B (5 5) D A 5
A (5 5) Frequent Patterns are:
E COFI-tree STEP3 DBA:5, DB: 8, DA: 5
E (8 6) Pattern
E (8 7) E D 0
D 5 D (5 5) B (1 1)
Figure 5. Steps needed to generate frequent
B 6 D (6 6) patterns related to item D
Frequent Patterns are:
B (5 5) ED:5, EB: 6, EDB: 5
Figure 4. Steps needed to generate frequent ates the pattern EB: 1. EB already exists and its counter
patterns related to item E is adjusted to become 6. The COFI-tree of Item E can be
removed at this time and another tree can be generated and
tested to produce all the frequent patterns related to the root
node. The same process is executed to generate the fre-
3.3 Mining the COFI-trees quent patterns. The D-COFI-tree (Figure 5) is created after
the E-COFI-tree. Mining this tree generates the following
The COFI-trees of all frequent items are not constructed frequent patterns: DBA: 5, DA: 5, and DB:8. The same pro-
together. Each tree is built, mined, then discarded before the cess occurs for the remaining trees that would produce AC:
next COFI-tree is built. The mining process is done for each 6 for the C-COFI-tree and BA:6 for the B-COFI-tree.
tree independently with the purpose of finding all frequent The following is our algorithm for building and mining
k-itemset patterns in which the item on the root of the tree the COFI-trees with pruning.
participates. Algorithm COFI: Creating with pruning and Mining
Steps to produce frequent patterns related to the E item COFI-trees
for example, as the F-COFI-tree will not be mined based Input: modified FP-Tree, a minimum support threshold σ
on the pruning results we found on the previous step, are Output: Full set of frequent patterns
illustrated in Figure 4. From each branch of the tree, us- Method:
ing the support-count and the participation-count, candi- 1. A = the least frequent item on the header table of
date frequent patterns are identified and stored temporarily FP-Tree
in a list. The non-frequent ones are discarded at the end 2. While (There are still frequent items) do
when all branches are processed. The mining process for 2.1 count the frequency of all items that share item (A)
the E-COFI-tree starts from the most locally frequent item a path. Frequency of all items that share the same path
in the header table of the tree, which is item B. Item B ex- are the same as of the frequency of the (A) items
ists in two branches in the E-COFI-tree which are (B:5, D:5 2.2 Remove all non-locally frequent items for
and E:8), and (B:1, and E:8). The frequency of each branch the frequent list of item (A)
is the frequency of the first item in the branch minus the 2.3 Create a root node for the (A)-COFI-tree with both
participation value of the same node. Item B in the first frequency-count and participation-count = 0
branch has a frequency value of 5 and participation value 2.3.1 C is the path of locally frequent items in the path
of 0 which makes the first pattern EDB frequency equals of item A to the root
to 5. The participation values for all nodes in this branch 2.3.2 Items on C form a prefix of the (A)-COFI-tree.
are incremented by 5, which is the frequency of this pat- 2.3.3 If the prefix is new then Set frequency-count=
tern. In the first pattern EDB: 5. We need to generate all frequency of (A) node and participation-
sub-patterns that item E participates in, which are ED: 5, count= 0 for all nodes in the path
EB: 5, and EDB: 5. The second branch that has B gener- Else
2.3.4 Adjust the frequency-count of the already COFI F P -G row th
exist part of the path. 35 0
2.3.5 Adjust the pointers of the Header list 30 0
if needed
T im e in S ec on ds
25 0
2.3.6 find the next node for item A in the FP-tree and 20 0
15 0
go to 2.3.1 10 0
2.4 MineCOFI-tree (A) 50
2.5 Release (A) COFI-tree 0
0.1 0.05 0.02 0.01 0.00 5 0.02 0.00 1 0.00 04
2.6 A = next frequent item from the header table S up p ort %
3. Goto 2
(A) Runtime
Function: MineCOFI-tree (A) COFI F P -G row th
1. nodeA = select next node //Selection of nodes starts with 16 00
14 00
M e m or y U sag e in K .By te
the node of most locally frequent item and following its 12 00
chain, then the next less frequent item with its chain, un- 10 00
80 0
til we reach the least frequent item in the Header list of the 60 0
(A)-COFI-tree 40 0
2. while there are still nodes do 20 0
0
2.1 D = set of nodes from nodeA to the root S up p ort %
0.1 0.05 0.02 0.01 0.00 5 0.02 0.00 1 0.00 04
2.2 F = nodeA.frequency-count-nodeA.
participation-count (B) Total Memory requirement
2.3 Generate all Candidate patterns X from items in D. COFI F P -G row th
Patterns that do not have A will be discarded. 16 00
2.4 Patterns in X that do not exist in the A-Candidate 14 00
M e m or y U sag e in K .By te
List will be added to it with frequency = F otherwise 12 00
10 00
just increment their frequency with F 80 0
2.5 Increment the value of participation-count 60 0
40 0
by F for all items in D 20 0
2.6 nodeA = select next node 0
0.1 0.05 0.02 0.01 0.00 5 0.02 0.00 1 0.00 04
3. Goto 2 S up p ort %
4. Based on support threshold σ remove non-frequent pat-
(C) Memory requirement without FP-tree
terns from A Candidate List.
No.of transactions = 500K, Dimension= 10K,
4 Experimental Studies Average no. of items / transaction = 12
To study the COFI-tree mining strategies we have con- Figure 6. Mining dataset of 500K transactions
ducted several experiments on a variety of data sizes com-
paring our approach with the well-known FP-Growth [11]
algorithm written by its original authors. The experiments
were conducted on 2.6 GHz CPU machine with 2 Gbytes proach is in the memory space saved. Our algorithm outper-
of memory using Win2000 operating system. Transactions forms the FP-Growth by one order of magnitude in terms of
were generated using IBM synthetic data generator [4]. We memory space requirements. We have also tested the mem-
have conducted several types of experiments to test the ef- ory space used during the mining process only, (i.e, isolat-
fect of changing the support, transaction size, dimension, ing the memory space used to create the FP-Tree by both
and transaction length. The first set of experiments were FP-growth and COFI-tree FP-Tree based algorithms). We
tested on a transaction database of 500K transactions, 10K have found also that the COFI-tree approach outperforms
the dimension, and the average transaction length was 12. the FP-tree by one order of magnitude in terms of mem-
We have varied the support from absolute value of 500 to ory space used by the COFI-tree compared with the condi-
2 in which frequent patterns generated varied from 15K to tional trees used by FP-Growth during the mining process.
3400K patterns. FP-Growth could not mine the last experi- Figure 6A presents the time needed to mine 500K transac-
ment in this set as it used all available memory space. In all tions using different support levels. Figure 6B depicts the
experiments the COFI-tree approach outperforms the FP- memory needed during the mining process of the previous
Growth approach. The major accomplishment of our ap- experiments. Figure 6C illustrates the memory needed by
An optional file name for the out patterns. This code gen-
Table 2. Time and Memory Scalability with re- erates ALL frequent patterns from the provided input file.
spect to support on the T10I4D100K dataset The code scans the database twice. The goal of the first
database scan is to find the frequency of each item in this
Time in Seconds Memory in KB transactional database. These frequencies are stored in a
Support % COFI FP-Growth COFI FP-Growth data structure called Candidate-Items. Each entry of this
0.50 1.5 3.0 18 173 candidate items is a structure called ItemsStructure that is
0.25 1.7 5.2 19 285 made of two long integers representing the item and its fre-
0.10 2.7 12.3 26 289 quency. All frequent items are then stored in a special data
0.05 14.0 20.9 19 403 structure called F1-Items. This data structure is sorted in
descending order based on the frequency of each item. To
access the location of each item we map it with a specific
the COFI-trees and Conditional trees during the mining pro- location using a new data structure called FindInHashTable.
cess. Other experiments were conducted to test the effect of In brief, since we do not know the number of unique items
changing the dimension, transaction size, transaction length at runtime, and thus can’t create an array for counting the
using the same support which is 0.05%. Some of these ex- items, rather than having a linked list of items, we create
periments are represented in Figure 7. Figures 7A and 7B blocks of p items. The number p could arbitrarily be 100 or
represent the time needed during the mining process. Fig- 1000. Indeed, following links in a linked list each time to
ures 7C and 7D represent the memory space needed during find and increment a counter could be expensive. Instead,
the whole mining process. Figures 7E and 7F represent blocs of items are easily indexed. In the worst case, we
the memory space needed by the COFI-trees or conditional could lose the space of p − 1 unused items.
trees during the mining process. In these experiments we
The second scan starts by eliminating all non frequent
have varied the dimension, which is the number of distinct
items from each transaction read and then sort this trans-
items from 5K to 10K, the average transaction length from
action based on the frequency of each frequent item. This
12 to 24 items in one transaction, and the number of trans-
process occurred in the Sort-Transaction method. The FP-
actions from 10K to 500K. All these experiments depicted
tree is built based on the sub-transaction made of the fre-
the fact that our approach is one order of magnitude better
quent items. The FP-tree data structure is a tree of n chil-
than the FP-Growth approach in terms of memory usage.
dren. The structure struct FPTTree { long Element; long
We also run experiments using the public UCI datasets counter; FPTTree* child; FPTTree* brother; FPTTree* fa-
provided on the FIMI workshop website, which are ther; FPTTree* next; } has been used to create each node
Mushroom, Chess, Connect, Pumsb, T40I10D100K, and of this tree, where a link is created between each node and
T10I4D100K. The COFI algorithm scales relatively well its first child, and the brother link is maintained to create a
vis-à-vis the support threshold with these datasets. Re- linked list of all children of the same node. This linked list
sults are not reported here for lack of space. Our ap- is built ordered based on the frequency of each item. The
proach revealed good results with high support value on all header list is maintained using the structure FrequentStruc
datasets. However, like with other approaches, in cases of { long Item; long Frequency; long COFIFrequency; long
low support value, where the number of frequent patterns COFIFrequency1; FPTTree* first; COFITree* firstCOFI; };
increases significantly, our approach faces some difficulties. After building the FP-tree we start building the first COFI-
For such cases it is recommended to consider discovering tree by selecting the item with least frequency from the fre-
closed itemsets or maximal patterns instead of just frequent quent list. A scan is made of the FP-tree starting from the
itemsets. The sheer number of frequent itemsets becomes linked list of this item to find the frequency of other items
overwhelming, and some argue even useless. Closed item- with respect to this item. After that, the COFI-tree is created
sets and maximal itemsets represent all frequent patterns by based on only the locally frequent items. Finally frequent
eliminating the redundant ones. For illustration, Table 2 patterns are generated and stored in the FrequentTree data
compares the CPU time and memory requirement for COFI structure. All nodes that have support greater or equal than
and FP-Growth on the T10I4D100K dataset. the given support present a frequent pattern. The COFI-tree
and the FrequentTree are removed from memory and the
5 Implementations next COFI-tree is created until we mine all frequent trees.
One interesting implementation improvement is the fact
The COFI-tree program submitted with this paper is a that the participation counter was also added to the header
C++ code. The executable of this code runs with 3 param- table of the COFI-tree this counter cumulates the partici-
eters, which are: (1) the path to the input file name. (2) pation of the item in all paterns already discovered in the
a positive integer that presents the absolute support. (3) current COFI-tree. The difference between the participa-
tion in the node and the participation in the header is that the [2] R. Agrawal, T. Imielinski, and A. Swami. Mining associa-
counter in the node counts the participation of the node item tion rules between sets of items in large databases. In Proc.
in all paths where the node appears, while the new counter 1993 ACM-SIGMOD Int. Conf. Management of Data, pages
in the COFI-tree header counts the participation of the item 207–216, Washington, D.C., May 1993.
[3] R. Agrawal and R. Srikant. Fast algorithms for mining as-
globally in the tree. This trick does not compromise the
sociation rules. In Proc. 1994 Int. Conf. Very Large Data
effectiveness and usefulness of the participation counting.
Bases, pages 487–499, Santiago, Chile, September 1994.
One main advantage of this counter is that it looks ahead [4] I. Almaden. Quest synthetic data generation code.
to see if all nodes of a specific item have already been tra- http://www.almaden.ibm.com/cs/quest/syndata.html.
versed or not to reduce the unneeded scans of the COFI-tree. [5] M.-L. Antonie and O. R. Zaı̈ane. Text document categoriza-
tion by term association. In IEEE International Conference
on Data Mining, pages 19–26, December 2002.
6 Conclusion and future work [6] C. Burdick, M. Calimlim, and J. Gehrke. MAFIA: A max-
imal frequent itemset algorithm for transactional databases.
The COFI algorithm, based on our COFI-tree structure, In IEEE International Conference on Data Mining (ICDM
we propose in this paper is one order of magnitude better 01), April 2001.
[7] M. El-Hajj and O. R. Zaı̈ane. Inverted matrix: Efficient dis-
than the FP-Growth algorithm in terms of memory usage,
covery of frequent items in large datasets in the context of
and sometimes in terms of speed. This Algorithm achieves interactive mining. In In Proc. 2003 Int’l Conf. on Data
this results thanks to: (1) the non recursive technique used Mining and Knowledge Discovery (ACM SIGKDD), August
during the mining process, in which with a simple traver- 2003.
sal of the COFI-tree a full set of frequent patterns can be [8] M. El-Hajj and O. R. Zaı̈ane. Non recursive generation of
generated. (2) The pruning method that is used to remove frequent k-itemsets from frequent pattern tree representa-
all locally non frequent patterns, leaving the COFI-tree with tions. In In Proc. of 5th International Conference on Data
only locally frequent items. Warehousing and Knowledge Discovery (DaWak’2003),
The major advantage of our algorithm COFI over FP- September 2003.
[9] M. El-Hajj and O. R. Zaı̈ane. Parallel association rule
Growth is that it needs a significantly smaller memory foot-
mining with minimum inter-processor communication. In
print, and thus can mine larger transactional databases with Fifth International Workshop on Parallel and Distributed
smaller main memory available. The fundamental differ- Databases (PaDD’2003) in conjunction with the 14th
ence, is that COFI tries to find a compromise between a Int’ Conf. on Database and Expert Systems Applications
fully pattern growth approach, that FP-Growth adopts, and DEXA2003, September 2003.
a total candidacy generation approach that apriori is known [10] J. Han and M. Kamber. Data Mining: Concepts and Tech-
for. COFI grows targeted patterns but performs a reduced niques. Morgan Kaufman, San Francisco, CA, 2001.
and focused generation of candidates during the mining. [11] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without
This is to avoid the recursion that FP-growth uses, and no- candidate generation. In ACM-SIGMOD, Dallas, 2000.
[12] J. Hipp, U. Guntzer, and G. Nakaeizadeh. Algorithms for
torious to blow the stack with large datasets.
association rule mining - a general survey and comparison.
We have developed algorithms for closed itemset min- ACM SIGKDD Explorations, 2(1):58–64, June 2000.
ing and maximal itemset mining based on our COFI-tree [13] J. Liu, Y. Pan, K. Wang, and J. Han. Mining frequent item
approach. However, their efficient implementations were sets by oppotunistic projection. In Eight ACM SIGKDD In-
not ready by the deadline of this workshop. These effi- ternationa Conf. on Knowledge Discovery and Data Mining,
cient algorithms and experimental results will be compared pages 229–238, Edmonton, Alberta, August 2002.
to existing algorithms such as CHARM[17], MAFIA[6] and [14] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang. H-
CLOSET+[15], and will be reported in the future. mine: Hyper-structure mining of frequent patterns in large
databases. In ICDM, pages 441–448, 2001.
[15] J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the
7 Acknowledgments best strategies for mining frequent closed itemsets. In 9th
ACM SIGKDD International Conf. on Knowledge Discovery
and Data Mining, July 2003.
This research is partially supported by a Research Grant [16] O. R. Zaı̈ane, J. Han, and H. Zhu. Mining recurrent items in
from NSERC, Canada. multimedia with progressive resolution refinement. In Int.
Conf. on Data Engineering (ICDE’2000), pages 461–470,
San Diego, CA, February 2000.
References [17] M. Zaki and C.-J. Hsiao. ChARM: An efficient algorithm
for closed itemset mining. In 2nd SIAM International Con-
[1] R. Agarwal, C.Aggarwal, and V. Prasad. A tree projection ference on Data MIning, April 2002.
algorithm for generation of frequent itemsets. Parallel and
distributed Computing, 2000.
D = Dimension, L = Average number of items in one transaction
Support = 0.05%
COFI FP-Growth COFI FP-Growth
400 25
350
20
300
Time in Seconds
Time in Seconds
250 15
200
150 10
100
5
50
0 0
10 50 100 10 50 100 500
Number of Transactions in K Number of Transactions in K
(A) D=5K, L=12 (B) D=10K, L=24
COFI FP-Growth COFI FP-Growth
400 800
350 700
Memory Usage in K.Byte
Memory Usage in K.Byte
300 600
250 500
200 400
150 300
100 200
50 100
0 0
10 50 100 10 50 100 500
Number of Transactions in K Number of Transactions in K
(C) D=5K, L=12 (D) D=10K, L=24
COFI FP-Growth COFI FP-Growth
350 600
300
Memory Usage in K.Byte
Memory Usage in K.Byte
500
250
400
200
300
150
200
100
50 100
0 0
10 50 100 10 50 100 500
Number of Transactions in K Number of Transactions in K
(E) D=5K, L=12 (F) D=10K, L=24
Figure 7. Mining dataset of different sizes