=Paper= {{Paper |id=Vol-90/paper-15 |storemode=property |title=COFI-tree Mining: A New Approach to Pattern Growth with Reduced Candidacy Generation |pdfUrl=https://ceur-ws.org/Vol-90/elhajj.pdf |volume=Vol-90 |dblpUrl=https://dblp.org/rec/conf/fimi/ZaianeE03 }} ==COFI-tree Mining: A New Approach to Pattern Growth with Reduced Candidacy Generation== https://ceur-ws.org/Vol-90/elhajj.pdf
        COFI-tree Mining: A New Approach to Pattern Growth with Reduced
                             Candidacy Generation

                 Mohammad El-Hajj                                              Osmar R. Zaı̈ane
        Department of Computing Science                             Department of Computing Science
    University of Alberta Edmonton, AB, Canada                  University of Alberta Edmonton, AB, Canada
             mohammad@cs.ualberta.ca                                        zaiane@cs.ualberta.ca


                        Abstract                                posed method uses a pruning technique that dramatically
                                                                saves the memory space. These relatively small trees are
    Existing association rule mining algorithms suffer          constructed based on a memory-based structure called FP-
from many problems when mining massive transactional            Trees [11]. This data structure is studied in detail in the
datasets. Some of these major problems are: (1) the repeti-     following sections. In short, we introduced in [8] the COFI-
tive I/O disk scans, (2) the huge computation involved dur-     tree stucture and an algorithm to mine it. In [7] we pre-
ing the candidacy generation, and (3) the high memory de-       sented a disk based data structure, inverted matrix, that re-
pendency. This paper presents the implementation of our         places the memory-based FP-tree and scales the interactive
frequent itemset mining algorithm, COFI, which achieves         frequent pattern mining significantly. Our contributions in
its efficiency by applying four new ideas. First, it can mine   this paper are the introduction of a clever pruning technique
using a compact memory based data structures. Second,           based on an interesting property drawn from our top-down
for each frequent item assigned, a relatively small indepen-    approach, and some implementation tricks and issues. We
dent tree is built summarizing co-occurrences. Third, clever    included the pruning in the algorithm of building the tree so
pruning reduces the search space drastically. Finally, a sim-   that the pruning is done on the fly.
ple and non-recursive mining process reduces the memory
requirements as minimum candidacy generation and count-         1.1 Problem Statement
ing is needed to generate all relevant frequent patterns.
                                                                   The problem of mining association rules over market
                                                                basket analysis was introduced in [2]. The problem consists
                                                                of finding associations between items or itemsets in trans-
1 Introduction                                                  actional data. The data could be retail sales in the form of
                                                                customer transactions or even medical images [16]. Asso-
   Frequent pattern discovery has become a common topic         ciation rules have been shown to be useful for other appli-
of investigation in the data mining research area. Its main     cations such as recommender systems, diagnosis, decision
theme is to discover the sets of items that occur together      support, telecommunication, and even supervised classifi-
more than a given threshold defined by the decision maker.      cation [5]. Formally, as defined in [3], the problem is stated
A well-known application domain that counts on the fre-         as follows: Let I = {i1 , i2 , ...im } be a set of literals, called
quent pattern discovery is the market basket analysis. In       items and m is considered the dimensionality of the prob-
most cases when the support threshold is low and the num-       lem. Let D be a set of transactions, where each transaction
ber of frequent patterns “explodes”, the discovery of these     T is a set of items such that T ⊆ I. A unique identifier
patterns becomes problematic for reasons such as: high          TID is given to each transaction. A transaction T is said
memory dependencies, huge search space, and massive I/O         to contain X, a set of items in I, if X ⊆ T . An associ-
required. However, recently new studies have been pro-          ation rule is an implication of the form “X ⇒ Y ”, where
posed to reduce the memory requirements [8], to decrease        X ⊆ I, Y ⊆ I, and X ∩ Y = ∅. An itemset X is said to be
the I/O dependencies [7], still more promising issues need      large or frequent if its support s is greater or equal than a
to be investigated such as pruning techniques to reduce the     given minimum support threshold σ. An itemset X satisfies
search space. In this paper we introduce a new method           a constraint C if and only if C(X) is true. The rule X ⇒ Y
for frequent pattern discovery that is based on the Co-         has a support s in the transaction set D if s% of the transac-
Occurrence Frequent Item tree concept [8, 9]. The new pro-      tions in D contain X ∪ Y . In other words, the support of the
rule is the probability that X and Y hold together among all      frequent patterns. This massive creation of conditional trees
the possible presented cases. It is said that the rule X ⇒ Y      makes this algorithm not scalable to mine large datasets be-
holds in the transaction set D with confidence c if c% of         yond few millions. In [14] the same authors propose a new
transactions in D that contain X also contain Y . In other        algorithm, H-mine, that invokes FP-Tree to mine condensed
words, the confidence of the rule is the conditional proba-       data. This algorithm is still not scalable as reported by its
bility that the consequent Y is true under the condition of       authors in [13].
the antecedent X. The problem of discovering all associa-
tion rules from a set of transactions D consists of generating    1.3 Preliminaries, Motivations and Contributions
the rules that have a support and confidence greater than a
given threshold. These rules are called strong rules. This           The Co-Occurrence Frequent Item tree (or COFI-tree for
association-mining task can be broken into two steps:             short) and the COFI algorithm presented in this paper are
1. A step for finding all frequent k-itemsets known for its       based on our previous work in [7, 8]. The main motivation
extreme I/O scan expense, and the massive computational           of our current research is the pruning technique that reduces
costs;                                                            the memory space needed by the COFI-trees. The presented
2. A straightforward step for generating strong rules.            algorithm is done in two phases in which phase 1 requires
   In this paper and our attached code, we focus exclusively      two full I/O scans of the transactional database to build the
on the first step: generating frequent itemsets.                  FP-Tree structure[11]. The second phase starts by building
                                                                  small Co-Occurrence Frequent trees for each frequent item.
1.2 Related Work                                                  These trees are pruned first to eliminate any non-frequent
                                                                  items with respect to the COFI-tree based frequent item.
                                                                  Finally the mining process is executed.
    Several algorithms have been proposed in the literature
                                                                     The remainder of this paper is organized as follows: Sec-
to address the problem of mining association rules [12, 10].
                                                                  tion 2 describes the Frequent Pattern tree, design and con-
One of the key algorithms, which seems to be the most pop-
                                                                  struction. Section 3 illustrates the design, constructions,
ular in many applications for enumerating frequent item-
                                                                  pruning, and mining of the Co-Occurrence Frequent Item
sets, is the apriori algorithm [3]. This apriori algorithm
                                                                  trees. Section 4 presents the implementation procedure of
also forms the foundation of most known algorithms. It
                                                                  this algorithm. Experimental results are given in Section 5.
uses an anti-monotone property stating that for a k-itemset
                                                                  Finally, Section 6 concludes by discussing some issues and
to be frequent, all its (k-1)-itemsets have to be frequent. The
                                                                  highlighting our future work.
use of this fundamental property reduces the computational
cost of candidate frequent itemset generation. However, in
the cases of extremely large input sets with big frequent 1-      2 Frequent Pattern Tree: Design and Con-
items set, the Apriori algorithm still suffers from two main        struction
problems of repeated I/O scanning and high computational
cost. One major hurdle observed with most real datasets              The COFI-tree approach we propose consists of two
is the sheer size of the candidate frequent 2-itemsets and        main stages. Stage one is the construction of a modified
3-itemsets.                                                       Frequent Pattern tree. Stage two is the repetitive building of
    TreeProjection is an efficient algorithm presented in [1].    small data structures, the actual mining for these data struc-
This algorithm builds a lexicographic tree in which each          tures, and their release.
node of this tree presents a frequent pattern. The authors
report that their algorithm is one order of magnitude faster      2.1 Construction of the Frequent Pattern Tree
than the existing techniques in the literature. Another inno-
vative approach of discovering frequent patterns in transac-          The goal of this stage is to build the compact data struc-
tional databases, FP-Growth, was proposed by Han et al.           ture called Frequent Pattern Tree [11]. This construction is
in [11]. This algorithm creates a compact tree-structure,         done in two phases, where each phase requires a full I/O
FP-Tree, representing frequent patterns, that alleviates the      scan of the dataset. A first initial scan of the database iden-
multi-scan problem and improves the candidate itemset             tifies the frequent 1-itemsets. The goal is to generate an or-
generation. The algorithm requires only two full I/O scans        dered list of frequent items that would be used when build-
of the dataset to build the prefix tree in main memory and        ing the tree in the second phase.
then mines directly this structure. The authors of this al-           This phase starts by enumerating the items appearing in
gorithm report that their algorithm is faster than the Apri-      the transactions. After enumeration these items (i.e. after
ori and the TreeProjection algorithms. Mining the FP-tree         reading the whole dataset), infrequent items with a support
structure is done recursively by building conditional trees       less than the support threshold are weeded out and the re-
that are of the same order of magnitude in number as the          maining frequent items are sorted by their frequency. This
                                                                  tween this item-node in the tree and its entry in the header
           Table 1. Transactional database                        table. The header table holds as one pointer per item that
             T.No.           Items
                                                                  points to the first occurrences of this item in the FP-Tree
              T1      A G D C B                                   structure.
              T2      B C H E D
              T3      B D E A M
              T4      C E F A N                                   2.2 Illustrative Example
              T5      A B N O P
              T6      A C Q R G                                       For illustration, we use an example with the transactions
              T7      A C H I G                                   shown in Table 1. Let the minimum support threshold be set
              T8      L E F K B                                   to 4. Phase 1 starts by accumulating the support for all items
              T9      A F M N O                                   that occur in the transactions. Step 2 of phase 1 removes all
              T10     C F P G R                                   non-frequent items, in our example (G, H, I, J, K, L,M, N,
              T11     A D B H I                                   O, P, Q and R), leaving only the frequent items (A, B, C, D,
              T12     D E B K L                                   E, and F). Finally all frequent items are sorted according to
              T13     M D C G O                                   their support to generate the sorted frequent 1-itemset. This
              T14     C F P Q J                                   last step ends phase 1 in Figure 1 of the COFI-tree algorithm
              T15     B D E F          I                          and starts the second phase. In phase 2, the first transaction
              T16     J   E B A D                                 (A, G, D, C, B) is filtered to consider only the frequent items
              T17     A K E F C                                   that occur in the header table (i.e. A, D, C and B). This fre-
              T18     C D L B A                                   quent list is sorted according to the items’ supports (A, B,
                                                                  C and D). This ordered transaction generates the first path
                                                                  of the FP-Tree with all item-node support initially equal to
list is organized in a table, called header table, where the      1. A link is established between each item-node in the tree
items and their respective support are stored along with          and its corresponding item entry in the header table. The
pointers to the first occurrence of the item in the frequent      same procedure is executed for the second transaction (B,
pattern tree. Phase 2 would construct a frequent pattern tree.    C, H, E, and D), which yields a sorted frequent item list (B,
                                                                  C, D, E) that forms the second path of the FP-Tree. Trans-
Item Counter Item Counter       Item Counter     Item Counter     action 3 (B, D, E, A, and M) yields the sorted frequent item
  A    11       N    3            A    11          F     7
  B    10      O     3            B    10          E     8
                                                                  list (A, B, D, E) that shares the same prefix (A, B) with an
  C    10       P    3            C    10          D     9        existing path on the tree. Item-nodes (A and B) support is
  D     9      Q     2            D     9          C    10        incremented by 1 making the support of (A) and (B) equal
  G     4       R    2            E     8          B    10
  E     8       I    3            F     7          A    11        to 2 and a new sub-path is created with the remaining items
  H     3       K    3                                            on the list (D, E) all with support equal to 1. The same pro-
  F     7       L    3
  M     3       J    3                                            cess occurs for all transactions until we build the FP-Tree
          Step 1                   Step 2            Step 3       for the transactions given in Table 1. Figure 2 shows the
                                                                  result of the tree building process. Notice that in our tree
              Figure 1. Steps of phase 1                          structure, contrary to the original FP-tree [11], our links are
                                                                  bi-directional. This, and other differences presented later,
    Phase 2 of constructing the Frequent Pattern tree struc-      are used by our mining algorithm.
ture is the actual building of this compact tree. This phase
requires a second complete I/O scan from the dataset. For
                                                                  3 Co-Occurrence Frequent-Item-trees: Con-
each transaction read, only the set of frequent items present
in the header table is collected and sorted in descending or-       struction, Pruning and Mining
der according to their frequency. These sorted transaction
items are used in constructing the FP-Trees as follows: for          Our approach for computing frequencies relies first on
the first item on the sorted transactional dataset, check if it   building independent, relatively small trees for each fre-
exists as one of the children of the root. If it exists then      quent item in the header table of the FP-Tree called COFI-
increment the support for this node. Otherwise, add a new         trees. A pruning technique is applied to remove all non-
node for this item as a child for the root node with 1 as         frequent items with respect to the main frequent item of
support. Then, consider the current item node as the new          the tested COFI-tree. Then we mine separately each one
temporary root and repeat the same procedure with the next        of the trees as soon as they are built, minimizing the candi-
item on the sorted transaction. During the process of adding      dacy generation and without building conditional sub-trees
any new item-node to the FP-Tree, a link is maintained be-        recursively. The trees are discarded as soon as mined. At
                                                                                     Root

                                             A 11                                           B 4                      C 3
                     F   7
                     E 8         F 1     C 4             B       6                   C 1    E 1    D2          F 2         D 1
                     D 9
                     C 10              E 2       C           2       D   3      D 1          F 1        E 2
                     B 10
                     A 11          F 2       D       2                   E 2   E 1                       F 1


                                                     Figure 2. Frequent Pattern Tree.


any given time, only one COFI-tree is present in main mem-                       cally not frequent with respect to item F as the support for
ory. In our following examples we always assume that we                          all these items are not greater than the support threshold σ
are building the COFI-trees based on the modified FP-Tree                        which is equal to 4, Figure 3. From knowing this, there
data-structure presented above.                                                  will be no need to mine the F-COFI-tree, we already know
                                                                                 that no frequent patterns other than the item F will be gen-
3.1 Pruning the COFI-trees                                                       erated. We can extend our knowledge at this stage to know
                                                                                 that item F will not appear in any of the frequent patterns.
    Pruning can be done after building a tree or, even better,                   The COFI-tree for item E indicates that only items D, and
while building it. We opted for pruning on the fly since the                     B are frequent with respect to item E, which means that
overhead is minimal but the consequences are drastic reduc-                      there will be no need to test patterns as EC, and EA. The
tion in memory requirements. We will discuss the pruning                         COFI-tree for item D indicates that item C will be elimi-
idea, then present the building algorithm that considers the                     nated, as it is not frequent with respect to item D. C-COFI-
pruning on the fly.                                                              tree ignores item B for the same reason. To sum up the
    In this section we are introducing a new anti-monotone                       Apriori property states in our example of 6 1-frequent item-
property called global frequent/local non-frequent property.                     set that we need to generate 15 2-Candidate itemset which
This property is similar to the Apriori one in the sense that                    are (A,B), (A,C), (A,D), (A,E), (A,F), (B,C), (B,D), (B,E),
it eliminates at the ith level all non-frequent items that will                  (B,F), (C,D), (C,E), (C,F), (D,E), (D,F), (E,F), using our
not participate in the (i+1) level of candidate itemsets gen-                    property we have eliminated (not generated or counted) 9
eration. The difference between the two properties is that                       patterns which are (A,E), (A,F), (B,C), (B,F), (C,D), (C,E),
we extended our property to eliminate also frequent items                        (C,F), (D,F), (E,F) leaving only 6 patterns to test which are
which are among the i-itemset and we are sure that they                          (A,B), (A,C), (A,D), (B,D), (B,E), (D,E).
will not participate in the (i+1) candidate set. The Apriori
property states that all nonempty subsets of a frequent item-                    3.2 Construction of the Co-Occurrence Frequent-
set must also be frequent. An example is given later in this                         Item-trees
section to illustrate both properties. In our approach, we
are trying to find all frequent patterns with respect to one                        The small COFI-trees we build are similar to the condi-
frequent item, which is the base item of the tested COFI-                        tional FP-Trees [11] in general in the sense that they have
tree. We already know that all items that participate in the                     a header with ordered frequent items and horizontal point-
creation of the COFI-tree are frequent with respect to the                       ers pointing to a succession of nodes containing the same
global transaction database, but that does not mean that they                    frequent item, and the prefix tree per se with paths repre-
are also locally frequent with respect to the based item in the                  senting sub-transactions. However, the COFI-trees have bi-
COFI-tree. The global frequent/local non-frequent property                       directional links in the tree allowing bottom-up scanning as
states that all nonempty subsets of a frequent itemset with                      well, and the nodes contain not only the item label and a
respect to the item A of the A-COFI-tree , must also be                          frequency counter, but also a participation counter as ex-
frequent with respect to item A. For each frequent item                          plained later in this section. The COFI-tree for a given fre-
A we traverse the FP-Tree to find all frequent items that                        quent item x contains only nodes labeled with items that are
occur with A in at least one transaction (or branch in the                       more frequent or as frequent as x.
FP-Tree) with their number of occurrences. All items that                           To illustrate the idea of the COFI-trees, we will explain
are locally frequent with item A will participate in build-                      step by step the process of creating COFI-trees for the FP-
ing the A-COFI-tree, other global frequent items, locally                        Tree of Figure 2. With our example, the first Co-Occurrence
non-frequent items will not participate in the creation of the                   Frequent Item tree is built for item F as it is the least fre-
A-COFI-tree. In our example we can find that all items                           quent item in the header table. In this tree for F, all frequent
that participate in the creation of the F-COFI-tree are lo-                      items, which are more frequent than F, and share transac-
tions with F, participate in building the tree. This can be                                F COFI-tree
                                                                                                 E 4             F        (7    0)
found by following the chain of item F in the FP-Tree struc-                                     D 2
ture. The F-COFI-tree starts with the root node containing                                       C 4
the item in question, then a scan of part of the FP-Tree is ap-                                  B 2
                                                                                                 A 3
plied following he chain of the F item in the FP-Tree. The
first branch FA has frequency of 1, as the frequency of the
branch is the frequency of the test item, which is F. The goal     E COFI-tree
                                                                                                                                     E   (8   0)
of this traversal is to count the frequency of each frequent          D 5
item with respect to item F. By doing so we can find that             C 3
                                                                      B 6           D 5                      D       (5    0)             B   (1   0)
item E occurs 4 times, D occurs 2 times, C occurs 4 times,
                                                                      A 4            B 6
B 2 times, and A 3 times, by applying the anti-monotone
constraint property we can predict that item F will never                                        B   (5      0)
appear in any frequent pattern except itself. Consequently
there will be no need to continue building the F-COFI-tree.        D COFI-tree
                                                                                                                 D        (9    0)
    The next frequent item to test is E. The same process
is done to compute the frequency of each frequent items               C 4            B 8
                                                                      B 8            A 5             B   (8          0)
with respect to item E. From this we can find that only two
                                                                      A 5
globally frequent items are also locally frequent which are
(D:5 and B:6). For each sub-transaction or branch in the                                     A    (5     0)
FP-Tree containing item E with other locally frequent items
that are more frequent than E which are parent nodes of E,                  C COFI-tree
a branch is formed starting from the root node E. the sup-                                               C    ( 10 0 )
                                                                            B 3
port of this branch is equal to the support of the E node                   A 6      A 6
in its corresponding branch in FP-Tree. If multiple fre-                                     A    (6     0)
quent items share the same prefix, they are merged into one
branch and a counter for each node of the tree is adjusted                  B COFI-tree
                                                                                                                      B    ( 10 0 )
accordingly. Figure 3 illustrates all COFI-trees for frequent               A 6      A 6
items of Figure 2. In Figure 3, the rectangle nodes are nodes
                                                                                             A    (6     0)
from the tree with an item label and two counters. The first
counter is a support-count for that node while the second
counter, called participation-count, is initialized to 0 and is
                                                                                          Figure 3. COFI-trees
used by the mining algorithm discussed later, a horizontal
link which points to the next node that has the same item-
name in the tree, and a bi-directional vertical link that links
a child node with its parent and a parent with its child. The
bi-directional pointers facilitate the mining process by mak-
ing the traversal of the tree easier. The squares are actually
cells from the header table as with the FP-Tree. This is a        support for this branch (following the upper links for this
list made of all frequent items that participate in building      item). Two nodes are created, for items D and B with sup-
the tree structure sorted in ascending order of their global      port equals to 2, D is a child node of B, and B is a child node
support. Each entry in this list contains the item-name, item-    of E. The third location of E indicate having EDB:1, which
counter, and a pointer to the first node in the tree that has     shares an existing branch in the E-COFI-tree, all counters
the same item-name.                                               are adjusted accordingly. A new branch of EB: 1 is created
   To explain the COFI-tree building process, we will high-       as the support of E=1 for the fourth occurrences of E. The
light the building steps for the E-COFI-tree in Figure 3. Fre-    final occurrence EDB: 2 uses an existing branch and only
quent item E is read from the header table and its first loca-    counters are adjusted. Like with FP-Trees, the header con-
tion in the FP-Tree is located using the pointer in the header    stitutes a list of all frequent items to maintain the location
table. The first location of item E indicate that it shares a     of first entry for each item in the COFI-tree. A link is also
branch with items CA, with support = 2, since none of these       made for each node in the tree that points to the next lo-
items are locally frequent then only the support of the E root    cation of the same item in the tree if it exists. The mining
node is incremented by 2. the second node of item E indi-         process is the last step done on the E-COFI-tree before re-
cates that it shares items DBA with support equals to 2 for       moving it and creating the next COFI-tree for the next item
this branch as the support of the E-item is considered the        in the header table.
 E COFI-tree STEP1                                                              Pattern        D COFI-tree STEP1                                          Pattern
                                      E   (8      0)          E   (8    5)      E D B 5                              D    (9   0)       D   (9    5)      D B A 5

                                                                                E D       5     B 8                                                       D B A 5
   D 5                 D    (5   0)       B   (1       0)     D   (5    1)      E B       5     A 5         B   (8   0)                 B   (8    5)      D B   5
    B 6                                                                         E D B 5                                                                   D A   5

              B   (5   0)                                     B   (5    5)                            A   (5    0)                      A   (5    5)


 E COFI-tree STEP2
                                      E   (8      5)                            Pattern
                                                                                               D COFI-tree STEP2                                          Pattern
                                                              E   (8    6)      E B       1
                                                                                                                     D    (9   5)       D   (9    5)      D B       3

   D 5                 D    (5   5)           B   (1     0)                     E D       5
                                                                                                B 8                                                       D B A 5
    B 6                                                       B   (1    1)      E B       6
                                                                                                A 5         B   (8   5)                 B   (8    5)      D B   8
                                                                                E D B 5
              B   (5   5)                                                                                                                                 D A   5

                                                                                                      A   (5    5)                      Frequent Patterns are:
 E COFI-tree STEP3                                                                                                                      DBA:5, DB: 8, DA: 5
                                      E   (8      6)                            Pattern
                                                              E   (8    7)      E D       0


   D 5                 D    (5   5)           B   (1     1)
                                                                                                 Figure 5. Steps needed to generate frequent
    B 6                                                       D   (6    6)                       patterns related to item D
                                                              Frequent Patterns are:
              B   (5   5)                                     ED:5, EB: 6, EDB: 5




   Figure 4. Steps needed to generate frequent                                                ates the pattern EB: 1. EB already exists and its counter
   patterns related to item E                                                                 is adjusted to become 6. The COFI-tree of Item E can be
                                                                                              removed at this time and another tree can be generated and
                                                                                              tested to produce all the frequent patterns related to the root
                                                                                              node. The same process is executed to generate the fre-
3.3 Mining the COFI-trees                                                                     quent patterns. The D-COFI-tree (Figure 5) is created after
                                                                                              the E-COFI-tree. Mining this tree generates the following
    The COFI-trees of all frequent items are not constructed                                  frequent patterns: DBA: 5, DA: 5, and DB:8. The same pro-
together. Each tree is built, mined, then discarded before the                                cess occurs for the remaining trees that would produce AC:
next COFI-tree is built. The mining process is done for each                                  6 for the C-COFI-tree and BA:6 for the B-COFI-tree.
tree independently with the purpose of finding all frequent                                      The following is our algorithm for building and mining
k-itemset patterns in which the item on the root of the tree                                  the COFI-trees with pruning.
participates.                                                                                    Algorithm COFI: Creating with pruning and Mining
    Steps to produce frequent patterns related to the E item                                  COFI-trees
for example, as the F-COFI-tree will not be mined based                                       Input: modified FP-Tree, a minimum support threshold σ
on the pruning results we found on the previous step, are                                     Output: Full set of frequent patterns
illustrated in Figure 4. From each branch of the tree, us-                                    Method:
ing the support-count and the participation-count, candi-                                     1. A = the least frequent item on the header table of
date frequent patterns are identified and stored temporarily                                  FP-Tree
in a list. The non-frequent ones are discarded at the end                                     2. While (There are still frequent items) do
when all branches are processed. The mining process for                                           2.1 count the frequency of all items that share item (A)
the E-COFI-tree starts from the most locally frequent item                                         a path. Frequency of all items that share the same path
in the header table of the tree, which is item B. Item B ex-                                       are the same as of the frequency of the (A) items
ists in two branches in the E-COFI-tree which are (B:5, D:5                                       2.2 Remove all non-locally frequent items for
and E:8), and (B:1, and E:8). The frequency of each branch                                         the frequent list of item (A)
is the frequency of the first item in the branch minus the                                        2.3 Create a root node for the (A)-COFI-tree with both
participation value of the same node. Item B in the first                                          frequency-count and participation-count = 0
branch has a frequency value of 5 and participation value                                           2.3.1 C is the path of locally frequent items in the path
of 0 which makes the first pattern EDB frequency equals                                              of item A to the root
to 5. The participation values for all nodes in this branch                                         2.3.2 Items on C form a prefix of the (A)-COFI-tree.
are incremented by 5, which is the frequency of this pat-                                           2.3.3 If the prefix is new then Set frequency-count=
tern. In the first pattern EDB: 5. We need to generate all                                           frequency of (A) node and participation-
sub-patterns that item E participates in, which are ED: 5,                                           count= 0 for all nodes in the path
EB: 5, and EDB: 5. The second branch that has B gener-                                               Else
     2.3.4 Adjust the frequency-count of the already                                                                                                       COFI                             F P -G row th
      exist part of the path.                                                                                                   35 0

     2.3.5 Adjust the pointers of the Header list                                                                               30 0

      if needed




                                                                                     T im e in S ec on ds
                                                                                                                                25 0


     2.3.6 find the next node for item A in the FP-tree and                                                                     20 0

                                                                                                                                15 0
      go to 2.3.1                                                                                                               10 0

    2.4 MineCOFI-tree (A)                                                                                                        50

    2.5 Release (A) COFI-tree                                                                                                     0
                                                                                                                                           0.1      0.05          0.02    0.01    0.00 5        0.02        0.00 1    0.00 04
    2.6 A = next frequent item from the header table                S up p ort %



3. Goto 2
                                                                                                                                                                  (A) Runtime

    Function: MineCOFI-tree (A)                                                                                                                            COFI                              F P -G row th

1. nodeA = select next node //Selection of nodes starts with                                                                    16 00

                                                                                                                                14 00




                                                                                               M e m or y U sag e in K .By te
the node of most locally frequent item and following its                                                                        12 00

chain, then the next less frequent item with its chain, un-                                                                     10 00

                                                                                                                                  80 0
til we reach the least frequent item in the Header list of the                                                                    60 0

(A)-COFI-tree                                                                                                                     40 0


2. while there are still nodes do                                                                                                 20 0

                                                                                                                                       0
     2.1 D = set of nodes from nodeA to the root                                           S up p ort %
                                                                                                                                            0.1      0.05          0.02    0.01    0.00 5       0.02         0.00 1    0.00 04


     2.2 F = nodeA.frequency-count-nodeA.
     participation-count                                                                                                                          (B) Total Memory requirement
     2.3 Generate all Candidate patterns X from items in D.                                                                                            COFI                                 F P -G row th
     Patterns that do not have A will be discarded.                                                                  16 00

     2.4 Patterns in X that do not exist in the A-Candidate                                                          14 00
                                                                    M e m or y U sag e in K .By te




     List will be added to it with frequency = F otherwise                                                           12 00

                                                                                                                     10 00
     just increment their frequency with F                                                                                      80 0

     2.5 Increment the value of participation-count                                                                             60 0

                                                                                                                                40 0
     by F for all items in D                                                                                                    20 0

     2.6 nodeA = select next node                                                                                                 0
                                                                                                                                           0.1      0.05          0.02    0.01    0.00 5       0.02         0.00 1    0.00 04
3. Goto 2                                                           S up p ort %



4. Based on support threshold σ remove non-frequent pat-
                                                                                                                                       (C) Memory requirement without FP-tree
terns from A Candidate List.
                                                                   No.of transactions = 500K, Dimension= 10K,
4 Experimental Studies                                             Average no. of items / transaction = 12


   To study the COFI-tree mining strategies we have con-            Figure 6. Mining dataset of 500K transactions
ducted several experiments on a variety of data sizes com-
paring our approach with the well-known FP-Growth [11]
algorithm written by its original authors. The experiments
were conducted on 2.6 GHz CPU machine with 2 Gbytes              proach is in the memory space saved. Our algorithm outper-
of memory using Win2000 operating system. Transactions           forms the FP-Growth by one order of magnitude in terms of
were generated using IBM synthetic data generator [4]. We        memory space requirements. We have also tested the mem-
have conducted several types of experiments to test the ef-      ory space used during the mining process only, (i.e, isolat-
fect of changing the support, transaction size, dimension,       ing the memory space used to create the FP-Tree by both
and transaction length. The first set of experiments were        FP-growth and COFI-tree FP-Tree based algorithms). We
tested on a transaction database of 500K transactions, 10K       have found also that the COFI-tree approach outperforms
the dimension, and the average transaction length was 12.        the FP-tree by one order of magnitude in terms of mem-
We have varied the support from absolute value of 500 to         ory space used by the COFI-tree compared with the condi-
2 in which frequent patterns generated varied from 15K to        tional trees used by FP-Growth during the mining process.
3400K patterns. FP-Growth could not mine the last experi-        Figure 6A presents the time needed to mine 500K transac-
ment in this set as it used all available memory space. In all   tions using different support levels. Figure 6B depicts the
experiments the COFI-tree approach outperforms the FP-           memory needed during the mining process of the previous
Growth approach. The major accomplishment of our ap-             experiments. Figure 6C illustrates the memory needed by
                                                                 An optional file name for the out patterns. This code gen-
    Table 2. Time and Memory Scalability with re-                erates ALL frequent patterns from the provided input file.
    spect to support on the T10I4D100K dataset                   The code scans the database twice. The goal of the first
                                                                 database scan is to find the frequency of each item in this
                 Time in Seconds          Memory in KB           transactional database. These frequencies are stored in a
 Support %      COFI FP-Growth          COFI FP-Growth           data structure called Candidate-Items. Each entry of this
   0.50          1.5        3.0          18       173            candidate items is a structure called ItemsStructure that is
   0.25          1.7        5.2          19       285            made of two long integers representing the item and its fre-
   0.10          2.7       12.3          26       289            quency. All frequent items are then stored in a special data
   0.05         14.0       20.9          19       403            structure called F1-Items. This data structure is sorted in
                                                                 descending order based on the frequency of each item. To
                                                                 access the location of each item we map it with a specific
the COFI-trees and Conditional trees during the mining pro-      location using a new data structure called FindInHashTable.
cess. Other experiments were conducted to test the effect of     In brief, since we do not know the number of unique items
changing the dimension, transaction size, transaction length     at runtime, and thus can’t create an array for counting the
using the same support which is 0.05%. Some of these ex-         items, rather than having a linked list of items, we create
periments are represented in Figure 7. Figures 7A and 7B         blocks of p items. The number p could arbitrarily be 100 or
represent the time needed during the mining process. Fig-        1000. Indeed, following links in a linked list each time to
ures 7C and 7D represent the memory space needed during          find and increment a counter could be expensive. Instead,
the whole mining process. Figures 7E and 7F represent            blocs of items are easily indexed. In the worst case, we
the memory space needed by the COFI-trees or conditional         could lose the space of p − 1 unused items.
trees during the mining process. In these experiments we
                                                                     The second scan starts by eliminating all non frequent
have varied the dimension, which is the number of distinct
                                                                 items from each transaction read and then sort this trans-
items from 5K to 10K, the average transaction length from
                                                                 action based on the frequency of each frequent item. This
12 to 24 items in one transaction, and the number of trans-
                                                                 process occurred in the Sort-Transaction method. The FP-
actions from 10K to 500K. All these experiments depicted
                                                                 tree is built based on the sub-transaction made of the fre-
the fact that our approach is one order of magnitude better
                                                                 quent items. The FP-tree data structure is a tree of n chil-
than the FP-Growth approach in terms of memory usage.
                                                                 dren. The structure struct FPTTree { long Element; long
   We also run experiments using the public UCI datasets         counter; FPTTree* child; FPTTree* brother; FPTTree* fa-
provided on the FIMI workshop website, which are                 ther; FPTTree* next; } has been used to create each node
Mushroom, Chess, Connect, Pumsb, T40I10D100K, and                of this tree, where a link is created between each node and
T10I4D100K. The COFI algorithm scales relatively well            its first child, and the brother link is maintained to create a
vis-à-vis the support threshold with these datasets. Re-        linked list of all children of the same node. This linked list
sults are not reported here for lack of space. Our ap-           is built ordered based on the frequency of each item. The
proach revealed good results with high support value on all      header list is maintained using the structure FrequentStruc
datasets. However, like with other approaches, in cases of       { long Item; long Frequency; long COFIFrequency; long
low support value, where the number of frequent patterns         COFIFrequency1; FPTTree* first; COFITree* firstCOFI; };
increases significantly, our approach faces some difficulties.   After building the FP-tree we start building the first COFI-
For such cases it is recommended to consider discovering         tree by selecting the item with least frequency from the fre-
closed itemsets or maximal patterns instead of just frequent     quent list. A scan is made of the FP-tree starting from the
itemsets. The sheer number of frequent itemsets becomes          linked list of this item to find the frequency of other items
overwhelming, and some argue even useless. Closed item-          with respect to this item. After that, the COFI-tree is created
sets and maximal itemsets represent all frequent patterns by     based on only the locally frequent items. Finally frequent
eliminating the redundant ones. For illustration, Table 2        patterns are generated and stored in the FrequentTree data
compares the CPU time and memory requirement for COFI            structure. All nodes that have support greater or equal than
and FP-Growth on the T10I4D100K dataset.                         the given support present a frequent pattern. The COFI-tree
                                                                 and the FrequentTree are removed from memory and the
5    Implementations                                             next COFI-tree is created until we mine all frequent trees.
                                                                    One interesting implementation improvement is the fact
   The COFI-tree program submitted with this paper is a          that the participation counter was also added to the header
C++ code. The executable of this code runs with 3 param-         table of the COFI-tree this counter cumulates the partici-
eters, which are: (1) the path to the input file name. (2)       pation of the item in all paterns already discovered in the
a positive integer that presents the absolute support. (3)       current COFI-tree. The difference between the participa-
tion in the node and the participation in the header is that the    [2] R. Agrawal, T. Imielinski, and A. Swami. Mining associa-
counter in the node counts the participation of the node item           tion rules between sets of items in large databases. In Proc.
in all paths where the node appears, while the new counter              1993 ACM-SIGMOD Int. Conf. Management of Data, pages
in the COFI-tree header counts the participation of the item            207–216, Washington, D.C., May 1993.
                                                                    [3] R. Agrawal and R. Srikant. Fast algorithms for mining as-
globally in the tree. This trick does not compromise the
                                                                        sociation rules. In Proc. 1994 Int. Conf. Very Large Data
effectiveness and usefulness of the participation counting.
                                                                        Bases, pages 487–499, Santiago, Chile, September 1994.
One main advantage of this counter is that it looks ahead           [4] I. Almaden.        Quest synthetic data generation code.
to see if all nodes of a specific item have already been tra-           http://www.almaden.ibm.com/cs/quest/syndata.html.
versed or not to reduce the unneeded scans of the COFI-tree.        [5] M.-L. Antonie and O. R. Zaı̈ane. Text document categoriza-
                                                                        tion by term association. In IEEE International Conference
                                                                        on Data Mining, pages 19–26, December 2002.
6 Conclusion and future work                                        [6] C. Burdick, M. Calimlim, and J. Gehrke. MAFIA: A max-
                                                                        imal frequent itemset algorithm for transactional databases.
    The COFI algorithm, based on our COFI-tree structure,               In IEEE International Conference on Data Mining (ICDM
we propose in this paper is one order of magnitude better               01), April 2001.
                                                                    [7] M. El-Hajj and O. R. Zaı̈ane. Inverted matrix: Efficient dis-
than the FP-Growth algorithm in terms of memory usage,
                                                                        covery of frequent items in large datasets in the context of
and sometimes in terms of speed. This Algorithm achieves                interactive mining. In In Proc. 2003 Int’l Conf. on Data
this results thanks to: (1) the non recursive technique used            Mining and Knowledge Discovery (ACM SIGKDD), August
during the mining process, in which with a simple traver-               2003.
sal of the COFI-tree a full set of frequent patterns can be         [8] M. El-Hajj and O. R. Zaı̈ane. Non recursive generation of
generated. (2) The pruning method that is used to remove                frequent k-itemsets from frequent pattern tree representa-
all locally non frequent patterns, leaving the COFI-tree with           tions. In In Proc. of 5th International Conference on Data
only locally frequent items.                                            Warehousing and Knowledge Discovery (DaWak’2003),
    The major advantage of our algorithm COFI over FP-                  September 2003.
                                                                    [9] M. El-Hajj and O. R. Zaı̈ane. Parallel association rule
Growth is that it needs a significantly smaller memory foot-
                                                                        mining with minimum inter-processor communication. In
print, and thus can mine larger transactional databases with            Fifth International Workshop on Parallel and Distributed
smaller main memory available. The fundamental differ-                  Databases (PaDD’2003) in conjunction with the 14th
ence, is that COFI tries to find a compromise between a                 Int’ Conf. on Database and Expert Systems Applications
fully pattern growth approach, that FP-Growth adopts, and               DEXA2003, September 2003.
a total candidacy generation approach that apriori is known        [10] J. Han and M. Kamber. Data Mining: Concepts and Tech-
for. COFI grows targeted patterns but performs a reduced                niques. Morgan Kaufman, San Francisco, CA, 2001.
and focused generation of candidates during the mining.            [11] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without
This is to avoid the recursion that FP-growth uses, and no-             candidate generation. In ACM-SIGMOD, Dallas, 2000.
                                                                   [12] J. Hipp, U. Guntzer, and G. Nakaeizadeh. Algorithms for
torious to blow the stack with large datasets.
                                                                        association rule mining - a general survey and comparison.
    We have developed algorithms for closed itemset min-                ACM SIGKDD Explorations, 2(1):58–64, June 2000.
ing and maximal itemset mining based on our COFI-tree              [13] J. Liu, Y. Pan, K. Wang, and J. Han. Mining frequent item
approach. However, their efficient implementations were                 sets by oppotunistic projection. In Eight ACM SIGKDD In-
not ready by the deadline of this workshop. These effi-                 ternationa Conf. on Knowledge Discovery and Data Mining,
cient algorithms and experimental results will be compared              pages 229–238, Edmonton, Alberta, August 2002.
to existing algorithms such as CHARM[17], MAFIA[6] and             [14] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang. H-
CLOSET+[15], and will be reported in the future.                        mine: Hyper-structure mining of frequent patterns in large
                                                                        databases. In ICDM, pages 441–448, 2001.
                                                                   [15] J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the
7 Acknowledgments                                                       best strategies for mining frequent closed itemsets. In 9th
                                                                        ACM SIGKDD International Conf. on Knowledge Discovery
                                                                        and Data Mining, July 2003.
   This research is partially supported by a Research Grant        [16] O. R. Zaı̈ane, J. Han, and H. Zhu. Mining recurrent items in
from NSERC, Canada.                                                     multimedia with progressive resolution refinement. In Int.
                                                                        Conf. on Data Engineering (ICDE’2000), pages 461–470,
                                                                        San Diego, CA, February 2000.
References                                                         [17] M. Zaki and C.-J. Hsiao. ChARM: An efficient algorithm
                                                                        for closed itemset mining. In 2nd SIAM International Con-
 [1] R. Agarwal, C.Aggarwal, and V. Prasad. A tree projection           ference on Data MIning, April 2002.
     algorithm for generation of frequent itemsets. Parallel and
     distributed Computing, 2000.
                                                        D = Dimension, L = Average number of items in one transaction
                                                                              Support = 0.05%
                                              COFI                FP-Growth                                                     COFI                FP-Growth


                                 400                                                                                      25

                                 350
                                                                                                                          20
                                 300
  Time in Seconds




                                                                                              Time in Seconds
                                 250                                                                                      15
                                 200

                                 150                                                                                      10

                                 100
                                                                                                                           5
                                     50

      0                                                                                          0
                    10                                   50            100                                  10                         50     100           500
Number of Transactions in K                                                                 Number of Transactions in K




                                                     (A) D=5K, L=12                                                                         (B) D=10K, L=24
                                              COFI                FP-Growth                                                     COFI                FP-Growth


                                     400                                                                                  800

                                     350                                                                                  700
            Memory Usage in K.Byte




                                                                                                 Memory Usage in K.Byte




                                     300                                                                                  600

                                     250                                                                                  500

                                     200                                                                                  400

                                     150                                                                                  300

                                     100                                                                                  200

                                      50                                                                                  100

                                          0                                                        0
                     10                                   50            100                                  10                        50     100           500
Number of Transactions in K                                                                 Number of Transactions in K




                                                     (C) D=5K, L=12                                                                         (D) D=10K, L=24
                                              COFI                FP-Growth                                                     COFI                FP-Growth


                                     350                                                                                  600

                                     300
        Memory Usage in K.Byte




                                                                                                 Memory Usage in K.Byte




                                                                                                                          500

                                     250
                                                                                                                          400
                                     200
                                                                                                                          300
                                     150
                                                                                                                          200
                                     100

                                      50                                                                                  100


                                       0                                                           0
                     10                                   50            100                                  10                        50     100           500
Number of Transactions in K                                                                 Number of Transactions in K




                                                     (E) D=5K, L=12                                                                         (F) D=10K, L=24

                                                               Figure 7. Mining dataset of different sizes