=Paper=
{{Paper
|id=Vol-126/paper-4
|storemode=property
|title=A Space Optimization for FP-Growth
|pdfUrl=https://ceur-ws.org/Vol-126/ozkural.pdf
|volume=Vol-126
|dblpUrl=https://dblp.org/rec/conf/fimi/OzkuralA04
}}
==A Space Optimization for FP-Growth==
<pdf width="1500px">https://ceur-ws.org/Vol-126/ozkural.pdf</pdf>
<pre>
                               A Space Optimization for FP-Growth

                                       Eray Özkural and Cevdet Aykanat
                                    Department of Computer Engineering
                                   Bilkent University 06800 Ankara, Turkey
                                      {erayo,aykanat}@cs.bilkent.edu.tr


                       Abstract                                computes the number of times a given item x ∈ I oc-
                                                               curs in the transaction set T ; it is extended to sets of
   Frequency mining problem comprises the core of              items f (T, X) = |{X ⊆ Y |Y ∈ T }| to compute the
several data mining algorithms. Among frequent pat-            frequency of a pattern.
tern discovery algorithms, FP-G ROWTH employs a                    Frequency mining is the discovery of all frequent
unique search strategy using compact structures re-            patterns in a transaction set with a frequency of sup-
sulting in a high performance algorithm that requires          port threshold  and more. The set of all frequent pat-
only two database passes. We introduce an enhanced             terns is F (T, ) = {X|X ∈ 2I ∧ f (T, X) ≥ }. In
version of this algorithm called FP-G ROWTH -T INY             the algorithms, the set of frequent items F = {x ∈
which can mine larger databases due to a space op-             I | f (T, x) ≥ } may require special consideration.
timization eliminating the need for intermediate con-          A significant property of frequency mining known as
ditional pattern bases. We present the algorithms re-          downward closure states that if X ∈ F(T, ) then
quired for directly constructing a conditional FP-Tree         ∀Y ⊂ X, Y ∈ F(T, ) [2].
in detail. The experiments demonstrate that our imple-             An inherent limitation of frequency mining is the
mentation has a running time performance compara-              amount of main memory available [8]. In this paper,
ble to the original algorithm while reducing memory            we present a space optimization to FP-Growth algo-
use up to twofold.                                             rithm and we demonstrate its impact on performance
                                                               with experiments on synthetic and real-world datasets.
                                                               In the next section, we give the background on the FP-
                                                               G ROWTH algorithm. Section 3 and Section 4 explain
1. Introduction                                                our algorithm and implementation. Section 5 presents
                                                               the experiments, following that we offer our conclu-
    Frequency mining is the discovery of all frequent          sions.
patterns in a transaction or relational database. Fre-
quent pattern discovery comprises the core of several          2. Background
data mining algorithms such as association rule min-
ing and sequence mining [10], dominating the running           2.1. Compact structures
time of these algorithms. The problem involves a trans-
action database T = {X|X ⊆ I} that consists in a set               Compact data structures have been used for efficient
of transactions each of which are drawn from a set I of        storage and query/update of candidate item sets in fre-
items. The mining algorithm finds all patterns that oc-        quency mining algorithms. SEAR [12], S PEAR [12],
cur with a frequency satisfying a given absolute sup-          and DIC [6] use tries (also known as prefix trees) while
port threshold . In practice, the number of items |I| is      FP-G ROWTH [10] uses FP-Tree which is an enhanced
in the order of magnitude of 10 3 and more. The number         trie structure.
of transactions is much larger, at least 10 5 . A pattern is       Using concise structures can reduce both running
X ⊆ I, a subset of I, and the set of all patterns is 2 I .     time and memory size requirements of an algorithm.
The frequency function f (T, x) = |{x ∈ Y |Y ∈ T }|            Tries are well known structures that are widely used
for storing strings and have decent query/update per-         which stores heads of node links accessing the linked
formance. The aforementioned algorithms exploit this          list that spans all same items. FP-Tree stores only fre-
property of the data structure for better performance.        quent items. At the root of the trie is a null item, and
Tries are also efficient in storage. A large number of        strings are inserted in the trie by sorting item sets in a
strings can be stored in this dictionary type which           unique1 decreasing frequency order [10].
would not otherwise fit into main memory. For fre-                Table 1 shows a sample transaction set and frequent
quency mining algorithms both properties are criti-           items in descending frequency order. Figure 1 illus-
cal as our goal is to achieve efficient and scalable al-      trates the FP-Tree of sample transaction set in Table 1.
gorithms. In particular, the scalability of these struc-      As shown in [10], FP-Tree carries complete informa-
tures is quite high [10] as they allow an algorithm to        tion required for frequency mining and in a compact
track the frequency information of the candidate pat-         manner; the height of the tree is bounded by maxi-
terns for very large databases. The FP-Tree structure in      mal number of frequent items in a transaction. M AKE -
FP-G ROWTH allows the algorithm to maintain all fre-          FP-T REE (Algorithm 1) constructs an FP-Tree from a
quency information in the main memory obtained from           given transaction set T and support threshold  as de-
two database passes. Using the FP-Tree structure has          scribed.
also resulted in novel search strategies.
    A notable work on compact structures is [15] in
which a binary-trie based summary structure for repre-            Transaction                        Ordered Frequent Items
senting transaction sets is proposed. The trie is further         t1 = {f, a, c, d, g, i, m, p}      {f, c, a, m, p}
compressed using Patricia tries. Although significant             t2 = {a, b, c, f, l, m, o}         {f, c, a, b, m}
savings in storage and improvements in query time are             t3 = {b, f, h, j, o}               {f, b}
reported, the effectiveness of the scheme in a frequency          t4 = {b, c, k, s, p}               {c, b, p}
mining algorithm remains to be seen. In another work
                                                                  t5 = {a, f, c, e, l, p, m, n}      {f, c, a, m, p}
in FIMI 2003 workshop [3], an algorithm called PATRI -
CIA M INE using Patricia tries has been proposed [13].
The performance of PATRICIA M INE has been shown to                     Table 1. A sample Transaction Set
be consistently good in the extensive benchmark stud-
ies of FIMI workshop [3]; it was one of the fastest
algorithms although it was not the most efficient. For           In Algorithm 3 we describe FP-G ROWTH which has
many applications, the average case performance may           innovative features such as:
be more important than performing well in a small
number of cases, therefore further research on this PA -          1. Novel search strategy
TRICIA M INE would be worthwhile.                                 2. Effective use of a summary structure
    In this paper, we introduce an optimized version of           3. Two database passes
FP-G ROWTH. A closer analysis of it is in order.
                                                              FP-G ROWTH turns the frequency k-length pattern min-
                                                              ing problem into “a sequence of k-frequent 1-item
2.2. FP-Growth algorithm                                      set mining problems via a set of conditional pattern
                                                              bases” [10]. It is proposed that with FP-G ROWTH there
    The FP-G ROWTH algorithm uses the frequent pat-           is “no need to generate any combinations of candidate
tern tree (FP-Tree) structure. FP-Tree is an improved         sets in the entire mining process”. With an FP-Tree
trie structure such that each itemset is stored as a string   T ree given as input the algorithm generates all fre-
in the trie along with its frequency. At each node of the     quent patterns. There are two points in the algorithm
trie, item, count and next fields are stored. The items       that should be explained: the single path case and con-
of the path from the root of the trie to a node consti-       ditional pattern bases. If an FP-Tree has only a single
tute the item set stored at the node and the count is         path, then an optimization is to consider all combina-
the frequency of this item set. The node link next is a       tions of items in the path (single path case is the ba-
pointer to the next node with the same item in the FP-
Tree. Field parent holds a pointer to the parent node,        1     All strings must be inserted in the same order; the order of items
null for root. Additionally, we maintain a header table             with the same frequency must be the same.
                                                              a set of candidates among which only some of them
                                          root
                                                              turn out to be frequent. The main innovation how-
                                                              ever remains intact: FP-G ROWTH takes advantage of
                                                              a tailored data structure to solve the frequency min-
 Header Table             f:4                           c:1   ing problem with a divide-and-conquer method and
                                                              with demonstrated efficiency and scalability. Besides,
   f                                                          the conditional pattern base is guaranteed to be smaller
                          c:3           b:1             b:1   than the original tree, which is a desirable property. An
   c
                                                              important distinction of this algorithm is that, when ex-
   a                                                          amined within the taxonomy of algorithms, it employs
                          a:3                           p:1
                                                              a unique search strategy. When the item sets tested
   b                                                          are considered, it is seen that this algorithm is neither
                          m:2           b:1                   DFS nor BFS. The classification for FP-G ROWTH in
   m                                                          Figure 3 of [11] may be slightly misleading. As Hipp
                                                              later mentions in [11], “FP-Growth does not follow the
   p                      p:2           m:1                   nodes of the tree . . . , but directly descends to some part
                                                              of the itemsets in the search space”. In fact, the part is
                                                              so well defined that it would be unfair to classify FP-
                                                              G ROWTH as conducting a DFS. It does not even start
                                                              with item sets of small length and proceed to longer
         Figure 1. An FP-Tree Structure.                      item sets. Rather, it considers a set of patterns at the
                                                              same time by taking advantage of the data structure.
                                                              This unique search strategy makes it hard to classify
sis of recursion in FP-G ROWTH). Otherwise, the algo-         FP-G ROWTH in the context of traditional uninformed
rithm constructs for each item a i in the header table a      search algorithms.
conditional pattern base and an FP-Tree T ree β based
on this structure for recursive frequency mining. Con-        Algorithm 1 M AKE -FP-T REE(DB, )
ditional pattern base is simply a compact representa-
                                                               1: Compute F and f (x) where x ∈ F
tion of a derivative database in which only a i and its
                                                               2: Sort F in frequency decreasing order as L
prefix paths in the original T ree occur. Consider path
                                                               3: Create root of an FP-Tree T with label “null”
< f : 4, c : 3, a : 3, m : 2, p : 2 > in T ree. For min-
                                                               4: for all transaction t i ∈ T do
ing patterns including m in this path, we need to con-
                                                               5:   Sort frequent items in t i according to L. Let
sider only the prefix path of m since the nodes after m
                                                                    sorted list be [p|P ] where p is the head of the
will be mined elsewhere (in this case only p). In the pre-
                                                                    list and P the rest.
fix path < f : 4, c : 3, a : 3 > any pattern including m
                                                               6:   I NSERT-T RIE([p|P ])
can have frequency equal to the frequency of m, there-
                                                               7: end for
fore we may adjust the frequencies in the prefix path
as < f : 2, c : 2, a : 2 > which is called a trans-
formed prefix path [10]. The set of transformed prefix
paths of ai forms a small database of patterns which co-      3. An improved version of FP-Growth
occur with ai and thus contains complete information
required for mining patterns including a i . Therefore,          During experiments with large databases, we have
recursively mining conditional pattern bases for all a i      observed that FP-G ROWTH was costly in terms of
in T ree is equivalent to mining T ree (which is equiva-      memory use. Thus, we have experimented with im-
lent to ning the complete DB.). T ree β in FP-G ROWTH         provements to the original algorithm. In this section,
is the FP-Tree of the conditional pattern base.               we propose FP-G ROWTH -T INY (Algorithm 4) which
    FP-G ROWTH is indeed remarkable with its unique           is an enhancement of FP-G ROWTH featuring a space
divide and conquer approach. Nevertheless, it may be          optimization and minor improvements. An important
profitable for us to view it as generating candidates de-     optimization eliminates the need for intermediate con-
spite the title of [10]. The conditional pattern base is      ditional pattern bases. A minor improvement comes
Algorithm 2 I NSERT-T RIE([p|P ], T )                                   Algorithm 4 FP-G ROWTH -T INY(T ree, α)
 1: if T has a child N such that item[N ] = item[p]                      1: if T ree contains a single path P then
    then                                                                 2:    prune infrequent nodes of P
 2:    count[N ] ← count[N ] + 1                                         3:    if |P | > 0 then
 3: else                                                                 4:        output “all patterns in 2 P and α”
 4:    Create new node N with count = 1, parent                          5:    end if
       linked to T and node-link linked to nodes with                    6: else
       the same item via next                                            7:    for all ai in header table of T ree do
 5: end if                                                               8:        T reeβ ← C ONS -C ONDITIONAL -FP-T REE (T ree, ai )
 6: if P = ∅ then                                                        9:        output pattern β ← a i ∪ α with count =
 7:    I NSERT-T RIE(P, N )                                                        f (ai )  f (x) of T ree
 8: end if                                                              10:        if T reeβ = ∅ then
                                                                        11:           FP-G ROWTH (T reeβ , β)
Algorithm 3 FP-G ROWTH (T ree, α)                                       12:        end if
 1: if T ree contains a single path P then                              13:    end for
 2:    for all combination β of the nodes in path P do                  14: end if
 3:       generate pattern β ∪ α with support minimum
                                                                        FP-G ROWTH can be implemented as a set of pat-
          support of nodes in β
                                                                        terns. A pattern in FP-G ROWTH consists of a set of
 4:    end for
                                                                        symbols and an associated count. With a counting al-
 5: else
                                                                        gorithm and retrieval/insertion of patterns directly into
 6:    for all ai in header table of T ree do
                                                                        the FP-Tree structure, we can eliminate the need for
 7:       generate pattern β ← a i ∪ α with support =
                                                                        such a pattern base. Algorithm 5 constructs a con-
          support[ai ]
                                                                        ditional FP-Tree from a given T ree and a symbol
 8:       construct β’s conditional pattern base and
                                                                        s for which the transformed prefix paths are com-
          then β’s conditional FP-Tree T ree β
                                                                        puted.
 9:       if T reeβ = ∅ then
                                                                           The improved procedure first counts the symbols in
10:          FP-G ROWTH(T reeβ , β)
11:       end if                                                        the conditional tree without generating an intermedi-
                                                                        ate structure and constructs the set of frequent items.
12:    end for
                                                                        Then, each transformed prefix path is computed as pat-
13: end if
                                                                        terns retrieved from T ree and are inserted in T ree β .
from not outputting all combinations of the single path                     C OUNT-P REFIX -PATH presented in Algorithm 6
in the basis of recursion. Instead, we output a repre-                  scans the prefix paths of a given node. Since the pat-
sentation of this task since subsequent algorithms can                  tern corresponding to the transformed prefix path has
take advantage of a compact representation for gener-                   the count of the node, it simply adds the count to the
ating association rules and so forth. 2 Another improve-                count of all symbols in the prefix path. This step is re-
ment is pruning the infrequent nodes of the single path.                quired for construction of a conditional FP-Tree
    In the following subsection, the space optimization                 directly since an FP-Tree is based on the decreas-
is discussed.                                                           ing frequency order of F . This small algorithm allows
                                                                        us to compute the counts of the symbols in the condi-
3.1. Eliminating conditional pattern base con-                          tional tree in an efficient way, and was the key obser-
     struction                                                          vation in making the optimization possible.
                                                                           Algorithm 7 retrieves a transformed prefix path for
   The conditional tree T ree β can be constructed di-                  a given node excluding node itself and Algorithm 8 in-
rectly from T ree without an intermediate condi-                        serts a pattern into the FP-Tree. G ET-PATTERN com-
tional pattern base. The conditional pattern base in                    putes the transformed prefix path as described in [10].
                                                                        I NSERT-PATTERN prunes the items not present in the
2   For the FIMI workshop, we output all patterns separately as re-     frequent item set F of T ree (which does not have to
    quired. It can be argued that a meaningful mining of all frequent   be identical to the F of calling procedure) and sorts the
    patterns must output them one by one.                               pattern in decreasing frequency order to maintain FP-
Algorithm 5 C ONS -C ONDITIONAL -FP-T REE(T ree, s)   Tree properties and adds the obtained string to the FP-
 1: table ← itemtable[T ree]
                                                      Tree structure. The addition is similar to insertion of a
 2: list ← table[symbol]
                                                      single string, with the difference that insertion of a pat-
 3: T ree ← M AKE -FP-T REE
                                                      tern is equivalent to insertion of the symbol string of
 4:  Count symbols without generating an interme-
                                                      the pattern count[pattern] times.
    diate structure
 5: node ← list                                       Algorithm 8 I NSERT-PATTERN(T ree, pattern)
 6: while node = null do                               1: pattern ← pattern ∩ F [T ree]
 7:    C OUNT-P REFIX -PATH(node, count[T ree])        2: Sort pattern in a predetermined frequency decreas-
 8:    node ← next[node]                                  ing order
 9: end while                                          3: Add the pattern to the structure
10: for all sym ∈ range[count] do
11:    if count[sym] ≥  then
12:       F [T ree ] ← F [T ree ] ∪ sym                The optimization in Algorithm 5 makes FP-
13:    end if                                         G ROWTH more efficient and scalable by avoiding
14: end for
                                                      additional iterations and cutting down storage re-
15:  Insert conditional patterns to T ree β
                                                      quirements. An implementation that uses an inter-
16: node ← list
                                                      mediate conditional pattern base structure will scan
17: while node = null do
                                                      the tree once, constructing a linked list with trans-
18:    pattern ← G ET-PATTERN(node)                   formed prefix paths in it. Then, it will construct the
19:    I NSERT-PATTERN(T ree , pattern)              frequent item set from the linked list, and in a sec-
20:    node ← next[node]                              ond iteration insert all transformed prefix paths
21: end while
                                                      with a procedure similar to I NSERT-PATTERN. Such
22: return T ree
                                                      an implementation would have to copy the trans-
                                                      formed prefix paths twice, and iterate over all pre-
                                                      fix paths three times, once in the tree, and twice
Algorithm 6 C OUNT-P REFIX -PATH (node, count)        in the conditional pattern list. In contrast, our op-
 1: pref ixcount ← count[node]                        timized procedure does not execute any expensive
 2: node ← parent[node]                               copying operations and it needs to scan the pat-
 3: while parent[node] = null do                      tern bases only twice in the tree. Besides efficiency,
 4:   count[symbol[node]]                      ←      the elimination of extra storage requirement is signif-
      count[symbol[node]] + pref ixcount              icant because it allows FP-G ROWTH to mine more
 5:   node ← parent[node]                             complicated data sets with the same amount of mem-
 6: end while                                         ory.
                                                         An idea similar to our algorithm was independently
Algorithm 7 G ET-PATTERN(node)                        explored in FP-G ROWTH ∗ by making use of informa-
 1: pattern ← M AKE -PATTERN                          tion in 2-items [9]. In their implementations, Grahne
 2: if parent[node] = null then                       and Zhu have used strategies based on 2-items to im-
 3:    count[pattern] ← count[node]                   prove running time and memory usage, and they have
 4:    currnode ← parent[node]                        reported favorable performance, which has also been
 5:    while parent[node] = null do                   demonstrated in the benchmarks of the FIMI ’03 work-
 6:      symbols[pattern] ← symbols[pattern] ∪        shop [3].
         symbol[currnode]
 7:      currnode ← parent[currnode]
 8:    end while                                      4. Implementation notes
 9: else
10:    count[pattern] ← 0                                We have made a straightforward implementation of
11: end if                                            FP-G ROWTH -T INY and licensed it under GNU GPL
12: return pattern                                    for public use. It has been written in C++, using GNU
                                                      g++ compiler version 3.2.2.
    For variable length arrays, we used vector<T> in             These databases have been derived from previous stud-
standard library. For storing transactions, patterns and         ies [1, 2, 14]. Table 2 explains the symbols we use
other structures representable as strings we have used           for denoting the parameters of association rule gener-
efficient variable length arrays. We used set<T> to              ator tool. The experimental databases are depicted in
store item sets in certain places where it would be fast         Table 3. In all synthetic databases, |I| is 1000, and
to do so, otherwise we have used sorted arrays to im-            |Fmax | is 2000. The original algorithm could not pro-
plement sets.                                                    cess T20.I6.D1137 in memory therefore the number of
    No low level memory or I/O optimizations were em-            transactions was decreased to 450K.
ployed.
    A shortcoming of the pattern growth approach is               |T |        Number of transactions in transaction set
that it does not seem to be very memory efficient. We             |t|avg      Average size of a transaction t i
store many fields per node and the algorithm consumes             |fm |avg    Average length of maximal pattern f m
a lot of memory in practice.                                      I           Number of items in transaction set
    The algorithm has a detail which required a special           |Fmax |     Number of maximal frequent patterns
code: sorting the frequent items in a transaction accord-
ing to an order L, in line 2 of Algorithm 1 and line 2
                                                                             Table 2. Dataset parameters
of Algorithm 8. For preserving FP-Tree properties all
transactions must be inserted in the very same order. 3
The items are sorted first in order of decreasing fre-
quency and secondarily in order of indices to achieve
a unique frequency decreasing order. Using this proce-             Name                     |T |    |t|avg   |fm |avg
dure, we are not obliged to maintain an L.                         T10.I6.1600K       1.6 × 10 6        10          6
                                                                   T10.I4.1024K     1.024 × 10 6        10          4
5. Performance study                                               T15.I4.367K       3.67 × 10 5        15          4
                                                                   T20.I6.450K        4.5 × 10 5        20          6
    In this section we report on our experiments demon-
strating the performance of FP-G ROWTH -T INY. We                            Table 3. Synthetic data sets
have measured the performance of Algorithm 3 and Al-
gorithm 4 on a 2.4Ghz Pentium 4 Linux system with
1GB memory and a common 7200 RPM IDE hard disk.
Both algorithms were run on four synthetic and five
                                                                 5.2. Real-world data
real-world databases with varying support threshold.
The implementation of the original FP-G ROWTH al-                   We have used five publicly available datasets in the
gorithm is due to Bart Goethals. 4                               FIMI repository. accidents is a traffic accident data
    We describe the data sets used for our experiments           [7]. retail is market basket data from an anony-
in the next two subsections. Following that, we present          mous Belgian retail store [5]. The bms-webview1,
our performance experiments and interpret the results            bms-webview2 and bms-pos datasets are from a
briefly, comparing the performance of the improved al-           benchmark study described in [4]. Some statistics of
gorithm with the original one.                                   the datasets are presented in Table 4.

5.1. Synthetic data                                              5.3. Memory consumption and running time
   We have used the association rule generator de-                  The memory consumption and running time of FP-
scribed in [2] for synthetic data. Synthetic databases in        G ROWTH -T INY and FP-G ROWTH are plotted for vary-
our evaluation have been selected from [17] and [16].            ing relative supports from 0.25% to 0.75% in Figure 2
                                                                 and Figure 3 for synthetic databases and Figure 4 and
3   For patterns also in our implementation.                     Figure 5 for real-world databases except for accidents
4   Goethals has made his implementation publicly available at   database which is a denser database that should be
    http://www.cs.helsinki.fi/u/goethals/                        mined at 10% and more. The implementations were run
                                                                Memory usage for T10.I6.1600K                                                                                   Running time for T10.I6.1600K
                                     600                                                                                                              150
                                                                                         FP-Growth-Tiny                                                                                                FP-Growth-Tiny
                                                                                         FP-Growth                                                    140                                              FP-Growth
                                     550
                                                                                                                                                      130
                                     500
                                                                                                                                                      120

               Heap use (Mbytes)
                                     450


                                                                                                                                         Time (sec)
                                                                                                                                                      110

                                     400                                                                                                              100
                                                                                                                                                        90
                                     350
                                                                                                                                                        80
                                     300
                                                                                                                                                        70
                                     250                                                                                                                60

                                     200                                                                                                                50
                                        0.25     0.3     0.35     0.4    0.45      0.5       0.55    0.6   0.65    0.7    0.75                            0.25    0.3    0.35    0.4    0.45    0.5     0.55    0.6    0.65    0.7    0.75
                                                                               Support (%)                                                                                                Support (%)

                                                                Memory usage for T10.I4.1024K                                                                                   Running time for T10.I4.1024K
                                     500                                                                                                              70
                                                                                         FP-Growth-Tiny                                                                                                FP-Growth-Tiny
                                                                                         FP-Growth                                                                                                     FP-Growth
                                     450                                                                                                              65

                                     400
               Heap use (Mbytes)


                                                                                                                                                      60


                                                                                                                                         Time (sec)
                                     350
                                                                                                                                                      55
                                     300
                                                                                                                                                      50
                                     250

                                                                                                                                                      45
                                     200

                                     150                                                                                                              40
                                        0.25     0.3     0.35     0.4    0.45      0.5       0.55    0.6   0.65    0.7    0.75                          0.25     0.3    0.35    0.4    0.45    0.5     0.55    0.6    0.65     0.7    0.75
                                                                               Support (%)                                                                                               Support (%)

                                                                  Memory usage for T15.I4.367K                                                                                  Running time for T15.I4.367K
                                    350                                                                                                               50
                                                                                                FP-Growth-Tiny                                                                                         FP-Growth-Tiny
                                                                                                FP-Growth
                                                                                                                                                                                                       FP-Growth
                                    300                                                                                                               45
             Heap use (Mbytes)


                                                                                                                                         Time (sec)
                                    250                                                                                                               40


                                    200                                                                                                               35


                                    150                                                                                                               30


                                                                                                                                                      25
                                    100                                                                                                                 0.25     0.3    0.35    0.4    0.45    0.5     0.55    0.6    0.65     0.7    0.75
                                       0.25     0.3     0.35     0.4     0.45      0.5       0.55    0.6    0.65    0.7    0.75
                                                                               Support (%)                                                                                               Support (%)

                                                                 Memory usage for T20.I6.450K                                                                                   Running time for T20.I6.450K
                                   600                                                                                                                130
                                                                                               FP-Growth-Tiny                                                                                         FP-Growth-Tiny
                                                                                               FP-Growth
                                   550                                                                                                                                                                FP-Growth
                                                                                                                                                      120

                                   500                                                                                                                110
  Heap use (Mbytes)


                                   450                                                                                                                100
                                                                                                                                    Time (sec)


                                   400                                                                                                                 90

                                   350                                                                                                                 80

                                   300                                                                                                                 70

                                   250
                                                                                                                                                       60

                                                                                                                                                       50
                                   200                                                                                                                   0.25    0.3    0.35    0.4    0.45    0.5     0.55    0.6    0.65    0.7    0.75
                                      0.25     0.3     0.35     0.4     0.45      0.5    0.55       0.6    0.65    0.7    0.75
                                                                           Support (%)                                                                                                   Support (%)


Figure 2. Memory consumption of FP-                                                                                               Figure 3. Running time performance of
G ROWTH -T INY and FP-G ROWTH on syn-                                                                                             FP-G ROWTH -T INY and FP-G ROWTH on
thetic databases                                                                                                                  synthetic databases
                                                                  Memory usage for accidents                                                                                                        Running time for accidents
                                 300                                                                                                                         600
                                                                                           FP-Growth-Tiny                                                                                                                  FP-Growth-Tiny
                                                                                           FP-Growth                                                                                                                       FP-Growth
                                 250                                                                                                                         500


            Heap use (Mbytes)
                                 200                                                                                                                         400


                                                                                                                                            Time (sec)
                                 150                                                                                                                         300


                                 100                                                                                                                         200


                                     50                                                                                                                      100


                                      0                                                                                                                           0
                                          10           15              20            25            30            35           40                                      10               15             20            25             30            35           40
                                                                               Support (%)                                                                                                                   Support (%)

                                                                      Memory usage for retail                                                                                                         Running time for retail
                                20                                                                                                                           4.5
                                                                                           FP-Growth-Tiny                                                                                                                  FP-Growth-Tiny
                                18                                                         FP-Growth                                                                                                                       FP-Growth
                                                                                                                                                                  4
                                16
                                                                                                                                                             3.5
            Heap use (Mbytes)


                                14


                                                                                                                                            Time (sec)
                                12                                                                                                                                3

                                10                                                                                                                           2.5
                                  8
                                                                                                                                                                  2
                                  6
                                                                                                                                                             1.5
                                  4

                                  2                                                                                                                               1
                                   0.25         0.3     0.35     0.4        0.45    0.5    0.55     0.6    0.65        0.7    0.75                                 0.25          0.3    0.35    0.4        0.45     0.5     0.55    0.6    0.65        0.7    0.75
                                                                              Support (%)                                                                                                                    Support (%)

                                                                  Memory usage for bms-pos                                                                                                          Running time for bms-pos
                                 120                                                                                                                         30
                                                                                           FP-Growth-Tiny                                                                                                                  FP-Growth-Tiny
                                 110                                                       FP-Growth                                                         28                                                            FP-Growth

                                 100                                                                                                                         26
            Heap use (Mbytes)


                                     90                                                                                                                      24


                                                                                                                                            Time (sec)
                                     80                                                                                                                      22

                                     70                                                                                                                      20

                                     60                                                                                                                      18

                                     50                                                                                                                      16

                                     40                                                                                                                      14

                                     30                                                                                                                      12
                                       0.25      0.3     0.35     0.4       0.45    0.5     0.55    0.6    0.65        0.7    0.75                             0.25          0.3       0.35     0.4        0.45    0.5     0.55     0.6    0.65        0.7    0.75
                                                                               Support (%)                                                                                                                   Support (%)

                                                                Memory usage for bms-webview1                                                                                                  Running time for bms-webview1
                                      4                                                                                                                      1.3
                                                                                           FP-Growth-Tiny                                                                                                                  FP-Growth-Tiny
                                                                                           FP-Growth                                                         1.2                                                           FP-Growth
                                 3.5                                                                                                                         1.1
            Heap use (Mbytes)


                                                                                                                                                                  1
                                                                                                                                            Time (sec)


                                      3
                                                                                                                                                             0.9

                                                                                                                                                             0.8
                                 2.5
                                                                                                                                                             0.7

                                      2                                                                                                                      0.6

                                                                                                                                                             0.5

                                 1.5                                                                                                                         0.4
                                    0.25         0.3     0.35     0.4       0.45    0.5     0.55    0.6    0.65        0.7    0.75                              0.25             0.3    0.35    0.4        0.45     0.5     0.55    0.6    0.65        0.7    0.75
                                                                               Support (%)                                                                                                                   Support (%)

                                                              Memory usage for bms-webview2                                                                                                 Running time for bms-webview2
                                14                                                                                                                       9
                                                                                          FP-Growth-Tiny                                                                                                                  FP-Growth-Tiny
                                                                                          FP-Growth                                                      8                                                                FP-Growth
                                12
                                                                                                                                                         7
  Heap use (Mbytes)


                                10                                                                                                                       6
                                                                                                                                       Time (sec)


                                                                                                                                                         5
                                 8
                                                                                                                                                         4

                                 6                                                                                                                       3

                                                                                                                                                         2
                                 4
                                                                                                                                                         1

                                 2                                                                                                                       0
                                  0.25         0.3     0.35     0.4     0.45       0.5    0.55     0.6    0.65        0.7    0.75                         0.25             0.3     0.35       0.4     0.45        0.5     0.55     0.6    0.65        0.7    0.75
                                                                             Support (%)                                                                                                                   Support (%)


Figure 4. Memory consumption of FP-                                                                                                  Figure 5. Running time performance of
G ROWTH -T INY and FP-G ROWTH on real-                                                                                               FP-G ROWTH -T INY and FP-G ROWTH on
world databases                                                                                                                      real-world databases
                                                                     50%). In T10.I4.1024K we see 12% slowdown for
    Name                       |T |          |I|      |t|avg
                                                                     0.25% support and 2% slowdown for 0.3% sup-
    accidents           3.41 × 105          469       33.81
                                                                     port. In other problem instances FP-G ROWTH -T INY,
    retail              8.82 × 104        16470       10.31
                                                                     runs faster, up to 28.5% for retail dataset at 0.75% sup-
    bms-pos             5.16 × 105         1657         6.53
                                                                     port.
    bms-webview1        5.96 × 10 4       60978         2.51
                                                                        In the figures, we observe a relation between mem-
    bms-webview2        7.75 × 10 4      330286         4.62
                                                                     ory saving and decreased running time. We had ex-
                                                                     pected that improving space utilization would remark-
            Table 4. Real-world data sets
                                                                     ably decrease the running time. However, we have
                                                                     not observed as large an improvement as we would
inside a typical KDE desktop session. The running time               have liked in running time. On the other hand, our tri-
is measured as the wall-clock time of the system call.               als show significant improvement in memory use con-
The memory usage is measured using the GNU glibc                     trasted to vanilla FP-G ROWTH, allowing us to mine
tool memusage, considering only the maximum heap                     more complicated/larger datasets with the same amount
size since stack use is much smaller than heap size.                 of memory.
   The plots for synthetic datasets are similar among                   The adverse situation with bms-webview1 and bms-
themselves, while we observe more variation in real-                 webview2 shows that the performance study must be
world datasets. Memory is saved in all databases, ex-                extended to determine whether the undesirable behav-
cept in bms-webview2, which requires 2.74 times the                  ior recurs at a large scale, since these are both sparse
memory used in FP-G ROWTH; this has an adverse ef-                   data sets coming from the same source. At any rate, a
fect on running time as discussed below. In others, we               closer inspection of FP-G ROWTH -T INY seems neces-
observe that memory usage reduces down to 41.5% in                   sary. We anticipate that the benchmark studies at the
accidents database with 4% support, which is 2.4 times               FIMI workshop will illustrate its performance more
smaller than FP-G ROWTH.                                             precisely.
   Due to the optimization, our implementation can
process larger databases than the vanilla version. For
most problem instances, the memory consumption has                   6. Conclusions
been reduced more than twofold compared to the orig-
inal algorithm. An advantage of our approach is that                     We have presented our version of FP-G ROWTH
with the same amount of memory, we can process more                  which sports multiple improvements in Section 3. An
complicated databases. 5 The experiments overall show                optimization over the original algorithm eliminates a
that the conditional pattern base construction which we              large intermediate structure required in the recursive
have eliminated has a significant space cost during the              step of the published FP-G ROWTH algorithm in addi-
recursive construction of conditional FP-Trees.                      tion to two other minor improvements.
   The running time behaviors of two algorithms are                      In Section 5, we have reported the results of our
quite similar on the average. Our algorithm tends to                 performance experiments on synthetic and real-world
perform better and is faster in higher support thresh-               databases. The performance of the optimized algo-
olds, while in lower thresholds the performance gap                  rithm has been compared with a publicly available
becomes closer. FP-G ROWTH -T INY runs faster ex-                    FP-G ROWTH implementation. We have observed more
cept in bms-webview1, bms-webview2 and lower                         than twofold improvement in memory utilization over
thresholds of T10.I4.1024K. In bms-webview1                          the vanilla algorithm. In the best case, memory size
database, FP-G ROWTH -T INY runs 10-27% slower;                      has become 2.4 times smaller, while in the worst case
in bms-webview2 database we observe that FP-                         memory saving was not possible in a small real-world
G ROWTH -T INY has slowed down by a factor of 5.56                   database. Typically, our implementation makes better
for 0.25% support threshold, and slowdown is ob-                     use of memory, enabling it to mine larger and more
served also for other support thresholds (down to                    complicated databases that cannot be processed by
                                                                     the original algorithm. The running time behavior of
5   Note that FP-G ROWTH uses a compressed representation of fre-    both algorithms are quite similar on the average; FP-
    quency information, whose size may be thought of as related to   G ROWTH -T INY runs up to 28.5% percent faster, how-
    complexity of the dataset.                                       ever it may run slower in a minority of instances.
References                                                        [14] A. Savasere, E. Omiecinski, and S. B. Navathe. An ef-
                                                                       ficient algorithm for mining association rules in large
 [1] R. Agrawal and J. C. Shafer. Parallel mining of associ-           databases. In The VLDB Journal, pages 432–444, 1995.
     ation rules. IEEE Trans. On Knowledge And Data En-           [15] D.-Y. Yang, A. Johar, A. Grama, and W. Szpankowski.
     gineering, 8:962–969, 1996.                                       Summary structures for frequency queries on large
 [2] R. Agrawal and R. Srikant. Fast algorithms for min-               transaction sets. In Data Compression Conference,
     ing association rules. In J. B. Bocca, M. Jarke, and              pages 420–429, 2000.
     C. Zaniolo, editors, Proc. 20th Int. Conf. Very Large        [16] M. J. Zaki, S. Parthasarathy, and W. Li. A localized al-
     Data Bases, VLDB, pages 487–499. Morgan Kaufmann,                 gorithm for parallel association mining. In ACM Sym-
     12–15 1994.                                                       posium on Parallel Algorithms and Architectures, pages
 [3] M. J. Z. Bart Goethals. Fimi ’03: Workshop on fre-                321–330, 1997.
     quent itemset mining implementations. In Proceedings         [17] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li.
     of the ICDM 2003 Workshop on Frequent Itemset Min-                Parallel algorithms for discovery of association rules.
     ing Implementations, Melbourne, Florida, USA, 2003.               Data Mining and Knowledge Discovery, 1(4):343–373,
 [4] Z. Z. Blue. Real world performance of association rule            1997.
     algorithms. In KDD 2001.
 [5] T. Brijs, G. Swinnen, K. Vanhoof, and G. Wets. Us-
     ing association rules for product assortment decisions:
     A case study. In Knowledge Discovery and Data Min-
     ing, pages 254–260, 1999.
 [6] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dy-
     namic itemset counting and implication rules for market
     basket data. In J. Peckham, editor, SIGMOD 1997, Pro-
     ceedings ACM SIGMOD International Conference on
     Management of Data, May 13-15, 1997, Tucson, Ari-
     zona, USA, pages 255–264. ACM Press, 05 1997.
 [7] K. Geurts, G. Wets, T. Brijs, and K. Vanhoof. Pro-
     filing high frequency accident locations using associ-
     ation rules. In Proceedings of the 82nd Annual Trans-
     portation Research Board, page 18pp, Washington DC
     (USA), January 2003.
 [8] B. Goethals. Memory issues in frequent itemset min-
     ing. In Proceedings of the 2004 ACM Symposium on
     Applied Computing (SAC’04), Nicosia, Cyprus, March
     2004.
 [9] G. Grahne and J. Zhu. Efficiently using prefix-trees in
     mining frequent itemsets. In Proceedings of the First
     IEEE ICDM Workshop on Frequent Itemset Mining Im-
     plementations (FIMI’03).
[10] J. Han, J. Pei, and Y. Yin. Mining frequent patterns
     without candidate generation. In W. Chen, J. Naughton,
     and P. A. Bernstein, editors, 2000 ACM SIGMOD Intl.
     Conference on Management of Data, pages 1–12. ACM
     Press, May 2000.
[11] J. Hipp, U. Güntzer, and G. Nakhaeizadeh. Algorithms
     for association rule mining – a general survey and com-
     parison. SIGKDD Explorations, 2(1):58–64, July 2000.
[12] A. Mueller. Fast sequential and parallel algorithms for
     association rule mining: A comparison. Technical Re-
     port CS-TR-3515, College Park, MD, 1995.
[13] A. Pietracaprina and D. Zandolin. Mining frequent
     itemsets using patricia tries. In Proceedings of the First
     IEEE ICDM Workshop on Frequent Itemset Mining Im-
     plementations (FIMI’03).

</pre>