<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Space Optimization for FP-Growth</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eray O¨zkural</string-name>
          <email>erayo@cs.bilkent.edu.tr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cevdet Aykanat</string-name>
          <email>aykanat@cs.bilkent.edu.tr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Engineering Bilkent University 06800</institution>
          <addr-line>Ankara</addr-line>
          ,
          <country country="TR">Turkey</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Frequency mining problem comprises the core of several data mining algorithms. Among frequent pattern discovery algorithms, FP-GROWTH employs a unique search strategy using compact structures resulting in a high performance algorithm that requires only two database passes. We introduce an enhanced version of this algorithm called FP-GROWTH-TINY which can mine larger databases due to a space optimization eliminating the need for intermediate conditional pattern bases. We present the algorithms required for directly constructing a conditional FP-Tree in detail. The experiments demonstrate that our implementation has a running time performance comparable to the original algorithm while reducing memory use up to twofold.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Frequency mining is the discovery of all frequent
patterns in a transaction or relational database.
Frequent pattern discovery comprises the core of several
data mining algorithms such as association rule
mining and sequence mining [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], dominating the running
time of these algorithms. The problem involves a
transaction database T = {X |X ⊆ I} that consists in a set
of transactions each of which are drawn from a set I of
items. The mining algorithm finds all patterns that
occur with a frequency satisfying a given absolute
support threshold . In practice, the number of items |I| is
in the order of magnitude of 10 3 and more. The number
of transactions is much larger, at least 105. A pattern is
      </p>
      <sec id="sec-1-1">
        <title>X ⊆ I, a subset of I, and the set of all patterns is 2I .</title>
        <p>The frequency function f (T , x) = |{x ∈ Y |Y ∈ T }|
computes the number of times a given item x ∈ I
occurs in the transaction set T ; it is extended to sets of
items f (T , X ) = |{X ⊆ Y |Y ∈ T }| to compute the
frequency of a pattern.</p>
        <p>Frequency mining is the discovery of all frequent
patterns in a transaction set with a frequency of
support threshold and more. The set of all frequent
patterns is F (T , ) = {X |X ∈ 2I ∧ f (T , X ) ≥ }. In
the algorithms, the set of frequent items F = {x ∈</p>
      </sec>
      <sec id="sec-1-2">
        <title>I | f (T , x) ≥ } may require special consideration.</title>
        <p>
          A significant property of frequency mining known as
downward closure states that if X ∈ F (T , ) then
∀Y ⊂ X , Y ∈ F (T , ) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>
          An inherent limitation of frequency mining is the
amount of main memory available [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. In this paper,
we present a space optimization to FP-Growth
algorithm and we demonstrate its impact on performance
with experiments on synthetic and real-world datasets.
In the next section, we give the background on the
FPGROWTH algorithm. Section 3 and Section 4 explain
our algorithm and implementation. Section 5 presents
the experiments, following that we offer our
conclusions.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <sec id="sec-2-1">
        <title>2.1. Compact structures</title>
        <p>
          Compact data structures have been used for efficient
storage and query/update of candidate item sets in
frequency mining algorithms. SEAR [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], SPEAR [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ],
and DIC [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] use tries (also known as prefix trees) while
FP-GROWTH [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] uses FP-Tree which is an enhanced
trie structure.
        </p>
        <p>
          Using concise structures can reduce both running
time and memory size requirements of an algorithm.
Tries are well known structures that are widely used
for storing strings and have decent query/update
performance. The aforementioned algorithms exploit this
property of the data structure for better performance.
Tries are also efficient in storage. A large number of
strings can be stored in this dictionary type which
would not otherwise fit into main memory. For
frequency mining algorithms both properties are
critical as our goal is to achieve efficient and scalable
algorithms. In particular, the scalability of these
structures is quite high [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] as they allow an algorithm to
track the frequency information of the candidate
patterns for very large databases. The FP-Tree structure in
FP-GROWTH allows the algorithm to maintain all
frequency information in the main memory obtained from
two database passes. Using the FP-Tree structure has
also resulted in novel search strategies.
        </p>
        <p>
          A notable work on compact structures is [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] in
which a binary-trie based summary structure for
representing transaction sets is proposed. The trie is further
compressed using Patricia tries. Although significant
savings in storage and improvements in query time are
reported, the effectiveness of the scheme in a frequency
mining algorithm remains to be seen. In another work
in FIMI 2003 workshop [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], an algorithm called
PATRICIAMINE using Patricia tries has been proposed [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
The performance of PATRICIAMINE has been shown to
be consistently good in the extensive benchmark
studies of FIMI workshop [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]; it was one of the fastest
algorithms although it was not the most efficient. For
many applications, the average case performance may
be more important than performing well in a small
number of cases, therefore further research on this
PATRICIAMINE would be worthwhile.
        </p>
        <p>In this paper, we introduce an optimized version of
FP-GROWTH. A closer analysis of it is in order.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. FP-Growth algorithm</title>
        <p>
          The FP-GROWTH algorithm uses the frequent
pattern tree (FP-Tree) structure. FP-Tree is an improved
trie structure such that each itemset is stored as a string
in the trie along with its frequency. At each node of the
trie, item, count and next fields are stored. The items
of the path from the root of the trie to a node
constitute the item set stored at the node and the count is
the frequency of this item set. The node link next is a
pointer to the next node with the same item in the
FPTree. Field parent holds a pointer to the parent node,
null for root. Additionally, we maintain a header table
which stores heads of node links accessing the linked
list that spans all same items. FP-Tree stores only
frequent items. At the root of the trie is a null item, and
strings are inserted in the trie by sorting item sets in a
unique1 decreasing frequency order [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>
          Table 1 shows a sample transaction set and frequent
items in descending frequency order. Figure 1
illustrates the FP-Tree of sample transaction set in Table 1.
As shown in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], FP-Tree carries complete
information required for frequency mining and in a compact
manner; the height of the tree is bounded by
maximal number of frequent items in a transaction.
MAKEFP-TREE (Algorithm 1) constructs an FP-Tree from a
given transaction set T and support threshold as
described.
        </p>
        <p>Transaction
Ordered Frequent Items
t1 = {f, a, c, d, g, i, m, p} {f, c, a, m, p}
t2 = {a, b, c, f, l, m, o} {f, c, a, b, m}
t3 = {b, f, h, j, o} {f, b}
t4 = {b, c, k, s, p} {c, b, p}
t5 = {a, f, c, e, l, p, m, n} {f, c, a, m, p}</p>
        <p>
          In Algorithm 3 we describe FP-GROWTH which has
innovative features such as:
1. Novel search strategy
2. Effective use of a summary structure
3. Two database passes
FP-GROWTH turns the frequency k-length pattern
mining problem into “a sequence of k-frequent 1-item
set mining problems via a set of conditional pattern
bases” [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. It is proposed that with FP-GROWTH there
is “no need to generate any combinations of candidate
sets in the entire mining process”. With an FP-Tree
T ree given as input the algorithm generates all
frequent patterns. There are two points in the algorithm
that should be explained: the single path case and
conditional pattern bases. If an FP-Tree has only a single
path, then an optimization is to consider all
combinations of items in the path (single path case is the
ba1
        </p>
        <p>All strings must be inserted in the same order; the order of items
with the same frequency must be the same.</p>
        <p>Header Table
f
c
a
b
m
p
f:4
c:3
a:3
m:2
p:2</p>
        <p>
          root
b:1
b:1
m:1
sis of recursion in FP-GROWTH). Otherwise, the
algorithm constructs for each item ai in the header table a
conditional pattern base and an FP-Tree T ree β based
on this structure for recursive frequency mining.
Conditional pattern base is simply a compact
representation of a derivative database in which only a i and its
prefix paths in the original T ree occur. Consider path
&lt; f : 4, c : 3, a : 3, m : 2, p : 2 &gt; in T ree. For
mining patterns including m in this path, we need to
consider only the prefix path of m since the nodes after m
will be mined elsewhere (in this case only p). In the
prefix path &lt; f : 4, c : 3, a : 3 &gt; any pattern including m
can have frequency equal to the frequency of m,
therefore we may adjust the frequencies in the prefix path
as &lt; f : 2, c : 2, a : 2 &gt; which is called a
transformed prefix path [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The set of transformed prefix
paths of ai forms a small database of patterns which
cooccur with ai and thus contains complete information
required for mining patterns including a i. Therefore,
recursively mining conditional pattern bases for all a i
in T ree is equivalent to mining T ree (which is
equivalent to ning the complete DB.). T reeβ in FP-GROWTH
is the FP-Tree of the conditional pattern base.
        </p>
        <p>
          FP-GROWTH is indeed remarkable with its unique
divide and conquer approach. Nevertheless, it may be
profitable for us to view it as generating candidates
despite the title of [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The conditional pattern base is
a set of candidates among which only some of them
turn out to be frequent. The main innovation
however remains intact: FP-GROWTH takes advantage of
a tailored data structure to solve the frequency
mining problem with a divide-and-conquer method and
with demonstrated efficiency and scalability. Besides,
the conditional pattern base is guaranteed to be smaller
than the original tree, which is a desirable property. An
important distinction of this algorithm is that, when
examined within the taxonomy of algorithms, it employs
a unique search strategy. When the item sets tested
are considered, it is seen that this algorithm is neither
DFS nor BFS. The classification for FP-GROWTH in
Figure 3 of [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] may be slightly misleading. As Hipp
later mentions in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], “FP-Growth does not follow the
nodes of the tree . . . , but directly descends to some part
of the itemsets in the search space”. In fact, the part is
so well defined that it would be unfair to classify
FPGROWTH as conducting a DFS. It does not even start
with item sets of small length and proceed to longer
item sets. Rather, it considers a set of patterns at the
same time by taking advantage of the data structure.
This unique search strategy makes it hard to classify
FP-GROWTH in the context of traditional uninformed
search algorithms.
        </p>
        <p>Algorithm 1 MAKE-FP-TREE(DB, )
1: Compute F and f (x) where x ∈ F
2: Sort F in frequency decreasing order as L
3: Create root of an FP-Tree T with label “null”
4: for all transaction ti ∈ T do
5: Sort frequent items in ti according to L. Let
sorted list be [p|P ] where p is the head of the
list and P the rest.</p>
        <sec id="sec-2-2-1">
          <title>6: INSERT-TRIE([p|P ])</title>
          <p>7: end for
3. An improved version of FP-Growth</p>
          <p>During experiments with large databases, we have
observed that FP-GROWTH was costly in terms of
memory use. Thus, we have experimented with
improvements to the original algorithm. In this section,
we propose FP-GROWTH-TINY (Algorithm 4) which
is an enhancement of FP-GROWTH featuring a space
optimization and minor improvements. An important
optimization eliminates the need for intermediate
conditional pattern bases. A minor improvement comes
Algorithm 2 INSERT-TRIE([p|P ], T )
1: if T has a child N such that item[N ] = item[p]
then
2: count[N ] ← count[N ] + 1
3: else
4: Create new node N with count = 1, parent
linked to T and node-link linked to nodes with
the same item via next
5: end if
6: if P = ∅ then
7: INSERT-TRIE(P, N )
8: end if
Algorithm 3 FP-GROWTH(T ree, α)
1: if T ree contains a single path P then
2: for all combination β of the nodes in path P do
3: generate pattern β ∪ α with support minimum
support of nodes in β
4: end for
5: else
6: for all ai in header table of T ree do
7: generate pattern β ← ai ∪ α with support =
support[ai]
8: construct β’s conditional pattern base and
then β’s conditional FP-Tree T reeβ
9: if T reeβ = ∅ then
10: FP-GROWTH(T reeβ , β)
11: end if
12: end for
13: end if
from not outputting all combinations of the single path
in the basis of recursion. Instead, we output a
representation of this task since subsequent algorithms can
take advantage of a compact representation for
generating association rules and so forth.2 Another
improvement is pruning the infrequent nodes of the single path.</p>
          <p>In the following subsection, the space optimization
is discussed.
3.1. Eliminating conditional pattern base
construction</p>
          <p>The conditional tree T reeβ can be constructed
directly from T ree without an intermediate
conditional pattern base. The conditional pattern base in
2 For the FIMI workshop, we output all patterns separately as
required. It can be argued that a meaningful mining of all frequent
patterns must output them one by one.</p>
          <p>Algorithm 4 FP-GROWTH-TINY(T ree, α)
1: if T ree contains a single path P then
2: prune infrequent nodes of P
3: if |P | &gt; 0 then
4: output “all patterns in 2P and α”
5: end if
6: else
7: for all ai in header table of T ree do
8: T reeβ ← CONS-CONDITIONAL-FP-TREE(T ree, ai)
9: output pattern β ← ai ∪ α with count =
f (ai) f (x) of T ree
10: if T reeβ = ∅ then
11: FP-GROWTH(T reeβ, β)
12: end if
13: end for
14: end if
FP-GROWTH can be implemented as a set of
patterns. A pattern in FP-GROWTH consists of a set of
symbols and an associated count. With a counting
algorithm and retrieval/insertion of patterns directly into
the FP-Tree structure, we can eliminate the need for
such a pattern base. Algorithm 5 constructs a
conditional FP-Tree from a given T ree and a symbol
s for which the transformed prefix paths are
computed.</p>
          <p>The improved procedure first counts the symbols in
the conditional tree without generating an
intermediate structure and constructs the set of frequent items.
Then, each transformed prefix path is computed as
patterns retrieved from T ree and are inserted in T ree β.</p>
          <p>COUNT-PREFIX-PATH presented in Algorithm 6
scans the prefix paths of a given node. Since the
pattern corresponding to the transformed prefix path has
the count of the node, it simply adds the count to the
count of all symbols in the prefix path. This step is
required for construction of a conditional FP-Tree
directly since an FP-Tree is based on the
decreasing frequency order of F . This small algorithm allows
us to compute the counts of the symbols in the
conditional tree in an efficient way, and was the key
observation in making the optimization possible.</p>
          <p>
            Algorithm 7 retrieves a transformed prefix path for
a given node excluding node itself and Algorithm 8
inserts a pattern into the FP-Tree. GET-PATTERN
computes the transformed prefix path as described in [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ].
INSERT-PATTERN prunes the items not present in the
frequent item set F of T ree (which does not have to
be identical to the F of calling procedure) and sorts the
pattern in decreasing frequency order to maintain
FP1: table ← itemtable[T ree]
2: list ← table[symbol]
3: T ree ← MAKE-FP-TREE
4: Count symbols without generating an
intermediate structure
5: node ← list
6: while node = null do
7: COUNT-PREFIX-PATH(node, count[T ree])
8: node ← next[node]
9: end while
10: for all sym ∈ range[count] do
11: if count[sym] ≥ then
12: F [T ree ] ← F [T ree ] ∪ sym
13: end if
14: end for
15: Insert conditional patterns to T reeβ
16: node ← list
17: while node = null do
18: pattern ← GET-PATTERN(node)
19: INSERT-PATTERN(T ree , pattern)
20: node ← next[node]
21: end while
22: return T ree
Algorithm 6 COUNT-PREFIX-PATH(node, count)
1: pref ixcount ← count[node]
2: node ← parent[node]
3: while parent[node] = null do
4: count[symbol[node]]
          </p>
          <p>count[symbol[node]] + pref ixcount
5: node ← parent[node]
6: end while
←
Algorithm 7 GET-PATTERN(node)
1: pattern ← MAKE-PATTERN
2: if parent[node] = null then
3: count[pattern] ← count[node]
4: currnode ← parent[node]
5: while parent[node] = null do
6: symbols[pattern] ← symbols[pattern] ∪
symbol[currnode]
7: currnode ← parent[currnode]
8: end while
9: else
10: count[pattern] ← 0
11: end if
12: return pattern
Tree properties and adds the obtained string to the
FPTree structure. The addition is similar to insertion of a
single string, with the difference that insertion of a
pattern is equivalent to insertion of the symbol string of
the pattern count[pattern] times.</p>
          <p>Algorithm 8 INSERT-PATTERN(T ree, pattern)
1: pattern ← pattern ∩ F [T ree]
2: Sort pattern in a predetermined frequency
decreasing order
3: Add the pattern to the structure</p>
          <p>The optimization in Algorithm 5 makes
FPGROWTH more efficient and scalable by avoiding
additional iterations and cutting down storage
requirements. An implementation that uses an
intermediate conditional pattern base structure will scan
the tree once, constructing a linked list with
transformed prefix paths in it. Then, it will construct the
frequent item set from the linked list, and in a
second iteration insert all transformed prefix paths
with a procedure similar to INSERT-PATTERN. Such
an implementation would have to copy the
transformed prefix paths twice, and iterate over all
prefix paths three times, once in the tree, and twice
in the conditional pattern list. In contrast, our
optimized procedure does not execute any expensive
copying operations and it needs to scan the
pattern bases only twice in the tree. Besides efficiency,
the elimination of extra storage requirement is
significant because it allows FP-GROWTH to mine more
complicated data sets with the same amount of
memory.</p>
          <p>
            An idea similar to our algorithm was independently
explored in FP-GROWTH∗ by making use of
information in 2-items [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ]. In their implementations, Grahne
and Zhu have used strategies based on 2-items to
improve running time and memory usage, and they have
reported favorable performance, which has also been
demonstrated in the benchmarks of the FIMI ’03
workshop [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ].
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Implementation notes</title>
      <p>We have made a straightforward implementation of
FP-GROWTH-TINY and licensed it under GNU GPL
for public use. It has been written in C++, using GNU
g++ compiler version 3.2.2.</p>
      <p>For variable length arrays, we used vector&lt;T&gt; in
standard library. For storing transactions, patterns and
other structures representable as strings we have used
efficient variable length arrays. We used set&lt;T&gt; to
store item sets in certain places where it would be fast
to do so, otherwise we have used sorted arrays to
implement sets.</p>
      <p>No low level memory or I/O optimizations were
employed.</p>
      <p>A shortcoming of the pattern growth approach is
that it does not seem to be very memory efficient. We
store many fields per node and the algorithm consumes
a lot of memory in practice.</p>
      <p>The algorithm has a detail which required a special
code: sorting the frequent items in a transaction
according to an order L, in line 2 of Algorithm 1 and line 2
of Algorithm 8. For preserving FP-Tree properties all
transactions must be inserted in the very same order. 3
The items are sorted first in order of decreasing
frequency and secondarily in order of indices to achieve
a unique frequency decreasing order. Using this
procedure, we are not obliged to maintain an L.</p>
    </sec>
    <sec id="sec-4">
      <title>5. Performance study</title>
      <p>In this section we report on our experiments
demonstrating the performance of FP-GROWTH-TINY. We
have measured the performance of Algorithm 3 and
Algorithm 4 on a 2.4Ghz Pentium 4 Linux system with
1GB memory and a common 7200 RPM IDE hard disk.
Both algorithms were run on four synthetic and five
real-world databases with varying support threshold.
The implementation of the original FP-GROWTH
algorithm is due to Bart Goethals.4</p>
      <p>We describe the data sets used for our experiments
in the next two subsections. Following that, we present
our performance experiments and interpret the results
briefly, comparing the performance of the improved
algorithm with the original one.</p>
      <sec id="sec-4-1">
        <title>5.1. Synthetic data</title>
        <p>
          We have used the association rule generator
described in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] for synthetic data. Synthetic databases in
our evaluation have been selected from [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] and [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
3 For patterns also in our implementation.
4 Goethals has made his implementation publicly available at
http://www.cs.helsinki.fi/u/goethals/
These databases have been derived from previous
studies [
          <xref ref-type="bibr" rid="ref1 ref14 ref2">1, 2, 14</xref>
          ]. Table 2 explains the symbols we use
for denoting the parameters of association rule
generator tool. The experimental databases are depicted in
Table 3. In all synthetic databases, |I| is 1000, and
|Fmax| is 2000. The original algorithm could not
process T20.I6.D1137 in memory therefore the number of
transactions was decreased to 450K.
        </p>
        <p>|T |
|t|avg
|fm|avg
I
|Fmax|</p>
        <p>Number of transactions in transaction set
Average size of a transaction ti
Average length of maximal pattern f m
Number of items in transaction set</p>
        <p>Number of maximal frequent patterns</p>
        <p>
          We have used five publicly available datasets in the
FIMI repository. accidents is a traffic accident data
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. retail is market basket data from an
anonymous Belgian retail store [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The bms-webview1,
bms-webview2 and bms-pos datasets are from a
benchmark study described in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Some statistics of
the datasets are presented in Table 4.
5.3. Memory consumption and running time
        </p>
        <p>The memory consumption and running time of
FPGROWTH-TINY and FP-GROWTH are plotted for
varying relative supports from 0.25% to 0.75% in Figure 2
and Figure 3 for synthetic databases and Figure 4 and
Figure 5 for real-world databases except for accidents
database which is a denser database that should be
mined at 10% and more. The implementations were run
FP-Growth-Tiny</p>
        <p>FP-Growth
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7</p>
        <p>0.75
and
thetic databases</p>
        <p>FP-Growth-Tiny
FP-Growth
FP-Growth-Tiny</p>
        <p>FP-Growth
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7</p>
        <p>0.75
and
500
400
)
c
e
s
( 300
e
m
i
T
200
100
2
0.25
)
s
e
t
y
b
M
(
e
s
u
p
a
e
H
)
s
e
t
y
b
M
(
e
s
u
p
a
e
H
120
110
100
90
80
70
60
50
40
30
3.5
2.5
4
3
2
1.5
14
12
)
se 10
t
y
b
M
(e 8
s
u
p
ea 6
H
4
2
0.25
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7</p>
        <p>0.75
Running time for accidents
and
world databases
and
inside a typical KDE desktop session. The running time
is measured as the wall-clock time of the system call.
The memory usage is measured using the GNU glibc
tool memusage, considering only the maximum heap
size since stack use is much smaller than heap size.</p>
        <p>The plots for synthetic datasets are similar among
themselves, while we observe more variation in
realworld datasets. Memory is saved in all databases,
except in bms-webview2, which requires 2.74 times the
memory used in FP-GROWTH; this has an adverse
effect on running time as discussed below. In others, we
observe that memory usage reduces down to 41.5% in
accidents database with 4% support, which is 2.4 times
smaller than FP-GROWTH.</p>
        <p>Due to the optimization, our implementation can
process larger databases than the vanilla version. For
most problem instances, the memory consumption has
been reduced more than twofold compared to the
original algorithm. An advantage of our approach is that
with the same amount of memory, we can process more
complicated databases.5 The experiments overall show
that the conditional pattern base construction which we
have eliminated has a significant space cost during the
recursive construction of conditional FP-Trees.</p>
        <p>The running time behaviors of two algorithms are
quite similar on the average. Our algorithm tends to
perform better and is faster in higher support
thresholds, while in lower thresholds the performance gap
becomes closer. FP-GROWTH-TINY runs faster
except in bms-webview1, bms-webview2 and lower
thresholds of T10.I4.1024K. In bms-webview1
database, FP-GROWTH-TINY runs 10-27% slower;
in bms-webview2 database we observe that
FPGROWTH-TINY has slowed down by a factor of 5.56
for 0.25% support threshold, and slowdown is
observed also for other support thresholds (down to
5</p>
        <p>Note that FP-GROWTH uses a compressed representation of
frequency information, whose size may be thought of as related to
complexity of the dataset.
50%). In T10.I4.1024K we see 12% slowdown for
0.25% support and 2% slowdown for 0.3%
support. In other problem instances FP-GROWTH-TINY,
runs faster, up to 28.5% for retail dataset at 0.75%
support.</p>
        <p>In the figures, we observe a relation between
memory saving and decreased running time. We had
expected that improving space utilization would
remarkably decrease the running time. However, we have
not observed as large an improvement as we would
have liked in running time. On the other hand, our
trials show significant improvement in memory use
contrasted to vanilla FP-GROWTH, allowing us to mine
more complicated/larger datasets with the same amount
of memory.</p>
        <p>The adverse situation with bms-webview1 and
bmswebview2 shows that the performance study must be
extended to determine whether the undesirable
behavior recurs at a large scale, since these are both sparse
data sets coming from the same source. At any rate, a
closer inspection of FP-GROWTH-TINY seems
necessary. We anticipate that the benchmark studies at the
FIMI workshop will illustrate its performance more
precisely.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusions</title>
      <p>We have presented our version of FP-GROWTH
which sports multiple improvements in Section 3. An
optimization over the original algorithm eliminates a
large intermediate structure required in the recursive
step of the published FP-GROWTH algorithm in
addition to two other minor improvements.</p>
      <p>In Section 5, we have reported the results of our
performance experiments on synthetic and real-world
databases. The performance of the optimized
algorithm has been compared with a publicly available
FP-GROWTH implementation. We have observed more
than twofold improvement in memory utilization over
the vanilla algorithm. In the best case, memory size
has become 2.4 times smaller, while in the worst case
memory saving was not possible in a small real-world
database. Typically, our implementation makes better
use of memory, enabling it to mine larger and more
complicated databases that cannot be processed by
the original algorithm. The running time behavior of
both algorithms are quite similar on the average;
FPGROWTH-TINY runs up to 28.5% percent faster,
however it may run slower in a minority of instances.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Shafer</surname>
          </string-name>
          .
          <article-title>Parallel mining of association rules</article-title>
          .
          <source>IEEE Trans. On Knowledge And Data Engineering</source>
          ,
          <volume>8</volume>
          :
          <fpage>962</fpage>
          -
          <lpage>969</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Srikant</surname>
          </string-name>
          .
          <article-title>Fast algorithms for mining association rules</article-title>
          . In J. B.
          <string-name>
            <surname>Bocca</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Jarke</surname>
          </string-name>
          , and C. Zaniolo, editors,
          <source>Proc. 20th Int. Conf. Very Large Data Bases, VLDB</source>
          , pages
          <fpage>487</fpage>
          -
          <lpage>499</lpage>
          . Morgan Kaufmann,
          <fpage>12</fpage>
          -
          <lpage>15</lpage>
          1994.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. J. Z. Bart</given-names>
            <surname>Goethals</surname>
          </string-name>
          . Fimi '
          <volume>03</volume>
          :
          <article-title>Workshop on frequent itemset mining implementations</article-title>
          .
          <source>In Proceedings of the ICDM 2003 Workshop on Frequent Itemset Mining Implementations</source>
          , Melbourne, Florida, USA,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z. Z.</given-names>
            <surname>Blue</surname>
          </string-name>
          .
          <article-title>Real world performance of association rule algorithms</article-title>
          .
          <source>In KDD</source>
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brijs</surname>
          </string-name>
          , G. Swinnen,
          <string-name>
            <given-names>K.</given-names>
            <surname>Vanhoof</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Wets</surname>
          </string-name>
          .
          <article-title>Using association rules for product assortment decisions: A case study</article-title>
          .
          <source>In Knowledge Discovery and Data Mining</source>
          , pages
          <fpage>254</fpage>
          -
          <lpage>260</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Brin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Motwani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Ullman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Tsur</surname>
          </string-name>
          .
          <article-title>Dynamic itemset counting and implication rules for market basket data</article-title>
          . In J. Peckham, editor,
          <source>SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13-15</source>
          ,
          <year>1997</year>
          , Tucson, Arizona, USA, pages
          <fpage>255</fpage>
          -
          <lpage>264</lpage>
          . ACM Press,
          <year>05 1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Geurts</surname>
          </string-name>
          , G. Wets,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brijs</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Vanhoof</surname>
          </string-name>
          .
          <article-title>Profiling high frequency accident locations using association rules</article-title>
          .
          <source>In Proceedings of the 82nd Annual Transportation Research Board, page 18pp</source>
          ,
          <string-name>
            <surname>Washington</surname>
            <given-names>DC</given-names>
          </string-name>
          (USA),
          <year>January 2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Goethals</surname>
          </string-name>
          .
          <article-title>Memory issues in frequent itemset mining</article-title>
          .
          <source>In Proceedings of the 2004 ACM Symposium on Applied Computing (SAC'04)</source>
          , Nicosia, Cyprus,
          <year>March 2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Grahne</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          .
          <article-title>Efficiently using prefix-trees in mining frequent itemsets</article-title>
          .
          <source>In Proceedings of the First IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI'03).</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>J</surname>
          </string-name>
          . Pei, and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yin</surname>
          </string-name>
          .
          <article-title>Mining frequent patterns without candidate generation</article-title>
          . In W. Chen,
          <string-name>
            <given-names>J.</given-names>
            <surname>Naughton</surname>
          </string-name>
          , and P. A. Bernstein, editors,
          <source>2000 ACM SIGMOD Intl. Conference on Management of Data</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . ACM Press, May
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hipp</surname>
          </string-name>
          , U. Gu¨ntzer, and
          <string-name>
            <given-names>G.</given-names>
            <surname>Nakhaeizadeh</surname>
          </string-name>
          .
          <article-title>Algorithms for association rule mining - a general survey and comparison</article-title>
          .
          <source>SIGKDD Explorations</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <fpage>58</fpage>
          -
          <lpage>64</lpage>
          ,
          <year>July 2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mueller</surname>
          </string-name>
          .
          <article-title>Fast sequential and parallel algorithms for association rule mining: A comparison</article-title>
          .
          <source>Technical Report CS-TR-3515</source>
          , College Park, MD,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pietracaprina</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Zandolin</surname>
          </string-name>
          .
          <article-title>Mining frequent itemsets using patricia tries</article-title>
          .
          <source>In Proceedings of the First IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI'03).</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Savasere</surname>
          </string-name>
          , E. Omiecinski, and
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Navathe</surname>
          </string-name>
          .
          <article-title>An efficient algorithm for mining association rules in large databases</article-title>
          .
          <source>In The VLDB Journal</source>
          , pages
          <fpage>432</fpage>
          -
          <lpage>444</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>D.-Y. Yang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Johar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Grama</surname>
            , and
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Szpankowski</surname>
          </string-name>
          .
          <article-title>Summary structures for frequency queries on large transaction sets</article-title>
          .
          <source>In Data Compression Conference</source>
          , pages
          <fpage>420</fpage>
          -
          <lpage>429</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>M. J. Zaki</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Parthasarathy</surname>
            , and
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>A localized algorithm for parallel association mining</article-title>
          .
          <source>In ACM Symposium on Parallel Algorithms and Architectures</source>
          , pages
          <fpage>321</fpage>
          -
          <lpage>330</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>M. J. Zaki</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Parthasarathy</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ogihara</surname>
            , and
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Parallel algorithms for discovery of association rules</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          ,
          <volume>1</volume>
          (
          <issue>4</issue>
          ):
          <fpage>343</fpage>
          -
          <lpage>373</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>