=Paper=
{{Paper
|id=Vol-126/paper-4
|storemode=property
|title=A Space Optimization for FP-Growth
|pdfUrl=https://ceur-ws.org/Vol-126/ozkural.pdf
|volume=Vol-126
|dblpUrl=https://dblp.org/rec/conf/fimi/OzkuralA04
}}
==A Space Optimization for FP-Growth==
A Space Optimization for FP-Growth
Eray Özkural and Cevdet Aykanat
Department of Computer Engineering
Bilkent University 06800 Ankara, Turkey
{erayo,aykanat}@cs.bilkent.edu.tr
Abstract computes the number of times a given item x ∈ I oc-
curs in the transaction set T ; it is extended to sets of
Frequency mining problem comprises the core of items f (T, X) = |{X ⊆ Y |Y ∈ T }| to compute the
several data mining algorithms. Among frequent pat- frequency of a pattern.
tern discovery algorithms, FP-G ROWTH employs a Frequency mining is the discovery of all frequent
unique search strategy using compact structures re- patterns in a transaction set with a frequency of sup-
sulting in a high performance algorithm that requires port threshold and more. The set of all frequent pat-
only two database passes. We introduce an enhanced terns is F (T, ) = {X|X ∈ 2I ∧ f (T, X) ≥ }. In
version of this algorithm called FP-G ROWTH -T INY the algorithms, the set of frequent items F = {x ∈
which can mine larger databases due to a space op- I | f (T, x) ≥ } may require special consideration.
timization eliminating the need for intermediate con- A significant property of frequency mining known as
ditional pattern bases. We present the algorithms re- downward closure states that if X ∈ F(T, ) then
quired for directly constructing a conditional FP-Tree ∀Y ⊂ X, Y ∈ F(T, ) [2].
in detail. The experiments demonstrate that our imple- An inherent limitation of frequency mining is the
mentation has a running time performance compara- amount of main memory available [8]. In this paper,
ble to the original algorithm while reducing memory we present a space optimization to FP-Growth algo-
use up to twofold. rithm and we demonstrate its impact on performance
with experiments on synthetic and real-world datasets.
In the next section, we give the background on the FP-
G ROWTH algorithm. Section 3 and Section 4 explain
1. Introduction our algorithm and implementation. Section 5 presents
the experiments, following that we offer our conclu-
Frequency mining is the discovery of all frequent sions.
patterns in a transaction or relational database. Fre-
quent pattern discovery comprises the core of several 2. Background
data mining algorithms such as association rule min-
ing and sequence mining [10], dominating the running 2.1. Compact structures
time of these algorithms. The problem involves a trans-
action database T = {X|X ⊆ I} that consists in a set Compact data structures have been used for efficient
of transactions each of which are drawn from a set I of storage and query/update of candidate item sets in fre-
items. The mining algorithm finds all patterns that oc- quency mining algorithms. SEAR [12], S PEAR [12],
cur with a frequency satisfying a given absolute sup- and DIC [6] use tries (also known as prefix trees) while
port threshold . In practice, the number of items |I| is FP-G ROWTH [10] uses FP-Tree which is an enhanced
in the order of magnitude of 10 3 and more. The number trie structure.
of transactions is much larger, at least 10 5 . A pattern is Using concise structures can reduce both running
X ⊆ I, a subset of I, and the set of all patterns is 2 I . time and memory size requirements of an algorithm.
The frequency function f (T, x) = |{x ∈ Y |Y ∈ T }| Tries are well known structures that are widely used
for storing strings and have decent query/update per- which stores heads of node links accessing the linked
formance. The aforementioned algorithms exploit this list that spans all same items. FP-Tree stores only fre-
property of the data structure for better performance. quent items. At the root of the trie is a null item, and
Tries are also efficient in storage. A large number of strings are inserted in the trie by sorting item sets in a
strings can be stored in this dictionary type which unique1 decreasing frequency order [10].
would not otherwise fit into main memory. For fre- Table 1 shows a sample transaction set and frequent
quency mining algorithms both properties are criti- items in descending frequency order. Figure 1 illus-
cal as our goal is to achieve efficient and scalable al- trates the FP-Tree of sample transaction set in Table 1.
gorithms. In particular, the scalability of these struc- As shown in [10], FP-Tree carries complete informa-
tures is quite high [10] as they allow an algorithm to tion required for frequency mining and in a compact
track the frequency information of the candidate pat- manner; the height of the tree is bounded by maxi-
terns for very large databases. The FP-Tree structure in mal number of frequent items in a transaction. M AKE -
FP-G ROWTH allows the algorithm to maintain all fre- FP-T REE (Algorithm 1) constructs an FP-Tree from a
quency information in the main memory obtained from given transaction set T and support threshold as de-
two database passes. Using the FP-Tree structure has scribed.
also resulted in novel search strategies.
A notable work on compact structures is [15] in
which a binary-trie based summary structure for repre- Transaction Ordered Frequent Items
senting transaction sets is proposed. The trie is further t1 = {f, a, c, d, g, i, m, p} {f, c, a, m, p}
compressed using Patricia tries. Although significant t2 = {a, b, c, f, l, m, o} {f, c, a, b, m}
savings in storage and improvements in query time are t3 = {b, f, h, j, o} {f, b}
reported, the effectiveness of the scheme in a frequency t4 = {b, c, k, s, p} {c, b, p}
mining algorithm remains to be seen. In another work
t5 = {a, f, c, e, l, p, m, n} {f, c, a, m, p}
in FIMI 2003 workshop [3], an algorithm called PATRI -
CIA M INE using Patricia tries has been proposed [13].
The performance of PATRICIA M INE has been shown to Table 1. A sample Transaction Set
be consistently good in the extensive benchmark stud-
ies of FIMI workshop [3]; it was one of the fastest
algorithms although it was not the most efficient. For In Algorithm 3 we describe FP-G ROWTH which has
many applications, the average case performance may innovative features such as:
be more important than performing well in a small
number of cases, therefore further research on this PA - 1. Novel search strategy
TRICIA M INE would be worthwhile. 2. Effective use of a summary structure
In this paper, we introduce an optimized version of 3. Two database passes
FP-G ROWTH. A closer analysis of it is in order.
FP-G ROWTH turns the frequency k-length pattern min-
ing problem into “a sequence of k-frequent 1-item
2.2. FP-Growth algorithm set mining problems via a set of conditional pattern
bases” [10]. It is proposed that with FP-G ROWTH there
The FP-G ROWTH algorithm uses the frequent pat- is “no need to generate any combinations of candidate
tern tree (FP-Tree) structure. FP-Tree is an improved sets in the entire mining process”. With an FP-Tree
trie structure such that each itemset is stored as a string T ree given as input the algorithm generates all fre-
in the trie along with its frequency. At each node of the quent patterns. There are two points in the algorithm
trie, item, count and next fields are stored. The items that should be explained: the single path case and con-
of the path from the root of the trie to a node consti- ditional pattern bases. If an FP-Tree has only a single
tute the item set stored at the node and the count is path, then an optimization is to consider all combina-
the frequency of this item set. The node link next is a tions of items in the path (single path case is the ba-
pointer to the next node with the same item in the FP-
Tree. Field parent holds a pointer to the parent node, 1 All strings must be inserted in the same order; the order of items
null for root. Additionally, we maintain a header table with the same frequency must be the same.
a set of candidates among which only some of them
root
turn out to be frequent. The main innovation how-
ever remains intact: FP-G ROWTH takes advantage of
a tailored data structure to solve the frequency min-
Header Table f:4 c:1 ing problem with a divide-and-conquer method and
with demonstrated efficiency and scalability. Besides,
f the conditional pattern base is guaranteed to be smaller
c:3 b:1 b:1 than the original tree, which is a desirable property. An
c
important distinction of this algorithm is that, when ex-
a amined within the taxonomy of algorithms, it employs
a:3 p:1
a unique search strategy. When the item sets tested
b are considered, it is seen that this algorithm is neither
m:2 b:1 DFS nor BFS. The classification for FP-G ROWTH in
m Figure 3 of [11] may be slightly misleading. As Hipp
later mentions in [11], “FP-Growth does not follow the
p p:2 m:1 nodes of the tree . . . , but directly descends to some part
of the itemsets in the search space”. In fact, the part is
so well defined that it would be unfair to classify FP-
G ROWTH as conducting a DFS. It does not even start
with item sets of small length and proceed to longer
Figure 1. An FP-Tree Structure. item sets. Rather, it considers a set of patterns at the
same time by taking advantage of the data structure.
This unique search strategy makes it hard to classify
sis of recursion in FP-G ROWTH). Otherwise, the algo- FP-G ROWTH in the context of traditional uninformed
rithm constructs for each item a i in the header table a search algorithms.
conditional pattern base and an FP-Tree T ree β based
on this structure for recursive frequency mining. Con- Algorithm 1 M AKE -FP-T REE(DB, )
ditional pattern base is simply a compact representa-
1: Compute F and f (x) where x ∈ F
tion of a derivative database in which only a i and its
2: Sort F in frequency decreasing order as L
prefix paths in the original T ree occur. Consider path
3: Create root of an FP-Tree T with label “null”
< f : 4, c : 3, a : 3, m : 2, p : 2 > in T ree. For min-
4: for all transaction t i ∈ T do
ing patterns including m in this path, we need to con-
5: Sort frequent items in t i according to L. Let
sider only the prefix path of m since the nodes after m
sorted list be [p|P ] where p is the head of the
will be mined elsewhere (in this case only p). In the pre-
list and P the rest.
fix path < f : 4, c : 3, a : 3 > any pattern including m
6: I NSERT-T RIE([p|P ])
can have frequency equal to the frequency of m, there-
7: end for
fore we may adjust the frequencies in the prefix path
as < f : 2, c : 2, a : 2 > which is called a trans-
formed prefix path [10]. The set of transformed prefix
paths of ai forms a small database of patterns which co- 3. An improved version of FP-Growth
occur with ai and thus contains complete information
required for mining patterns including a i . Therefore, During experiments with large databases, we have
recursively mining conditional pattern bases for all a i observed that FP-G ROWTH was costly in terms of
in T ree is equivalent to mining T ree (which is equiva- memory use. Thus, we have experimented with im-
lent to ning the complete DB.). T ree β in FP-G ROWTH provements to the original algorithm. In this section,
is the FP-Tree of the conditional pattern base. we propose FP-G ROWTH -T INY (Algorithm 4) which
FP-G ROWTH is indeed remarkable with its unique is an enhancement of FP-G ROWTH featuring a space
divide and conquer approach. Nevertheless, it may be optimization and minor improvements. An important
profitable for us to view it as generating candidates de- optimization eliminates the need for intermediate con-
spite the title of [10]. The conditional pattern base is ditional pattern bases. A minor improvement comes
Algorithm 2 I NSERT-T RIE([p|P ], T ) Algorithm 4 FP-G ROWTH -T INY(T ree, α)
1: if T has a child N such that item[N ] = item[p] 1: if T ree contains a single path P then
then 2: prune infrequent nodes of P
2: count[N ] ← count[N ] + 1 3: if |P | > 0 then
3: else 4: output “all patterns in 2 P and α”
4: Create new node N with count = 1, parent 5: end if
linked to T and node-link linked to nodes with 6: else
the same item via next 7: for all ai in header table of T ree do
5: end if 8: T reeβ ← C ONS -C ONDITIONAL -FP-T REE (T ree, ai )
6: if P = ∅ then 9: output pattern β ← a i ∪ α with count =
7: I NSERT-T RIE(P, N ) f (ai ) f (x) of T ree
8: end if 10: if T reeβ = ∅ then
11: FP-G ROWTH (T reeβ , β)
Algorithm 3 FP-G ROWTH (T ree, α) 12: end if
1: if T ree contains a single path P then 13: end for
2: for all combination β of the nodes in path P do 14: end if
3: generate pattern β ∪ α with support minimum
FP-G ROWTH can be implemented as a set of pat-
support of nodes in β
terns. A pattern in FP-G ROWTH consists of a set of
4: end for
symbols and an associated count. With a counting al-
5: else
gorithm and retrieval/insertion of patterns directly into
6: for all ai in header table of T ree do
the FP-Tree structure, we can eliminate the need for
7: generate pattern β ← a i ∪ α with support =
such a pattern base. Algorithm 5 constructs a con-
support[ai ]
ditional FP-Tree from a given T ree and a symbol
8: construct β’s conditional pattern base and
s for which the transformed prefix paths are com-
then β’s conditional FP-Tree T ree β
puted.
9: if T reeβ = ∅ then
The improved procedure first counts the symbols in
10: FP-G ROWTH(T reeβ , β)
11: end if the conditional tree without generating an intermedi-
ate structure and constructs the set of frequent items.
12: end for
Then, each transformed prefix path is computed as pat-
13: end if
terns retrieved from T ree and are inserted in T ree β .
from not outputting all combinations of the single path C OUNT-P REFIX -PATH presented in Algorithm 6
in the basis of recursion. Instead, we output a repre- scans the prefix paths of a given node. Since the pat-
sentation of this task since subsequent algorithms can tern corresponding to the transformed prefix path has
take advantage of a compact representation for gener- the count of the node, it simply adds the count to the
ating association rules and so forth. 2 Another improve- count of all symbols in the prefix path. This step is re-
ment is pruning the infrequent nodes of the single path. quired for construction of a conditional FP-Tree
In the following subsection, the space optimization directly since an FP-Tree is based on the decreas-
is discussed. ing frequency order of F . This small algorithm allows
us to compute the counts of the symbols in the condi-
3.1. Eliminating conditional pattern base con- tional tree in an efficient way, and was the key obser-
struction vation in making the optimization possible.
Algorithm 7 retrieves a transformed prefix path for
The conditional tree T ree β can be constructed di- a given node excluding node itself and Algorithm 8 in-
rectly from T ree without an intermediate condi- serts a pattern into the FP-Tree. G ET-PATTERN com-
tional pattern base. The conditional pattern base in putes the transformed prefix path as described in [10].
I NSERT-PATTERN prunes the items not present in the
2 For the FIMI workshop, we output all patterns separately as re- frequent item set F of T ree (which does not have to
quired. It can be argued that a meaningful mining of all frequent be identical to the F of calling procedure) and sorts the
patterns must output them one by one. pattern in decreasing frequency order to maintain FP-
Algorithm 5 C ONS -C ONDITIONAL -FP-T REE(T ree, s) Tree properties and adds the obtained string to the FP-
1: table ← itemtable[T ree]
Tree structure. The addition is similar to insertion of a
2: list ← table[symbol]
single string, with the difference that insertion of a pat-
3: T ree ← M AKE -FP-T REE
tern is equivalent to insertion of the symbol string of
4: Count symbols without generating an interme-
the pattern count[pattern] times.
diate structure
5: node ← list Algorithm 8 I NSERT-PATTERN(T ree, pattern)
6: while node = null do 1: pattern ← pattern ∩ F [T ree]
7: C OUNT-P REFIX -PATH(node, count[T ree]) 2: Sort pattern in a predetermined frequency decreas-
8: node ← next[node] ing order
9: end while 3: Add the pattern to the structure
10: for all sym ∈ range[count] do
11: if count[sym] ≥ then
12: F [T ree ] ← F [T ree ] ∪ sym The optimization in Algorithm 5 makes FP-
13: end if G ROWTH more efficient and scalable by avoiding
14: end for
additional iterations and cutting down storage re-
15: Insert conditional patterns to T ree β
quirements. An implementation that uses an inter-
16: node ← list
mediate conditional pattern base structure will scan
17: while node = null do
the tree once, constructing a linked list with trans-
18: pattern ← G ET-PATTERN(node) formed prefix paths in it. Then, it will construct the
19: I NSERT-PATTERN(T ree , pattern) frequent item set from the linked list, and in a sec-
20: node ← next[node] ond iteration insert all transformed prefix paths
21: end while
with a procedure similar to I NSERT-PATTERN. Such
22: return T ree
an implementation would have to copy the trans-
formed prefix paths twice, and iterate over all pre-
fix paths three times, once in the tree, and twice
Algorithm 6 C OUNT-P REFIX -PATH (node, count) in the conditional pattern list. In contrast, our op-
1: pref ixcount ← count[node] timized procedure does not execute any expensive
2: node ← parent[node] copying operations and it needs to scan the pat-
3: while parent[node] = null do tern bases only twice in the tree. Besides efficiency,
4: count[symbol[node]] ← the elimination of extra storage requirement is signif-
count[symbol[node]] + pref ixcount icant because it allows FP-G ROWTH to mine more
5: node ← parent[node] complicated data sets with the same amount of mem-
6: end while ory.
An idea similar to our algorithm was independently
Algorithm 7 G ET-PATTERN(node) explored in FP-G ROWTH ∗ by making use of informa-
1: pattern ← M AKE -PATTERN tion in 2-items [9]. In their implementations, Grahne
2: if parent[node] = null then and Zhu have used strategies based on 2-items to im-
3: count[pattern] ← count[node] prove running time and memory usage, and they have
4: currnode ← parent[node] reported favorable performance, which has also been
5: while parent[node] = null do demonstrated in the benchmarks of the FIMI ’03 work-
6: symbols[pattern] ← symbols[pattern] ∪ shop [3].
symbol[currnode]
7: currnode ← parent[currnode]
8: end while 4. Implementation notes
9: else
10: count[pattern] ← 0 We have made a straightforward implementation of
11: end if FP-G ROWTH -T INY and licensed it under GNU GPL
12: return pattern for public use. It has been written in C++, using GNU
g++ compiler version 3.2.2.
For variable length arrays, we used vector in These databases have been derived from previous stud-
standard library. For storing transactions, patterns and ies [1, 2, 14]. Table 2 explains the symbols we use
other structures representable as strings we have used for denoting the parameters of association rule gener-
efficient variable length arrays. We used set to ator tool. The experimental databases are depicted in
store item sets in certain places where it would be fast Table 3. In all synthetic databases, |I| is 1000, and
to do so, otherwise we have used sorted arrays to im- |Fmax | is 2000. The original algorithm could not pro-
plement sets. cess T20.I6.D1137 in memory therefore the number of
No low level memory or I/O optimizations were em- transactions was decreased to 450K.
ployed.
A shortcoming of the pattern growth approach is |T | Number of transactions in transaction set
that it does not seem to be very memory efficient. We |t|avg Average size of a transaction t i
store many fields per node and the algorithm consumes |fm |avg Average length of maximal pattern f m
a lot of memory in practice. I Number of items in transaction set
The algorithm has a detail which required a special |Fmax | Number of maximal frequent patterns
code: sorting the frequent items in a transaction accord-
ing to an order L, in line 2 of Algorithm 1 and line 2
Table 2. Dataset parameters
of Algorithm 8. For preserving FP-Tree properties all
transactions must be inserted in the very same order. 3
The items are sorted first in order of decreasing fre-
quency and secondarily in order of indices to achieve
a unique frequency decreasing order. Using this proce- Name |T | |t|avg |fm |avg
dure, we are not obliged to maintain an L. T10.I6.1600K 1.6 × 10 6 10 6
T10.I4.1024K 1.024 × 10 6 10 4
5. Performance study T15.I4.367K 3.67 × 10 5 15 4
T20.I6.450K 4.5 × 10 5 20 6
In this section we report on our experiments demon-
strating the performance of FP-G ROWTH -T INY. We Table 3. Synthetic data sets
have measured the performance of Algorithm 3 and Al-
gorithm 4 on a 2.4Ghz Pentium 4 Linux system with
1GB memory and a common 7200 RPM IDE hard disk.
Both algorithms were run on four synthetic and five
5.2. Real-world data
real-world databases with varying support threshold.
The implementation of the original FP-G ROWTH al- We have used five publicly available datasets in the
gorithm is due to Bart Goethals. 4 FIMI repository. accidents is a traffic accident data
We describe the data sets used for our experiments [7]. retail is market basket data from an anony-
in the next two subsections. Following that, we present mous Belgian retail store [5]. The bms-webview1,
our performance experiments and interpret the results bms-webview2 and bms-pos datasets are from a
briefly, comparing the performance of the improved al- benchmark study described in [4]. Some statistics of
gorithm with the original one. the datasets are presented in Table 4.
5.1. Synthetic data 5.3. Memory consumption and running time
We have used the association rule generator de- The memory consumption and running time of FP-
scribed in [2] for synthetic data. Synthetic databases in G ROWTH -T INY and FP-G ROWTH are plotted for vary-
our evaluation have been selected from [17] and [16]. ing relative supports from 0.25% to 0.75% in Figure 2
and Figure 3 for synthetic databases and Figure 4 and
3 For patterns also in our implementation. Figure 5 for real-world databases except for accidents
4 Goethals has made his implementation publicly available at database which is a denser database that should be
http://www.cs.helsinki.fi/u/goethals/ mined at 10% and more. The implementations were run
Memory usage for T10.I6.1600K Running time for T10.I6.1600K
600 150
FP-Growth-Tiny FP-Growth-Tiny
FP-Growth 140 FP-Growth
550
130
500
120
Heap use (Mbytes)
450
Time (sec)
110
400 100
90
350
80
300
70
250 60
200 50
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75
Support (%) Support (%)
Memory usage for T10.I4.1024K Running time for T10.I4.1024K
500 70
FP-Growth-Tiny FP-Growth-Tiny
FP-Growth FP-Growth
450 65
400
Heap use (Mbytes)
60
Time (sec)
350
55
300
50
250
45
200
150 40
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75
Support (%) Support (%)
Memory usage for T15.I4.367K Running time for T15.I4.367K
350 50
FP-Growth-Tiny FP-Growth-Tiny
FP-Growth
FP-Growth
300 45
Heap use (Mbytes)
Time (sec)
250 40
200 35
150 30
25
100 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75
Support (%) Support (%)
Memory usage for T20.I6.450K Running time for T20.I6.450K
600 130
FP-Growth-Tiny FP-Growth-Tiny
FP-Growth
550 FP-Growth
120
500 110
Heap use (Mbytes)
450 100
Time (sec)
400 90
350 80
300 70
250
60
50
200 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75
Support (%) Support (%)
Figure 2. Memory consumption of FP- Figure 3. Running time performance of
G ROWTH -T INY and FP-G ROWTH on syn- FP-G ROWTH -T INY and FP-G ROWTH on
thetic databases synthetic databases
Memory usage for accidents Running time for accidents
300 600
FP-Growth-Tiny FP-Growth-Tiny
FP-Growth FP-Growth
250 500
Heap use (Mbytes)
200 400
Time (sec)
150 300
100 200
50 100
0 0
10 15 20 25 30 35 40 10 15 20 25 30 35 40
Support (%) Support (%)
Memory usage for retail Running time for retail
20 4.5
FP-Growth-Tiny FP-Growth-Tiny
18 FP-Growth FP-Growth
4
16
3.5
Heap use (Mbytes)
14
Time (sec)
12 3
10 2.5
8
2
6
1.5
4
2 1
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75
Support (%) Support (%)
Memory usage for bms-pos Running time for bms-pos
120 30
FP-Growth-Tiny FP-Growth-Tiny
110 FP-Growth 28 FP-Growth
100 26
Heap use (Mbytes)
90 24
Time (sec)
80 22
70 20
60 18
50 16
40 14
30 12
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75
Support (%) Support (%)
Memory usage for bms-webview1 Running time for bms-webview1
4 1.3
FP-Growth-Tiny FP-Growth-Tiny
FP-Growth 1.2 FP-Growth
3.5 1.1
Heap use (Mbytes)
1
Time (sec)
3
0.9
0.8
2.5
0.7
2 0.6
0.5
1.5 0.4
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75
Support (%) Support (%)
Memory usage for bms-webview2 Running time for bms-webview2
14 9
FP-Growth-Tiny FP-Growth-Tiny
FP-Growth 8 FP-Growth
12
7
Heap use (Mbytes)
10 6
Time (sec)
5
8
4
6 3
2
4
1
2 0
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75
Support (%) Support (%)
Figure 4. Memory consumption of FP- Figure 5. Running time performance of
G ROWTH -T INY and FP-G ROWTH on real- FP-G ROWTH -T INY and FP-G ROWTH on
world databases real-world databases
50%). In T10.I4.1024K we see 12% slowdown for
Name |T | |I| |t|avg
0.25% support and 2% slowdown for 0.3% sup-
accidents 3.41 × 105 469 33.81
port. In other problem instances FP-G ROWTH -T INY,
retail 8.82 × 104 16470 10.31
runs faster, up to 28.5% for retail dataset at 0.75% sup-
bms-pos 5.16 × 105 1657 6.53
port.
bms-webview1 5.96 × 10 4 60978 2.51
In the figures, we observe a relation between mem-
bms-webview2 7.75 × 10 4 330286 4.62
ory saving and decreased running time. We had ex-
pected that improving space utilization would remark-
Table 4. Real-world data sets
ably decrease the running time. However, we have
not observed as large an improvement as we would
inside a typical KDE desktop session. The running time have liked in running time. On the other hand, our tri-
is measured as the wall-clock time of the system call. als show significant improvement in memory use con-
The memory usage is measured using the GNU glibc trasted to vanilla FP-G ROWTH, allowing us to mine
tool memusage, considering only the maximum heap more complicated/larger datasets with the same amount
size since stack use is much smaller than heap size. of memory.
The plots for synthetic datasets are similar among The adverse situation with bms-webview1 and bms-
themselves, while we observe more variation in real- webview2 shows that the performance study must be
world datasets. Memory is saved in all databases, ex- extended to determine whether the undesirable behav-
cept in bms-webview2, which requires 2.74 times the ior recurs at a large scale, since these are both sparse
memory used in FP-G ROWTH; this has an adverse ef- data sets coming from the same source. At any rate, a
fect on running time as discussed below. In others, we closer inspection of FP-G ROWTH -T INY seems neces-
observe that memory usage reduces down to 41.5% in sary. We anticipate that the benchmark studies at the
accidents database with 4% support, which is 2.4 times FIMI workshop will illustrate its performance more
smaller than FP-G ROWTH. precisely.
Due to the optimization, our implementation can
process larger databases than the vanilla version. For
most problem instances, the memory consumption has 6. Conclusions
been reduced more than twofold compared to the orig-
inal algorithm. An advantage of our approach is that We have presented our version of FP-G ROWTH
with the same amount of memory, we can process more which sports multiple improvements in Section 3. An
complicated databases. 5 The experiments overall show optimization over the original algorithm eliminates a
that the conditional pattern base construction which we large intermediate structure required in the recursive
have eliminated has a significant space cost during the step of the published FP-G ROWTH algorithm in addi-
recursive construction of conditional FP-Trees. tion to two other minor improvements.
The running time behaviors of two algorithms are In Section 5, we have reported the results of our
quite similar on the average. Our algorithm tends to performance experiments on synthetic and real-world
perform better and is faster in higher support thresh- databases. The performance of the optimized algo-
olds, while in lower thresholds the performance gap rithm has been compared with a publicly available
becomes closer. FP-G ROWTH -T INY runs faster ex- FP-G ROWTH implementation. We have observed more
cept in bms-webview1, bms-webview2 and lower than twofold improvement in memory utilization over
thresholds of T10.I4.1024K. In bms-webview1 the vanilla algorithm. In the best case, memory size
database, FP-G ROWTH -T INY runs 10-27% slower; has become 2.4 times smaller, while in the worst case
in bms-webview2 database we observe that FP- memory saving was not possible in a small real-world
G ROWTH -T INY has slowed down by a factor of 5.56 database. Typically, our implementation makes better
for 0.25% support threshold, and slowdown is ob- use of memory, enabling it to mine larger and more
served also for other support thresholds (down to complicated databases that cannot be processed by
the original algorithm. The running time behavior of
5 Note that FP-G ROWTH uses a compressed representation of fre- both algorithms are quite similar on the average; FP-
quency information, whose size may be thought of as related to G ROWTH -T INY runs up to 28.5% percent faster, how-
complexity of the dataset. ever it may run slower in a minority of instances.
References [14] A. Savasere, E. Omiecinski, and S. B. Navathe. An ef-
ficient algorithm for mining association rules in large
[1] R. Agrawal and J. C. Shafer. Parallel mining of associ- databases. In The VLDB Journal, pages 432–444, 1995.
ation rules. IEEE Trans. On Knowledge And Data En- [15] D.-Y. Yang, A. Johar, A. Grama, and W. Szpankowski.
gineering, 8:962–969, 1996. Summary structures for frequency queries on large
[2] R. Agrawal and R. Srikant. Fast algorithms for min- transaction sets. In Data Compression Conference,
ing association rules. In J. B. Bocca, M. Jarke, and pages 420–429, 2000.
C. Zaniolo, editors, Proc. 20th Int. Conf. Very Large [16] M. J. Zaki, S. Parthasarathy, and W. Li. A localized al-
Data Bases, VLDB, pages 487–499. Morgan Kaufmann, gorithm for parallel association mining. In ACM Sym-
12–15 1994. posium on Parallel Algorithms and Architectures, pages
[3] M. J. Z. Bart Goethals. Fimi ’03: Workshop on fre- 321–330, 1997.
quent itemset mining implementations. In Proceedings [17] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li.
of the ICDM 2003 Workshop on Frequent Itemset Min- Parallel algorithms for discovery of association rules.
ing Implementations, Melbourne, Florida, USA, 2003. Data Mining and Knowledge Discovery, 1(4):343–373,
[4] Z. Z. Blue. Real world performance of association rule 1997.
algorithms. In KDD 2001.
[5] T. Brijs, G. Swinnen, K. Vanhoof, and G. Wets. Us-
ing association rules for product assortment decisions:
A case study. In Knowledge Discovery and Data Min-
ing, pages 254–260, 1999.
[6] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dy-
namic itemset counting and implication rules for market
basket data. In J. Peckham, editor, SIGMOD 1997, Pro-
ceedings ACM SIGMOD International Conference on
Management of Data, May 13-15, 1997, Tucson, Ari-
zona, USA, pages 255–264. ACM Press, 05 1997.
[7] K. Geurts, G. Wets, T. Brijs, and K. Vanhoof. Pro-
filing high frequency accident locations using associ-
ation rules. In Proceedings of the 82nd Annual Trans-
portation Research Board, page 18pp, Washington DC
(USA), January 2003.
[8] B. Goethals. Memory issues in frequent itemset min-
ing. In Proceedings of the 2004 ACM Symposium on
Applied Computing (SAC’04), Nicosia, Cyprus, March
2004.
[9] G. Grahne and J. Zhu. Efficiently using prefix-trees in
mining frequent itemsets. In Proceedings of the First
IEEE ICDM Workshop on Frequent Itemset Mining Im-
plementations (FIMI’03).
[10] J. Han, J. Pei, and Y. Yin. Mining frequent patterns
without candidate generation. In W. Chen, J. Naughton,
and P. A. Bernstein, editors, 2000 ACM SIGMOD Intl.
Conference on Management of Data, pages 1–12. ACM
Press, May 2000.
[11] J. Hipp, U. Güntzer, and G. Nakhaeizadeh. Algorithms
for association rule mining – a general survey and com-
parison. SIGKDD Explorations, 2(1):58–64, July 2000.
[12] A. Mueller. Fast sequential and parallel algorithms for
association rule mining: A comparison. Technical Re-
port CS-TR-3515, College Park, MD, 1995.
[13] A. Pietracaprina and D. Zandolin. Mining frequent
itemsets using patricia tries. In Proceedings of the First
IEEE ICDM Workshop on Frequent Itemset Mining Im-
plementations (FIMI’03).