<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A Space Optimization for FP-Growth</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Eray</forename><surname>Özkural</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Engineering</orgName>
								<orgName type="institution">Bilkent University</orgName>
								<address>
									<postCode>06800</postCode>
									<settlement>Ankara</settlement>
									<country key="TR">Turkey</country>
								</address>
							</affiliation>
						</author>
						<author role="corresp">
							<persName><forename type="first">Cevdet</forename><surname>Aykanat</surname></persName>
							<email>aykanat@cs.bilkent.edu.tr</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Engineering</orgName>
								<orgName type="institution">Bilkent University</orgName>
								<address>
									<postCode>06800</postCode>
									<settlement>Ankara</settlement>
									<country key="TR">Turkey</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">A Space Optimization for FP-Growth</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">EC51DF3D93F3D6FB8FCF1F620C755C1F</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T05:45+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Frequency mining problem comprises the core of several data mining algorithms. Among frequent pattern discovery algorithms, FP-GROWTH employs a unique search strategy using compact structures resulting in a high performance algorithm that requires only two database passes. We introduce an enhanced version of this algorithm called FP-GROWTH-TINY which can mine larger databases due to a space optimization eliminating the need for intermediate conditional pattern bases. We present the algorithms required for directly constructing a conditional FP-Tree in detail. The experiments demonstrate that our implementation has a running time performance comparable to the original algorithm while reducing memory use up to twofold.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Frequency mining is the discovery of all frequent patterns in a transaction or relational database. Frequent pattern discovery comprises the core of several data mining algorithms such as association rule mining and sequence mining <ref type="bibr" target="#b9">[10]</ref>, dominating the running time of these algorithms. The problem involves a transaction database T = {X|X ⊆ I} that consists in a set of transactions each of which are drawn from a set I of items. The mining algorithm finds all patterns that occur with a frequency satisfying a given absolute support threshold . In practice, the number of items |I| is in the order of magnitude of 10 3 and more. The number of transactions is much larger, at least 10 5 . A pattern is X ⊆ I, a subset of I, and the set of all patterns is 2 I . The frequency function f (T, x) = |{x ∈ Y |Y ∈ T }| computes the number of times a given item x ∈ I occurs in the transaction set T ; it is extended to sets of items f (T, X) = |{X ⊆ Y |Y ∈ T }| to compute the frequency of a pattern.</p><p>Frequency mining is the discovery of all frequent patterns in a transaction set with a frequency of support threshold and more. The set of all frequent patterns is F (T, ) = {X|X ∈ 2 I ∧ f (T, X) ≥ }. In the algorithms, the set of frequent items F = {x ∈ I | f (T, x) ≥ } may require special consideration. A significant property of frequency mining known as downward closure states that if X ∈ F(T, ) then ∀Y ⊂ X, Y ∈ F(T, ) <ref type="bibr" target="#b1">[2]</ref>.</p><p>An inherent limitation of frequency mining is the amount of main memory available <ref type="bibr" target="#b7">[8]</ref>. In this paper, we present a space optimization to FP-Growth algorithm and we demonstrate its impact on performance with experiments on synthetic and real-world datasets. In the next section, we give the background on the FP-GROWTH algorithm. Section 3 and Section 4 explain our algorithm and implementation. Section 5 presents the experiments, following that we offer our conclusions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Background</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Compact structures</head><p>Compact data structures have been used for efficient storage and query/update of candidate item sets in frequency mining algorithms. SEAR <ref type="bibr" target="#b11">[12]</ref>, SPEAR <ref type="bibr" target="#b11">[12]</ref>, and DIC <ref type="bibr" target="#b5">[6]</ref> use tries (also known as prefix trees) while FP-GROWTH <ref type="bibr" target="#b9">[10]</ref> uses FP-Tree which is an enhanced trie structure.</p><p>Using concise structures can reduce both running time and memory size requirements of an algorithm. Tries are well known structures that are widely used for storing strings and have decent query/update performance. The aforementioned algorithms exploit this property of the data structure for better performance. Tries are also efficient in storage. A large number of strings can be stored in this dictionary type which would not otherwise fit into main memory. For frequency mining algorithms both properties are critical as our goal is to achieve efficient and scalable algorithms. In particular, the scalability of these structures is quite high <ref type="bibr" target="#b9">[10]</ref> as they allow an algorithm to track the frequency information of the candidate patterns for very large databases. The FP-Tree structure in FP-GROWTH allows the algorithm to maintain all frequency information in the main memory obtained from two database passes. Using the FP-Tree structure has also resulted in novel search strategies.</p><p>A notable work on compact structures is <ref type="bibr" target="#b14">[15]</ref> in which a binary-trie based summary structure for representing transaction sets is proposed. The trie is further compressed using Patricia tries. Although significant savings in storage and improvements in query time are reported, the effectiveness of the scheme in a frequency mining algorithm remains to be seen. In another work in FIMI 2003 workshop <ref type="bibr" target="#b2">[3]</ref>, an algorithm called PATRI-CIAMINE using Patricia tries has been proposed <ref type="bibr" target="#b12">[13]</ref>. The performance of PATRICIAMINE has been shown to be consistently good in the extensive benchmark studies of FIMI workshop <ref type="bibr" target="#b2">[3]</ref>; it was one of the fastest algorithms although it was not the most efficient. For many applications, the average case performance may be more important than performing well in a small number of cases, therefore further research on this PA-TRICIAMINE would be worthwhile.</p><p>In this paper, we introduce an optimized version of FP-GROWTH. A closer analysis of it is in order.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">FP-Growth algorithm</head><p>The FP-GROWTH algorithm uses the frequent pattern tree (FP-Tree) structure. FP-Tree is an improved trie structure such that each itemset is stored as a string in the trie along with its frequency. At each node of the trie, item, count and next fields are stored. The items of the path from the root of the trie to a node constitute the item set stored at the node and the count is the frequency of this item set. The node link next is a pointer to the next node with the same item in the FP-Tree. Field parent holds a pointer to the parent node, null for root. Additionally, we maintain a header table which stores heads of node links accessing the linked list that spans all same items. FP-Tree stores only frequent items. At the root of the trie is a null item, and strings are inserted in the trie by sorting item sets in a unique<ref type="foot" target="#foot_0">1</ref> decreasing frequency order <ref type="bibr" target="#b9">[10]</ref>.</p><p>Table <ref type="table" target="#tab_2">1</ref> shows a sample transaction set and frequent items in descending frequency order. Figure <ref type="figure" target="#fig_0">1</ref> illustrates the FP-Tree of sample transaction set in Table <ref type="table" target="#tab_2">1</ref>. As shown in <ref type="bibr" target="#b9">[10]</ref>, FP-Tree carries complete information required for frequency mining and in a compact manner; the height of the tree is bounded by maximal number of frequent items in a transaction. MAKE-FP-TREE (Algorithm 1) constructs an FP-Tree from a given transaction set T and support threshold as described.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Transaction</head><p>Ordered Frequent Items <ref type="figure">f, c, e, l, p, m, n} {f, c, a, m</ref></p><formula xml:id="formula_0">t 1 = {f, a, c, d, g, i, m, p} {f, c, a, m, p} t 2 = {a, b, c, f, l, m, o} {f, c, a, b, m} t 3 = {b, f, h, j, o} {f, b} t 4 = {b, c, k, s, p} {c, b, p} t 5 = {a,</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>, p}</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1. A sample Transaction Set</head><p>In Algorithm 3 we describe FP-GROWTH which has innovative features such as:</p><p>1. Novel search strategy 2. Effective use of a summary structure 3. Two database passes FP-GROWTH turns the frequency k-length pattern mining problem into "a sequence of k-frequent 1-item set mining problems via a set of conditional pattern bases" <ref type="bibr" target="#b9">[10]</ref>. It is proposed that with FP-GROWTH there is "no need to generate any combinations of candidate sets in the entire mining process". With an FP-Tree T ree given as input the algorithm generates all frequent patterns. There are two points in the algorithm that should be explained: the single path case and conditional pattern bases. If an FP-Tree has only a single path, then an optimization is to consider all combinations of items in the path (single path case is the ba- sis of recursion in FP-GROWTH). Otherwise, the algorithm constructs for each item a i in the header table a conditional pattern base and an FP-Tree T ree β based on this structure for recursive frequency mining. Conditional pattern base is simply a compact representation of a derivative database in which only a i and its prefix paths in the original T ree occur. Consider path &lt; f : 4, c : 3, a : 3, m : 2, p : 2 &gt; in T ree. For mining patterns including m in this path, we need to consider only the prefix path of m since the nodes after m will be mined elsewhere (in this case only p). In the prefix path &lt; f : 4, c : 3, a : 3 &gt; any pattern including m can have frequency equal to the frequency of m, therefore we may adjust the frequencies in the prefix path as &lt; f : 2, c : 2, a : 2 &gt; which is called a transformed prefix path <ref type="bibr" target="#b9">[10]</ref>. The set of transformed prefix paths of a i forms a small database of patterns which cooccur with a i and thus contains complete information required for mining patterns including a i . Therefore, recursively mining conditional pattern bases for all a i in T ree is equivalent to mining T ree (which is equivalent to ning the complete DB.). T ree β in FP-GROWTH is the FP-Tree of the conditional pattern base.</p><p>FP-GROWTH is indeed remarkable with its unique divide and conquer approach. Nevertheless, it may be profitable for us to view it as generating candidates despite the title of <ref type="bibr" target="#b9">[10]</ref>. The conditional pattern base is a set of candidates among which only some of them turn out to be frequent. The main innovation however remains intact: FP-GROWTH takes advantage of a tailored data structure to solve the frequency mining problem with a divide-and-conquer method and with demonstrated efficiency and scalability. Besides, the conditional pattern base is guaranteed to be smaller than the original tree, which is a desirable property. An important distinction of this algorithm is that, when examined within the taxonomy of algorithms, it employs a unique search strategy. When the item sets tested are considered, it is seen that this algorithm is neither DFS nor BFS. The classification for FP-GROWTH in Figure <ref type="figure">3</ref> of <ref type="bibr" target="#b10">[11]</ref> may be slightly misleading. As Hipp later mentions in <ref type="bibr" target="#b10">[11]</ref>, "FP-Growth does not follow the nodes of the tree . . . , but directly descends to some part of the itemsets in the search space". In fact, the part is so well defined that it would be unfair to classify FP-GROWTH as conducting a DFS. It does not even start with item sets of small length and proceed to longer item sets. Rather, it considers a set of patterns at the same time by taking advantage of the data structure. This unique search strategy makes it hard to classify FP-GROWTH in the context of traditional uninformed search algorithms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 1 MAKE-FP-TREE(DB, )</head><p>1: Compute F and f (x) where x ∈ F 2: Sort F in frequency decreasing order as L 3: Create root of an FP-Tree T with label "null" 4: for all transaction t i ∈ T do 5: Sort frequent items in t i according to L. Let sorted list be [p|P ] where p is the head of the list and P the rest. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">An improved version of FP-Growth</head><p>During experiments with large databases, we have observed that FP-GROWTH was costly in terms of memory use. Thus, we have experimented with improvements to the original algorithm. In this section, we propose FP-GROWTH-TINY (Algorithm 4) which is an enhancement of FP-GROWTH featuring a space optimization and minor improvements. An important optimization eliminates the need for intermediate conditional pattern bases. A minor improvement comes  end for 13: end if from not outputting all combinations of the single path in the basis of recursion. Instead, we output a representation of this task since subsequent algorithms can take advantage of a compact representation for generating association rules and so forth. 2 Another improvement is pruning the infrequent nodes of the single path.</p><formula xml:id="formula_1">Algorithm 2 INSERT-TRIE([p|P ], T ) 1: if T has a child N such that item[N ] = item[p] then 2: count[N ] ← count[N ] +</formula><p>In the following subsection, the space optimization is discussed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Eliminating conditional pattern base construction</head><p>The conditional tree T ree β can be constructed directly from T ree without an intermediate conditional pattern base. The conditional pattern base in 2 For the FIMI workshop, we output all patterns separately as required. It can be argued that a meaningful mining of all frequent patterns must output them one by one.</p><p>Algorithm 4 FP-GROWTH-TINY(T ree, α)</p><p>1: if T ree contains a single path P then 2:</p><p>prune infrequent nodes of P 3:</p><formula xml:id="formula_2">if |P | &gt; 0 then 4:</formula><p>output "all patterns in 2 P and α" T ree β ← CONS-CONDITIONAL-FP-TREE(T ree, a i )</p><formula xml:id="formula_3">9: output pattern β ← a i ∪ α with count = f (a i ) f (x) of T ree 10: if T ree β = ∅ then 11: FP-GROWTH(T ree β , β) 12:</formula><p>end if <ref type="bibr">13:</ref> end for 14: end if FP-GROWTH can be implemented as a set of patterns. A pattern in FP-GROWTH consists of a set of symbols and an associated count. With a counting algorithm and retrieval/insertion of patterns directly into the FP-Tree structure, we can eliminate the need for such a pattern base. Algorithm 5 constructs a conditional FP-Tree from a given T ree and a symbol s for which the transformed prefix paths are computed.</p><p>The improved procedure first counts the symbols in the conditional tree without generating an intermediate structure and constructs the set of frequent items. Then, each transformed prefix path is computed as patterns retrieved from T ree and are inserted in T ree β .</p><p>COUNT-PREFIX-PATH presented in Algorithm 6 scans the prefix paths of a given node. Since the pattern corresponding to the transformed prefix path has the count of the node, it simply adds the count to the count of all symbols in the prefix path. This step is required for construction of a conditional FP-Tree directly since an FP-Tree is based on the decreasing frequency order of F . This small algorithm allows us to compute the counts of the symbols in the conditional tree in an efficient way, and was the key observation in making the optimization possible.</p><p>Algorithm 7 retrieves a transformed prefix path for a given node excluding node itself and Algorithm 8 inserts a pattern into the FP-Tree. GET-PATTERN computes the transformed prefix path as described in <ref type="bibr" target="#b9">[10]</ref>. INSERT-PATTERN prunes the items not present in the frequent item set F of T ree (which does not have to be identical to the F of calling procedure) and sorts the pattern in decreasing frequency order to maintain FP-Algorithm 5 CONS-CONDITIONAL-FP-TREE(T ree, s) Tree properties and adds the obtained string to the FP-Tree structure. The addition is similar to insertion of a single string, with the difference that insertion of a pattern is equivalent to insertion of the symbol string of the pattern count[pattern] times.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 8 INSERT-PATTERN(T ree, pattern)</head><p>1: pattern ← pattern ∩ F [T ree] 2: Sort pattern in a predetermined frequency decreasing order 3: Add the pattern to the structure The optimization in Algorithm 5 makes FP-GROWTH more efficient and scalable by avoiding additional iterations and cutting down storage requirements. An implementation that uses an intermediate conditional pattern base structure will scan the tree once, constructing a linked list with transformed prefix paths in it. Then, it will construct the frequent item set from the linked list, and in a second iteration insert all transformed prefix paths with a procedure similar to INSERT-PATTERN. Such an implementation would have to copy the transformed prefix paths twice, and iterate over all prefix paths three times, once in the tree, and twice in the conditional pattern list. In contrast, our optimized procedure does not execute any expensive copying operations and it needs to scan the pattern bases only twice in the tree. Besides efficiency, the elimination of extra storage requirement is significant because it allows FP-GROWTH to mine more complicated data sets with the same amount of memory.</p><p>An idea similar to our algorithm was independently explored in FP-GROWTH * by making use of information in 2-items <ref type="bibr" target="#b8">[9]</ref>. In their implementations, Grahne and Zhu have used strategies based on 2-items to improve running time and memory usage, and they have reported favorable performance, which has also been demonstrated in the benchmarks of the FIMI '03 workshop <ref type="bibr" target="#b2">[3]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Implementation notes</head><p>We have made a straightforward implementation of FP-GROWTH-TINY and licensed it under GNU GPL for public use. It has been written in C++, using GNU g++ compiler version 3.2.2.</p><p>For variable length arrays, we used vector&lt;T&gt; in standard library. For storing transactions, patterns and other structures representable as strings we have used efficient variable length arrays. We used set&lt;T&gt; to store item sets in certain places where it would be fast to do so, otherwise we have used sorted arrays to implement sets.</p><p>No low level memory or I/O optimizations were employed.</p><p>A shortcoming of the pattern growth approach is that it does not seem to be very memory efficient. We store many fields per node and the algorithm consumes a lot of memory in practice.</p><p>The algorithm has a detail which required a special code: sorting the frequent items in a transaction according to an order L, in line 2 of Algorithm 1 and line 2 of Algorithm 8. For preserving FP-Tree properties all transactions must be inserted in the very same order. <ref type="foot" target="#foot_1">3</ref>The items are sorted first in order of decreasing frequency and secondarily in order of indices to achieve a unique frequency decreasing order. Using this procedure, we are not obliged to maintain an L.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Performance study</head><p>In this section we report on our experiments demonstrating the performance of FP-GROWTH-TINY. We have measured the performance of Algorithm 3 and Algorithm 4 on a 2.4Ghz Pentium 4 Linux system with 1GB memory and a common 7200 RPM IDE hard disk. Both algorithms were run on four synthetic and five real-world databases with varying support threshold. The implementation of the original FP-GROWTH algorithm is due to Bart Goethals. <ref type="foot" target="#foot_2">4</ref>We describe the data sets used for our experiments in the next two subsections. Following that, we present our performance experiments and interpret the results briefly, comparing the performance of the improved algorithm with the original one.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Synthetic data</head><p>We have used the association rule generator described in <ref type="bibr" target="#b1">[2]</ref> for synthetic data. Synthetic databases in our evaluation have been selected from <ref type="bibr" target="#b16">[17]</ref> and <ref type="bibr" target="#b15">[16]</ref>.</p><p>These databases have been derived from previous studies <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2,</ref><ref type="bibr" target="#b13">14]</ref>. Table <ref type="table" target="#tab_3">2</ref> explains the symbols we use for denoting the parameters of association rule generator tool. The experimental databases are depicted in Table <ref type="table" target="#tab_4">3</ref>. In all synthetic databases, |I| is 1000, and |F max | is 2000. The original algorithm could not process T20.I6.D1137 in memory therefore the number of transactions was decreased to 450K.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>|T |</head><p>Number </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Real-world data</head><p>We have used five publicly available datasets in the FIMI repository. accidents is a traffic accident data <ref type="bibr" target="#b6">[7]</ref>. retail is market basket data from an anonymous Belgian retail store <ref type="bibr" target="#b4">[5]</ref>. The bms-webview1, bms-webview2 and bms-pos datasets are from a benchmark study described in <ref type="bibr" target="#b3">[4]</ref>. Some statistics of the datasets are presented in Table <ref type="table" target="#tab_5">4</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Memory consumption and running time</head><p>The memory consumption and running time of FP-GROWTH-TINY and FP-GROWTH are plotted for varying relative supports from 0.25% to 0.75% in Figure <ref type="figure">2</ref> and Figure <ref type="figure">3</ref> for synthetic databases and Figure <ref type="figure">4</ref> and Figure <ref type="figure">5</ref> for real-world databases except for accidents database which is a denser database that should be mined at 10% and more. The implementations were run   The plots for synthetic datasets are similar among themselves, while we observe more variation in realworld datasets. Memory is saved in all databases, except in bms-webview2, which requires 2.74 times the memory used in FP-GROWTH; this has an adverse effect on running time as discussed below. In others, we observe that memory usage reduces down to 41.5% in accidents database with 4% support, which is 2.4 times smaller than FP-GROWTH.</p><p>Due to the optimization, our implementation can process larger databases than the vanilla version. For most problem instances, the memory consumption has been reduced more than twofold compared to the original algorithm. An advantage of our approach is that with the same amount of memory, we can process more complicated databases. <ref type="foot" target="#foot_3">5</ref> The experiments overall show that the conditional pattern base construction which we have eliminated has a significant space cost during the recursive construction of conditional FP-Trees.</p><p>The running time behaviors of two algorithms are quite similar on the average. Our algorithm tends to perform better and is faster in higher support thresholds, while in lower thresholds the performance gap becomes closer. FP-GROWTH-TINY runs faster except in bms-webview1, bms-webview2 and lower thresholds of T10.I4.1024K. In bms-webview1 database, FP-GROWTH-TINY runs 10-27% slower; in bms-webview2 database we observe that FP-GROWTH-TINY has slowed down by a factor of 5.56 for 0.25% support threshold, and slowdown is observed also for other support thresholds (down to 50%). In T10.I4.1024K we see 12% slowdown for 0.25% support and 2% slowdown for 0.3% support. In other problem instances FP-GROWTH-TINY, runs faster, up to 28.5% for retail dataset at 0.75% support.</p><p>In the figures, we observe a relation between memory saving and decreased running time. We had expected that improving space utilization would remarkably decrease the running time. However, we have not observed as large an improvement as we would have liked in running time. On the other hand, our trials show significant improvement in memory use contrasted to vanilla FP-GROWTH, allowing us to mine more complicated/larger datasets with the same amount of memory.</p><p>The adverse situation with bms-webview1 and bms-webview2 shows that the performance study must be extended to determine whether the undesirable behavior recurs at a large scale, since these are both sparse data sets coming from the same source. At any rate, a closer inspection of FP-GROWTH-TINY seems necessary. We anticipate that the benchmark studies at the FIMI workshop will illustrate its performance more precisely.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusions</head><p>We have presented our version of FP-GROWTH which sports multiple improvements in Section 3. An optimization over the original algorithm eliminates a large intermediate structure required in the recursive step of the published FP-GROWTH algorithm in addition to two other minor improvements.</p><p>In Section 5, we have reported the results of our performance experiments on synthetic and real-world databases. The performance of the optimized algorithm has been compared with a publicly available FP-GROWTH implementation. We have observed more than twofold improvement in memory utilization over the vanilla algorithm. In the best case, memory size has become 2.4 times smaller, while in the worst case memory saving was not possible in a small real-world database. Typically, our implementation makes better use of memory, enabling it to mine larger and more complicated databases that cannot be processed by the original algorithm. The running time behavior of both algorithms are quite similar on the average; FP-GROWTH-TINY runs up to 28.5% percent faster, however it may run slower in a minority of instances.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 .</head><label>1</label><figDesc>Figure 1. An FP-Tree Structure.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>6 :</head><label>6</label><figDesc>INSERT-TRIE([p|P ])7: end for</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>8 :</head><label>8</label><figDesc>construct β's conditional pattern base and then β's conditional FP-Tree T ree β 9: if T ree β = ∅ then 10: FP-GROWTH(T ree β , β)</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>Figure 2. Memory consumption of FP-GROWTH-TINY and FP-GROWTH on synthetic databases</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 .Figure 5 .</head><label>45</label><figDesc>Figure 4. Memory consumption of FP-GROWTH-TINY and FP-GROWTH on realworld databases</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>Create new node N with count = 1, parent linked to T and node-link linked to nodes with the same item via next 5: end if 6: if P = ∅ then</figDesc><table><row><cell></cell><cell>1</cell></row><row><cell cols="2">3: else</cell></row><row><cell>4:</cell><cell></cell></row><row><cell>7:</cell><cell>INSERT-TRIE(P, N )</cell></row><row><cell cols="2">8: end if</cell></row><row><cell cols="2">Algorithm 3 FP-GROWTH(T ree, α)</cell></row><row><cell cols="2">1: if T ree contains a single path P then</cell></row><row><cell>2:</cell><cell>for all combination β of the nodes in path P do</cell></row><row><cell>3:</cell><cell>generate pattern β ∪ α with support minimum</cell></row><row><cell></cell><cell>support of nodes in β</cell></row><row><cell>4:</cell><cell>end for</cell></row><row><cell cols="2">5: else</cell></row></table><note>6:for all a i in header table of T ree do7: generate pattern β ← a i ∪ α with support = support[a i ]</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>for all a i in header table of T ree do</figDesc><table><row><cell>5:</cell><cell>end if</cell></row><row><cell cols="2">6: else</cell></row><row><cell>7:</cell><cell></cell></row><row><cell>8:</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>1 :</head><label>1</label><figDesc>table ← itemtable[T ree] 2: list ← table[symbol] 3: T ree ← MAKE-FP-TREE</figDesc><table><row><cell>4:</cell><cell cols="2">Count symbols without generating an interme-</cell></row><row><cell></cell><cell>diate structure</cell><cell></cell></row><row><cell cols="2">5: node ← list</cell><cell></cell></row><row><cell cols="2">6: while node = null do</cell><cell></cell></row><row><cell>7:</cell><cell>COUNT-PREFIX-PATH(node, count[T ree])</cell><cell></cell></row><row><cell>8:</cell><cell>node ← next[node]</cell><cell></cell></row><row><cell cols="2">9: end while</cell><cell></cell></row><row><cell cols="2">10: for all sym ∈ range[count] do</cell><cell></cell></row><row><cell>11:</cell><cell>if count[sym] ≥ then</cell><cell></cell></row><row><cell>12:</cell><cell>F [T ree ] ← F [T ree ] ∪ sym</cell><cell></cell></row><row><cell>13:</cell><cell>end if</cell><cell></cell></row><row><cell cols="2">14: end for</cell><cell></cell></row><row><cell>15:</cell><cell>Insert conditional patterns to T ree β</cell><cell></cell></row><row><cell cols="2">16: node ← list</cell><cell></cell></row><row><cell cols="2">17: while node = null do</cell><cell></cell></row><row><cell>18:</cell><cell>pattern ← GET-PATTERN(node)</cell><cell></cell></row><row><cell>19:</cell><cell>INSERT-PATTERN(T ree , pattern)</cell><cell></cell></row><row><cell>20:</cell><cell>node ← next[node]</cell><cell></cell></row><row><cell cols="2">21: end while</cell><cell></cell></row><row><cell cols="2">22: return T ree</cell><cell></cell></row><row><cell cols="2">Algorithm 6 COUNT-PREFIX-PATH(node, count)</cell><cell></cell></row><row><cell>4:</cell><cell>count[symbol[node]]</cell><cell>←</cell></row><row><cell></cell><cell>count[symbol[node]] + pref ixcount</cell><cell></cell></row><row><cell>5:</cell><cell>node ← parent[node]</cell><cell></cell></row><row><cell cols="2">6: end while</cell><cell></cell></row><row><cell cols="2">Algorithm 7 GET-PATTERN(node)</cell><cell></cell></row><row><cell>6:</cell><cell cols="2">symbols[pattern] ← symbols[pattern] ∪</cell></row><row><cell></cell><cell>symbol[currnode]</cell><cell></cell></row><row><cell>7:</cell><cell>currnode ← parent[currnode]</cell><cell></cell></row><row><cell>8:</cell><cell>end while</cell><cell></cell></row><row><cell cols="2">9: else</cell><cell></cell></row><row><cell>10:</cell><cell>count[pattern] ← 0</cell><cell></cell></row><row><cell cols="2">11: end if</cell><cell></cell></row><row><cell cols="2">12: return pattern</cell><cell></cell></row></table><note>1: pref ixcount ← count[node] 2: node ← parent[node] 3: while parent[node] = null do 1: pattern ← MAKE-PATTERN 2: if parent[node] = null then 3: count[pattern] ← count[node] 4: currnode ← parent[node] 5:while parent[node] = null do</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 2 . Dataset parameters</head><label>2</label><figDesc>of transactions in transaction set |t| avg Average size of a transaction t i |f m | avg Average length of maximal pattern f m I Number of items in transaction set |F max | Number of maximal frequent patterns</figDesc><table><row><cell>Name</cell><cell cols="3">|T | |t| avg |f m | avg</cell></row><row><cell>T10.I6.1600K</cell><cell>1.6 × 10 6</cell><cell>10</cell><cell>6</cell></row><row><cell cols="2">T10.I4.1024K 1.024 × 10 6</cell><cell>10</cell><cell>4</cell></row><row><cell>T15.I4.367K</cell><cell>3.67 × 10 5</cell><cell>15</cell><cell>4</cell></row><row><cell>T20.I6.450K</cell><cell>4.5 × 10 5</cell><cell>20</cell><cell>6</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 3 . Synthetic data sets</head><label>3</label><figDesc></figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 4 . Real-world data sets inside</head><label>4</label><figDesc>a typical KDE desktop session. The running time is measured as the wall-clock time of the system call. The memory usage is measured using the GNU glibc tool memusage, considering only the maximum heap size since stack use is much smaller than heap size.</figDesc><table><row><cell>Name</cell><cell>|T |</cell><cell cols="2">|I| |t| avg</cell></row><row><cell>accidents</cell><cell>3.41 × 10 5</cell><cell cols="2">469 33.81</cell></row><row><cell>retail</cell><cell>8.82 × 10 4</cell><cell cols="2">16470 10.31</cell></row><row><cell>bms-pos</cell><cell>5.16 × 10 5</cell><cell>1657</cell><cell>6.53</cell></row><row><cell cols="2">bms-webview1 5.96 × 10 4</cell><cell>60978</cell><cell>2.51</cell></row><row><cell cols="3">bms-webview2 7.75 × 10 4 330286</cell><cell>4.62</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">All strings must be inserted in the same order; the order of items with the same frequency must be the same.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1">For patterns also in our implementation.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2">Goethals has made his implementation publicly available at http://www.cs.helsinki.fi/u/goethals/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3">Note that FP-GROWTH uses a compressed representation of frequency information, whose size may be thought of as related to complexity of the dataset.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Parallel mining of association rules</title>
		<author>
			<persName><forename type="first">R</forename></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">C</forename><surname>Shafer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Trans. On Knowledge And Data Engineering</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="962" to="969" />
			<date type="published" when="1996">1996</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Fast algorithms for mining association rules</title>
		<author>
			<persName><forename type="first">R</forename><surname>Agrawal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Srikant</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 20th Int. Conf. Very Large Data Bases, VLDB</title>
				<editor>
			<persName><forename type="first">J</forename><forename type="middle">B</forename><surname>Bocca</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Jarke</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Zaniolo</surname></persName>
		</editor>
		<meeting>20th Int. Conf. Very Large Data Bases, VLDB</meeting>
		<imprint>
			<publisher>Morgan Kaufmann</publisher>
			<biblScope unit="page" from="12" to="15" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Fimi &apos;03: Workshop on frequent itemset mining implementations</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J Z</forename><surname>Bart Goethals</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the ICDM 2003 Workshop on Frequent Itemset Mining Implementations</title>
				<meeting>the ICDM 2003 Workshop on Frequent Itemset Mining Implementations<address><addrLine>Melbourne, Florida, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Real world performance of association rule algorithms</title>
		<author>
			<persName><forename type="first">Z</forename><forename type="middle">Z</forename><surname>Blue</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">KDD</title>
				<imprint>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Using association rules for product assortment decisions: A case study</title>
		<author>
			<persName><forename type="first">T</forename><surname>Brijs</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Swinnen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Vanhoof</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Wets</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Knowledge Discovery and Data Mining</title>
				<imprint>
			<date type="published" when="1999">1999</date>
			<biblScope unit="page" from="254" to="260" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Dynamic itemset counting and implication rules for market basket data</title>
		<author>
			<persName><forename type="first">S</forename><surname>Brin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Motwani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Ullman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Tsur</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data</title>
				<editor>
			<persName><forename type="first">J</forename><surname>Peckham</surname></persName>
		</editor>
		<meeting><address><addrLine>Tucson, Arizona, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM Press</publisher>
			<date type="published" when="1997">May 13-15, 1997. 1997</date>
			<biblScope unit="page">5</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Profiling high frequency accident locations using association rules</title>
		<author>
			<persName><forename type="first">K</forename><surname>Geurts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Wets</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Brijs</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Vanhoof</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 82 nd Annual Transportation Research Board, page 18pp</title>
				<meeting>the 82 nd Annual Transportation Research Board, page 18pp<address><addrLine>Washington DC (USA)</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2003-01">January 2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Memory issues in frequent itemset mining</title>
		<author>
			<persName><forename type="first">B</forename><surname>Goethals</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2004 ACM Symposium on Applied Computing (SAC&apos;04)</title>
				<meeting>the 2004 ACM Symposium on Applied Computing (SAC&apos;04)<address><addrLine>Nicosia, Cyprus</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2004-03">March 2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Efficiently using prefix-trees in mining frequent itemsets</title>
		<author>
			<persName><forename type="first">G</forename><surname>Grahne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the First IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI&apos;03)</title>
				<meeting>the First IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI&apos;03)</meeting>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Mining frequent patterns without candidate generation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ACM SIGMOD Intl. Conference on Management of Data</title>
				<editor>
			<persName><forename type="first">W</forename><surname>Chen</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Naughton</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><forename type="middle">A</forename><surname>Bernstein</surname></persName>
		</editor>
		<imprint>
			<publisher>ACM Press</publisher>
			<date type="published" when="2000-05">2000. May 2000</date>
			<biblScope unit="page" from="1" to="12" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Algorithms for association rule mining -a general survey and comparison</title>
		<author>
			<persName><forename type="first">J</forename><surname>Hipp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Güntzer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Nakhaeizadeh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SIGKDD Explorations</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="58" to="64" />
			<date type="published" when="2000-07">July 2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Fast sequential and parallel algorithms for association rule mining: A comparison</title>
		<author>
			<persName><forename type="first">A</forename><surname>Mueller</surname></persName>
		</author>
		<idno>CS-TR-3515</idno>
		<imprint>
			<date type="published" when="1995">1995</date>
			<pubPlace>, MD</pubPlace>
		</imprint>
		<respStmt>
			<orgName>College Park</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Technical Report</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Mining frequent itemsets using patricia tries</title>
		<author>
			<persName><forename type="first">A</forename><surname>Pietracaprina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zandolin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the First IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI&apos;03)</title>
				<meeting>the First IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI&apos;03)</meeting>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">An efficient algorithm for mining association rules in large databases</title>
		<author>
			<persName><forename type="first">A</forename><surname>Savasere</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Omiecinski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">B</forename><surname>Navathe</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The VLDB Journal</title>
		<imprint>
			<biblScope unit="page" from="432" to="444" />
			<date type="published" when="1995">1995</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Summary structures for frequency queries on large transaction sets</title>
		<author>
			<persName><forename type="first">D.-Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Johar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Grama</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Szpankowski</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Data Compression Conference</title>
				<imprint>
			<date type="published" when="2000">2000</date>
			<biblScope unit="page" from="420" to="429" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">A localized algorithm for parallel association mining</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Zaki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Parthasarathy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ACM Symposium on Parallel Algorithms and Architectures</title>
				<imprint>
			<date type="published" when="1997">1997</date>
			<biblScope unit="page" from="321" to="330" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Parallel algorithms for discovery of association rules</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Zaki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Parthasarathy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ogihara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Data Mining and Knowledge Discovery</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="343" to="373" />
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
