<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Probabilistic Iterative Expansion of Candidates in Mining Frequent Itemsets</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Attila</forename><surname>Gyenesei</surname></persName>
							<email>gyenesei@it.utu.fi</email>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Turku Centre for Computer Science</orgName>
								<orgName type="department" key="dep2">Dept. of Inf. Technology</orgName>
								<orgName type="institution">Univ. of Turku</orgName>
								<address>
									<country key="FI">Finland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jukka</forename><surname>Teuhola</surname></persName>
							<email>teuhola@it.utu.fi</email>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Turku Centre for Computer Science</orgName>
								<orgName type="department" key="dep2">Dept. of Inf. Technology</orgName>
								<orgName type="institution">Univ. of Turku</orgName>
								<address>
									<country key="FI">Finland</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Probabilistic Iterative Expansion of Candidates in Mining Frequent Itemsets</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">FEAC35F91B0EAC416C65F168C3275430</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T05:54+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>A simple new algorithm is suggested for frequent itemset mining, using item probabilities as the basis for generating candidates. The method first finds all the frequent items, and then generates an estimate of the frequent sets, assuming item independence. The candidates are stored in a trie where each path from the root to a node represents one candidate itemset. The method expands the trie iteratively, until all frequent itemsets are found. Expansion is based on scanning through the data set in each iteration cycle, and extending the subtries based on observed node frequencies. Trie probing can be restricted to only those nodes which possibly need extension. The number of candidates is usually quite moderate; for dense datasets 2-4 times the number of final frequent itemsets, for non-dense sets somewhat more. In practical experiments the method has been observed to make clearly fewer passes than the well-known Apriori method. As for speed, our non-optimised implementation is in some cases faster, in some others slower than the comparison methods.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>We study the well-known problem of finding frequent itemsets from a transaction database, see <ref type="bibr" target="#b1">[2]</ref>. A transaction in this case means a set of so-called items. For example, a supermarket basket is represented as a transaction, where the purchased products represent the items. The database may contain millions of such transactions. The frequent itemset mining is a task, where we should find those subsets of items that occur at least in a given minimum number of transactions. This is an important basic task, applicable in solving more advanced data mining problems, for example discovering association rules <ref type="bibr" target="#b1">[2]</ref>. What makes the task difficult is that the number of potential frequent itemsets is exponential in the number of distinct items.</p><p>In this paper, we follow the notations of Goethals <ref type="bibr" target="#b6">[7]</ref>. The overall set of items is denoted by I. Any subset X ⊆ I is called an itemset. If X has k items, it is called a k-itemset. A transaction is an itemset identified by a tid. A transaction with itemset Y is said to support itemset X, if X ⊆ Y. The cover of an itemset X in a database D is the set of transactions in D that support X. The support of itemset X is the size of its cover in D. The relative frequency (probability) of itemset X with respect to D is</p><formula xml:id="formula_0">D D X Support D X P ) , ( ) , ( =<label>(1)</label></formula><p>An itemset X is frequent if its support is greater than or equal to a given threshold σ. We can also express the condition using a relative threshold for the frequency: P(X, D) ≥ σ rel , where 0 ≤ σ rel ≤ 1. There are variants of the basic 'all-frequent-itemsets' problem, namely the maximal and closed itemset mining problems, see [1, 4, 5,  8, 12]. However, here we restrict ourselves to the basic task.</p><p>A large number of algorithms have been suggested for frequent itemset mining during the last decade; for surveys, see [7, 10, 15]. Most of the algorithms share the same general approach: generate a set of candidate itemsets, count their frequencies in D, and use the obtained information in generating more candidates, until the complete set is found. The methods differ mainly in the order and extent of candidate generation. The most famous is probably the Apriori algorithm, developed independently by Agrawal et al. <ref type="bibr" target="#b2">[3]</ref> and Mannila et al. <ref type="bibr" target="#b10">[11]</ref>. It is a representative of breadth-first candidate generation: it first finds all frequent 1-itemsets, then all frequent 2-itemsets, etc. The core of the method is clever pruning of candidate k-itemsets, for which there exists a non-frequent k-1-subset. This is an application of the obvious monotonicity property: All subsets of a frequent itemset must also be frequent. Apriori is essentially based on this property.</p><p>The other main candidate generation approach is depthfirst order, of which the best-known representatives are Eclat <ref type="bibr" target="#b13">[14]</ref> and FP-growth <ref type="bibr" target="#b8">[9]</ref> (though the 'candidate' concept in the context of FP-growth is disputable). These two are generally considered to be among the fastest algorithms for frequent itemset mining. However, we shall mainly use Apriori as a reference method, because it is technically closer to ours.</p><p>Most of the suggested methods are analytical in the sense that they are based on logical inductions to restrict the number of candidates to be checked. Our approach (called PIE) is probabilistic, based on relative item frequencies, using which we compute estimates for itemset frequencies in candidate generation. More precisely, we generate iteratively improving approximations (candidate itemsets) to the solution. Our general endeavour has been to develop a relatively simple method, with fast basic steps and few iteration cycles, at the cost of somewhat increased number of candidates. However, another goal is that the method should be robust, i.e. it should work reasonably fast for all kinds of datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Method description</head><p>Our method can be characterized as a generate-and-test algorithm, such as Apriori. However, our candidate generation is based on probabilistic estimates of the supports of itemsets. The testing phase is rather similar to Apriori, but involves special book-keeping to lay a basis for the next generation phase.</p><p>We start with a general description of the main steps of the algorithm. The first thing to do is to determine the frequencies of all items in the dataset, and select the frequent ones for subsequent processing. If there are m frequent items, we internally identify them by numbers 0, …, m-1. For each item i, we use its probability (relative frequency) P(i) in the generation of candidates for frequent itemsets.</p><p>The candidates are represented as a trie structure, which is normal in this context, see <ref type="bibr" target="#b6">[7]</ref>. Each node is labelled by one item, and a path of labels from the root to a node represents an itemset. The root itself represents the empty itemset. The paths are sorted, so that a subtrie rooted by item i can contain only items &gt; i. Note also that several nodes in the trie can have the same item label, but not on a single path. A complete trie, storing all subsets of the whole itemset, would have 2 m nodes and be structurally a binomial tree <ref type="bibr" target="#b12">[13]</ref>, where on level j there are ) ( m j nodes, see Fig. <ref type="figure" target="#fig_1">1</ref> for m = 4. The trie is used for book-keeping purposes. However, it is important to avoid building the complete trie, but only some upper part of it, so that the nodes (i.e. their root paths) represent reasonable candidates for frequent sets. In our algorithm, the first approximation for candidate itemsets is obtained by computing estimates for their probabilities, assuming independence of item occurrences. It means that, for example, for an itemset {x, y, z} the estimated probability is the product P(x)P(y)P(z). Nodes are created in the trie from root down along all paths as long as the path-related probability is not less that the threshold σ rel . Note that the probability values are monotonically non-increasing on the way down.   shows an example of the initial trie for a given set of transactions (with m = 4). Those nodes of the complete trie (Fig. <ref type="figure" target="#fig_1">1</ref>) that do not exist in the actual trie are called virtual nodes, and marked with dashed circles in Fig. <ref type="figure" target="#fig_2">2</ref>.</p><p>The next step is to read the transactions and count the true number of occurrences for each node (i.e. the related path support) in the trie. Simultaneously, for each visited node, we maintain a counter called pending support (PS), being the number of transactions for which at least one  virtual child of the node would match. The pending support will be our criterion for the expansion of the node: If PS(x) ≥ σ, then it is possible that a virtual child of node x is frequent, and the node must be expanded. If there are no such nodes, the algorithm is ready, and the result can be read from the trie: All nodes with support ≥ σ represent frequent itemsets. Trie expansion starts the next cycle, and we iterate until the stopping condition holds. However, we must be very careful in the expansion: which virtual nodes should we materialize (and how deep, recursively), in order to avoid trie 'explosion', but yet approach the final solution? Here we apply item probabilities, again. In principle, we could take advantage of all information available in the current trie (frequencies of subsets, etc.), as is done in the Apriori algorithm and many others. However, we prefer simpler calculation, based on global probabilities of items.</p><p>Suppose that we have a node x with pending support PS(x) ≥ σ. Assume that it has virtual child items v 0 , v 1 , …, v s-1 with global probabilities P(v 0 ), P(v 1 ), …, P(v s-1 ). Every transaction contributing to PS(x) has a match with at least one of v 0 , v 1 , …, v s-1 . The local probability (LP) for a match with v i is computed as follows:</p><formula xml:id="formula_1">) ( i v LP ) 1 0 ( matches , , v One of v matches | i v P … = matches) , , v P(One of v matches , v One of v matches i v P … … ∧ = 1 0 )) 1 0 ( ) (( ) 1 0 ( ) ( matches , , v One of v P matches i v P … = )) ( 1 ( )) 1 ( 1 ))( 0 ( 1 ( 1 ) ( s v P v P v P i v P − − − − = K<label>(2)</label></formula><p>Using this formula, we get an estimated support ES(v i ):</p><p>)</p><formula xml:id="formula_2">) ( ( ) ( ) ( i V Parent PS i v LP i v ES =<label>(3)</label></formula><p>If ES(v i ) ≥ σ, then we conclude that v i is expected to be frequent. However, in order to guarantee a finite number of iterations in the worst case, we have to relax this condition a bit. Since the true distribution may be very skewed, almost the whole pending support may belong to only one virtual child. To ensure convergence, we apply the following condition for child expansion in the k th iteration,</p><formula xml:id="formula_3">σ α k i v ES ≥ ) (<label>(4)</label></formula><p>with some constant α between 0 and 1. In the worst case this will eventually (when k is high enough) result in expansion, to get rid of a PS-value ≥ σ. In our tests, we used the heuristic value α = average probability of frequent items. The reasoning behind this choice is that it speeds up the local expansion growth by one level, on the average (k levels for α k ). This acceleration restricts the number of iterations efficiently. The largest extensions are applied only to the 'skewest' subtries, so that the total size of the trie remains tolerable. Another approach to choose α would be to do a statistical analysis to determine confidence bounds for ES. However, this is left for future work. Fig. <ref type="figure" target="#fig_4">3</ref> shows an example of trie expansion, assuming that the minimum support threshold σ = 80, α = 0.8, and k = 1. The item probabilities are assumed to be P(y) = 0.7, P(z) = 0.5, and P(v) = 0.8. Node t has a pending support of 100, related to its two virtual children, y and z. This means that 100 transactions contained the path from root to t, plus either or both of items y and z, so we have to test for expansion. Our formula gives y a local probability LP(y) = 0.7 / (1−(1−0.7)(1−0.5)) ≈ 0.82, so the estimated support is 82 &gt; α⋅σ = 64, and we expand y. However, the local probability of z is only ≈ 0.59, so its estimated support is 59, and it will not be expanded. When a virtual node (y) has been materialized, we immediately test also its expansion, based on its ES-value, recursively. However, in the recursive steps we cannot apply formula (2), because we have no evidence of the children of y. Instead, we apply the unconditional probabilities of z and v in estimation: LP(z) = 82⋅0.5 = 41 &lt; α⋅σ = 64, and LP(v) = 82⋅0.8 = 65.6 &gt; 64. Node v is materialized, but z is not. Expansion test continues down from v. Thus, both in initialization of the trie and in its expansion phases, we can create several new levels (i.e. longer candidates) at a time, contrary to e.g. the base version of Apriori. It is true that also Apriori can be modified to create several candidate levels at a time, but at the cost of increased number of candidates.</p><p>After the expansion phase the iteration continues with the counting phase, and new values for node supports and pending supports are determined. The two phases alternate</p><formula xml:id="formula_4">t … PS=100 ES=82 ES=59 ES=41 ES=65.6 x y z z v</formula><p>until all pending supports are less than σ. We have given our method the name 'PIE', reflecting this Probabilistic Iterative Expansion property.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Elaboration</head><p>The above described basic version does a lot of extra work. One observation is that as soon as the pending support of some node x is smaller than σ, we can often 'freeze' the whole subtrie, because it will not give us anything new; we call it 'ready'. The readiness of nodes can be checked easily with a recursive process: A node x is ready if PS(x) &lt; σ and all its real children are ready. The readiness can be utilized to reduce work both in counting and expansion phases. In counting, we process one transaction at a time and scan its item subsets down the trie, but only until the first ready node on each path. Also the expansion procedure is skipped for ready nodes. Finally, a simple stopping condition is when the root becomes ready.</p><p>Another tailoring, not yet implemented, relates to the observation that most of the frequent itemsets are found in the first few iterations, and a lot of I/O effort is spent to find the last few frequent sets. For those, not all transactions are needed in solving the frequency. In the counting phase, we can distinguish between relevant and irrelevant transactions. A transaction is irrelevant, if it does not increase the pending support value of any nonready node. If the number of relevant transactions is small enough, we can store them separately (in main memory or temporary file) during the next scanning phase.</p><p>Our implementation of the trie is quite simple; saving memory is considered, but not as the first preference. The child linkage is implemented as an array of pointers, and the frequent items are renumbered to 0, …, m-1 (if there are m frequent items) to be able to use them as indices to the array. A minor improvement is that for item i, we need only m-i-1 pointers, corresponding to the possible children i+1, …, m-1.</p><p>The main restriction of the current implementation is the assumption that the trie fits in the main memory. Compression of nodes would help to some extent: Now we reserve a pointer for every possible child node, but most of them are null. Storing only non-null pointers saves memory, but makes the trie scanning slower. Also, we could release the ready nodes as soon as they are detected, in order to make room for expansions. Of course, before releasing, the related frequent itemsets should be reported. However, a fully general solution should work for any main memory and trie size. Some kind of external representation should be developed, but this is left for future work.</p><p>A high-level pseudocode of the current implementation is given in the following. The recursive parts are not coded explicitly, but should be rather obvious.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm PIE − Probabilistic iterative expansion of candidates in frequent itemset mining</head><p>Input: A transaction database D, the minimum support threshold σ. Output: The complete set of frequent itemsets. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experimental results</head><p>For verifying the usability of our PIE algorithm, we used four of the test datasets made available to the Workshop on Frequent Itemset Mining Implementations (FIMI'03) <ref type="bibr" target="#b5">[6]</ref>. The test datasets and some of their properties are described in Table <ref type="table" target="#tab_2">1</ref>. They represent rather different kinds of domains, and we wanted to include both dense and non-dense datasets, as well as various numbers of items. For the PIE method, the interesting statistics to be collected are the number of candidates, depth of the trie, and the number of iterations. These results are given in Table <ref type="table" target="#tab_4">2</ref> for selected values of σ, for the 'Chess' dataset. We chose values of σ that keep the number of frequent itemsets reasonable (extremely high numbers are probably useless for any application). The table shows also the number of frequent items and frequent sets, to enable comparison with the number of candidates. For this dense dataset, the number of candidates varies between 2-4 times the number of frequent itemsets. For non-dense datasets the ratio is usually larger. Table <ref type="table" target="#tab_4">2</ref> shows also the values of the 'security parameter' α, being the average probability of frequent items. Considering I/O performance, we can see that the number of iteration cycles (= number of file scans) is quite small, compared to the base version of the Apriori method, for which the largest frequent itemset dictates the number of iterations. This is roughly the same as the trie depth, as shown in Table <ref type="table" target="#tab_4">2</ref>.</p><p>The PIE method can also be characterized by describing the development of the trie during the iterations. The most interesting figures are the number of nodes and the number of ready nodes, given in Table <ref type="table" target="#tab_3">3</ref>. Especially the number of ready nodes implies that even though we have rather many candidates (= nodes in the trie), large parts of them are not touched in the later iterations. For speed comparison, we chose the Apriori and FPgrowth implementations, provided by Bart Goethals <ref type="bibr" target="#b5">[6]</ref>. The results for the four test datasets and for different minimum support thresholds are shown in Table <ref type="table" target="#tab_5">4</ref>. The processor used in the experiments was a 1.5 GHz Pentium 4, with 512 MB main memory. We used a g++ compiler, using optimizing switch -O6. The PIE algorithm was coded in C. We can see that in some situations the PIE algorithm is the fastest, in some others the slowest. This is probably a general observation: the performance of most frequent itemset mining algorithms is highly dependent on the data set and threshold. It seems that PIE is at its best for sparse datasets (such as T40I10D100K and Kosarak), but not so good for very dense datasets (such as 'Chess' and 'Mushroom'). Its speed for large thresholds probably results from the simplicity of the algorithm. For smaller thresholds, the trie gets large and the counting starts to consume more time, especially with a small main memory size.</p><p>One might guess that our method is at its best for random data sets, because those would correspond to our assumption about independent item occurrences. We tested this with a dataset of 100 000 transactions, each of which contained 20 random items out of 30 possible. The results were rather interesting: For all tested thresholds for minimum support, we found all the frequent itemsets in the first iteration. However, verification of the completeness required one or two additional iterations, with a clearly higher number of candidates, consuming a majority of the total time. Table <ref type="table">5</ref> shows the time and number of candidates both after the first and after the final iteration. The stepwise growth of the values reveals the levelwise growth of the trie. Apriori worked well also for this dataset, being in most cases faster than PIE. Results for FP-growth (not shown) are naturally much slower, because randomness prevents a compact representation of the transactions.</p><p>We wish to point out that our implementation was an initial version, with no special tricks for speed-up. We are convinced that the code details can be improved to make the method still more competitive. For example, buffering of transactions (or temporary files) were not used to enhance the I/O performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions and future work</head><p>A probability-based approach was suggested for frequent itemset mining, as an alternative to the 'analytic' methods common today. It has been observed to be rather robust, working reasonably well for various kinds of datasets. The number of candidate itemsets does not 'explode', so that the data structure (trie) can be kept in the main memory in most practical cases.</p><p>The number of iterations is smallest for random datasets, because candidate generation is based on just that assumption. For skewed datasets, the number of iterations may somewhat grow. This is partly due to our simplifying premise that the items are independent. This point could be tackled by making use of the conditional probabilities obtainable from the trie. Initial tests did not show any significant advantage over the basic approach, but a more</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>Fig.2</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 .</head><label>1</label><figDesc>Figure 1. The complete trie for 4 items.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 .</head><label>2</label><figDesc>Figure 2. An initial trie for the transaction set {(0, 3), (1, 2), (0, 1, 3), (1)}, with minimum support threshold σ = 1/6. The virtual nodes with probabilities &lt; 1/6 are shown using dashed lines.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 3 .</head><label>3</label><figDesc>Figure 3. An example of expansion for probabilities P(y) = 0.7, P(z) = 0.5, and P(v) = 0.8.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 1 . Test dataset description</head><label>1</label><figDesc></figDesc><table><row><cell>Dataset</cell><cell cols="2">#Transactions #Items</cell></row><row><cell>Chess</cell><cell>3 196</cell><cell>75</cell></row><row><cell>Mushroom</cell><cell>8 124</cell><cell>119</cell></row><row><cell>T40I10D100K</cell><cell>100 000</cell><cell>942</cell></row><row><cell>Kosarak</cell><cell>900 002</cell><cell>41 270</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3 . Development of the trie for dataset 'Chess', with three different values of</head><label>3</label><figDesc></figDesc><table><row><cell>σ.</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 2 . Statistics from the PIE algorithm for dataset 'Chess'.</head><label>2</label><figDesc></figDesc><table><row><cell>σ</cell><cell>#Frequent items</cell><cell>#Frequent sets</cell><cell>Alpha</cell><cell>#Candidates</cell><cell>Trie depth</cell><cell>#Iterations</cell><cell>#Apriori's iterations</cell></row><row><cell>3 000</cell><cell>12</cell><cell>155</cell><cell>0.970</cell><cell>400</cell><cell>6</cell><cell>3</cell><cell>6</cell></row><row><cell>2 900</cell><cell>13</cell><cell>473</cell><cell>0.967</cell><cell>1 042</cell><cell>8</cell><cell>4</cell><cell>7</cell></row><row><cell>2 800</cell><cell>16</cell><cell>1 350</cell><cell>0.953</cell><cell>2 495</cell><cell>8</cell><cell>4</cell><cell>8</cell></row><row><cell>2 700</cell><cell>17</cell><cell>3 134</cell><cell>0.947</cell><cell>5 218</cell><cell>9</cell><cell>4</cell><cell>8</cell></row><row><cell>2 600</cell><cell>19</cell><cell>6 135</cell><cell>0.934</cell><cell>10 516</cell><cell>10</cell><cell>4</cell><cell>9</cell></row><row><cell>2 500</cell><cell>22</cell><cell>11 493</cell><cell>0.914</cell><cell>18 709</cell><cell>11</cell><cell>4</cell><cell>10</cell></row><row><cell>2 400</cell><cell>23</cell><cell>20 582</cell><cell>0.907</cell><cell>47 515</cell><cell>12</cell><cell>4</cell><cell>11</cell></row><row><cell>2 300</cell><cell>24</cell><cell>35 266</cell><cell>0.900</cell><cell>131 108</cell><cell>13</cell><cell>4</cell><cell>12</cell></row><row><cell>2 200</cell><cell>27</cell><cell>59 181</cell><cell>0.877</cell><cell>216 943</cell><cell>14</cell><cell>5</cell><cell>13</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 4 . Comparison of execution times (in seconds) of three frequent itemset mining programs for four test datasets.</head><label>4</label><figDesc></figDesc><table><row><cell>(a) Chess</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>σ</cell><cell>#Freq. sets</cell><cell>Apriori</cell><cell>FP-growth</cell><cell>PIE</cell></row><row><cell>3 000</cell><cell>155</cell><cell>0.312</cell><cell>0.250</cell><cell>0.125</cell></row><row><cell>2 900</cell><cell>473</cell><cell>0.469</cell><cell>0.266</cell><cell>0.265</cell></row><row><cell>2 800</cell><cell>1 350</cell><cell>0.797</cell><cell>0.297</cell><cell>1.813</cell></row><row><cell>2 700</cell><cell>3 134</cell><cell>1.438</cell><cell>0.344</cell><cell>6.938</cell></row><row><cell>2 600</cell><cell>6 135</cell><cell>3.016</cell><cell>0.438</cell><cell>14.876</cell></row><row><cell>2 500</cell><cell>11 493</cell><cell>10.204</cell><cell>0.610</cell><cell>26.360</cell></row><row><cell>2 400</cell><cell>20 582</cell><cell>21.907</cell><cell>0.829</cell><cell>78.325</cell></row><row><cell>2 300</cell><cell>35 266</cell><cell>42.048</cell><cell cols="2">1.156 203.828</cell></row><row><cell>2 200</cell><cell>59 181</cell><cell>73.297</cell><cell cols="2">1.766 315.562</cell></row><row><cell cols="2">(b) Mushroom</cell><cell></cell><cell></cell><cell></cell></row><row><cell>σ</cell><cell>#Freq. sets</cell><cell>Apriori</cell><cell>FP-growth</cell><cell>PIE</cell></row><row><cell>5 000</cell><cell>41</cell><cell>0.375</cell><cell>0.391</cell><cell>0.062</cell></row><row><cell>4 500</cell><cell>97</cell><cell>0.437</cell><cell>0.406</cell><cell>0.094</cell></row><row><cell>4 000</cell><cell>167</cell><cell>0.578</cell><cell>0.438</cell><cell>0.141</cell></row><row><cell>3 500</cell><cell>369</cell><cell>0.797</cell><cell>0.500</cell><cell>0.297</cell></row><row><cell>3 000</cell><cell>931</cell><cell>1.062</cell><cell>0.546</cell><cell>1.157</cell></row><row><cell>2 500</cell><cell>2 365</cell><cell>1.781</cell><cell>0.610</cell><cell>6.046</cell></row><row><cell>2 000</cell><cell>6 613</cell><cell>3.719</cell><cell>0.750</cell><cell>27.047</cell></row><row><cell>1 500</cell><cell>56 693</cell><cell>55.110</cell><cell cols="2">1.124 153.187</cell></row><row><cell cols="2">(c) T40I10D100K</cell><cell></cell><cell></cell><cell></cell></row><row><cell>σ</cell><cell>#Freq. sets</cell><cell>Apriori</cell><cell>FP-growth</cell><cell>PIE</cell></row><row><cell>20 000</cell><cell>5</cell><cell>2.797</cell><cell>6.328</cell><cell>0.797</cell></row><row><cell>18 000</cell><cell>9</cell><cell>2.828</cell><cell>6.578</cell><cell>1.110</cell></row><row><cell>16 000</cell><cell>17</cell><cell>3.001</cell><cell>7.250</cell><cell>1.156</cell></row><row><cell>14 000</cell><cell>24</cell><cell>3.141</cell><cell>8.484</cell><cell>1.187</cell></row><row><cell>12 000</cell><cell>48</cell><cell>3.578</cell><cell>14.750</cell><cell>1.906</cell></row><row><cell>10 000</cell><cell>82</cell><cell>4.296</cell><cell>23.874</cell><cell>4.344</cell></row><row><cell>8 000</cell><cell>137</cell><cell>7.859</cell><cell>41.203</cell><cell>11.796</cell></row><row><cell>6 000</cell><cell>239</cell><cell>20.531</cell><cell>72.985</cell><cell>29.671</cell></row><row><cell>4 000</cell><cell>440</cell><cell cols="2">35.282 114.953</cell><cell>68.672</cell></row><row><cell>(c) Kosarak</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>σ</cell><cell>#Freq. sets</cell><cell>Apriori</cell><cell>FP-growth</cell><cell>PIE</cell></row><row><cell>20 000</cell><cell>121</cell><cell>27.970</cell><cell>30.141</cell><cell>5.203</cell></row><row><cell>18 000</cell><cell>141</cell><cell>28.438</cell><cell>31.296</cell><cell>6.110</cell></row><row><cell>16 000</cell><cell>167</cell><cell>29.016</cell><cell>32.765</cell><cell>7.969</cell></row><row><cell>14 000</cell><cell>202</cell><cell>29.061</cell><cell>33.516</cell><cell>9.688</cell></row><row><cell>12 000</cell><cell>267</cell><cell>29.766</cell><cell>34.875</cell><cell>12.032</cell></row><row><cell>10 000</cell><cell>376</cell><cell>34.906</cell><cell>37.657</cell><cell>18.016</cell></row><row><cell>8 000</cell><cell>575</cell><cell>35.891</cell><cell>41.657</cell><cell>30.453</cell></row><row><cell>6 000</cell><cell>1 110</cell><cell>39.656</cell><cell>51.922</cell><cell>70.376</cell></row></table></figure>
		</body>
		<back>
			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0" />			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Depth First Generation of Long Patterns</title>
		<author>
			<persName><forename type="first">R</forename><surname>Agrawal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Aggarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">V V</forename><surname>Prasad</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the Int. Conf. on Knowledge Discovery and Data Mining</title>
				<editor>
			<persName><forename type="first">R</forename><surname>Ramakrishnan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Stolfo</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Bayardo</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">I</forename><surname>Parsa</surname></persName>
		</editor>
		<meeting>of the Int. Conf. on Knowledge Discovery and Data Mining</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2000-08">Aug. 2000</date>
			<biblScope unit="page" from="108" to="118" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Mining Association Rules Between Sets of Items in Large Databases</title>
		<author>
			<persName><forename type="first">R</forename><surname>Agrawal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Imielinski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Swami</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of ACM SIGMOD Int. Conf. of Management of Data</title>
				<editor>
			<persName><forename type="first">P</forename><surname>Buneman</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Jajodia</surname></persName>
		</editor>
		<meeting>of ACM SIGMOD Int. Conf. of Management of Data</meeting>
		<imprint>
			<date type="published" when="1993-05">May 1993</date>
			<biblScope unit="page" from="207" to="216" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Fast Algorithms for Mining Association Rules in Large Databases</title>
		<author>
			<persName><forename type="first">R</forename><surname>Agrawal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Srikant</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 20th VLDB Conf</title>
				<editor>
			<persName><forename type="first">J</forename><forename type="middle">B</forename><surname>Bocca</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Jarke</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Zaniolo</surname></persName>
		</editor>
		<meeting>of the 20th VLDB Conf</meeting>
		<imprint>
			<date type="published" when="1994-09">Sept. 1994</date>
			<biblScope unit="page" from="487" to="499" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Efficiently Mining Long Patterns from Databases</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J</forename><surname>Bayardo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the ACM SIGMOD Int. Conf. on Management of Data</title>
				<editor>
			<persName><forename type="first">L</forename><forename type="middle">M</forename><surname>Haas</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Tiwary</surname></persName>
		</editor>
		<meeting>of the ACM SIGMOD Int. Conf. on Management of Data</meeting>
		<imprint>
			<date type="published" when="1998-06">June 1998</date>
			<biblScope unit="page" from="85" to="93" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">MAFIA: a Maximal Frequent Itemset Algorithm for Transactional Databases</title>
		<author>
			<persName><forename type="first">D</forename><surname>Burdick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Calimlim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gehrke</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of IEEE Int. Conf. on Data Engineering</title>
				<meeting>of IEEE Int. Conf. on Data Engineering</meeting>
		<imprint>
			<date type="published" when="2001-04">April 2001</date>
			<biblScope unit="page" from="443" to="552" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<ptr target="http://fimi.cs.helsinki.fi" />
		<title level="m">Frequent Itemset Mining Implementations (FIMI&apos;03) Workshop website</title>
				<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Efficient Frequent Pattern Mining</title>
		<author>
			<persName><forename type="first">B</forename><surname>Goethals</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2002-12">Dec. 2002</date>
			<pubPlace>Belgium</pubPlace>
		</imprint>
		<respStmt>
			<orgName>University of Limburg</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">PhD thesis</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Efficiently Mining Maximal Frequent Itemsets</title>
		<author>
			<persName><forename type="first">K</forename><surname>Gouda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Zaki</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of 2001 IEEE International Conference on Data Mining</title>
				<editor>
			<persName><forename type="first">N</forename><surname>Cercone</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><forename type="middle">Y</forename><surname>Lin</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">X</forename><surname>Wu</surname></persName>
		</editor>
		<meeting>of 2001 IEEE International Conference on Data Mining</meeting>
		<imprint>
			<date type="published" when="2001-11">Nov. 2001</date>
			<biblScope unit="page" from="163" to="170" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Mining Frequent Patterns Without Candidate Generation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of ACM SIGMOD Int. Conf. on Management of Data</title>
				<editor>
			<persName><forename type="first">W</forename><surname>Chen</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Naughton</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><forename type="middle">A</forename><surname>Bernstein</surname></persName>
		</editor>
		<meeting>of ACM SIGMOD Int. Conf. on Management of Data</meeting>
		<imprint>
			<date type="published" when="2000">2000</date>
			<biblScope unit="page" from="1" to="12" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Algorithms for Association Rule Mining -a General Survey and Comparison</title>
		<author>
			<persName><forename type="first">J</forename><surname>Hipp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Güntzer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Nakhaeizadeh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM SIGKDD Explorations</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="58" to="65" />
			<date type="published" when="2000-07">July 2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Efficient Algorithms for Discovering Association Rules</title>
		<author>
			<persName><forename type="first">H</forename><surname>Mannila</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Toivonen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">I</forename><surname>Verkamo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the AAAI Workshop on Knowledge Discovery in Databases</title>
				<editor>
			<persName><forename type="first">U</forename><forename type="middle">M</forename><surname>Fayyad</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Uthurusamy</surname></persName>
		</editor>
		<meeting>of the AAAI Workshop on Knowledge Discovery in Databases</meeting>
		<imprint>
			<date type="published" when="1994-07">July 1994</date>
			<biblScope unit="page" from="181" to="192" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Closet: An Efficient Algorithm for Mining Frequent Closed Itemsets</title>
		<author>
			<persName><forename type="first">J</forename><surname>Pei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Mao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery</title>
				<meeting>of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery</meeting>
		<imprint>
			<date type="published" when="2000-05">May 2000</date>
			<biblScope unit="page" from="21" to="30" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">A Data Structure for Manipulating Priority Queues</title>
		<author>
			<persName><forename type="first">J</forename><surname>Vuillemin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Comm. of the ACM</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="309" to="314" />
			<date type="published" when="1978">1978</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Scalable Algorithms for Association Mining</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Zaki</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Knowledge and Data Engineering</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="372" to="390" />
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Real World Performance of Association Rule Algorithms</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kohavi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Mason</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</title>
				<editor>
			<persName><forename type="first">F</forename><surname>Provost</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Srikant</surname></persName>
		</editor>
		<meeting>of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</meeting>
		<imprint>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="401" to="406" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
