Probabilistic Iterative Expansion of Candidates in Mining Frequent Itemsets

Probabilistic Iterative Expansion of Candidates in Mining Frequent Itemsets AttilaGyenesei gyenesei@it.utu.fi Turku Centre for Computer Science Dept. of Inf. Technology Univ. of Turku

Finland

JukkaTeuhola teuhola@it.utu.fi Turku Centre for Computer Science Dept. of Inf. Technology Univ. of Turku

Finland

Probabilistic Iterative Expansion of Candidates in Mining Frequent Itemsets FEAC35F91B0EAC416C65F168C3275430 GROBID - A machine learning software for extracting information from scholarly documents

A simple new algorithm is suggested for frequent itemset mining, using item probabilities as the basis for generating candidates. The method first finds all the frequent items, and then generates an estimate of the frequent sets, assuming item independence. The candidates are stored in a trie where each path from the root to a node represents one candidate itemset. The method expands the trie iteratively, until all frequent itemsets are found. Expansion is based on scanning through the data set in each iteration cycle, and extending the subtries based on observed node frequencies. Trie probing can be restricted to only those nodes which possibly need extension. The number of candidates is usually quite moderate; for dense datasets 2-4 times the number of final frequent itemsets, for non-dense sets somewhat more. In practical experiments the method has been observed to make clearly fewer passes than the well-known Apriori method. As for speed, our non-optimised implementation is in some cases faster, in some others slower than the comparison methods.

Introduction

We study the well-known problem of finding frequent itemsets from a transaction database, see [2]. A transaction in this case means a set of so-called items. For example, a supermarket basket is represented as a transaction, where the purchased products represent the items. The database may contain millions of such transactions. The frequent itemset mining is a task, where we should find those subsets of items that occur at least in a given minimum number of transactions. This is an important basic task, applicable in solving more advanced data mining problems, for example discovering association rules [2]. What makes the task difficult is that the number of potential frequent itemsets is exponential in the number of distinct items.

In this paper, we follow the notations of Goethals [7]. The overall set of items is denoted by I. Any subset X ⊆ I is called an itemset. If X has k items, it is called a k-itemset. A transaction is an itemset identified by a tid. A transaction with itemset Y is said to support itemset X, if X ⊆ Y. The cover of an itemset X in a database D is the set of transactions in D that support X. The support of itemset X is the size of its cover in D. The relative frequency (probability) of itemset X with respect to D is

D D X Support D X P ) , ( ) , ( =(1)

An itemset X is frequent if its support is greater than or equal to a given threshold σ. We can also express the condition using a relative threshold for the frequency: P(X, D) ≥ σ rel , where 0 ≤ σ rel ≤ 1. There are variants of the basic 'all-frequent-itemsets' problem, namely the maximal and closed itemset mining problems, see [1, 4, 5, 8, 12]. However, here we restrict ourselves to the basic task.

A large number of algorithms have been suggested for frequent itemset mining during the last decade; for surveys, see [7, 10, 15]. Most of the algorithms share the same general approach: generate a set of candidate itemsets, count their frequencies in D, and use the obtained information in generating more candidates, until the complete set is found. The methods differ mainly in the order and extent of candidate generation. The most famous is probably the Apriori algorithm, developed independently by Agrawal et al. [3] and Mannila et al. [11]. It is a representative of breadth-first candidate generation: it first finds all frequent 1-itemsets, then all frequent 2-itemsets, etc. The core of the method is clever pruning of candidate k-itemsets, for which there exists a non-frequent k-1-subset. This is an application of the obvious monotonicity property: All subsets of a frequent itemset must also be frequent. Apriori is essentially based on this property.

The other main candidate generation approach is depthfirst order, of which the best-known representatives are Eclat [14] and FP-growth [9] (though the 'candidate' concept in the context of FP-growth is disputable). These two are generally considered to be among the fastest algorithms for frequent itemset mining. However, we shall mainly use Apriori as a reference method, because it is technically closer to ours.

Most of the suggested methods are analytical in the sense that they are based on logical inductions to restrict the number of candidates to be checked. Our approach (called PIE) is probabilistic, based on relative item frequencies, using which we compute estimates for itemset frequencies in candidate generation. More precisely, we generate iteratively improving approximations (candidate itemsets) to the solution. Our general endeavour has been to develop a relatively simple method, with fast basic steps and few iteration cycles, at the cost of somewhat increased number of candidates. However, another goal is that the method should be robust, i.e. it should work reasonably fast for all kinds of datasets.

Method description

Our method can be characterized as a generate-and-test algorithm, such as Apriori. However, our candidate generation is based on probabilistic estimates of the supports of itemsets. The testing phase is rather similar to Apriori, but involves special book-keeping to lay a basis for the next generation phase.

We start with a general description of the main steps of the algorithm. The first thing to do is to determine the frequencies of all items in the dataset, and select the frequent ones for subsequent processing. If there are m frequent items, we internally identify them by numbers 0, …, m-1. For each item i, we use its probability (relative frequency) P(i) in the generation of candidates for frequent itemsets.

The candidates are represented as a trie structure, which is normal in this context, see [7]. Each node is labelled by one item, and a path of labels from the root to a node represents an itemset. The root itself represents the empty itemset. The paths are sorted, so that a subtrie rooted by item i can contain only items > i. Note also that several nodes in the trie can have the same item label, but not on a single path. A complete trie, storing all subsets of the whole itemset, would have 2 m nodes and be structurally a binomial tree [13], where on level j there are ) ( m j nodes, see Fig. 1 for m = 4. The trie is used for book-keeping purposes. However, it is important to avoid building the complete trie, but only some upper part of it, so that the nodes (i.e. their root paths) represent reasonable candidates for frequent sets. In our algorithm, the first approximation for candidate itemsets is obtained by computing estimates for their probabilities, assuming independence of item occurrences. It means that, for example, for an itemset {x, y, z} the estimated probability is the product P(x)P(y)P(z). Nodes are created in the trie from root down along all paths as long as the path-related probability is not less that the threshold σ rel . Note that the probability values are monotonically non-increasing on the way down. shows an example of the initial trie for a given set of transactions (with m = 4). Those nodes of the complete trie (Fig. 1) that do not exist in the actual trie are called virtual nodes, and marked with dashed circles in Fig. 2.

The next step is to read the transactions and count the true number of occurrences for each node (i.e. the related path support) in the trie. Simultaneously, for each visited node, we maintain a counter called pending support (PS), being the number of transactions for which at least one virtual child of the node would match. The pending support will be our criterion for the expansion of the node: If PS(x) ≥ σ, then it is possible that a virtual child of node x is frequent, and the node must be expanded. If there are no such nodes, the algorithm is ready, and the result can be read from the trie: All nodes with support ≥ σ represent frequent itemsets. Trie expansion starts the next cycle, and we iterate until the stopping condition holds. However, we must be very careful in the expansion: which virtual nodes should we materialize (and how deep, recursively), in order to avoid trie 'explosion', but yet approach the final solution? Here we apply item probabilities, again. In principle, we could take advantage of all information available in the current trie (frequencies of subsets, etc.), as is done in the Apriori algorithm and many others. However, we prefer simpler calculation, based on global probabilities of items.

Suppose that we have a node x with pending support PS(x) ≥ σ. Assume that it has virtual child items v 0 , v 1 , …, v s-1 with global probabilities P(v 0 ), P(v 1 ), …, P(v s-1 ). Every transaction contributing to PS(x) has a match with at least one of v 0 , v 1 , …, v s-1 . The local probability (LP) for a match with v i is computed as follows:

) ( i v LP ) 1 0 ( matches , , v One of v matches | i v P … = matches) , , v P(One of v matches , v One of v matches i v P … … ∧ = 1 0 )) 1 0 ( ) (( ) 1 0 ( ) ( matches , , v One of v P matches i v P … = )) ( 1 ( )) 1 ( 1 ))( 0 ( 1 ( 1 ) ( s v P v P v P i v P − − − − = K(2)

Using this formula, we get an estimated support ES(v i ):

)

) ( ( ) ( ) ( i V Parent PS i v LP i v ES =(3)

If ES(v i ) ≥ σ, then we conclude that v i is expected to be frequent. However, in order to guarantee a finite number of iterations in the worst case, we have to relax this condition a bit. Since the true distribution may be very skewed, almost the whole pending support may belong to only one virtual child. To ensure convergence, we apply the following condition for child expansion in the k th iteration,

σ α k i v ES ≥ ) ((4)

with some constant α between 0 and 1. In the worst case this will eventually (when k is high enough) result in expansion, to get rid of a PS-value ≥ σ. In our tests, we used the heuristic value α = average probability of frequent items. The reasoning behind this choice is that it speeds up the local expansion growth by one level, on the average (k levels for α k ). This acceleration restricts the number of iterations efficiently. The largest extensions are applied only to the 'skewest' subtries, so that the total size of the trie remains tolerable. Another approach to choose α would be to do a statistical analysis to determine confidence bounds for ES. However, this is left for future work. Fig. 3 shows an example of trie expansion, assuming that the minimum support threshold σ = 80, α = 0.8, and k = 1. The item probabilities are assumed to be P(y) = 0.7, P(z) = 0.5, and P(v) = 0.8. Node t has a pending support of 100, related to its two virtual children, y and z. This means that 100 transactions contained the path from root to t, plus either or both of items y and z, so we have to test for expansion. Our formula gives y a local probability LP(y) = 0.7 / (1−(1−0.7)(1−0.5)) ≈ 0.82, so the estimated support is 82 > α⋅σ = 64, and we expand y. However, the local probability of z is only ≈ 0.59, so its estimated support is 59, and it will not be expanded. When a virtual node (y) has been materialized, we immediately test also its expansion, based on its ES-value, recursively. However, in the recursive steps we cannot apply formula (2), because we have no evidence of the children of y. Instead, we apply the unconditional probabilities of z and v in estimation: LP(z) = 82⋅0.5 = 41 < α⋅σ = 64, and LP(v) = 82⋅0.8 = 65.6 > 64. Node v is materialized, but z is not. Expansion test continues down from v. Thus, both in initialization of the trie and in its expansion phases, we can create several new levels (i.e. longer candidates) at a time, contrary to e.g. the base version of Apriori. It is true that also Apriori can be modified to create several candidate levels at a time, but at the cost of increased number of candidates.

After the expansion phase the iteration continues with the counting phase, and new values for node supports and pending supports are determined. The two phases alternate

t … PS=100 ES=82 ES=59 ES=41 ES=65.6 x y z z v

until all pending supports are less than σ. We have given our method the name 'PIE', reflecting this Probabilistic Iterative Expansion property.

Elaboration

The above described basic version does a lot of extra work. One observation is that as soon as the pending support of some node x is smaller than σ, we can often 'freeze' the whole subtrie, because it will not give us anything new; we call it 'ready'. The readiness of nodes can be checked easily with a recursive process: A node x is ready if PS(x) < σ and all its real children are ready. The readiness can be utilized to reduce work both in counting and expansion phases. In counting, we process one transaction at a time and scan its item subsets down the trie, but only until the first ready node on each path. Also the expansion procedure is skipped for ready nodes. Finally, a simple stopping condition is when the root becomes ready.

Another tailoring, not yet implemented, relates to the observation that most of the frequent itemsets are found in the first few iterations, and a lot of I/O effort is spent to find the last few frequent sets. For those, not all transactions are needed in solving the frequency. In the counting phase, we can distinguish between relevant and irrelevant transactions. A transaction is irrelevant, if it does not increase the pending support value of any nonready node. If the number of relevant transactions is small enough, we can store them separately (in main memory or temporary file) during the next scanning phase.

Our implementation of the trie is quite simple; saving memory is considered, but not as the first preference. The child linkage is implemented as an array of pointers, and the frequent items are renumbered to 0, …, m-1 (if there are m frequent items) to be able to use them as indices to the array. A minor improvement is that for item i, we need only m-i-1 pointers, corresponding to the possible children i+1, …, m-1.

The main restriction of the current implementation is the assumption that the trie fits in the main memory. Compression of nodes would help to some extent: Now we reserve a pointer for every possible child node, but most of them are null. Storing only non-null pointers saves memory, but makes the trie scanning slower. Also, we could release the ready nodes as soon as they are detected, in order to make room for expansions. Of course, before releasing, the related frequent itemsets should be reported. However, a fully general solution should work for any main memory and trie size. Some kind of external representation should be developed, but this is left for future work.

A high-level pseudocode of the current implementation is given in the following. The recursive parts are not coded explicitly, but should be rather obvious.

Algorithm PIE − Probabilistic iterative expansion of candidates in frequent itemset mining

Input: A transaction database D, the minimum support threshold σ. Output: The complete set of frequent itemsets.

Experimental results

For verifying the usability of our PIE algorithm, we used four of the test datasets made available to the Workshop on Frequent Itemset Mining Implementations (FIMI'03) [6]. The test datasets and some of their properties are described in Table 1. They represent rather different kinds of domains, and we wanted to include both dense and non-dense datasets, as well as various numbers of items. For the PIE method, the interesting statistics to be collected are the number of candidates, depth of the trie, and the number of iterations. These results are given in Table 2 for selected values of σ, for the 'Chess' dataset. We chose values of σ that keep the number of frequent itemsets reasonable (extremely high numbers are probably useless for any application). The table shows also the number of frequent items and frequent sets, to enable comparison with the number of candidates. For this dense dataset, the number of candidates varies between 2-4 times the number of frequent itemsets. For non-dense datasets the ratio is usually larger. Table 2 shows also the values of the 'security parameter' α, being the average probability of frequent items. Considering I/O performance, we can see that the number of iteration cycles (= number of file scans) is quite small, compared to the base version of the Apriori method, for which the largest frequent itemset dictates the number of iterations. This is roughly the same as the trie depth, as shown in Table 2.

The PIE method can also be characterized by describing the development of the trie during the iterations. The most interesting figures are the number of nodes and the number of ready nodes, given in Table 3. Especially the number of ready nodes implies that even though we have rather many candidates (= nodes in the trie), large parts of them are not touched in the later iterations. For speed comparison, we chose the Apriori and FPgrowth implementations, provided by Bart Goethals [6]. The results for the four test datasets and for different minimum support thresholds are shown in Table 4. The processor used in the experiments was a 1.5 GHz Pentium 4, with 512 MB main memory. We used a g++ compiler, using optimizing switch -O6. The PIE algorithm was coded in C. We can see that in some situations the PIE algorithm is the fastest, in some others the slowest. This is probably a general observation: the performance of most frequent itemset mining algorithms is highly dependent on the data set and threshold. It seems that PIE is at its best for sparse datasets (such as T40I10D100K and Kosarak), but not so good for very dense datasets (such as 'Chess' and 'Mushroom'). Its speed for large thresholds probably results from the simplicity of the algorithm. For smaller thresholds, the trie gets large and the counting starts to consume more time, especially with a small main memory size.

One might guess that our method is at its best for random data sets, because those would correspond to our assumption about independent item occurrences. We tested this with a dataset of 100 000 transactions, each of which contained 20 random items out of 30 possible. The results were rather interesting: For all tested thresholds for minimum support, we found all the frequent itemsets in the first iteration. However, verification of the completeness required one or two additional iterations, with a clearly higher number of candidates, consuming a majority of the total time. Table 5 shows the time and number of candidates both after the first and after the final iteration. The stepwise growth of the values reveals the levelwise growth of the trie. Apriori worked well also for this dataset, being in most cases faster than PIE. Results for FP-growth (not shown) are naturally much slower, because randomness prevents a compact representation of the transactions.

We wish to point out that our implementation was an initial version, with no special tricks for speed-up. We are convinced that the code details can be improved to make the method still more competitive. For example, buffering of transactions (or temporary files) were not used to enhance the I/O performance.

Conclusions and future work

A probability-based approach was suggested for frequent itemset mining, as an alternative to the 'analytic' methods common today. It has been observed to be rather robust, working reasonably well for various kinds of datasets. The number of candidate itemsets does not 'explode', so that the data structure (trie) can be kept in the main memory in most practical cases.

The number of iterations is smallest for random datasets, because candidate generation is based on just that assumption. For skewed datasets, the number of iterations may somewhat grow. This is partly due to our simplifying premise that the items are independent. This point could be tackled by making use of the conditional probabilities obtainable from the trie. Initial tests did not show any significant advantage over the basic approach, but a more

Fig.2

Figure 1 .1Figure 1. The complete trie for 4 items.

Figure 2 .2Figure 2. An initial trie for the transaction set {(0, 3), (1, 2), (0, 1, 3), (1)}, with minimum support threshold σ = 1/6. The virtual nodes with probabilities < 1/6 are shown using dashed lines.

Figure 3 .3Figure 3. An example of expansion for probabilities P(y) = 0.7, P(z) = 0.5, and P(v) = 0.8.

Table 1 . Test dataset description1Dataset#Transactions #ItemsChess3 19675Mushroom8 124119T40I10D100K100 000942Kosarak900 00241 270

Table 3 . Development of the trie for dataset 'Chess', with three different values of3σ.

Table 2 . Statistics from the PIE algorithm for dataset 'Chess'.2σ#Frequent items#Frequent setsAlpha#CandidatesTrie depth#Iterations#Apriori's iterations3 000121550.9704006362 900134730.9671 0428472 800161 3500.9532 4958482 700173 1340.9475 2189482 600196 1350.93410 51610492 5002211 4930.91418 709114102 4002320 5820.90747 515124112 3002435 2660.900131 108134122 2002759 1810.877216 94314513

Table 4 . Comparison of execution times (in seconds) of three frequent itemset mining programs for four test datasets.4(a) Chessσ#Freq. setsAprioriFP-growthPIE3 0001550.3120.2500.1252 9004730.4690.2660.2652 8001 3500.7970.2971.8132 7003 1341.4380.3446.9382 6006 1353.0160.43814.8762 50011 49310.2040.61026.3602 40020 58221.9070.82978.3252 30035 26642.0481.156 203.8282 20059 18173.2971.766 315.562(b) Mushroomσ#Freq. setsAprioriFP-growthPIE5 000410.3750.3910.0624 500970.4370.4060.0944 0001670.5780.4380.1413 5003690.7970.5000.2973 0009311.0620.5461.1572 5002 3651.7810.6106.0462 0006 6133.7190.75027.0471 50056 69355.1101.124 153.187(c) T40I10D100Kσ#Freq. setsAprioriFP-growthPIE20 00052.7976.3280.79718 00092.8286.5781.11016 000173.0017.2501.15614 000243.1418.4841.18712 000483.57814.7501.90610 000824.29623.8744.3448 0001377.85941.20311.7966 00023920.53172.98529.6714 00044035.282 114.95368.672(c) Kosarakσ#Freq. setsAprioriFP-growthPIE20 00012127.97030.1415.20318 00014128.43831.2966.11016 00016729.01632.7657.96914 00020229.06133.5169.68812 00026729.76634.87512.03210 00037634.90637.65718.0168 00057535.89141.65730.4536 0001 11039.65651.92270.376

Depth First Generation of Long Patterns RAgrawal CAggarwal VV VPrasad Proc. of the Int. Conf. on Knowledge Discovery and Data Mining RRamakrishnan SStolfo RBayardo IParsa of the Int. Conf. on Knowledge Discovery and Data Mining ACM Aug. 2000 Mining Association Rules Between Sets of Items in Large Databases RAgrawal TImielinski ASwami Proc. of ACM SIGMOD Int. Conf. of Management of Data PBuneman SJajodia of ACM SIGMOD Int. Conf. of Management of Data May 1993 Fast Algorithms for Mining Association Rules in Large Databases RAgrawal RSrikant Proc. of the 20th VLDB Conf JBBocca MJarke CZaniolo of the 20th VLDB Conf Sept. 1994 Efficiently Mining Long Patterns from Databases RJBayardo Proc. of the ACM SIGMOD Int. Conf. on Management of Data LMHaas ATiwary of the ACM SIGMOD Int. Conf. on Management of Data June 1998 MAFIA: a Maximal Frequent Itemset Algorithm for Transactional Databases DBurdick MCalimlim JGehrke Proc. of IEEE Int. Conf. on Data Engineering of IEEE Int. Conf. on Data Engineering April 2001 Frequent Itemset Mining Implementations (FIMI'03) Workshop website 2003 Efficient Frequent Pattern Mining BGoethals Dec. 2002 Belgium University of Limburg PhD thesis Efficiently Mining Maximal Frequent Itemsets KGouda MJZaki Proc. of 2001 IEEE International Conference on Data Mining NCercone TYLin XWu of 2001 IEEE International Conference on Data Mining Nov. 2001 Mining Frequent Patterns Without Candidate Generation JHan JPei YYin Proc. of ACM SIGMOD Int. Conf. on Management of Data WChen JNaughton PABernstein of ACM SIGMOD Int. Conf. on Management of Data 2000 Algorithms for Association Rule Mining -a General Survey and Comparison JHipp UGüntzer NNakhaeizadeh ACM SIGKDD Explorations 2 July 2000 Efficient Algorithms for Discovering Association Rules HMannila HToivonen AIVerkamo Proc. of the AAAI Workshop on Knowledge Discovery in Databases UMFayyad RUthurusamy of the AAAI Workshop on Knowledge Discovery in Databases July 1994 Closet: An Efficient Algorithm for Mining Frequent Closed Itemsets JPei JHan RMao Proc. of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery May 2000 A Data Structure for Manipulating Priority Queues JVuillemin Comm. of the ACM 21 4 1978 Scalable Algorithms for Association Mining MJZaki IEEE Transactions on Knowledge and Data Engineering 12 3 2000 Real World Performance of Association Rule Algorithms ZZheng RKohavi LMason Proc. of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining FProvost RSrikant of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2001