LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets Takeaki Uno1 , Tatsuya Asai2 , Yuzo Uchida2 , Hiroki Arimura2 1 National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430, Japan e-mail: uno@nii.jp 2 Department of Informatics, Kyushu University, 6-10-1 Hakozaki, Fukuoka 812-0053, JAPAN e-mail:{t-asai,y-uchida,arim}@i.kyushu-u.ac.jp Abstract: an occurrence of X. For a given constant α ≥ 0, an item set X is called frequent if |T (X)| ≥ α. If In this paper, we propose three algorithms LCM- a frequent item set is included in no other frequent freq, LCM, and LCMmax for mining all frequent sets, set, it is said to bemaximal. For a transaction set frequent closed item sets, and maximal frequent sets, S ⊆ T , let I(S) = T ∈S T . If an item set X satisfies respectively, from transaction databases. The main I(T (X)) = X, then X is called a closed item set. We theoretical contribution is that we construct tree- denote by F and C the sets of all frequent itemsets shaped transversal routes composed of only frequent and all frequent closed item sets, respectively. closed item sets, which is induced by a parent-child relationship defined on frequent closed item sets. By In this paper, we propose an efficient algorithm traversing the route in a depth-first manner, LCM LCM for enumerating all frequent closed item sets. finds all frequent closed item sets in polynomial time LCM is an abbreviation of Linear time Closed item per item set, without storing previously obtained set Miner. Existing algorithms for this task basically closed item sets in memory. Moreover, we introduce enumerate frequent item sets with cutting off unnec- several algorithmic techniques using the sparse and essary frequent item sets by pruning. However, the dense structures of input data. Algorithms for enu- pruning is not complete, hence the algorithms oper- merating all frequent item sets and maximal frequent ate unnecessary frequent item sets, and do something item sets are obtained from LCM as its variants. By more. In LCM, we define a parent-child relation- computational experiments on real world and syn- ship between frequent closed item sets. The relation- thetic databases to compare their performance to the ship induces tree-shaped transversal routes composed previous algorithms, we found that our algorithms only of all the frequent closed item sets. Our algo- are fast on large real world datasets with natural dis- rithm traverses the routes, hence takes linear time tributions such as KDD-cup2000 datasets, and many of the number of frequent closed item sets. This other synthetic databases. algorithm is obtained from the algorithms for enu- merating maximal bipartite cliques [14, 15], which is designed based on reverse search technique [3, 16]. 1. Introduction In addition to the search tree technique for closed Frequent item set mining is one of the fundamen- item sets, we use several techniques to speed-up the tal problems in data mining and has many applica- update of the occurrences of item sets. One technique tions such as association rule mining [1], inductive is occurrence deliver, which simutaneously computes databases [9], and query expansion [12]. the occurrence sets of all the successors of the cur- Let E be the universe of items, consisting of items rent item set during a single scan on the current oc- 1, ..., n. A subset X of E is called an item set. T currence set. The other is diffsets proposed in [18]. is a set of transactions over E, i.e., each T ∈ T Since there is a trade-off between these two methods is composed of items of E. For an item set X, let that the former is fast for sparse data while the latter T (X) = { t ∈ T | X ⊆ t } be the set of transactions is fast for dense data, we developed the hybrid algo- including X. Each transaction of T (X) is called rithm combining them. In some iterations, we make 1 a decision based of the estimation of their computa- tion time, hence our algorithm can use appropriate T one for dense parts and sparse parts of the input. We also consider the problems of enumerating all E frequent sets, and maximal frequent sets, and derive i(X) two algorithms LCMfreq and LCMmax from LCM. X parent of X LCMmax is obtained from LCM by adding the ex- occurrences of X occurrences of the parent plicit check of maximality. LCMfreq is not merely a LCM without the check of closedness, but also Figure 1: An example of the parent of X: The parent achives substantial speed-up using closed itemset dis- of X is obtained by deleting items larger than i(X) covery techniques because it enumerates only the rep- (in the gray area) and take a closure. resentatives of groups of frequent item sets, and gen- erate other frequent item sets from the representa- tives. prefix of Y if X = Y (i) holds for i = tail(X). Then, From computer experiments on real and artificial the parent-child relation P for the set enumeration datasets with the previous algorithms, we observed tree for F is define as X = P(Y ) iff Y = X ∪ {i} for that our algorithms LCMfreq, LCM, and LCMmax some i > tail(X), or equivalently, X = Y \{tail(Y )}. significantly outperform the previous algorithms on Then, the whole search space for F forms a prefix real world datasets with natural distributions such tree (or trie) with this edge relation P. as BMS-Web-View-1 and BMS-POS datasets in the Now, we define the parent-child relation P for KDD-CUP 2000 datasets as well as large synthe- closed item sets in C as follows. For X ∈ C, we de- sis datasets such as IBM T10K4D100K. The per- fine the parent of X by P(X) = I(T (X(i(X) − 1))), formance of our algorithms is similar to other algo- where i(X) be the minimum item i such that T (X) = rithms for hard datasets such as Connect and Chess T (X(i)) but T (X) = T (X(i − 1)). If Y is the parent datasets from UCI-Machine Learning Repository, but of X, we say X is a child of Y . Let ⊥ = I(T (∅)) be less significant than MAFIA, however LCM works the smallest item set in C called the root. For any with small memory rather than other algorithms. X ∈ C \ {⊥}, its parent P(X) is always defined and The organization of the paper is as follows. In Sec- belongs to C. An illustration is given in Fig. 1. tion 2, we explain our tree enumeration method for For any X ∈ C and its parent Y , the proper in- frequent closed item sets and our algorithm LCM. In clusion Y ⊂ X holds since T (X(i(X) − 1)) ⊂ T (X). Section 3, we describe several algorithmic techniques Thus, the relation P is acyclic, and its graph rep- for speeding up and saving memory. Then, Section 4 resentation forms a tree. By traversing the tree in and 5 give LCMmax and LCMfreq for maximal and a depth-first manner, we can enumerate all the fre- all frequent item sets, respectively. Techniques for quent closed item sets in linear time in the size of the implementation is described in Section 6, and the re- tree, which is equal to the number of the frequent sults of computational experiments are reported in closed item sets in C. In addition, we need not store Section 7. Finally, we conclude in Section 8. the tree in memory. Starting from the root ⊥, we find a child X of the root, and go to X. In the same way, we go to a child of X. When we arrive at a leaf 2. Enumerating Frequent Closed Item of the tree, we backtrack, and find another child. Re- Sets peating this process, we eventually find all frequent closed item set in C. In this section, we introduce a parent-child relation- To find the children of the current frequent closed ship between frequent closed item sets in C, and de- item set, we use the following lemma. For an item set scribe our algorithm LCM for enumeration them. X and an index i, let X[i] = X ∪ H where H is the Recent efficient algorithms for frequent item sets, set of the items j ∈ I(T (X ∪ {i})) satisfying j ≥ i. e.g.,[4, 17, 18], use a tree-shaped search structure for F , called the set enumeration tree [4] defined as fol- Lemma 1 X  is a child of X ∈ C (X  ∈ C and the lows. Let X = {x1 , . . . , xn } be an itemset as an parent of X  is X) if and only if ordered sequence such that x1 < · · · < xn , where the (cond1) X  = X[i] for some i > i(X), tail of X is tail(X) = xn ∈ E. Let X, Y be item (cond2) X  = I(T (X  )) (X  is a closed item set) sets. For an index i, X(i) = X ∩ {1, . . . , i}. X is a (cond3) X  is frequent 2 Proof : Suppose that X  = X[i] satisfies the condi- occurrences tions (cond1), (cond2) and (cond3). Then, X  ∈ C. A B Since T (X(i − 1)) = T (X) and T (X(i − 1) ∪ {i}) = C T (X  ) holds thus i(X  ) = i holds. Hence, X  is a D A child of X. Suppose that X  is a child of X. Then, F A A G C C A B (cond2) and (cond3) hold. From the definition of H D • • • D B C i(X  ), T (X(i(X  )) ∪ {i(X  )}) = T (X  ) holds. Hence, T(X) F G F C F G F H X  = X[i(X  )] holds. We also have i(X  ) > i(X) since T (X  (i(X  ) − 1)) = T (X). Hence (cond1) jQ[i] ••• jQ[|E|-2] jQ[|E|-1] jQ[|E|] = T(X[i]) holds. Right first sweep Clearly, T (X[i]) = T (X ∪ {i}) holds if (cond2) Figure 2: Occurrence deliver and right first sweep: I(T (X[i])) = X[i] holds. Note that no child X  of X In the figure, J [i] is written as JQ[i]. For each oc- satisfies X[i] = X[i ], i = i , since the minimum item currence T of X, occurrence deliver inserts T to J [i] of X[i]\X and X[i ]\X are i and i , respectively. Us- such that i ∈ T. When the algorithm generates a re- ing this lemma, we construct the following algorithm cursive call respect to X[|E| − 2], the recursive calls scheme for, given a closed itemset X, enumerating all respect to X[|E| − 1] and X[|E|] have been termi- descendants in the search tree for closed itemsets. nated, J [|E| − 1] and J [|E|] are cleared. The recur- Algorithm LCM (X : frequent closed item set) sive call of X[|E|−2] uses only J [|E| − 1] and J [|E|], 1. output X and hence the algorithm re-uses them in the recursive 2. For each i > i(X) do call. 3. If X[i] is frequent and X[i] = I(T (X[i])) then Call LCM( X[i] ) 3. Reducing Computation Time 4. End for The computation time of LCM described in the pre- Theorem 1 Let 0 < σ < 1 be a minimum support. vious section is linear in |C|, with a factor depending Algorithm LCM enumerates, given the root closed on T (X) for each closed item set X ∈ C. However, item set ⊥ = I(T (∅)), all frequent closed item sets this still takes long time if it is implemented in a in linear time in the number of frequent closed item straightforward way. In this section, we introduce sets in C. some techniques based on sparse and dense structures of the input data. The existing enumeration algorithm for frequent closed item sets are based on backtrack algorithm, Occurrence Deliver. First, We introduce the which traverse a tree composed of all frequent item technique called the occurrence deliver for reducing sets in F , and skip some item sets by pruning the the construction time for T (X[i]), which is needed tree. Since the pruning is not complete, however, to check (cond3). This technique is particularly effi- these algorithms generate unnecessary frequent item cient in the case that |T (X[i])| is much smaller than sets. On the other hand, the algorithm in [10] directly |T (X)|. In a usual way, T (X[i]) is obtained from generates only closed item sets with the closure oper- T (X) in O(|T (X)|) time by removing all transac- ation I(T (·)) as ours, but their method may generate tions not including i based on the equiality T (X[i]) = duplicated closed item sets and needs expensive du- T (X ∪ {i}) = T (X) ∩ T ({i}) (this method is known plicate check. as down-project). Thus, the total computation for all On the other hand, our algorithm traverses a tree children takes |E| scans and O(|T (X)| · |E|) time. composed only of frequent closed item sets, and each Instead of this, we build for all i = i(X), . . . , |E| def iteration is not as heavy as the previous algorithms. the occurrence lists J [i] = T (X[i]) simultaneously Hence, our algorithm runs fast in practice. If we con- by scanning the transactions in T (X) at once as fol- sider our algorithm as a modification of usual back- lows. We initialize J [i] = ∅ for all i = i(X), . . . , |E|. tracking algorithm, each iteration of our algorithm For each T ∈ T (X) and for each i ∈ T (i > i(X)), re-orders the items larger than i(X) such that the we insert T to J [i]. See Fig. 2 for explanation, where items not included in X follow the items included in we write jQ[i] for J [i]. This correctly computes J [i] X. Note that the parent X is not a prefix of X[i] in for all i in the total time O(|T (X)|). Furthermore, a recursive call. The check of (cond2) can be consid- we need not make recursive call of LCM for X[i] if ered as a pruning of non-closed item sets. T (X[i]) = ∅ (this is often called lookahead [4]). In 3 our experiments on BMS instances, the occurrence time up to 1/3. The hybrid technique is also useful deliver reduces the computation time up to 1/10 in in reducing the memory space in diffset as follows. some cases. Although the memory B(X) used by diffsets is not Right-first sweep. The occurrence deliver bounded by the input size ||T || in the worst case, method needs eager computation of the occurrence it is ensured in hybrid that B(X) does not exceed sets J [i] = T (X[i]) for all children before expand- A(X) ≤ ||T || because the diffset is chosen only when ing one of them. A simple implementation of it may A(X) ≥ αB(X). require much memory than the ordinary lazy compu- Checking the closedness in occrrence de- tation of T (X[i]) as in [17]. However, we can reduce liver. Another key is to efficiently check the closed- the memory usage using a method called the right- ness X[i] = I(T (X[i])) (cond 2). The straightfor- first sweep as follows. ward computation of the closure I(T (X[i])) takes Given a parent X, we make the recursive call for much time since it requires the access to the whole X[i] in the decreasing order for each i = |E|, . . . , i(X) sets T (X[j]), j < i and i is usually as large as |E|. (See Fig. 2). At each call of X[i], we collect the By definition, (cond 2) is violated iff there exists memory allocated before for J [i + 1], . . . , J [|E|] and some j ∈ {1, . . . , i − 1} such that j ∈ T for ev- then re-use it for J [i]. After terminating the call for ery T ∈ T (X ∪ {i}). We first choose a transac- X[i], the memory for J [i] is released for the future tion T ∗ (∪{i}) ∈ T (X ∪ {i}) of minimum size, and use in J [j] for j < i. Since |J [i]| = |T (X[i])|  ≤ tests if j ∈ T for increasing j ∈ T ∗ (∪{i}). This |T ({i})| for any i and X, the total memory |J [i]| results O( j∈T ∗ (X∪{i}) m(X[i], j)) time algorithm,  i is bounded by the input size ||T || = T ∈T |T |, and where m(X  , j) is the maximum index m such that thus, it is sufficient to allocate the memory for J at all of the first m transactions of T (X  ) include j, once as a global variable. which is much faster than the straightforward algo- Diffsets. In the case that |T (X[i])| is nearly equal rithm with O( j i by setting DJ [j] to be DJ [j] \ stances. Hence, we make columns of adjacency DJ [i] in time O( i>i(X),X[i]∈F ((|T (X)|−|T (X[i])|). matrix for only transactions of size larger than  Diffsets are needed for only i such that X[i] is fre- ( T ∈T |T |)/δ.  Here δ is a constant. This uses at quent. By diffsets, the computation time for in- most O(δ × T ∈T |T |), linear in the input size. stances such as connect, chess, pumsb are reduced Checking the closedness in diffsets. In the to 1/100, where |T (X[i])| is as large as |T (X)|. case that |T (X[i])| is nearly equal to |T (X)|, the Hybrid Computation. As we saw in the preced- above check is not done in short time. In this case, ing subsections, our occurrence deliver is fast when we keep diffset DJ [j] for all j < i, i ∈ X such |T (X[i])| is much smaller than |T (X)| while the diff- that X[i] is frequent. To maintain DJ for all i is set of [18] is fast when |T (X[i])| is nearly close to a heavy task, thus we discard unnecessary DJ ’s as |T (X)|. Therefore, our LCM dinamically decides follows. If T (X ∪ {j}) includes an item included which of occurrence deliver and diffsets we will use. in no T (X[i ]), i > i(X), then for any descendant To do this, we compare two quantities on X: X  of X, j ∈ I(T (X  [j  ])) for any j  > i(X  ). A(X) = i |T (X ∪ {i})| and Hence, we no longer have to keep DJ [j] for such B(X) = i:X∪{i}∈F (|T (X)| − |T (X ∪ {i})|). j. Let N C(X) be the set of items j such that X[j] For some fixed constant α > 1, we decide to use is frequent and any item of T (X) \ T (X ∪ {j}) is the occurrence deliver if A(X) < αB(X) and the included in some T (X[j  ]), j  > i(X). Then, the diffset otherwise. We make this decision only at the computation  time for checking (cond2) is written child iterations of the root set ⊥ since this decision as O( j∈N C(X),j i(X), insert t to J [j] composes a closed item set class into several sublat- 4. For each j, J [j] = ∅ in the decreasing order tices (gray rectangles). 5. If |J [j]| ≥ α and (cond2) holds then LCM Iter( T (J [j]), J [j], j ) 6. Delete J [j] complexity but increase the computation time. In the 7. End for case of occurrence deliver, we generate T (X ∪{j}) for all j in the same way as the occurrence  deliver, and LCM Iter2( X, T (X), i(X), DJ ) /* diffset */ check the maximality. This takes O( ji(X) |T (X[j])| +    time, or O( i,X∪{i}∈F ((|T (X)|−|T (X ∪{i})|)) time   j>i(X),X[j]∈F j  ∈T ∗ (X) m(X[j], j )) time, for each frequent closed item set X, with memory lin- or O( i>i(X),X[i]∈F ((|T (X)| − |T (X[i])|) + ear in the input size.  j∈N C(X),ji(X),X[j]∈F |T (X) \ T (X[j])| becomes smaller, frequent item sets which adds items one by one in lex- then we change to diffsets. Note that these estima- icographical order. Suppose that we currently have a tors can computed in short time by using the result frequent item set X, and find another frequent item of occurrence deliver. set X ∪ {i}. Let S = X[i]. Then, according to the above lemma, we can observe that for any frequent Theorem 4 LCMfreq enumerates all frequent item set X  including X and not intersecting S \ X, sets of F in O( j>i(X) |T (X[j])|) time or any item set including X  and included in X  ∪ S is  O( j>i(X),X[j]∈F |T (X) \ T (X[j])|) time for also frequent. Conversely, any frequent item set in-  cluding X is generated from X  not intersecting S\X. each frequent set X, within O( T ∈T |T |) space. Hence, we enumerate only representatives including X and not intersecting S \ X, and generate other Particularly, LCMfreq requires one integer for each frequent item sets by adding each subset of S \ X. item of any transaction, which is required to store the This method can be considered that we “decompose” input data. Other memory LCMfreq uses is bounded classes of closed item sets into several sublattices (hy- by O(|T | + |E|). percubes) each of whose maximal and minimal ele- Experimentally, an iteration of LCMfreq in- ments are S and X  , respectively (see Fig. 3). We putting frequent set X takes O(|T (X)| + |X|) call this technique hypercube decomposition. or O((size of diffset) + |X|) steps in average. In Suppose that we are currently operating a rep- some sense, this is optimal since we have to take resentative X  including X, and going to gener- O(|X|) time to output, and O(|T (X)|) time or ate a recursive call respect to X  ∪ {j}. Then, if O((size of diffset)) time to check the frequency of X. (X  [i] \ X  ) \ S = ∅, X  and S ∪ (X  [i] \ X  ) satisfies the condition of Lemma 2. Hence, we add X  [i] \ X  to S. 6. Implementation We describe LCMfreq as follows. In this section, we explain our implementation. First, Algorithm LCMfreq ( X : representative, we explain the data structure of our algorithm. A S : item set, i : item ) transaction T of input data is stored by an array with 1. Output all item sets R, X ⊆ R ⊆ X ∪ S length |T |. Each cell of the array stores the index of 2. For each j > i, j ∈ X ∪ S an item of T. For example, t = {4, 2, 7} is stored in 3. If X ∪ {j} is frequent then an array with 3 cells, [2, 4, 7]. We sort the elements of Call LCMfreq ( X ∪ {j}, S ∪ (X[j] \ (X ∪ {j})), j) the array so that we can take {i, ..., |E|} ∩ T in linear 4. End for time of {i, ..., |E|}∩T. J is also stored in arrays in the For some synthetic instances such that frequent same way. We are not in need of doubly linked lists closed item sets are fewer than frequent item sets, or binary trees, which take much time to be operated. the average size of S is up to 5. In these cases, the To reduce the practical computation time, we sort algorithm finds 2|S| = 32 frequent item sets at once, the transactions by their sizes, and items by the num- hence the computation time is reduced much by the ber of transactions including them. Experimentally,  improvement. this reduces j>i(X) |T (X ∪{j})|. In some cases, the To check the frequency of all X ∪ {j}, we can computation time has been reduced by a factor of use occurrence deliver and diffsets used for LCM. 1/3. LCMfreq does not require the check of (cond2), hence  The computation time of each iteration is O( j>i(X) |T (X[j])|) time for occurrence deliver, 7. Computational Experiments  and O( j>i(X),X[j]∈F |T (X) \ T (X[j])|) for diff- sets. Since the computation time change, we To examine the practical efficiency of our algorithms, use another estimators  for hybrid. In almost all we run the experiments on the real and synthetic cases, if once |T (X) \ T (X[j])| be- datasets, which are made available on FIMI’03 site.  j>i(X),X[j]∈F comes smaller than j>i(X) |T (X[j])|, the condi- In the following, we will report the results of the ex- tion holds in any iteration generated by a recursive periments. 6 Table 1: The datasets. AvTrSz means the average transaction size Dataset #items #Trans AvTrSz #FI #FCI #MFI Minsup (%) BMS-Web-View1 497 59,602 2.51 3.9K–NA 3.9K–1241K 2.1K–129.4K 0.1–0.01 BMS-Web-View2 3,340 77,512 4.62 24K–9897K 23K–755K 3.9K–118K 0.1–0.01 BMS-POS 1,657 517,255 6.5 122K–33400K 122K–21885K 30K–4280K 0.1–0.01 T10I4D100K 1,000 100,000 10.0 15K–335K 14K–229K 7.9K–114K 0.15–0.025 T40I10D100K 1,000 100,000 39.6 - - - 2–0.5 pumsb 7,117 49,046 74.0 - - - 95–60 pumsb star 7,117 49,046 50.0 - - - 50–10 mushroom 120 8,124 23.0 - - - 20–0.1 connect 130 67,577 43.0 - - - 95–40 chess 76 3196 37.0 - - - 90–30 7.1 Datasets and Methods 7.2 Results We implemented our algorithms our three algorithms Figures 6 through Figure 12 show the running time LCMfreq (LCMfreq), LCM (LCM), LCMmax (LCM- with varying minimum supports for the seven al- max) in C and compiled with gcc3.2. gorithms, namely LCMfreq, LCM, LCMmax, FP- The algorithms were tested on the datasets shown growth, eclat, apriori, mafia-mfi on the nine datasets in Table 1. available from the FIMI’03 homepage1 , described in the previous subsection. In the follow- which include: T10I4D100K, T40I10D100K from ing, we call all, maximal, closed frequent item set IBM Almaden Quest research group; chess, con- mining simply by all, maximal, closed. nect, mushroom, pumsb, pumsb star from UCI ML repository2 and PUMSB; BMS-WebView-1, BMS- WebView-2, BMS-POS from KDD-CUP 20003. Results on Synthetic Data We compare our algorithms LCMfreq, LCM, LCM- Figure 4 shows the running time with minimum sup- max with the following frequent item set mining algo- port ranging from 0.15% to 0.025% on IBM-Artificial rithms: Implementations of Fp-growth [7], Eclat [17], T10I4D100K datasets. From this plot, we see that Apriori [1, 2] by Bart Goethals 4 ; We also com- most algorithms run within around a few 10 min- pare the LCM algorithms with the implementation of utes and the behaviors are quite similar when mini- Mafia [6], a fast maximal frequent pattern miner, by mum support increases. In Figure 4, All of LCMmax, University of Cornell’s Database group 5 . This ver- LCM, and LCMfreq are twice faster than FP-growth sions of mafia with frequent item sets, frequent closed on IBM T10I4D100K dataset. On the other hand, item sets, and maximal frequent item sets options Mafia-mfi, Mafia-fci, and Mafia-fi are slower than ev- are denoted by mafia-fi, mafia-fci, mafia-mfi, respec- ery other algorithms. In Figure 5, Mafia-mfi is fastest tively. Although we have also planned to make the for maximal, and LCMfreq is fastest for all, for min- performance comparison with Charm, the state-of- imum support less than 1% on IBM T10I4D100K the-art frequent closed item set miner, we gave up the dataset. comparison in this time due to the time constraint. All experiments were run on a PC with the configu- ration of Pen4 2.8GHz, 1GB memory, and RPM 7200 Results on KDD-CUP datasets hard disk of 180GB. In the experiments, LCMfreq, Figures 6 through Figure 8 show the running time LCM and LCMmax use at most 123MB, 300MB, and with range of minimum supports from ranging from 300MB of memory, resp. Note that LCM and LCM- 0.1% to 0.01% on three real world datasets BMS- max can save the memory use by decreasing δ. WebView-1, BMS-WebView-2, BMS-POS datasets. 1 http://fimi.cs.helsinki.fi/testdata.html In the figure, we can observe that LCM algorithms 2 http://www.ics.uci.edu/ mlearn/MLRepository.html 3 http://www.ecn.purdue.edu/KDDCUP/ outperform others in almost cases, especially for 4 http://www.cs.helsinki.fi/u/goethals/software/ lower minimum support. In particular, LCM was 5 University of Cornell Database group, Himalaya Data best among seven algorithms in a wide range of min- Mining Tools, http://himalaya-tools.sourceforge.net/ imum support from 0.1% to 0.01% on all datasets. 7 㪠㪙㪤㩷㪫㪈㪇㪠㪋㪛㪈㪇㪇㪢 㪠㪙㪤㩷㪫㪋㪇㪠㪈㪇㪛㪈㪇㪇㪢 㪈㪇㪇㪇 㪈㪇㪇㪇 㪣㪚㪤㪽㫉㪼㫈 㪣㪚㪤㪽㫉㪼㫈 㪣㪚㪤 㪣㪚㪤 㪀 㪈㪇㪇 㪀 㪈㪇㪇 㪺 㪼 㪣㪚㪤㫄㪸㫏 㪺 㪼 㪣㪚㪤㫄㪸㫏 㫊 㫊 㩿㩷 㪽㫇㪾㫉㫆㫎㫋㪿 㩿㩷 㪽㫇㪾㫉㫆㫎㫋㪿 㪼 㪼 㫄 㫀 㪽㫇㪶㪼㪺㫃㪸㫋 㫄 㫀 㪽㫇㪶㪼㪺㫃㪸㫋 㫋 㫋 㪈㪇 㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀 㪈㪇 㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀 㫄㪸㪽㫀㪸㪶㪽㫀 㫄㪸㪽㫀㪸㪶㪽㫀 㫄㪸㪽㫀㪸㪶㪽㪺㫀 㫄㪸㪽㫀㪸㪶㪽㪺㫀 㫄㪸㪽㫀㪸㪶㫄㪽㫀 㫄㪸㪽㫀㪸㪶㫄㪽㫀 㪈 㪈 㪇㪅㪈㪌 㪇㪅㪈㪊 㪇㪅㪈 㪇㪅㪇㪏 㪇㪅㪇㪌 㪇㪅㪇㪊 㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀 㪉 㪈㪅㪌 㪈 㪇㪅㪌 㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀 Figure 4: Running time of the algorithms on IBM- Figure 5: Running time of the algorithms on IBM- Artificial T10I40D100K Artificial T40I10D100K 㪙㪤㪪㪄㪮㪼㪹㪭㫀㪼㫎㪄㪈 㪙㪤㪪㪄㪮㪼㪹㪭㫀㪼㫎㪄㪉 㪈㪇㪇㪇 㪈㪇㪇㪇 㪈㪇㪇 㪣㪚㪤㪽㫉㪼㫈 㪈㪇㪇 㪣㪚㪤㪽㫉㪼㫈 㪀 㪺 㪣㪚㪤 㪀 㪺 㪣㪚㪤 㪼 㪣㪚㪤㫄㪸㫏 㪼 㪣㪚㪤㫄㪸㫏 㩿㫊 㩿㫊 㩷 㪼 㪈㪇 㪽㫇㪾㫉㫆㫎㫋㪿 㩷 㪼 㪈㪇 㪽㫇㪾㫉㫆㫎㫋㪿 㫀㫄 㪽㫇㪶㪼㪺㫃㪸㫋 㫀㫄 㪽㫇㪶㪼㪺㫃㪸㫋 㫋 㫋 㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀 㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀 㪈 㫄㪸㪽㫀㪸㪶㪽㫀 㪈 㫄㪸㪽㫀㪸㪶㪽㫀 㫄㪸㪽㫀㪸㪶㫄㪽㫀 㫄㪸㪽㫀㪸㪶㫄㪽㫀 㪇㪅㪈 㪇㪅㪈 㪇㪅㪈 㪇㪅㪇㪏 㪇㪅㪇㪍 㪇㪅㪇㪋 㪇㪅㪇㪉 㪇㪅㪇㪈 㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀 㪇㪅㪈 㪇㪅㪇㪏 㪇㪅㪇㪍 㪇㪅㪇㪋 㪇㪅㪇㪉 㪇㪅㪇㪈 㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀 Figure 6: Running time of the algorithms on BMS- Figure 7: Running time of the algorithms on BMS- WebView-1 WebView-2 㪙㪤㪪㪄㪧㪦㪪 2, BMS-POS datasets. LCMfreq works quite well 㪈㪇㪇㪇㪇 for higher minimum support, but takes more than 30 minutes for minimum support above 0.04% on BMS- 㪈㪇㪇㪇 㪣㪚㪤㪽㫉㪼㫈 㪀 㪣㪚㪤 Web-View-1. In these cases, the number of frequent 㪺 㪼 㫊 㩿㩷 㪣㪚㪤㫄㪸㫏 item sets is quite large, over 100,000,000,000. Inter- 㪼 㪈㪇㪇 㪽㫇㪾㫉㫆㫎㫋㪿 㫀㫄 㫋 㪽㫇㪶㪼㪺㫃㪸㫋 estingly, Mafia-mfi’s performance is stable in a wide 㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀 range of minimum support from 0.1% to 0.01%. 㪈㪇 㫄㪸㪽㫀㪸㪶㪽㫀 㫄㪸㪽㫀㪸㪶㫄㪽㫀 In summary, LCM family algorithms significantly perform well on real world datasets BMS-WebView- 㪈 㪇㪅㪈 㪇㪅㪇㪏 㪇㪅㪇㪍 㪇㪅㪇㪋 㪇㪅㪇㪉 㪇㪅㪇㪈 㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀 1, BMS-WebView-2, BMS-POS datasets. Figure 8: Running time of the algorithms on Results on UCI-ML repository and PUMSB BMS-POS datasets Figures 9 through Figure 12 show the running For higher minimum support ranging from 0.1% to time on middle sized data sets pumsb and pumsb*, 0.06%, the performances of all algorithms are similar, kosarak and small sized datasets connect, chess, and LCM families have slightly better performance. and mushroom. These datasets taken from machine For lower minimum support ranging from 0.04% to learning domains are small but hard datasets for fre- 0.01%, Eclat and Apriori are much slower than every quent pattern mining task since they have many fre- other algorithms. LCM outperforms others. Some quent patterns even with high minimum supports, frequent item set miners such as Mafia-fi, and Mafia- e.g., from 50% to 90%. These datasets are originally fci runs out of 1GB of main memory for these mini- build for classification task and have slightly differ- mum supports on BMS-WebView-1, BMS-WebView- ent characteristics than large business datasets such 8 㫇㫌㫄㫊㪹 㫇㫌㫄㫊㪹㪶㫊㫋㪸㫉 㪈㪇㪇㪇 㪈㪇㪇㪇㪇 㪈㪇㪇㪇 㪈㪇㪇 㪣㪚㪤㪽㫉㪼㫈 㪣㪚㪤㪽㫉㪼㫈 㪀 㪣㪚㪤 㪀 㪣㪚㪤 㪺 㪼 㪣㪚㪤㫄㪸㫏 㪺 㪼 㪈㪇㪇 㪣㪚㪤㫄㪸㫏 㫊 㩿㩷 㫊 㩿㩷 㪼 㪈㪇 㪽㫇㪾㫉㫆㫎㫋㪿 㪼 㪽㫇㪾㫉㫆㫎㫋㪿 㫄 㫀 㫋 㪽㫇㪶㪼㪺㫃㪸㫋 㫄 㫀 㫋 㪈㪇 㪽㫇㪶㪼㪺㫃㪸㫋 㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀 㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀 㪈 㫄㪸㪽㫀㪸㪶㪽㫀 㫄㪸㪽㫀㪸㪶㪽㫀 㪈 㫄㪸㪽㫀㪸㪶㫄㪽㫀 㫄㪸㪽㫀㪸㪶㫄㪽㫀 㪇㪅㪈 㪇㪅㪈 㪐㪌 㪐㪇 㪏㪌 㪏㪇 㪎㪌 㪎㪇 㪍㪌 㪍㪇 㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀 㪌㪇 㪊㪌 㪊㪇 㪉㪌 㪉㪇 㪈㪌 㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀 Figure 9: Running time of the algorithms on pumsb Figure 10: Running time of the algorithms on pumsb* 㫂㫆㫊㪸㫉㪸㫂 㫄㫌㫊㪿㫉㫆㫆㫄 㪈㪇㪇㪇 㪈㪇㪇㪇 㪈㪇㪇 㪣㪚㪤㪽㫉㪼㫈 㪣㪚㪤 㪀 㪺 㪼 㪈㪇 㪣㪚㪤㫄㪸㫏 㪀 㪈㪇㪇 㪣㪚㪤㪽㫉㪼㫈 㫊 㩿㩷 㪽㫇㪾㫉㫆㫎㫋㪿 㪺 㪼 㪣㪚㪤 㪼 㪽㫇㪶㪼㪺㫃㪸㫋 㩿㫊 㩷 㫀㫄 㪈 㪼 㪣㪚㪤㫄㪸㫏 㫋 㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀 㫀㫄 㫋 㪽㫇㪾㫉㫆㫎㫋㪿 㫄㪸㪽㫀㪸㪶㪽㫀 㪈㪇 㪽㫇㪶㪼㪺㫃㪸㫋 㪇㪅㪈 㫄㪸㪽㫀㪸㪶㪽㪺㫀 㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀 㫄㪸㪽㫀㪸㪶㫄㪽㫀 㪇㪅㪇㪈 㪈 㪉㪇 㪈㪇 㪌 㪈 㪇㪅㪈 㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀 㪇㪅㪊 㪇㪅㪉㪌 㪇㪅㪉 㪇㪅㪈㪌 㪇㪅㪈 㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀 Figure 12: Running time of the algorithms on mush- Figure 11: Running time of the algorithms on kosarak room as BMS-Web-View-1 or BMS-POS. closed item set in the total input size. In practice, we In these figures, we see that Mafia-mfi constantly show by experiments that our algorithms run fast on outperforms every other maximal frequent item sets several real world datasets such as BMS-WebView-1. mining algorithms for wide range of minimum sup- We also showed variants LCMfreq and LCMmax of ports except on pumsb*. On the other hand, Apriori LCM for computing maximal and all frequent item is much slower than other algorithm. On the mining sets. LCMfreq uses new schemes hybrid and hyper- of all frequent item sets, LCMfreq is faster than the cube decomposition, and the schemes work well for others algorithms. On the mining of frequent closed many problems. item sets, there seems to be no consistent tendency on the performance results. However, LCM does not store the obtained solutions in the memory, while the Acknowledgement other algorithms do. Thus, in the sense of memory- saving, LCM has an advantage. We gratefully thank to Prof. Ken Satoh of National Institute of Informatics. This research had been sup- ported by group research fund of National Institute 8 Conclusion of Informatics, JAPAN. In this paper, we present an efficient algorithm LCM for mining frequent closed item sets based on parent- References child relationship defined on frequent closed item sets. This technique is taken from the algorithms [1] R. Agrawal and R. Srikant, “Fast Algorithms for for enumerating maximal bipartite cliques [14, 15] Mining Association Rules in Large Databases,” based on reverse search [3]. In theory, we demon- In Proc. VLDB ’94, pp. 487–499, 1994. strate that LCM exactly enumerates the set of fre- [2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen quent closed item sets within polynomial time per and A. I. Verkamo, “Fast Discovery of Associa- 9 㪺㫆㫅㫅㪼㪺㫋 㪺㪿㪼㫊㫊 㪈㪇㪇㪇 㪈㪇㪇㪇 㪈㪇㪇 㪈㪇㪇 㪣㪚㪤㪽㫉㪼㫈 㪣㪚㪤 㪣㪚㪤㪽㫉㪼㫈 㪀 㪺 㪣㪚㪤㫄㪸㫏 㪈㪇 㪣㪚㪤 㪼 㪀 㫊 㩿㩷 㪽㫇㪾㫉㫆㫎㫋㪿 㪺 㪼 㪣㪚㪤㫄㪸㫏 㪈㪇 㫊 㩿㩷 㪽㫇㪾㫉㫆㫎㫋㪿 㪼 㪽㫇㪶㪼㪺㫃㪸㫋 㪼 㪈 㫀㫄 㪽㫇㪶㪼㪺㫃㪸㫋 㫋 㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀 㫄 㫀 㫋 㫄㪸㪽㫀㪸㪶㪽㫀 㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀 㪈 㪇㪅㪈 㫄㪸㪽㫀㪸㪶㪽㪺㫀 㫄㪸㪽㫀㪸㪶㪽㫀 㫄㪸㪽㫀㪸㪶㫄㪽㫀 㫄㪸㪽㫀㪸㪶㪽㪺㫀 㪇㪅㪇㪈 㫄㪸㪽㫀㪸㪶㫄㪽㫀 㪇㪅㪈 㪐㪇 㪏㪇 㪎㪇 㪍㪇 㪌㪇 㪋㪇 㪊㪇 㪐㪌 㪐㪇 㪏㪇 㪎㪇 㪍㪇 㪌㪇 㪋㪇 㪊㪇 㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀 㪇㪅㪇㪇㪈 㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀 Figure 13: Running time of the algorithms on connect Figure 14: Running time of the algorithms on chess tion Rules,” In Advances in Knowledge Discov- [12] B. Possas, N. Ziviani, W. Meira Jr., ery and Data Mining, MIT Press, pp. 307–328, B. A. Ribeiro-Neto, “Set-based model: a 1996. new approach for information retrieval,” In [3] D. Avis and K. Fukuda, “Reverse Search for Proc. SIGIR’02, pp. 230-237, 2002. Enumeration,” Discrete Applied Mathematics, [13] S. Tsukiyama, M. Ide, H. Ariyoshi and I. Shi- Vol. 65, pp. 21–46, 1996. rakawa, “A New Algorithm for Generating All the Maximum Independent Sets,” SIAM Jour- [4] R. J. Bayardo Jr., Efficiently Mining Long Pat- nal on Computing, Vol. 6, pp. 505–517, 1977. terns from Databases, In Proc. SIGMOD’98, pp. 85–93, 1998. [14] Takeaki Uno, “A Practical Fast Algorithm for Enumerating Cliques in Huge Bipartite Graphs [5] E. Boros, V. Gurvich, L. Khachiyan, and and Its Implementation,” 89th Special Interest K. Makino, “On the Complexity of Generat- Group of Algorithms, Information Processing ing Maximal Frequent and Minimal Infrequent Society Japan, 2003, Sets,” In Proc. STACS 2002, pp. 133-141, 2002. [15] Takeaki Uno, “Fast Algorithms for Enumerat- [6] D. Burdick, M. Calimlim, J. Gehrke, “MAFIA: ing Cliques in Huge Graphs,” Research Group of A Maximal Frequent Itemset Algorithm for Computation, IEICE, Kyoto University, pp.55- Transactional Databases,” In Proc. ICDE 2001, 62, 2003 pp. 443-452, 2001. [16] Takeaki Uno, “A New Approach for Speed- [7] J. Han, J. Pei, Y. Yin, “Mining Frequent ing Up Enumeration Algorithms,” In Patterns without Candidate Generation,” In Proc. ISAAC’98, pp. 287–296, 1998. Proc. SIGMOD’00, pp. 1-12, 2000 [17] M. J. Zaki, “Scalable algorithms for associa- [8] R. Kohavi, C. E. Brodley, B. Frasca, L. Ma- tion mining,” Knowledge and Data Engineering, son and Z. Zheng, “KDD-Cup 2000 Organizers’ 12(2), pp. 372–390, 2000. Report: Peeling the Onion,” SIGKDD Explo- [18] M. J. Zaki, C. Hsiao, “CHARM: An Effi- rations, 2(2), pp. 86-98, 2000. cient Algorithm for Closed Itemset Mining,” In [9] H. Mannila, H. Toivonen, “Multiple Uses of Fre- Proc. SDM’02, SIAM, pp. 457-473, 2002. quent Sets and Condensed Representations,” In [19] Z. Zheng, R. Kohavi and L. Mason, “Real World Proc. KDD’96, pp. 189–194, 1996. Performance of Association Rule Algorithms,” [10] N. Pasquier, Y. Bastide, R. Taouil, L. Lakhal, In Proc. SIGKDD-01, pp. 401-406, 2001. Discovering frequent closed itemsets for associa- tion rules, In Proc. ICDT’99, pp. 398-416, 1999. [11] J. Pei, J. Han, R. Mao, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Item- sets,” ACM SIGMOD Workshop on Research Is- sues in Data Mining and Knowledge Discovery 2000, pp. 21-30, 2000. 10