1. Introduction

Mannila, H. and Toivonen, T., “On an Algorithm for Finding All Interesting Sen- tences”, Cybernetics and Systems, Vol II, The Thirteen European Meeting on Cybernetics and Systems Research, pp.

Detailed Description of an Algorithm for Enumeration of Maximal Frequent Sets with Irredundant Dualization

0 Takeaki Uno, Ken Satoh National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo , 101-8430 , Japan

2000

978 1996 21 30

We describe an implementation of an algorithm for enumerating all maximal frequent sets using irredundant dualization, which is an improved version of that of Gunopulos et al. The algorithm of Gunopulos et al. solves many dualization problems, and takes long computation time. We interleaves dualization with the main algorithm, and reduce the computation time for dualization by as long as one dualization. This also reduces the space complexity. Moreover, we accelerate the computation by using sparseness.

1. Introduction

Let E be an item set and T be a set of transactions defined on E. For an item set S ⊆ E, we denote the set of transactions including S by X(S). We define the frequency of S by |X(S)|. For a given constant α, if an item set S satisfies |X(S)| ≥ α, then S is said to be frequent. A frequent item set included in no other frequent item set is said to be maximal. An item set not frequent is called infrequent. An infrequent item set including no other infrequent item set is said to be minimal.

This paper describes an implementation of an algorithm for enumerating all maximal frequent sets using dualization in detail presented at [SatohUno03].

The algorithm is an improved version of that of Gunopulos et al. [Gunopulos97a, Gunopulos97b].

The algorithm computes maximal frequent sets based on computing minimal transversals of a hypergraph, computing minimal hitting set, or, in other words, computing a dualization of a monotone function [Fredman96]. The algorithm finds all minimal item sets not included in any current obtained maximal frequent set by dualization. If a frequent item set is in those minimal item sets, then the algorithm finds a new maximal frequent set including the frequent item set. In this way, the algorithm avoids checking all frequent item sets. However, this algorithm solves dualization problems many times, hence it is not fast for practical purpose. Moreover, the algorithm uses the dualization algorithm of Fredman and Khachiyan [Fredman96] which is said to be slow in practice.

We improved the algorithm in [SatohUno03] by using incremental dualization algorithms proposed by Kavvadias and Stavropoulos [Kavvadias99], and Uno [Uno02]. We developed an algorithm by interleaving dualization with finding maximal frequent sets. Roughly speaking, our algorithm solves one dualization problem with the size |Bd+|, in which Bd+ is the set of maximal frequent sets, while the algorithm of Gunopulos et al. solves |Bd+| dualization problems with sizes from 1 through |Bd+|. This reduces the computation time by a factor of 1/|Bd+|.

To reduce the computation time more, we used Uno’s dualization algorithm [Uno02]. The experimental computation time of Uno’s algorithm is linear in the number of outputs, and O(|E|) per output, while that of Kavvadias and Stavropoulos seems to be O(|E|2). This reduces the computation time by a factor of 1/|E|. Moreover, we add an improvement based on sparseness of input. By this, the experimental computation time per output is reduced to O(ave(Bd+)) where ave(Bd+) is the average size of maximal frequent sets. In summary, we reduced the computation time by a factor of ave(Bd+) / (|Bd+| × |E|2) by using the combination of the algorithm of Gunopulos et al. and the algorithm of Kavvadias and Stavropoulos.

In the following sections, we describe our algorithm and the computational result. Section 2 describes the algorithm of Gunopulos et al. and Section 3 describes our algorithm and Uno’s algorithm. Section 4 explains our improvement using sparseness. Computational experiments for FIMI’03 instances are shown in Section 5, and we conclude the paper in Section 6.

Dualize and Advance[Gunopulos97a] 1 Bd+ := {go up(∅)} 2 Compute M HS(Bd+). 3 If no set in M HS(Bd+) is frequent, output M HS(Bd+). 4 If there exists a frequent set S in M HS(Bd+), Bd+ := Bd+ ∪ {go up(S)} and go to 2.

2. Enumerating maximal frequent sets

by dualization In this section, we describe the algorithm of Gunopulos et al. Explanations are also in [Gunopulos97a, Gunopulos97b, SatohUno03], however, those are written with general terms. In this section, we explain in terms of frequent set mining.

Let Bd− be the set of minimal infrequent sets. For a subset family H of E, a hitting set HS of H is a set such that for every S ∈ H, S ∩ HS = ∅. If a hitting set includes no other hitting set, then it is said to be minimal. We denote the set of all minimal hitting sets of H by M HS(H). We denote the complement of a subset S w.r.t. E by S. For a subset family H, we denote {S|S ∈ H} by H.

There is a strong connection between the maximal frequent sets and the minimal infrequent sets by the minimal hitting set operation.

Proposition 1 [Mannila96] Bd− = M HS(Bd+)

Using the following proposition, Gunopulos et al. proposed an algorithm called Dualize and Advance shown in Fig. 1 to compute the maximal frequent sets [Gunopulos97a].

Proposition 2 [Gunopulos97a] Let Bd+ ⊆ Bd+. Then, for every S ∈ M HS(Bd+), either S ∈ Bd− or S is frequent (but not both).

In the above algorithm, go up(S) for a subset S of E is a maximal frequent set which is computed as follows.

1. Select one element e from S and check the frequency of S ∪ {e}. 2. If it is frequent, S := S ∪ {e} and go to 1. 3. Otherwise, if there is no element e in S such that

S ∪ {e} is frequent, then return S.

Proposition 3 [Gunopulos97a] The number of frequency checks in the “Dualize and Advance” algorithm to compute Bd+ is at most |Bd+| · |Bd−| + |Bd+| · |E|2.

Basically, the algorithm of Gunopulos et al. solves dualization problems with sizes from 1 through |Bd+|. Although we can terminate dualization when we find a new maximal frequent set, we may check each minimal infrequent item set again and again. This is one of the reasons that the algorithm of Gunopulos et al. is not fast in practice. In the next section, we propose a new algorithm obtained by interleaving gp up into a dualization algorithm. The algorithm basically solves one dualization problem of size |Bd+|.

3. Description of our algorithm

The key lemma of our algorithm is the following. Lemma 1 [SatohUno03] Let Bd1+ and Bd2+ be sub+ + sets of Bd+. If Bd1 ⊆ Bd2 ,

M HS(Bd1+) ∩ Bd− ⊆ M HS(Bd2+) ∩ Bd− Suppose that we have already found minimal hitting sets corresponding to Bd+ of a subset Bd+ of the maximal frequent sets. The above lemma means that if we add a maximal frequent set to Bd+, any minimal hitting set we found which corresponds to a minimal infrequent set is still a minimal infrequent set. Therefore, if we can use an algorithm to visit each minimal hitting set based on an incremental addition of maximal frequent sets one by one, we no longer have to check the same minimal hitting set again even if maximal frequent sets are newly found. The dualization algorithms proposed by Kavvadias and Stavropoulos [Kavvadias99] and Uno[Uno02] are such kinds of algorithms. Using these algorithms, we reduce the number of checks.

Let us show Uno’s algorithm [Uno02]. This is an improved version of Kavvadias and Stavropoulos’s algorithm [Kavvadias99]. Here we introduce some notation. A set S ∈ H is called critical for e ∈ hs, if S ∩ hs = {e}. We denote a family of critical sets for e w.r.t. hs and H as crit(e, hs). Note that mhs is a minimal hitting set of H if and only if for every e ∈ mhs, crit(e, mhs) is not empty. end end return; Suppose that H = {S1, ..., Sm}, and let M HSi be M HS({S0, ..., Si})(1 ≤ i ≤ n). We simply denote M HS(H) by M HS. A hitting set hs for {S1, ..., Si} is minimal if and only if crit(e, hs) ∩ {S1, ..., Si} = ∅ for any e ∈ hs.

Lemma 2 [Uno02] For any mhs ∈ M HSi(1 ≤ i ≤ n), there exists just one minimal hitting set mhs ∈ M HSi−1 satisfying either of the following conditions (but not both), • mhs = mhs. • mhs = mhs \ {e} {S0, ..., Si} = {Si}.

where crit(e, mhs) ∩

We call mhs the parent of mhs, and mhs a child of mhs . Since the parent-child relationship is not cyclic, its graphic representation forms a forest in which each of its connected components is a tree rooted at a minimal hitting set of M HS1. We consider the trees as traversal routes defined for all minimal hitting sets of all M HSi. These transversal routes can be traced in a depth-first manner by generating children of the current visiting minimal hitting set, hence we can enumerate all minimal hitting sets of M HS in linear time of i |M HSi|. Although i |M HSi| can be exponential to |M HS|, such cases are expected to be exceptional in practice. Experimentally, i |M HSi| is linear in |M HS|.

To find children of a minimal hitting set, we use the following proposition that immediately leads from the above lemma.

Proposition 4 [Uno02]

Any child mhs of mhs ∈ M HSi satisfies one of the following conditions. (1) mhs = mhs (2) mhs = mhs ∪ {e} In particular, no mhs has a child satisfying (1) and a child satisfying (2).

If mhs ∩ Si+1 = ∅ then mhs ∈ M HSi+1, and (1) holds. If mhs ∩ Si+1 = ∅, then mhs ∈ M HSi+1, and (2) can hold for some e ∈ Si+1. If mhs = mhs ∪ {e} is a child of mhs, then for any e ∈ mhs, there is Sj ∈ crit(e , mhs), j ≤ i such that e ∈ Sj . From these observations, we obtain the algorithm described in Fig. 2.

An iteration of the algorithm in Fig. 2 takes: • O(|mhs|) time for line 1. • O(|Si+1 ∪ mhs|) time for line 2. • O((|E| − |mhs|) × e ∈mhs |crit(e , mhs) ∩ {S0, ..., Si}|) time for lines 3 to 5, except for the computation of crit.

To compute crit quickly, we store crit(e, mhs) in memory, and update them when we generate a recursive call. Note that this takes O(m) memory. Since crit(e , mhs ∪ {e}) is obtained from crit(e , mhs) by removing sets including e (i.e., crit(e , mhs ∪ {e}) = {S|S ∈ crit(e , mhs), e ∈ Si+1}), crit(e , mhs ∪ { } e ) for all e can be computed in O(m) time. Hence the computation time of an iteration is bounded by O(|E| × m).

Based on this dualization algorithm, we developed a maximal frequent sets enumeration algorithm.

First, the algorithm sets the input H of the dualization problem to the empty set. Then, the algorithm solves the dualization in the same way as the Irredundant Border Enumerator global integer bdpnum; sets bd0+, bd1+....; main() begin bdpnum := 0; construct bdp(0, ∅); output all the bdj+(0 ≤ j ≤ bdpnum); end construct bdp(i, mhs) begin

bdpnumbdj+ is found */ if i == bdpnum /* minimal hitting set for ∪j:=0 then goto 1 else goto 2 if mhs is not frequent, return; /* new Bd− element is found */ bdb+dpnum := go up2(mhs); /* new Bd+ element is found */ bdpnum := bdpnum + 1; /* proceed to 2 */ if bdi+ ∩ mhs = ∅ then construct bdp(i + 1, mhs); end else begin return; for every e ∈ bdi+ do if bdi+ ∪ {e} is a minimal hitting set of {bd0+, bd1+..., bdi+−1}) then construct bdp(i + 1, mhs ∪ {e}); above algorithm. When a minimal hitting set mhs is found, the algorithm checks its frequency. If mhs is frequent, the algorithm finds a maximal frequent set S including it, and adds S to H as a new element of H. Now mhs is not a minimal hitting set since S ∩ mhs = ∅. The algorithm continues generating a recursive call to find a minimal hitting set of the updated H. In the case that mhs is not frequent, from Lemma 1, mhs continues to be a minimal hitting set even when H is updated. Hence, we backtrack and find other minimal hitting sets.

When the algorithm terminates, H is the set of maximal frequent sets, and the set of all minimal hitting sets the algorithm found is the set of minimal infrequent sets. The recursive tree the algorithm generated is a subtree of the recursive tree obtained by Uno’s dualization algorithm inputting Bd+, which is the set of the complement of maximal frequent sets.

This algorithm is described in Fig. 3. We call the algorithm Irredundant Border Enumerator (IBE algorithm, for short).

Theorem 1 The computation time of IBE is O(Dual(Bd+) + |Bd+|g), where Dual(B¯d+) is the computation time of Uno’s algorithm for dualizing Bd+, and g is the computation time for go up.

Note also that, the space complexity of the IBE algorithm is O(ΣS∈Bd+ |S|) since all we need to memorize is Bd+ and once a set in Bd− is checked, it is no longer need to be recorded. On the other hand, Gunopulos et al. [Gunopulos97a] suggests a usage of Fredman and Khachiyan’s algorithm [Fredman96] which needs a space of O(ΣS∈(Bd+∪Bd−)|S|) since the algorithm needs both Bd+ and Bd− at the last stage.

4. Using sparseness

In this section, we speed up the dualization phase of our algorithm by using a sparseness of H. In real data, the sizes of maximal frequent sets are usually small. They are often bounded by a constant. We use this sparse structure for accelerating the algorithm. global S0, ..., Sm; compute mhs(i, mhs) /* mhs is a minimal hitting set of S0, ..., Si */ begin 1 if uncov(mhs) == ∅ then output mhs and return; 2 i := minimum index of uncov(mhs) ; 3 for every e ∈ mhs do 4 increase the counter of items in ∪S∈crit(mhs,e)S by one

end 5 for every e ∈ mhs s.t. counter is increased by |mhs| do /* items included in all ∪S∈crit(mhs,e)S */ 6 compute mhs(i + 1, mhs ∪ {e});

return; end

First, we consider a way to reduce the computation time of iterations. Let us see the algorithm described in Fig. 2. The bottle neck part of the computation of an iteration of the algorithm is lines 3 to 5, which check the existence of a critical set Sj ∈ crit(mhs, e ), j < i such that e ∈ Sj .

Instead of this, we compute S∈crit(mhs,e) S for each e ∈ mhs. If and only if e ∈ S∈crit(mhs,e) S for all e ∈ mhs, e satisfies the condition of “if” at line 4. To compute

S∈crit(mhs,e) S for all e ∈ mhs, we take O( e∈mhs S∈crit(mhs,e) |S|) time. In the case of IBE algorithm, S is a maximal frequent set, hence the average size of |S| is expected to be small.

The sizes of minimal infrequent sets are not greater than the maximum size of maximal frequent sets, and they are usually smaller than the average size of the maximal frequent sets. Hence, |mhs| is also expected to be small.

Second, we reduce the number of iterations. For mhs ⊆ E, we define uncov(mhs) by the set of S ∈ H satisfying S ∩ mhs = ∅. If mhs ∩ Si = ∅, the iteration inputting mhs and i does nothing but generates a recursive call with increasing i by one. This type of iterations should be skipped. Only iterations executing lines 3 to 5 are crucial. Hence, in each iteration, we set i to the minimum index among uncov(mhs).

As a result of this, we need not execute line 2, and the number of iterations is reduced from i |M HSi| to | i M HSi|. We describe the improved algorithm in Fig. 4.

In our implementation, when we generate a recursive call, we allocate memory for each variable used in the recursive call. Hence, the memory required by the algorithm can be up to O(|E| × m). However, experimentally the required memory is always linear in the input size. Note that we can reduce the worst case memory complexity by some sophisticated algorithms.

5. Experiments

In this section, we show some results of the computational experiments of our algorithms. We implement our algorithm using C programming language, and examined instances of FIMI2003. For instances of KDD-cup 2000[KDDcup00], we compared the results to the computational experiments of CHARM[Zaki02], closed[Pei00], FP-growth[Han00], and Apriori[Agrawal96] shown in [Zheng01]. The experiments in [Zheng01] were done on a PC with a Duron 550MHz CPU and 1GB RAM memory. Our experiments were done on a PC with a Pentium III 500MHz CPU and 256MB RAM memory, which is little slower than a Duron 550MHz CPU. The results are shown in Figs. 4 – 14. Note that our algorithm uses at most 170MB for any following instance. We also show the number of frequent sets, frequent closed/maximal item sets, and minimal frequent sets.

In our experiments, IBE algorithm takes approximately O(|Bd−| × ave(Bd+)) time, while the computation time of other algorithms deeply depends on the number of frequent sets, the number of frequent closed item sets, and the minimum support. We recall that ave(Bd+) is the average size of maximal frequent sets. In some instances, our IBE algorithm performs rather well compared to other algorithms. In these cases, the number of maximal frequent item sets is smaller than number of frequent item sets. IBE algorithm seems to give a good performance for difficult problems such that the number of maximal frequent sets is very small rather than those of frequent item sets and frequent closed item sets.

6. Conclusion

In this paper, we describe the detailed implementation method of our algorithm proposed in [SatohUno03] and we give some experimental results on test data.

Acknowledgments

We are grateful to Heikki Mannila for participating useful discussions about this research. 6 0 4 8

kosarak 3000 2500 2000 1500 1000 time(sec 1000 )

mushroom connect 30000 25000 20000 15000 10000 IBE IBE maximum memory use: 74MB pumsb: #items 7117, #transactions 49046, ave. size of transaction 74 support 45000 44000 43000 42000

IBE 301 582 1069 1840 #freq. sets 1163 2993 7044 15757 #closed sets 685 1655 3582 7013 #max. freq. sets 144 288 541 932 #min. infreq. sets 7482 7737 8402 9468

maximum use of memory: 70MB pumsb star: #items 7117, #transactions 49046, ave. size of transaction 50 support 30000 25000 20000 15000 10000 5000

IBE 8 19 59 161 556 2947 #freq. sets 165 627 21334 356945 >2G #closed sets 66 221 2314 14274 111849 #max. freq. sets 4 17 81 315 1666 15683 #min. infreq. sets 7143 7355 8020 9635 19087 98938

maximum use of memory: 44MB kosarak: #items 41217, #transactions 990002, ave. size of transaction 8 support 3000 2500 2000 1500 1000

IBE 226 294 528 759 2101 #freq. sets 4894 8561 34483 219725 711424 #closed sets 4865 8503 31604 157393 496675 #max. freq. sets 792 1146 2858 4204 16231 #min. infreq. sets 87974 120591 200195 406287 875391

maximum use of memory: 170MB mushroom: #items 120, #transactions 8124, ave. size of transaction 23 support 30 20 10 5 2

IBE 132 231 365 475 433 #freq. sets 505205917 781458545 1662769667 >2G >2G #closed sets 91122 109304 145482 181243 230585 #max. freq. sets 15232 21396 30809 34131 27299 #min. infreq. sets 66085 79481 81746 69945 31880 maximum use of memory 47MB connect: #items 130, #transactions 67577, ave. size of transaction 43 support 63000 60000 57000 54000 51000 48000 45000

IBE 229 391 640 893 1154 1381 1643 #freq. sets 6327 41143 171239 541911 1436863 - #closed sets 1566 4372 9041 15210 23329 - #max. freq. sets 152 269 464 671 913 1166 1466 #min. infreq. sets 297 486 703 980 1291 1622 1969 maximum use of memory: 60MB chess: #items 76, #transactions 3196, ave. size of transaction 37 support 2200 1900 1600 1300 1000

IBE 19 61 176 555 2191 #freq. sets 59181 278734 1261227 5764922 29442848 #closed sets 28358 106125 366529 1247700 4445373 #max. freq. sets 1047 3673 11209 35417 114382 #min. infreq. sets 1725 5202 14969 46727 152317 maximum use of memory: 50MB

[Agrawal96] Agrawal , R. , Mannila , H. , Srikant , R. , Toivonen , H. , and Verkamo , A. I. , “Fast Discovery of Association Rules”, U. M. Fayyad , G.

Piatetsky-Shapiro , P.

Smyth , and R.

Uthurusamy , (eds), Advances in Knowledge Discovery and Data Mining , chapter 12 , pp. 307 - 328 ( 1996 ).

[Fredman96] Fredman , M. L. and Khachiyan , L. , “ On the Complexity of Dualization of Monotone Disjunctive Normal Forms” , Journal of Algorithms 21 ( 3 ), pp. 618 - 628 ( 1996 )

[Gunopulos97a] Gunopulos , D. , Khardon , R. , Mannila , H.

and

Toivonen , H., “Data mining, Hypergraph Transversals, and Machine Learning” , Proc. of PODS'97 , pp. 209 - 216 ( 1997 ).

[Gunopulos97b] Gunopulos , D. , Mannila , H. , and Saluja , S. , “ Discovering All Most Specific Sentences using Randomized Algorithms” , Proc. of ICDT'97 , pp. 215 - 229 ( 1997 ).

[Han00] Han, J ., Pei , J. , Yin , Y. , “Mining Frequent Patterns without Candidate Generation,” SIGMOD Conference 2000 , pp. 1 - 12 , 2000

[Kavvadias99] Kavvadias , D. J. , and Stavropoulos , E. C. , “ Evaluation of an Algorithm for the Transversal Hypergraph Problem” , Algorithm Engineering, pp 72 - 84 ( 1999 ).

[KDDcup00] Kohavi , R. , Brodley , C. E. , Frasca , B. , Mason , L. , and Zheng , Z. , “KDD-Cup 2000 Organizers' Report: Peeling the Onion,” SIGKDD Explorations , 2 ( 2 ), pp. 86 - 98 , 2000 .