=Paper= {{Paper |id=Vol-90/paper-9 |storemode=property |title=LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets |pdfUrl=https://ceur-ws.org/Vol-90/uno.pdf |volume=Vol-90 |dblpUrl=https://dblp.org/rec/conf/fimi/UnoAUA03 }} ==LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets== https://ceur-ws.org/Vol-90/uno.pdf
                           LCM: An Eﬃcient Algorithm for
                        Enumerating Frequent Closed Item Sets

               Takeaki Uno1 , Tatsuya Asai2 , Yuzo Uchida2 , Hiroki Arimura2
                              1
                                National Institute of Informatics
                   2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430, Japan
                                   e-mail: uno@nii.jp
 2
   Department of Informatics, Kyushu University, 6-10-1 Hakozaki, Fukuoka 812-0053, JAPAN
                    e-mail:{t-asai,y-uchida,arim}@i.kyushu-u.ac.jp


                    Abstract:                                 an occurrence of X. For a given constant α ≥ 0,
                                                              an item set X is called frequent if |T (X)| ≥ α. If
   In this paper, we propose three algorithms LCM-            a frequent item set is included in no other frequent
freq, LCM, and LCMmax for mining all frequent sets,           set, it is said to bemaximal. For a transaction set
frequent closed item sets, and maximal frequent sets,         S ⊆ T , let I(S) = T ∈S T . If an item set X satisﬁes
respectively, from transaction databases. The main            I(T (X)) = X, then X is called a closed item set. We
theoretical contribution is that we construct tree-           denote by F and C the sets of all frequent itemsets
shaped transversal routes composed of only frequent           and all frequent closed item sets, respectively.
closed item sets, which is induced by a parent-child
relationship deﬁned on frequent closed item sets. By             In this paper, we propose an eﬃcient algorithm
traversing the route in a depth-ﬁrst manner, LCM              LCM for enumerating all frequent closed item sets.
ﬁnds all frequent closed item sets in polynomial time         LCM is an abbreviation of Linear time Closed item
per item set, without storing previously obtained             set Miner. Existing algorithms for this task basically
closed item sets in memory. Moreover, we introduce            enumerate frequent item sets with cutting oﬀ unnec-
several algorithmic techniques using the sparse and           essary frequent item sets by pruning. However, the
dense structures of input data. Algorithms for enu-           pruning is not complete, hence the algorithms oper-
merating all frequent item sets and maximal frequent          ate unnecessary frequent item sets, and do something
item sets are obtained from LCM as its variants. By           more. In LCM, we deﬁne a parent-child relation-
computational experiments on real world and syn-              ship between frequent closed item sets. The relation-
thetic databases to compare their performance to the          ship induces tree-shaped transversal routes composed
previous algorithms, we found that our algorithms             only of all the frequent closed item sets. Our algo-
are fast on large real world datasets with natural dis-       rithm traverses the routes, hence takes linear time
tributions such as KDD-cup2000 datasets, and many             of the number of frequent closed item sets. This
other synthetic databases.                                    algorithm is obtained from the algorithms for enu-
                                                              merating maximal bipartite cliques [14, 15], which is
                                                              designed based on reverse search technique [3, 16].
1. Introduction
                                                                 In addition to the search tree technique for closed
Frequent item set mining is one of the fundamen-              item sets, we use several techniques to speed-up the
tal problems in data mining and has many applica-             update of the occurrences of item sets. One technique
tions such as association rule mining [1], inductive          is occurrence deliver, which simutaneously computes
databases [9], and query expansion [12].                      the occurrence sets of all the successors of the cur-
   Let E be the universe of items, consisting of items        rent item set during a single scan on the current oc-
1, ..., n. A subset X of E is called an item set. T           currence set. The other is diﬀsets proposed in [18].
is a set of transactions over E, i.e., each T ∈ T             Since there is a trade-oﬀ between these two methods
is composed of items of E. For an item set X, let             that the former is fast for sparse data while the latter
T (X) = { t ∈ T | X ⊆ t } be the set of transactions          is fast for dense data, we developed the hybrid algo-
including X. Each transaction of T (X) is called              rithm combining them. In some iterations, we make

                                                          1
a decision based of the estimation of their computa-
tion time, hence our algorithm can use appropriate               T
one for dense parts and sparse parts of the input.
   We also consider the problems of enumerating all               E
frequent sets, and maximal frequent sets, and derive                             i(X)
two algorithms LCMfreq and LCMmax from LCM.                            X                    parent of X
LCMmax is obtained from LCM by adding the ex-                         occurrences of X      occurrences of the parent
plicit check of maximality. LCMfreq is not merely
a LCM without the check of closedness, but also                Figure 1: An example of the parent of X: The parent
achives substantial speed-up using closed itemset dis-         of X is obtained by deleting items larger than i(X)
covery techniques because it enumerates only the rep-          (in the gray area) and take a closure.
resentatives of groups of frequent item sets, and gen-
erate other frequent item sets from the representa-
tives.                                                         preﬁx of Y if X = Y (i) holds for i = tail(X). Then,
   From computer experiments on real and artiﬁcial             the parent-child relation P for the set enumeration
datasets with the previous algorithms, we observed             tree for F is deﬁne as X = P(Y ) iﬀ Y = X ∪ {i} for
that our algorithms LCMfreq, LCM, and LCMmax                   some i > tail(X), or equivalently, X = Y \{tail(Y )}.
signiﬁcantly outperform the previous algorithms on             Then, the whole search space for F forms a preﬁx
real world datasets with natural distributions such            tree (or trie) with this edge relation P.
as BMS-Web-View-1 and BMS-POS datasets in the                     Now, we deﬁne the parent-child relation P for
KDD-CUP 2000 datasets as well as large synthe-                 closed item sets in C as follows. For X ∈ C, we de-
sis datasets such as IBM T10K4D100K. The per-                  ﬁne the parent of X by P(X) = I(T (X(i(X) − 1))),
formance of our algorithms is similar to other algo-           where i(X) be the minimum item i such that T (X) =
rithms for hard datasets such as Connect and Chess             T (X(i)) but T (X) = T (X(i − 1)). If Y is the parent
datasets from UCI-Machine Learning Repository, but             of X, we say X is a child of Y . Let ⊥ = I(T (∅)) be
less signiﬁcant than MAFIA, however LCM works                  the smallest item set in C called the root. For any
with small memory rather than other algorithms.                X ∈ C \ {⊥}, its parent P(X) is always deﬁned and
   The organization of the paper is as follows. In Sec-        belongs to C. An illustration is given in Fig. 1.
tion 2, we explain our tree enumeration method for                For any X ∈ C and its parent Y , the proper in-
frequent closed item sets and our algorithm LCM. In            clusion Y ⊂ X holds since T (X(i(X) − 1)) ⊂ T (X).
Section 3, we describe several algorithmic techniques          Thus, the relation P is acyclic, and its graph rep-
for speeding up and saving memory. Then, Section 4             resentation forms a tree. By traversing the tree in
and 5 give LCMmax and LCMfreq for maximal and                  a depth-ﬁrst manner, we can enumerate all the fre-
all frequent item sets, respectively. Techniques for           quent closed item sets in linear time in the size of the
implementation is described in Section 6, and the re-          tree, which is equal to the number of the frequent
sults of computational experiments are reported in             closed item sets in C. In addition, we need not store
Section 7. Finally, we conclude in Section 8.                  the tree in memory. Starting from the root ⊥, we
                                                               ﬁnd a child X of the root, and go to X. In the same
                                                               way, we go to a child of X. When we arrive at a leaf
2. Enumerating Frequent Closed Item                            of the tree, we backtrack, and ﬁnd another child. Re-
     Sets                                                      peating this process, we eventually ﬁnd all frequent
                                                               closed item set in C.
In this section, we introduce a parent-child relation-            To ﬁnd the children of the current frequent closed
ship between frequent closed item sets in C, and de-           item set, we use the following lemma. For an item set
scribe our algorithm LCM for enumeration them.                 X and an index i, let X[i] = X ∪ H where H is the
   Recent eﬃcient algorithms for frequent item sets,           set of the items j ∈ I(T (X ∪ {i})) satisfying j ≥ i.
e.g.,[4, 17, 18], use a tree-shaped search structure for
F , called the set enumeration tree [4] deﬁned as fol-         Lemma 1 X  is a child of X ∈ C (X  ∈ C and the
lows. Let X = {x1 , . . . , xn } be an itemset as an           parent of X  is X) if and only if
ordered sequence such that x1 < · · · < xn , where the         (cond1) X  = X[i] for some i > i(X),
tail of X is tail(X) = xn ∈ E. Let X, Y be item                (cond2) X  = I(T (X  )) (X  is a closed item set)
sets. For an index i, X(i) = X ∩ {1, . . . , i}. X is a        (cond3) X  is frequent

                                                           2
Proof : Suppose that X  = X[i] satisﬁes the condi-
                                                                                     occurrences

tions (cond1), (cond2) and (cond3). Then, X  ∈ C.                     A
                                                                       B
Since T (X(i − 1)) = T (X) and T (X(i − 1) ∪ {i}) =                    C
T (X  ) holds thus i(X  ) = i holds. Hence, X  is a                 D
                                                                                    A
child of X. Suppose that X  is a child of X. Then,
                                                                       F                                 A               A
                                                                       G            C                    C       A       B
(cond2) and (cond3) hold. From the deﬁnition of                        H            D
                                                                                            • • •
                                                                                                         D       B       C
i(X  ), T (X(i(X  )) ∪ {i(X  )}) = T (X  ) holds. Hence,         T(X)
                                                                                    F
                                                                                    G
                                                                                                         F       C       F
                                                                                                         G       F       H
X  = X[i(X  )] holds. We also have i(X  ) > i(X)
since T (X  (i(X  ) − 1)) = T (X). Hence (cond1)                               jQ[i]
                                                                                            •••
                                                                                                   jQ[|E|-2] jQ[|E|-1]   jQ[|E|]
                                                                               = T(X[i])
holds.
                                                                                                        Right first sweep
   Clearly, T (X[i]) = T (X ∪ {i}) holds if (cond2)                Figure 2: Occurrence deliver and right ﬁrst sweep:
I(T (X[i])) = X[i] holds. Note that no child X  of X              In the ﬁgure, J [i] is written as JQ[i]. For each oc-
satisﬁes X[i] = X[i ], i = i , since the minimum item           currence T of X, occurrence deliver inserts T to J [i]
of X[i]\X and X[i ]\X are i and i , respectively. Us-            such that i ∈ T. When the algorithm generates a re-
ing this lemma, we construct the following algorithm               cursive call respect to X[|E| − 2], the recursive calls
scheme for, given a closed itemset X, enumerating all              respect to X[|E| − 1] and X[|E|] have been termi-
descendants in the search tree for closed itemsets.                nated, J [|E| − 1] and J [|E|] are cleared. The recur-
Algorithm LCM (X : frequent closed item set)                       sive call of X[|E|−2] uses only J [|E| − 1] and J [|E|],
1. output X                                                        and hence the algorithm re-uses them in the recursive
2. For each i > i(X) do                                            call.
3. If X[i] is frequent and X[i] = I(T (X[i])) then
    Call LCM( X[i] )
                                                                   3. Reducing Computation Time
4. End for
                                                                   The computation time of LCM described in the pre-
Theorem 1 Let 0 < σ < 1 be a minimum support.                      vious section is linear in |C|, with a factor depending
Algorithm LCM enumerates, given the root closed                    on T (X) for each closed item set X ∈ C. However,
item set ⊥ = I(T (∅)), all frequent closed item sets               this still takes long time if it is implemented in a
in linear time in the number of frequent closed item               straightforward way. In this section, we introduce
sets in C.                                                         some techniques based on sparse and dense structures
                                                                   of the input data.
   The existing enumeration algorithm for frequent
closed item sets are based on backtrack algorithm,                    Occurrence Deliver. First, We introduce the
which traverse a tree composed of all frequent item                technique called the occurrence deliver for reducing
sets in F , and skip some item sets by pruning the                 the construction time for T (X[i]), which is needed
tree. Since the pruning is not complete, however,                  to check (cond3). This technique is particularly eﬃ-
these algorithms generate unnecessary frequent item                cient in the case that |T (X[i])| is much smaller than
sets. On the other hand, the algorithm in [10] directly            |T (X)|. In a usual way, T (X[i]) is obtained from
generates only closed item sets with the closure oper-             T (X) in O(|T (X)|) time by removing all transac-
ation I(T (·)) as ours, but their method may generate              tions not including i based on the equiality T (X[i]) =
duplicated closed item sets and needs expensive du-                T (X ∪ {i}) = T (X) ∩ T ({i}) (this method is known
plicate check.                                                     as down-project). Thus, the total computation for all
   On the other hand, our algorithm traverses a tree               children takes |E| scans and O(|T (X)| · |E|) time.
composed only of frequent closed item sets, and each                  Instead of this, we build for all i = i(X), . . . , |E|
                                                                                                  def
iteration is not as heavy as the previous algorithms.              the occurrence lists J [i] = T (X[i]) simultaneously
Hence, our algorithm runs fast in practice. If we con-             by scanning the transactions in T (X) at once as fol-
sider our algorithm as a modiﬁcation of usual back-                lows. We initialize J [i] = ∅ for all i = i(X), . . . , |E|.
tracking algorithm, each iteration of our algorithm                For each T ∈ T (X) and for each i ∈ T (i > i(X)),
re-orders the items larger than i(X) such that the                 we insert T to J [i]. See Fig. 2 for explanation, where
items not included in X follow the items included in               we write jQ[i] for J [i]. This correctly computes J [i]
X. Note that the parent X is not a preﬁx of X[i] in                for all i in the total time O(|T (X)|). Furthermore,
a recursive call. The check of (cond2) can be consid-              we need not make recursive call of LCM for X[i] if
ered as a pruning of non-closed item sets.                         T (X[i]) = ∅ (this is often called lookahead [4]). In

                                                               3
our experiments on BMS instances, the occurrence                    time up to 1/3. The hybrid technique is also useful
deliver reduces the computation time up to 1/10 in                  in reducing the memory space in diﬀset as follows.
some cases.                                                         Although the memory B(X) used by diﬀsets is not
   Right-ﬁrst sweep.              The occurrence deliver            bounded by the input size ||T || in the worst case,
method needs eager computation of the occurrence                    it is ensured in hybrid that B(X) does not exceed
sets J [i] = T (X[i]) for all children before expand-               A(X) ≤ ||T || because the diﬀset is chosen only when
ing one of them. A simple implementation of it may                  A(X) ≥ αB(X).
require much memory than the ordinary lazy compu-                      Checking the closedness in occrrence de-
tation of T (X[i]) as in [17]. However, we can reduce               liver. Another key is to eﬃciently check the closed-
the memory usage using a method called the right-                   ness X[i] = I(T (X[i])) (cond 2). The straightfor-
ﬁrst sweep as follows.                                              ward computation of the closure I(T (X[i])) takes
   Given a parent X, we make the recursive call for                 much time since it requires the access to the whole
X[i] in the decreasing order for each i = |E|, . . . , i(X)         sets T (X[j]), j < i and i is usually as large as |E|.
(See Fig. 2). At each call of X[i], we collect the                     By deﬁnition, (cond 2) is violated iﬀ there exists
memory allocated before for J [i + 1], . . . , J [|E|] and          some j ∈ {1, . . . , i − 1} such that j ∈ T for ev-
then re-use it for J [i]. After terminating the call for            ery T ∈ T (X ∪ {i}). We ﬁrst choose a transac-
X[i], the memory for J [i] is released for the future               tion T ∗ (∪{i}) ∈ T (X ∪ {i}) of minimum size, and
use in J [j] for j < i. Since |J [i]| = |T (X[i])|
                                                         ≤         tests if j ∈ T for increasing j ∈ T ∗ (∪{i}). This
|T ({i})| for any i and X, the total memory           |J [i]|       results O( j∈T ∗ (X∪{i}) m(X[i], j)) time algorithm,
                                                   i
is bounded by the input size ||T || = T ∈T |T |, and                where m(X  , j) is the maximum index m such that
thus, it is suﬃcient to allocate the memory for J at                all of the ﬁrst m transactions of T (X  ) include j,
once as a global variable.                                          which is much   faster than the straightforward algo-
   Diﬀsets. In the case that |T (X[i])| is nearly equal             rithm with O( j i by setting DJ [j] to be DJ [j] \          stances. Hence, we make columns of adjacency
DJ [i] in time O( i>i(X),X[i]∈F ((|T (X)|−|T (X[i])|).              matrix    for only transactions of size larger than
                                                                     
Diﬀsets are needed for only i such that X[i] is fre-                ( T ∈T |T |)/δ.
                                                                                   Here δ is a constant. This uses at
quent. By diﬀsets, the computation time for in-                     most O(δ × T ∈T |T |), linear in the input size.
stances such as connect, chess, pumsb are reduced                      Checking the closedness in diﬀsets. In the
to 1/100, where |T (X[i])| is as large as |T (X)|.                  case that |T (X[i])| is nearly equal to |T (X)|, the
   Hybrid Computation. As we saw in the preced-                     above check is not done in short time. In this case,
ing subsections, our occurrence deliver is fast when                we keep diﬀset DJ [j] for all j < i, i ∈ X such
|T (X[i])| is much smaller than |T (X)| while the diﬀ-              that X[i] is frequent. To maintain DJ for all i is
set of [18] is fast when |T (X[i])| is nearly close to              a heavy task, thus we discard unnecessary DJ ’s as
|T (X)|. Therefore, our LCM dinamically decides                     follows. If T (X ∪ {j}) includes an item included
which of occurrence deliver and diﬀsets we will use.                in no T (X[i ]), i > i(X), then for any descendant
To do this,     we compare two quantities on X:                    X  of X, j ∈ I(T (X  [j  ])) for any j  > i(X  ).
  A(X) = i |T (X ∪ {i})| and                                       Hence, we no longer have to keep DJ [j] for such
  B(X) = i:X∪{i}∈F (|T (X)| − |T (X ∪ {i})|).                       j. Let N C(X) be the set of items j such that X[j]
   For some ﬁxed constant α > 1, we decide to use                   is frequent and any item of T (X) \ T (X ∪ {j}) is
the occurrence deliver if A(X) < αB(X) and the                      included in some T (X[j  ]), j  > i(X). Then, the
diﬀset otherwise. We make this decision only at the                 computation
                                                                                  time for checking (cond2) is written
child iterations of the root set ⊥ since this decision              as O( j∈N C(X),j i(X), insert t to J [j]                composes a closed item set class into several sublat-
4. For each j, J [j] = ∅ in the decreasing order              tices (gray rectangles).
5. If |J [j]| ≥ α and (cond2) holds then
     LCM Iter( T (J [j]), J [j], j )
6. Delete J [j]                                                complexity but increase the computation time. In the
7. End for                                                     case of occurrence deliver, we generate T (X ∪{j}) for
                                                               all j in the same way as the occurrence
                                                                                                     deliver, and
LCM Iter2( X, T (X), i(X), DJ ) /* diﬀset */                   check the maximality. This takes O( ji(X) |T (X[j])| +                           
                                                             time, or O( i,X∪{i}∈F ((|T (X)|−|T (X ∪{i})|)) time
                                          
         
   j>i(X),X[j]∈F    j  ∈T ∗ (X) m(X[j], j ))      time,       for each frequent closed item set X, with memory lin-
or O( i>i(X),X[i]∈F ((|T (X)| − |T (X[i])|) +                  ear in the input size.

   j∈N C(X),ji(X),X[j]∈F |T (X) \ T (X[j])| becomes smaller,
frequent item sets which adds items one by one in lex-                 then we change to diﬀsets. Note that these estima-
icographical order. Suppose that we currently have a                   tors can computed in short time by using the result
frequent item set X, and ﬁnd another frequent item                     of occurrence deliver.
set X ∪ {i}. Let S = X[i]. Then, according to the
above lemma, we can observe that for any frequent                      Theorem 4 LCMfreq enumerates all frequent
item set X  including X and not intersecting S \ X,
                                                                       sets of F in O( j>i(X) |T (X[j])|) time or
any item set including X  and included in X  ∪ S is                     
                                                                       O( j>i(X),X[j]∈F |T (X) \ T (X[j])|) time for
also frequent. Conversely, any frequent item set in-                                                 
cluding X is generated from X  not intersecting S\X.                  each frequent set X, within O( T ∈T |T |) space.
Hence, we enumerate only representatives including
X and not intersecting S \ X, and generate other                          Particularly, LCMfreq requires one integer for each
frequent item sets by adding each subset of S \ X.                     item of any transaction, which is required to store the
This method can be considered that we “decompose”                      input data. Other memory LCMfreq uses is bounded
classes of closed item sets into several sublattices (hy-              by O(|T | + |E|).
percubes) each of whose maximal and minimal ele-                          Experimentally, an iteration of LCMfreq in-
ments are S and X  , respectively (see Fig. 3). We                    putting frequent set X takes O(|T (X)| + |X|)
call this technique hypercube decomposition.                           or O((size of diﬀset) + |X|) steps in average. In
   Suppose that we are currently operating a rep-                      some sense, this is optimal since we have to take
resentative X  including X, and going to gener-                       O(|X|) time to output, and O(|T (X)|) time or
ate a recursive call respect to X  ∪ {j}. Then, if                    O((size of diﬀset)) time to check the frequency of X.
(X  [i] \ X  ) \ S = ∅, X  and S ∪ (X  [i] \ X  ) satisﬁes
the condition of Lemma 2. Hence, we add X  [i] \ X 
to S.                                                                  6. Implementation
   We describe LCMfreq as follows.
                                                                       In this section, we explain our implementation. First,
Algorithm LCMfreq ( X : representative,
                                                                       we explain the data structure of our algorithm. A
                    S : item set, i : item )
                                                                       transaction T of input data is stored by an array with
1. Output all item sets R, X ⊆ R ⊆ X ∪ S
                                                                       length |T |. Each cell of the array stores the index of
2. For each j > i, j ∈ X ∪ S
                                                                       an item of T. For example, t = {4, 2, 7} is stored in
3. If X ∪ {j} is frequent then
                                                                       an array with 3 cells, [2, 4, 7]. We sort the elements of
   Call LCMfreq ( X ∪ {j}, S ∪ (X[j] \ (X ∪ {j})), j)
                                                                       the array so that we can take {i, ..., |E|} ∩ T in linear
4. End for
                                                                       time of {i, ..., |E|}∩T. J is also stored in arrays in the
   For some synthetic instances such that frequent                     same way. We are not in need of doubly linked lists
closed item sets are fewer than frequent item sets,                    or binary trees, which take much time to be operated.
the average size of S is up to 5. In these cases, the                    To reduce the practical computation time, we sort
algorithm ﬁnds 2|S| = 32 frequent item sets at once,                   the transactions by their sizes, and items by the num-
hence the computation time is reduced much by the                      ber of transactions   including them. Experimentally,
                                                                                      
improvement.                                                           this reduces j>i(X) |T (X ∪{j})|. In some cases, the
   To check the frequency of all X ∪ {j}, we can                       computation time has been reduced by a factor of
use occurrence deliver and diﬀsets used for LCM.                       1/3.
LCMfreq does not require the check of (cond2),
hence
    The computation time of each iteration is
O( j>i(X) |T (X[j])|) time for occurrence deliver,                     7. Computational Experiments
         
and O( j>i(X),X[j]∈F |T (X) \ T (X[j])|) for diﬀ-
sets.    Since the computation time change, we                         To examine the practical eﬃciency of our algorithms,
use another estimators
                         for hybrid. In almost all                    we run the experiments on the real and synthetic
cases, if once                  |T (X) \ T (X[j])| be-                 datasets, which are made available on FIMI’03 site.
                      
                 j>i(X),X[j]∈F
comes smaller than       j>i(X) |T (X[j])|, the condi-                 In the following, we will report the results of the ex-
tion holds in any iteration generated by a recursive                   periments.

                                                                   6
                       Table 1: The datasets. AvTrSz means the average transaction size

                  Dataset #items #Trans AvTrSz     #FI         #FCI        #MFI   Minsup (%)
           BMS-Web-View1     497 59,602   2.51 3.9K–NA      3.9K–1241K 2.1K–129.4K 0.1–0.01
           BMS-Web-View2 3,340 77,512     4.62 24K–9897K     23K–755K   3.9K–118K  0.1–0.01
                BMS-POS 1,657 517,255      6.5 122K–33400K 122K–21885K 30K–4280K   0.1–0.01
              T10I4D100K 1,000 100,000    10.0 15K–335K      14K–229K   7.9K–114K 0.15–0.025
             T40I10D100K 1,000 100,000    39.6      -            -          -        2–0.5
                   pumsb 7,117 49,046     74.0      -            -          -        95–60
               pumsb star 7,117 49,046    50.0      -            -          -        50–10
                mushroom     120 8,124    23.0      -            -          -       20–0.1
                  connect    130 67,577   43.0      -            -          -        95–40
                    chess     76   3196   37.0      -            -          -        90–30



7.1    Datasets and Methods                                   7.2    Results
We implemented our algorithms our three algorithms            Figures 6 through Figure 12 show the running time
LCMfreq (LCMfreq), LCM (LCM), LCMmax (LCM-                    with varying minimum supports for the seven al-
max) in C and compiled with gcc3.2.                           gorithms, namely LCMfreq, LCM, LCMmax, FP-
   The algorithms were tested on the datasets shown           growth, eclat, apriori, maﬁa-mﬁ on the nine datasets
in Table 1. available from the FIMI’03 homepage1 ,            described in the previous subsection. In the follow-
which include: T10I4D100K, T40I10D100K from                   ing, we call all, maximal, closed frequent item set
IBM Almaden Quest research group; chess, con-                 mining simply by all, maximal, closed.
nect, mushroom, pumsb, pumsb star from UCI ML
repository2 and PUMSB; BMS-WebView-1, BMS-
WebView-2, BMS-POS from KDD-CUP 20003.                        Results on Synthetic Data
   We compare our algorithms LCMfreq, LCM, LCM-               Figure 4 shows the running time with minimum sup-
max with the following frequent item set mining algo-         port ranging from 0.15% to 0.025% on IBM-Artiﬁcial
rithms: Implementations of Fp-growth [7], Eclat [17],         T10I4D100K datasets. From this plot, we see that
Apriori [1, 2] by Bart Goethals 4 ; We also com-              most algorithms run within around a few 10 min-
pare the LCM algorithms with the implementation of            utes and the behaviors are quite similar when mini-
Maﬁa [6], a fast maximal frequent pattern miner, by           mum support increases. In Figure 4, All of LCMmax,
University of Cornell’s Database group 5 . This ver-          LCM, and LCMfreq are twice faster than FP-growth
sions of maﬁa with frequent item sets, frequent closed        on IBM T10I4D100K dataset. On the other hand,
item sets, and maximal frequent item sets options             Maﬁa-mﬁ, Maﬁa-fci, and Maﬁa-ﬁ are slower than ev-
are denoted by maﬁa-ﬁ, maﬁa-fci, maﬁa-mﬁ, respec-             ery other algorithms. In Figure 5, Maﬁa-mﬁ is fastest
tively. Although we have also planned to make the             for maximal, and LCMfreq is fastest for all, for min-
performance comparison with Charm, the state-of-              imum support less than 1% on IBM T10I4D100K
the-art frequent closed item set miner, we gave up the        dataset.
comparison in this time due to the time constraint.
   All experiments were run on a PC with the conﬁgu-
ration of Pen4 2.8GHz, 1GB memory, and RPM 7200               Results on KDD-CUP datasets
hard disk of 180GB. In the experiments, LCMfreq,
                                                              Figures 6 through Figure 8 show the running time
LCM and LCMmax use at most 123MB, 300MB, and
                                                              with range of minimum supports from ranging from
300MB of memory, resp. Note that LCM and LCM-
                                                              0.1% to 0.01% on three real world datasets BMS-
max can save the memory use by decreasing δ.
                                                              WebView-1, BMS-WebView-2, BMS-POS datasets.
  1 http://fimi.cs.helsinki.fi/testdata.html                  In the ﬁgure, we can observe that LCM algorithms
  2 http://www.ics.uci.edu/ mlearn/MLRepository.html
  3 http://www.ecn.purdue.edu/KDDCUP/
                                                              outperform others in almost cases, especially for
  4 http://www.cs.helsinki.fi/u/goethals/software/            lower minimum support. In particular, LCM was
  5 University of Cornell Database group, Himalaya Data       best among seven algorithms in a wide range of min-
Mining Tools, http://himalaya-tools.sourceforge.net/          imum support from 0.1% to 0.01% on all datasets.

                                                          7
                                㪠㪙㪤㩷㪫㪈㪇㪠㪋㪛㪈㪇㪇㪢                                                                        㪠㪙㪤㩷㪫㪋㪇㪠㪈㪇㪛㪈㪇㪇㪢
  㪈㪇㪇㪇                                                                                        㪈㪇㪇㪇


                                                                            㪣㪚㪤㪽㫉㪼㫈                                                                        㪣㪚㪤㪽㫉㪼㫈
                                                                            㪣㪚㪤                                                                            㪣㪚㪤
 㪀 㪈㪇㪇                                                                                       㪀 㪈㪇㪇
 㪺
 㪼
                                                                            㪣㪚㪤㫄㪸㫏           㪺
                                                                                             㪼
                                                                                                                                                           㪣㪚㪤㫄㪸㫏
 㫊                                                                                           㫊
 㩿㩷                                                                         㪽㫇㪾㫉㫆㫎㫋㪿         㩿㩷                                                            㪽㫇㪾㫉㫆㫎㫋㪿
  㪼                                                                                           㪼
  㫄
  㫀
                                                                            㪽㫇㪶㪼㪺㫃㪸㫋          㫄
                                                                                              㫀
                                                                                                                                                           㪽㫇㪶㪼㪺㫃㪸㫋
  㫋                                                                                           㫋
        㪈㪇                                                                  㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀              㪈㪇                                                     㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀
                                                                            㫄㪸㪽㫀㪸㪶㪽㫀                                                                       㫄㪸㪽㫀㪸㪶㪽㫀
                                                                            㫄㪸㪽㫀㪸㪶㪽㪺㫀                                                                      㫄㪸㪽㫀㪸㪶㪽㪺㫀
                                                                            㫄㪸㪽㫀㪸㪶㫄㪽㫀                                                                      㫄㪸㪽㫀㪸㪶㫄㪽㫀
         㪈                                                                                           㪈
                 㪇㪅㪈㪌    㪇㪅㪈㪊       㪇㪅㪈          㪇㪅㪇㪏      㪇㪅㪇㪌    㪇㪅㪇㪊   㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀                           㪉           㪈㪅㪌        㪈          㪇㪅㪌     㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀


Figure 4: Running time of the algorithms on IBM- Figure 5: Running time of the algorithms on IBM-
Artiﬁcial T10I40D100K                            Artiﬁcial T40I10D100K
                                㪙㪤㪪㪄㪮㪼㪹㪭㫀㪼㫎㪄㪈                                                                             㪙㪤㪪㪄㪮㪼㪹㪭㫀㪼㫎㪄㪉
  㪈㪇㪇㪇                                                                                        㪈㪇㪇㪇


       㪈㪇㪇                                                                  㪣㪚㪤㪽㫉㪼㫈                㪈㪇㪇                                                     㪣㪚㪤㪽㫉㪼㫈
 㪀
 㪺
                                                                            㪣㪚㪤              㪀
                                                                                             㪺
                                                                                                                                                           㪣㪚㪤
 㪼                                                                          㪣㪚㪤㫄㪸㫏           㪼                                                             㪣㪚㪤㫄㪸㫏
 㩿㫊                                                                                          㩿㫊
  㩷
  㪼
        㪈㪇                                                                  㪽㫇㪾㫉㫆㫎㫋㪿          㩷
                                                                                              㪼
                                                                                                    㪈㪇                                                     㪽㫇㪾㫉㫆㫎㫋㪿
  㫀㫄                                                                        㪽㫇㪶㪼㪺㫃㪸㫋          㫀㫄                                                           㪽㫇㪶㪼㪺㫃㪸㫋
   㫋                                                                                           㫋
                                                                            㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀                                                                     㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀
         㪈                                                                  㫄㪸㪽㫀㪸㪶㪽㫀                 㪈                                                     㫄㪸㪽㫀㪸㪶㪽㫀
                                                                            㫄㪸㪽㫀㪸㪶㫄㪽㫀                                                                      㫄㪸㪽㫀㪸㪶㫄㪽㫀

       㪇㪅㪈                                                                                         㪇㪅㪈
                 㪇㪅㪈    㪇㪅㪇㪏     㪇㪅㪇㪍     㪇㪅㪇㪋      㪇㪅㪇㪉    㪇㪅㪇㪈          㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀                     㪇㪅㪈       㪇㪅㪇㪏     㪇㪅㪇㪍   㪇㪅㪇㪋   㪇㪅㪇㪉    㪇㪅㪇㪈   㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀


Figure 6: Running time of the algorithms on BMS- Figure 7: Running time of the algorithms on BMS-
WebView-1                                        WebView-2

                                   㪙㪤㪪㪄㪧㪦㪪                                                          2, BMS-POS datasets. LCMfreq works quite well
  㪈㪇㪇㪇㪇
                                                                                                    for higher minimum support, but takes more than 30
                                                                                                    minutes for minimum support above 0.04% on BMS-
       㪈㪇㪇㪇                                                                 㪣㪚㪤㪽㫉㪼㫈
 㪀                                                                          㪣㪚㪤                     Web-View-1. In these cases, the number of frequent
 㪺
 㪼
 㫊
 㩿㩷
                                                                            㪣㪚㪤㫄㪸㫏                  item sets is quite large, over 100,000,000,000. Inter-
  㪼
        㪈㪇㪇                                                                 㪽㫇㪾㫉㫆㫎㫋㪿
  㫀㫄
   㫋                                                                        㪽㫇㪶㪼㪺㫃㪸㫋                estingly, Maﬁa-mﬁ’s performance is stable in a wide
                                                                            㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀              range of minimum support from 0.1% to 0.01%.
         㪈㪇                                                                 㫄㪸㪽㫀㪸㪶㪽㫀
                                                                            㫄㪸㪽㫀㪸㪶㫄㪽㫀                  In summary, LCM family algorithms signiﬁcantly
                                                                                                    perform well on real world datasets BMS-WebView-
             㪈
                  㪇㪅㪈     㪇㪅㪇㪏      㪇㪅㪇㪍         㪇㪅㪇㪋      㪇㪅㪇㪉    㪇㪅㪇㪈   㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀                1, BMS-WebView-2, BMS-POS datasets.

Figure 8: Running time of the algorithms on                                                         Results on UCI-ML repository and PUMSB
BMS-POS                                                                                             datasets
                                                                                                    Figures 9 through Figure 12 show the running
   For higher minimum support ranging from 0.1% to                                                  time on middle sized data sets pumsb and pumsb*,
0.06%, the performances of all algorithms are similar,                                              kosarak and small sized datasets connect, chess,
and LCM families have slightly better performance.                                                  and mushroom. These datasets taken from machine
For lower minimum support ranging from 0.04% to                                                     learning domains are small but hard datasets for fre-
0.01%, Eclat and Apriori are much slower than every                                                 quent pattern mining task since they have many fre-
other algorithms. LCM outperforms others. Some                                                      quent patterns even with high minimum supports,
frequent item set miners such as Maﬁa-ﬁ, and Maﬁa-                                                  e.g., from 50% to 90%. These datasets are originally
fci runs out of 1GB of main memory for these mini-                                                  build for classiﬁcation task and have slightly diﬀer-
mum supports on BMS-WebView-1, BMS-WebView-                                                         ent characteristics than large business datasets such

                                                                                         8
                               㫇㫌㫄㫊㪹                                                                              㫇㫌㫄㫊㪹㪶㫊㫋㪸㫉
 㪈㪇㪇㪇                                                                                   㪈㪇㪇㪇㪇

                                                                                             㪈㪇㪇㪇
      㪈㪇㪇                                                             㪣㪚㪤㪽㫉㪼㫈                                                                      㪣㪚㪤㪽㫉㪼㫈
 㪀                                                                    㪣㪚㪤              㪀                                                           㪣㪚㪤
 㪺
 㪼                                                                    㪣㪚㪤㫄㪸㫏
                                                                                       㪺
                                                                                       㪼      㪈㪇㪇                                                  㪣㪚㪤㫄㪸㫏
 㫊
 㩿㩷                                                                                    㫊
                                                                                       㩿㩷
  㪼
       㪈㪇                                                             㪽㫇㪾㫉㫆㫎㫋㪿          㪼                                                          㪽㫇㪾㫉㫆㫎㫋㪿
  㫄
  㫀
  㫋                                                                   㪽㫇㪶㪼㪺㫃㪸㫋          㫄
                                                                                        㫀
                                                                                        㫋      㪈㪇                                                  㪽㫇㪶㪼㪺㫃㪸㫋
                                                                      㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀                                                                   㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀
        㪈                                                             㫄㪸㪽㫀㪸㪶㪽㫀                                                                     㫄㪸㪽㫀㪸㪶㪽㫀
                                                                                                   㪈
                                                                      㫄㪸㪽㫀㪸㪶㫄㪽㫀                                                                    㫄㪸㪽㫀㪸㪶㫄㪽㫀

      㪇㪅㪈                                                                                     㪇㪅㪈
            㪐㪌     㪐㪇    㪏㪌    㪏㪇         㪎㪌   㪎㪇     㪍㪌     㪍㪇   㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀                           㪌㪇   㪊㪌     㪊㪇       㪉㪌       㪉㪇    㪈㪌   㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀


Figure 9: Running time of the algorithms on pumsb                                  Figure 10: Running time of the algorithms on pumsb*

                                㫂㫆㫊㪸㫉㪸㫂                                                                           㫄㫌㫊㪿㫉㫆㫆㫄
                                                                                        㪈㪇㪇㪇

 㪈㪇㪇㪇
                                                                                             㪈㪇㪇                                                  㪣㪚㪤㪽㫉㪼㫈
                                                                                                                                                  㪣㪚㪤
                                                                                       㪀
                                                                                       㪺
                                                                                       㪼      㪈㪇                                                  㪣㪚㪤㫄㪸㫏
 㪀 㪈㪇㪇                                                              㪣㪚㪤㪽㫉㪼㫈            㫊
                                                                                       㩿㩷                                                         㪽㫇㪾㫉㫆㫎㫋㪿
 㪺
 㪼                                                                  㪣㪚㪤                 㪼
                                                                                                                                                  㪽㫇㪶㪼㪺㫃㪸㫋
 㩿㫊
  㩷                                                                                     㫀㫄     㪈
  㪼                                                                 㪣㪚㪤㫄㪸㫏               㫋
                                                                                                                                                  㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀
  㫀㫄
   㫋                                                                㪽㫇㪾㫉㫆㫎㫋㪿                                                                      㫄㪸㪽㫀㪸㪶㪽㫀
       㪈㪇                                                           㪽㫇㪶㪼㪺㫃㪸㫋                 㪇㪅㪈                                                  㫄㪸㪽㫀㪸㪶㪽㪺㫀
                                                                    㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀                                                                    㫄㪸㪽㫀㪸㪶㫄㪽㫀
                                                                                         㪇㪅㪇㪈
        㪈                                                                                              㪉㪇    㪈㪇         㪌        㪈        㪇㪅㪈   㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀
             㪇㪅㪊        㪇㪅㪉㪌        㪇㪅㪉        㪇㪅㪈㪌        㪇㪅㪈    㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀

                                                     Figure 12: Running time of the algorithms on mush-
Figure 11: Running time of the algorithms on kosarak room


as BMS-Web-View-1 or BMS-POS.                                                                 closed item set in the total input size. In practice, we
   In these ﬁgures, we see that Maﬁa-mﬁ constantly                                            show by experiments that our algorithms run fast on
outperforms every other maximal frequent item sets                                            several real world datasets such as BMS-WebView-1.
mining algorithms for wide range of minimum sup-                                              We also showed variants LCMfreq and LCMmax of
ports except on pumsb*. On the other hand, Apriori                                            LCM for computing maximal and all frequent item
is much slower than other algorithm. On the mining                                            sets. LCMfreq uses new schemes hybrid and hyper-
of all frequent item sets, LCMfreq is faster than the                                         cube decomposition, and the schemes work well for
others algorithms. On the mining of frequent closed                                           many problems.
item sets, there seems to be no consistent tendency
on the performance results. However, LCM does not
store the obtained solutions in the memory, while the                                         Acknowledgement
other algorithms do. Thus, in the sense of memory-
saving, LCM has an advantage.                                                                 We gratefully thank to Prof. Ken Satoh of National
                                                                                              Institute of Informatics. This research had been sup-
                                                                                              ported by group research fund of National Institute
8           Conclusion                                                                        of Informatics, JAPAN.

In this paper, we present an eﬃcient algorithm LCM
for mining frequent closed item sets based on parent-                                         References
child relationship deﬁned on frequent closed item
sets. This technique is taken from the algorithms                                              [1] R. Agrawal and R. Srikant, “Fast Algorithms for
for enumerating maximal bipartite cliques [14, 15]                                                 Mining Association Rules in Large Databases,”
based on reverse search [3]. In theory, we demon-                                                  In Proc. VLDB ’94, pp. 487–499, 1994.
strate that LCM exactly enumerates the set of fre-                                             [2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen
quent closed item sets within polynomial time per                                                  and A. I. Verkamo, “Fast Discovery of Associa-

                                                                                   9
                            㪺㫆㫅㫅㪼㪺㫋                                                                  㪺㪿㪼㫊㫊
  㪈㪇㪇㪇                                                                     㪈㪇㪇㪇

                                                                               㪈㪇㪇
       㪈㪇㪇                                              㪣㪚㪤㪽㫉㪼㫈
                                                        㪣㪚㪤                                                                    㪣㪚㪤㪽㫉㪼㫈
 㪀
 㪺                                                      㪣㪚㪤㫄㪸㫏                  㪈㪇                                             㪣㪚㪤
 㪼                                                                        㪀
 㫊
 㩿㩷                                                     㪽㫇㪾㫉㫆㫎㫋㪿
                                                                          㪺
                                                                          㪼                                                    㪣㪚㪤㫄㪸㫏
        㪈㪇                                                                㫊
                                                                          㩿㩷                                                   㪽㫇㪾㫉㫆㫎㫋㪿
  㪼                                                     㪽㫇㪶㪼㪺㫃㪸㫋           㪼
                                                                                 㪈
  㫀㫄                                                                                                                           㪽㫇㪶㪼㪺㫃㪸㫋
   㫋                                                    㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀         㫄
                                                                           㫀
                                                                           㫋
                                                        㫄㪸㪽㫀㪸㪶㪽㫀                                                               㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀
         㪈                                                                      㪇㪅㪈
                                                        㫄㪸㪽㫀㪸㪶㪽㪺㫀                                                              㫄㪸㪽㫀㪸㪶㪽㫀
                                                        㫄㪸㪽㫀㪸㪶㫄㪽㫀                                                              㫄㪸㪽㫀㪸㪶㪽㪺㫀
                                                                               㪇㪅㪇㪈                                            㫄㪸㪽㫀㪸㪶㫄㪽㫀
       㪇㪅㪈                                                                            㪐㪇   㪏㪇   㪎㪇    㪍㪇     㪌㪇   㪋㪇   㪊㪇
             㪐㪌   㪐㪇   㪏㪇   㪎㪇   㪍㪇   㪌㪇   㪋㪇   㪊㪇   㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀           㪇㪅㪇㪇㪈                                             㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀


Figure 13: Running time of the algorithms on connect                  Figure 14: Running time of the algorithms on chess


         tion Rules,” In Advances in Knowledge Discov-                          [12] B. Possas, N. Ziviani, W. Meira Jr.,
         ery and Data Mining, MIT Press, pp. 307–328,                                B. A. Ribeiro-Neto, “Set-based model: a
         1996.                                                                       new approach for information retrieval,” In
 [3] D. Avis and K. Fukuda, “Reverse Search for                                      Proc. SIGIR’02, pp. 230-237, 2002.
     Enumeration,” Discrete Applied Mathematics,                                [13] S. Tsukiyama, M. Ide, H. Ariyoshi and I. Shi-
     Vol. 65, pp. 21–46, 1996.                                                       rakawa, “A New Algorithm for Generating All
                                                                                     the Maximum Independent Sets,” SIAM Jour-
 [4] R. J. Bayardo Jr., Eﬃciently Mining Long Pat-
                                                                                     nal on Computing, Vol. 6, pp. 505–517, 1977.
     terns from Databases, In Proc. SIGMOD’98,
     pp. 85–93, 1998.                                                           [14] Takeaki Uno, “A Practical Fast Algorithm for
                                                                                     Enumerating Cliques in Huge Bipartite Graphs
 [5] E. Boros, V. Gurvich, L. Khachiyan, and
                                                                                     and Its Implementation,” 89th Special Interest
     K. Makino, “On the Complexity of Generat-
                                                                                     Group of Algorithms, Information Processing
     ing Maximal Frequent and Minimal Infrequent
                                                                                     Society Japan, 2003,
     Sets,” In Proc. STACS 2002, pp. 133-141, 2002.
                                                                                [15] Takeaki Uno, “Fast Algorithms for Enumerat-
 [6] D. Burdick, M. Calimlim, J. Gehrke, “MAFIA:                                     ing Cliques in Huge Graphs,” Research Group of
     A Maximal Frequent Itemset Algorithm for                                        Computation, IEICE, Kyoto University, pp.55-
     Transactional Databases,” In Proc. ICDE 2001,                                   62, 2003
     pp. 443-452, 2001.
                                                                                [16] Takeaki Uno, “A New Approach for Speed-
 [7] J. Han, J. Pei, Y. Yin, “Mining Frequent                                        ing Up Enumeration Algorithms,”      In
     Patterns without Candidate Generation,” In                                      Proc. ISAAC’98, pp. 287–296, 1998.
     Proc. SIGMOD’00, pp. 1-12, 2000
                                                                                [17] M. J. Zaki, “Scalable algorithms for associa-
 [8] R. Kohavi, C. E. Brodley, B. Frasca, L. Ma-                                     tion mining,” Knowledge and Data Engineering,
     son and Z. Zheng, “KDD-Cup 2000 Organizers’                                     12(2), pp. 372–390, 2000.
     Report: Peeling the Onion,” SIGKDD Explo-
                                                                                [18] M. J. Zaki, C. Hsiao, “CHARM: An Eﬃ-
     rations, 2(2), pp. 86-98, 2000.
                                                                                     cient Algorithm for Closed Itemset Mining,” In
 [9] H. Mannila, H. Toivonen, “Multiple Uses of Fre-                                 Proc. SDM’02, SIAM, pp. 457-473, 2002.
     quent Sets and Condensed Representations,” In
                                                                                [19] Z. Zheng, R. Kohavi and L. Mason, “Real World
     Proc. KDD’96, pp. 189–194, 1996.
                                                                                     Performance of Association Rule Algorithms,”
[10] N. Pasquier, Y. Bastide, R. Taouil, L. Lakhal,                                  In Proc. SIGKDD-01, pp. 401-406, 2001.
     Discovering frequent closed itemsets for associa-
     tion rules, In Proc. ICDT’99, pp. 398-416, 1999.
[11] J. Pei, J. Han, R. Mao, “CLOSET: An Eﬃcient
     Algorithm for Mining Frequent Closed Item-
     sets,” ACM SIGMOD Workshop on Research Is-
     sues in Data Mining and Knowledge Discovery
     2000, pp. 21-30, 2000.

                                                                     10