<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Takeaki Uno</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tatsuya Asai</string-name>
          <email>t-asai@i.kyushu-u.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuzo Uchida</string-name>
          <email>y-uchida@i.kyushu-u.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hiroki Arimura</string-name>
          <email>arim@i.kyushu-u.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>National Institute of Informatics</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hitotsubashi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chiyoda-ku</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tokyo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Japan e-mail: uno@nii.jp</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Informatics, Kyushu University</institution>
          ,
          <addr-line>6-10-1 Hakozaki, Fukuoka 812-0053</addr-line>
          ,
          <country country="JP">JAPAN</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2003</year>
      </pub-date>
      <fpage>287</fpage>
      <lpage>296</lpage>
      <abstract>
        <p>In this paper, we propose three algorithms LCMfreq, LCM, and LCMmax for mining all frequent sets, frequent closed item sets, and maximal frequent sets, respectively, from transaction databases. The main theoretical contribution is that we construct treeshaped transversal routes composed of only frequent closed item sets, which is induced by a parent-child relationship defined on frequent closed item sets. By traversing the route in a depth-first manner, LCM finds all frequent closed item sets in polynomial time per item set, without storing previously obtained closed item sets in memory. Moreover, we introduce several algorithmic techniques using the sparse and dense structures of input data. Algorithms for enumerating all frequent item sets and maximal frequent item sets are obtained from LCM as its variants. By computational experiments on real world and synthetic databases to compare their performance to the previous algorithms, we found that our algorithms are fast on large real world datasets with natural distributions such as KDD-cup2000 datasets, and many other synthetic databases.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Frequent item set mining is one of the
fundamental problems in data mining and has many
applications such as association rule mining [1], inductive
databases [
        <xref ref-type="bibr" rid="ref8">9</xref>
        ], and query expansion [
        <xref ref-type="bibr" rid="ref11">12</xref>
        ].
      </p>
      <p>Let E be the universe of items, consisting of items
1, ..., n. A subset X of E is called an item set . T
is a set of transactions over E, i.e., each T ∈ T
is composed of items of E. For an item set X, let
T (X) = { t ∈ T | X ⊆ t } be the set of transactions
including X . Each transaction of T (X) is called</p>
      <p>
        In this paper, we propose an efficient algorithm
LCM for enumerating all frequent closed item sets.
LCM is an abbreviation of Linear time Closed item
set Miner . Existing algorithms for this task basically
enumerate frequent item sets with cutting off
unnecessary frequent item sets by pruning. However, the
pruning is not complete, hence the algorithms
operate unnecessary frequent item sets, and do something
more. In LCM, we define a parent-child
relationship between frequent closed item sets. The
relationship induces tree-shaped transversal routes composed
only of all the frequent closed item sets. Our
algorithm traverses the routes, hence takes linear time
of the number of frequent closed item sets. This
algorithm is obtained from the algorithms for
enumerating maximal bipartite cliques [14, 15], which is
designed based on reverse search technique [
        <xref ref-type="bibr" rid="ref2">3, 16</xref>
        ].
      </p>
      <p>In addition to the search tree technique for closed
item sets, we use several techniques to speed-up the
update of the occurrences of item sets. One technique
is occurrence deliver, which simutaneously computes
the occurrence sets of all the successors of the
current item set during a single scan on the current
occurrence set. The other is diffsets proposed in [18].
Since there is a trade-off between these two methods
that the former is fast for sparse data while the latter
is fast for dense data, we developed the hybrid
algorithm combining them. In some iterations, we make
a decision based of the estimation of their
computation time, hence our algorithm can use appropriate
one for dense parts and sparse parts of the input.</p>
      <p>We also consider the problems of enumerating all
frequent sets, and maximal frequent sets, and derive
two algorithms LCMfreq and LCMmax from LCM.
LCMmax is obtained from LCM by adding the
explicit check of maximality. LCMfreq is not merely
a LCM without the check of closedness, but also
achives substantial speed-up using closed itemset
discovery techniques because it enumerates only the
representatives of groups of frequent item sets, and
generate other frequent item sets from the
representatives.</p>
      <p>From computer experiments on real and artificial
datasets with the previous algorithms, we observed
that our algorithms LCMfreq, LCM, and LCMmax
significantly outperform the previous algorithms on
real world datasets with natural distributions such
as BMS-Web-View-1 and BMS-POS datasets in the
KDD-CUP 2000 datasets as well as large
synthesis datasets such as IBM T10K4D100K. The
performance of our algorithms is similar to other
algorithms for hard datasets such as Connect and Chess
datasets from UCI-Machine Learning Repository, but
less significant than MAFIA, however LCM works
with small memory rather than other algorithms.</p>
      <p>The organization of the paper is as follows. In
Section 2, we explain our tree enumeration method for
frequent closed item sets and our algorithm LCM. In
Section 3, we describe several algorithmic techniques
for speeding up and saving memory. Then, Section 4
and 5 give LCMmax and LCMfreq for maximal and
all frequent item sets, respectively. Techniques for
implementation is described in Section 6, and the
results of computational experiments are reported in
Section 7. Finally, we conclude in Section 8.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Enumerating Frequent Closed Item</title>
      <sec id="sec-2-1">
        <title>Sets</title>
        <p>In this section, we introduce a parent-child
relationship between frequent closed item sets in C, and
describe our algorithm LCM for enumeration them.</p>
        <p>
          Recent efficient algorithms for frequent item sets,
e.g.,[
          <xref ref-type="bibr" rid="ref3">4, 17, 18</xref>
          ], use a tree-shaped search structure for
F , called the set enumeration tree [
          <xref ref-type="bibr" rid="ref3">4</xref>
          ] defined as
follows. Let X = {x1, . . . , xn} be an itemset as an
ordered sequence such that x1 &lt; · · · &lt; xn, where the
tail of X is tail(X) = xn ∈ E. Let X, Y be item
sets. For an index i, X (i) = X ∩ {1, . . . , i}. X is a
occurrences of X
parent of X
occurrences of the parent
prefix of Y if X = Y (i) holds for i = tail(X). Then,
the parent-child relation P for the set enumeration
tree for F is define as X = P (Y ) iff Y = X ∪ {i} for
some i &gt; tail(X), or equivalently, X = Y \{tail(Y )}.
Then, the whole search space for F forms a prefix
tree (or trie) with this edge relation P .
        </p>
        <p>Now, we define the parent-child relation P for
closed item sets in C as follows. For X ∈ C, we
define the parent of X by P (X) = I(T (X(i(X) − 1))),
where i(X ) be the minimum item i such that T (X) =
T (X(i)) but T (X) = T (X(i − 1)). If Y is the parent
of X , we say X is a child of Y . Let ⊥ = I(T (∅)) be
the smallest item set in C called the root . For any
X ∈ C \ {⊥}, its parent P (X) is always defined and
belongs to C. An illustration is given in Fig. 1.</p>
        <p>For any X ∈ C and its parent Y , the proper
inclusion Y ⊂ X holds since T (X(i(X) − 1)) ⊂ T (X).
Thus, the relation P is acyclic, and its graph
representation forms a tree. By traversing the tree in
a depth-first manner, we can enumerate all the
frequent closed item sets in linear time in the size of the
tree, which is equal to the number of the frequent
closed item sets in C. In addition, we need not store
the tree in memory. Starting from the root ⊥, we
find a child X of the root, and go to X . In the same
way, we go to a child of X . When we arrive at a leaf
of the tree, we backtrack, and find another child.
Repeating this process, we eventually find all frequent
closed item set in C.</p>
        <p>To find the children of the current frequent closed
item set, we use the following lemma. For an item set
X and an index i, let X [i] = X ∪ H where H is the
set of the items j ∈ I(T (X ∪ {i})) satisfying j ≥ i.
Lemma 1 X is a child of X ∈ C (X
parent of X is X ) if and only if
(cond1) X = X [i] for some i &gt; i(X ),
(cond2) X = I(T (X )) (X is a closed item set)
(cond3) X is frequent
∈ C and the
Proof : Suppose that X = X [i] satisfies the
conditions (cond1), (cond2) and (cond3). Then, X ∈ C.
Since T (X(i − 1)) = T (X) and T (X(i − 1) ∪ {i}) =
T (X ) holds thus i(X ) = i holds. Hence, X is a
child of X. Suppose that X is a child of X. Then,
(cond2) and (cond3) hold. From the definition of
i(X ), T (X(i(X )) ∪ {i(X )}) = T (X ) holds. Hence,
X = X [i(X )] holds. We also have i(X ) &gt; i(X )
since T (X (i(X ) − 1)) = T (X). Hence (cond1)
holds.</p>
        <p>Clearly, T (X[i]) = T (X ∪ { }
i ) holds if (cond2)
I(T (X[i])) = X [i] holds. Note that no child X of X
satisfies X [i] = X [i ], i = i , since the minimum item
of X [i]\X and X [i ]\X are i and i , respectively.
Using this lemma, we construct the following algorithm
scheme for, given a closed itemset X , enumerating all
descendants in the search tree for closed itemsets.
Algorithm LCM (X : frequent closed item set)</p>
        <sec id="sec-2-1-1">
          <title>1. output X</title>
          <p>2. For each i &gt; i(X ) do
3. If X [i] is frequent and X [i] = I(T (X[i])) then</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Call LCM( X [i] )</title>
          <p>4. End for
Theorem 1 Let 0 &lt; σ &lt; 1 be a minimum support.
Algorithm LCM enumerates, given the root closed
item set ⊥ = I(T (∅)), all frequent closed item sets
in linear time in the number of frequent closed item
sets in C.</p>
          <p>
            The existing enumeration algorithm for frequent
closed item sets are based on backtrack algorithm,
which traverse a tree composed of all frequent item
sets in F , and skip some item sets by pruning the
tree. Since the pruning is not complete, however,
these algorithms generate unnecessary frequent item
sets. On the other hand, the algorithm in [
            <xref ref-type="bibr" rid="ref9">10</xref>
            ] directly
generates only closed item sets with the closure
operation I(T (·)) as ours, but their method may generate
duplicated closed item sets and needs expensive
duplicate check.
          </p>
          <p>On the other hand, our algorithm traverses a tree
composed only of frequent closed item sets, and each
iteration is not as heavy as the previous algorithms.
Hence, our algorithm runs fast in practice. If we
consider our algorithm as a modification of usual
backtracking algorithm, each iteration of our algorithm
re-orders the items larger than i(X ) such that the
items not included in X follow the items included in
X. Note that the parent X is not a prefix of X [i] in
a recursive call. The check of (cond2) can be
considered as a pruning of non-closed item sets.
T(X)
occurrences
A
C
D
F
G
  </p>
          <p>A
C
D
F
G</p>
          <p>A
B
C
F</p>
          <p>A
B
C
F
H</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Reducing Computation Time</title>
      <p>The computation time of LCM described in the
previous section is linear in |C|, with a factor depending
on T (X) for each closed item set X ∈ C. However,
this still takes long time if it is implemented in a
straightforward way. In this section, we introduce
some techniques based on sparse and dense structures
of the input data.</p>
      <p>Occurrence Deliver. First, We introduce the
technique called the occurrence deliver for reducing
the construction time for T (X[i]), which is needed
to check (cond3). This technique is particularly
efficient in the case that |T (X[i])| is much smaller than
|T (X)|. In a usual way, T (X[i]) is obtained from
T (X) in O(|T (X)|) time by removing all
transactions not including i based on the equiality T (X[i]) =
T (X ∪ {i}) = T (X) ∩ T ({i}) (this method is known
as down-project). Thus, the total computation for all
children takes |E| scans and O(|T (X)| · |E|) time.</p>
      <p>
        Instead of this, we build for all i = i(X ), . . . , |E|
the occurrence lists J [i] d=ef T (X[i]) simultaneously
by scanning the transactions in T (X) at once as
follows. We initialize J [i] = ∅ for all i = i(X ), . . . , |E|.
For each T ∈ T (X) and for each i ∈ T (i &gt; i(X )),
we insert T to J [i]. See Fig. 2 for explanation, where
we write jQ[i] for J [i]. This correctly computes J [i]
for all i in the total time O(|T (X)|). Furthermore,
we need not make recursive call of LCM for X [i] if
T (X[i]) = ∅ (this is often called lookahead [
        <xref ref-type="bibr" rid="ref3">4</xref>
        ]). In
our experiments on BMS instances, the occurrence
deliver reduces the computation time up to 1/10 in
some cases.
      </p>
      <p>Right-first sweep. The occurrence deliver
method needs eager computation of the occurrence
sets J [i] = T (X[i]) for all children before
expanding one of them. A simple implementation of it may
require much memory than the ordinary lazy
computation of T (X[i]) as in [17]. However, we can reduce
the memory usage using a method called the
rightfirst sweep as follows.</p>
      <p>Given a parent X , we make the recursive call for
X [i] in the decreasing order for each i = |E|, . . . , i(X )
(See Fig. 2). At each call of X [i], we collect the
memory allocated before for J [i + 1], . . . , J [|E|] and
then re-use it for J [i]. After terminating the call for
X [i], the memory for J [i] is released for the future
use in J [j] for j &lt; i. Since |J [i]| = |T (X[i])| ≤
|T ({i})| for any i and X , the total memory i |J [i]|
is bounded by the input size ||T || = T ∈T |T |, and
thus, it is sufficient to allocate the memory for J at
once as a global variable.</p>
      <p>Diffsets. In the case that |T (X[i])| is nearly equal
to |T (X)| we use the diffset technique proposed in
[18]. The diffset for index i is DJ [i] = T (X)\T (X[i]),
where T (X[i]) = T (X ∪ {i}). Then, the frequency
of X [i] is obtained by |T (X[i])| = |T (X)| − |DJ [i]|.
When we generate a recursive call respect to X [i], we
update DJ [j], j &gt; i by setting DJ [j] to be DJ [j] \
DJ [i] in time O( i&gt;i(X),X[i]∈F ((|T (X)|−|T (X[i])|).
Diffsets are needed for only i such that X [i] is
frequent. By diffsets, the computation time for
instances such as connect, chess, pumsb are reduced
to 1/100, where |T (X[i])| is as large as |T (X)|.</p>
      <p>Hybrid Computation. As we saw in the
preceding subsections, our occurrence deliver is fast when
|T (X[i])| is much smaller than |T (X)| while the
diffset of [18] is fast when |T (X[i])| is nearly close to
|T (X)|. Therefore, our LCM dinamically decides
which of occurrence deliver and diffsets we will use.
To do this, we compare two quantities on X :
A(X) = i |T (X ∪ {i})| and
B(X) = i:X∪{i}∈F (|T (X)| − |T (X ∪ {i})|).</p>
      <p>For some fixed constant α &gt; 1, we decide to use
the occurrence deliver if A(X) &lt; αB(X) and the
diffset otherwise. We make this decision only at the
child iterations of the root set ⊥ since this decision
takes much time. Empirically, restricting the range
i ∈ {1, . . . , |E|} of the the index i in A(X) and B(X)
to i ∈ {i(X ) + 1, . . . , |E|} results significant
speedup. By experiments on BMS instances, we observe
that the hybrid technique reduces the computation
time up to 1/3. The hybrid technique is also useful
in reducing the memory space in diffset as follows.
Although the memory B(X) used by diffsets is not
bounded by the input size ||T || in the worst case,
it is ensured in hybrid that B(X) does not exceed
A(X) ≤ ||T || because the diffset is chosen only when
A(X) ≥ αB(X).</p>
      <sec id="sec-3-1">
        <title>Checking the closedness in occrrence de</title>
        <p>liver. Another key is to efficiently check the
closedness X [i] = I(T (X[i])) (cond 2). The
straightforward computation of the closure I(T (X[i])) takes
much time since it requires the access to the whole
sets T (X[j]), j &lt; i and i is usually as large as |E|.</p>
        <p>By definition, (cond 2) is violated iff there exists
some j ∈ {1, . . . , i − 1} such that j ∈ T for
every T ∈ T (X ∪ {i}). We first choose a
transaction T ∗(∪{i}) ∈ T (X ∪ { }
i ) of minimum size, and
tests if j ∈ T for increasing j ∈ T ∗(∪{i}). This
results O( j∈T ∗(X∪{i}) m(X[i], j)) time algorithm,
where m(X , j) is the maximum index m such that
all of the first m transactions of T (X ) include j,
which is much faster than the straightforward
algorithm with O( j&lt;i |T (X ∪ {i} ∪ {j}])|) time.</p>
        <p>In fact, the efficient check requires the adjacency
matrix (sometime called bitmap) representing the
inclusion relationship between items and transactions.
However, the adjacency matrix requires O(|T | × |E|)
memory, which is quite hard to store for large
instances. Hence, we make columns of adjacency
matrix for only transactions of size larger than
( T ∈T |T |)/δ. Here δ is a constant. This uses at
most O(δ × T ∈T |T |), linear in the input size.</p>
        <p>Checking the closedness in diffsets. In the
case that |T (X[i])| is nearly equal to |T (X)|, the
above check is not done in short time. In this case,
we keep diffset DJ [j] for all j &lt; i, i ∈ X such
that X [i] is frequent. To maintain DJ for all i is
a heavy task, thus we discard unnecessary DJ ’s as
follows. If T (X ∪ {j}) includes an item included
in no T (X[i ]), i &gt; i(X ), then for any descendant
X of X , j ∈ I(T (X [j ])) for any j &gt; i(X ).
Hence, we no longer have to keep DJ [j] for such
j. Let N C(X) be the set of items j such that X [j]
is frequent and any item of T (X) \ T (X ∪ { }
j ) is
included in some T (X[j ]), j &gt; i(X ). Then, the
computation time for checking (cond2) is written
as O( j∈NC(X),j&lt;i |T (X) \ T (X ∪ {j})|). By
checking (cond2) in these ways, the computation time for
checking (cond2) is reduced from 1/10 to 1/100.</p>
        <p>Detailed Algorithm. We present below the
description of the algorithm LCM, which recursively
computes (X, T (X), i(X )), simultaneously.
global: J , DJ</p>
        <p>/* Global sets of lists */</p>
      </sec>
      <sec id="sec-3-2">
        <title>Algorithm LCM()</title>
        <p>1. X := I(T (∅)) /* The root ⊥ */
2. For i := 1 to |E|
3. If X [i] satisfies (cond2) and (cond3) then
Call LCM Iter( X [i], T (X[i]), i ) or
Call LCMd Iter2( X [i], T (X[i]), i, DJ )</p>
        <p>based on the decision criteria
4. End for
LCM Iter( X, T (X), i(X ) ) /* occurrence deliver */</p>
      </sec>
      <sec id="sec-3-3">
        <title>1. output X</title>
        <p>2. For each T ∈ T (X)</p>
        <p>For each j ∈ T , j &gt; i(X ), insert t to J [j]
4. For each j, J [j] = ∅ in the decreasing order
5. If |J [j]| ≥ α and (cond2) holds then</p>
        <p>LCM Iter( T (J [j]), J [j], j )
6. Delete J [j]
7. End for
LCM Iter2( X, T (X), i(X ), DJ ) /* diffset */</p>
      </sec>
      <sec id="sec-3-4">
        <title>1. output X</title>
        <p>2. For each i, X [i] is frequent
3. If X [i] satisfies (cond2) then
4. For each j, X [i] ∪ {j} is frequent,</p>
        <p>DJ [j] := DJ [j] \ DJ [i]
5. LCM Iter2( T (J [j]), J [j], j, DJ )
6. End if
7. End for
Theorem 2 Algorithm LCM enumerates all
frequent closed item sets in O(
j&gt;i(X) |T (X[j])| +
j&gt;i(X),X[j]∈F j ∈T ∗(X) m(X [j], j )) time,
or O( i&gt;i(X),X[i]∈F ((|T (X)| − |T (X[i])|) +
eacjh∈NfrCe(qXu)e,nj&lt;tic|lTos(eXd )ite\mTse(XtX, with memory linear
∪ {j})|)) time for
to the input size.</p>
        <sec id="sec-3-4-1">
          <title>Maximal Frequent</title>
        </sec>
        <sec id="sec-3-4-2">
          <title>Enumerating</title>
        </sec>
        <sec id="sec-3-4-3">
          <title>Sets</title>
          <p>In this section, we explain an enumeration algorithm
of maximal frequent sets with the use of frequent
closed item set enumeration. The main idea is very
simple. Since any maximal frequent item set is a
frequent closed item set, we enumerate frequent closed
item sets and output only those being maximal
frequent sets. For a frequent closed item set X, X is a
maximal frequent set if and only if X ∪ {i} is
infrequent for any i ∈ X. By adding this check to LCM,
we obtain LCMmax.</p>
          <p>This modification does not increase the memory
complexity but increase the computation time. In the
case of occurrence deliver, we generate T (X ∪{j}) for
all j in the same way as the occurrence deliver, and
check the maximality. This takes O( j&lt;i(X) |T (X ∪
{j}|) time. In the case of difference update, we do
not discard diffsets unnecessary for closed item set
enumeration. We keep diffsets DJ for all j such that
X ∪ {j} is frequent. To update and maintain this,
we spend O( j,X∪{j}∈F |T (X) \ T (X ∪ {j})|) time.
Note that we are not in need of check the maximality
if X has a child.</p>
          <p>Theorem 3 Algorithm LCMmax enumerates all
maximal frequent item sets in O( i |T (X ∪ {i})|)
time, or O( i,X∪{i}∈F ((|T (X)|−|T (X ∪{i})|)) time
for each frequent closed item set X, with memory
linear in the input size.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Enumerating Frequent Sets</title>
      <p>In this section, we describe an enumeration
algorithm for frequent item sets. The key idea of our
algorithm is that we classify the frequent item sets
into groups and enumerate the representative of each
group. Each group is composed of frequent item sets
included in the class of a closed item set. This idea
is based on the following lemma.</p>
      <p>Lemma 2 Suppose that frequent item sets X and
S ⊃ X satisfy T (X) = T (S). Then, for any item
set X including X, T (X ) = T (X ∪ S).</p>
      <p>Particularly, T (X ) = T (R) holds for any X ⊆
R ⊆ X ∪S, hence all R are included in the same class
of a closed item set. Hence, any frequent item set X
is generated from X \ (S \ X ). We call X \ (S \ X )
representative.</p>
      <p>Let us consider a backtracking algorithm finding
frequent item sets which adds items one by one in
lexicographical order. Suppose that we currently have a
frequent item set X, and find another frequent item
set X ∪ { }</p>
      <p>i . Let S = X [i]. Then, according to the
above lemma, we can observe that for any frequent
item set X including X and not intersecting S \ X,
any item set including X and included in X ∪ S is
also frequent. Conversely, any frequent item set
including X is generated from X not intersecting S \X.
Hence, we enumerate only representatives including
X and not intersecting S \ X, and generate other
frequent item sets by adding each subset of S \ X.
This method can be considered that we “decompose”
classes of closed item sets into several sublattices
(hypercubes) each of whose maximal and minimal
elements are S and X , respectively (see Fig. 3). We
call this technique hypercube decomposition.</p>
      <p>Suppose that we are currently operating a
representative X including X, and going to
generate a recursive call respect to X ∪ {j}. Then, if
(X [i] \ X ) \ S = ∅, X and S ∪ (X [i] \ X ) satisfies
the condition of Lemma 2. Hence, we add X [i] \ X
to S.</p>
      <p>We describe LCMfreq as follows.</p>
      <p>Algorithm LCMfreq ( X : representative,</p>
      <p>S : item set, i : item )
1. Output all item sets R, X ⊆ R ⊆ X ∪ S
2. For each j &gt; i, j ∈ X ∪ S
3. If X ∪ {j} is frequent then</p>
      <p>Call LCMfreq ( X ∪ {j}, S ∪ (X[j] \ (X ∪ {j})), j)
4. End for</p>
      <p>For some synthetic instances such that frequent
closed item sets are fewer than frequent item sets,
the average size of S is up to 5. In these cases, the
algorithm finds 2|S| = 32 frequent item sets at once,
hence the computation time is reduced much by the
improvement.</p>
      <p>To check the frequency of all X ∪ { }
j , we can
use occurrence deliver and diffsets used for LCM.
LCMfreq does not require the check of (cond2),
hence The computation time of each iteration is
O( j&gt;i(X) |T (X[j])|) time for occurrence deliver,
and O( j&gt;i(X),X[j]∈F |T (X) \ T (X[j])|) for
diffsets. Since the computation time change, we
use another estimators for hybrid. In almost all
cases, if once j&gt;i(X),X[j]∈F |T (X) \ T (X[j])|
becomes smaller than j&gt;i(X) |T (X[j])|, the
condition holds in any iteration generated by a recursive
call. Hence, the algorithm first starts with
occurrence deliver, and compares them in each iteration.
If j&gt;i(X),X[j]∈F |T (X) \ T (X[j])| becomes smaller,
then we change to diffsets. Note that these
estimators can computed in short time by using the result
of occurrence deliver.</p>
      <p>Theorem 4 LCMfreq enumerates
sets of F in O( j&gt;i(X) |T (X[j])|)
O( j&gt;i(X),X[j]∈F |T (X) \ T (X[j])|)
each frequent set X, within O(
all
frequent
time or
time for
T ∈T |T |) space.</p>
      <p>Particularly, LCMfreq requires one integer for each
item of any transaction, which is required to store the
input data. Other memory LCMfreq uses is bounded
by O(|T | + |E|).</p>
      <p>Experimentally, an iteration of LCMfreq
inputting frequent set X takes O(|T (X)| + |X |)
or O((size of diffset) + |X |) steps in average. In
some sense, this is optimal since we have to take
O(|X |) time to output, and O(|T (X)|) time or
O((size of diffset)) time to check the frequency of X.</p>
    </sec>
    <sec id="sec-5">
      <title>6. Implementation</title>
      <p>
        In this section, we explain our implementation. First,
we explain the data structure of our algorithm. A
transaction T of input data is stored by an array with
length |T |. Each cell of the array stores the index of
an item of T . For example, t = {4, 2, 7} is stored in
an array with 3 cells, [
        <xref ref-type="bibr" rid="ref3 ref6">2, 4, 7</xref>
        ]. We sort the elements of
the array so that we can take {i, ..., |E|} ∩ T in linear
time of {i, ..., |E|}∩T . J is also stored in arrays in the
same way. We are not in need of doubly linked lists
or binary trees, which take much time to be operated.
      </p>
      <p>To reduce the practical computation time, we sort
the transactions by their sizes, and items by the
number of transactions including them. Experimentally,
this reduces j&gt;i(X) |T (X ∪{j})|. In some cases, the
computation time has been reduced by a factor of
1/3.</p>
    </sec>
    <sec id="sec-6">
      <title>7. Computational Experiments</title>
      <p>To examine the practical efficiency of our algorithms,
we run the experiments on the real and synthetic
datasets, which are made available on FIMI’03 site.
In the following, we will report the results of the
experiments.
We implemented our algorithms our three algorithms
LCMfreq (LCMfreq), LCM (LCM), LCMmax
(LCMmax) in C and compiled with gcc3.2.</p>
      <p>The algorithms were tested on the datasets shown
in Table 1. available from the FIMI’03 homepage1,
which include: T10I4D100K, T40I10D100K from
IBM Almaden Quest research group; chess,
connect, mushroom, pumsb, pumsb star from UCI ML
repository2 and PUMSB; BMS-WebView-1,
BMSWebView-2, BMS-POS from KDD-CUP 20003.</p>
      <p>
        We compare our algorithms LCMfreq, LCM,
LCMmax with the following frequent item set mining
algorithms: Implementations of Fp-growth [
        <xref ref-type="bibr" rid="ref6">7</xref>
        ], Eclat [17],
Apriori [1, 2] by Bart Goethals 4; We also
compare the LCM algorithms with the implementation of
Mafia [
        <xref ref-type="bibr" rid="ref5">6</xref>
        ], a fast maximal frequent pattern miner, by
University of Cornell’s Database group 5. This
versions of mafia with frequent item sets, frequent closed
item sets, and maximal frequent item sets options
are denoted by mafia-fi, mafia-fci, mafia-mfi,
respectively. Although we have also planned to make the
performance comparison with Charm, the
state-ofthe-art frequent closed item set miner, we gave up the
comparison in this time due to the time constraint.
      </p>
      <p>All experiments were run on a PC with the
configuration of Pen4 2.8GHz, 1GB memory, and RPM 7200
hard disk of 180GB. In the experiments, LCMfreq,
LCM and LCMmax use at most 123MB, 300MB, and
300MB of memory, resp. Note that LCM and
LCMmax can save the memory use by decreasing δ.
1http://fimi.cs.helsinki.fi/testdata.html
2http://www.ics.uci.edu/ mlearn/MLRepository.html
3http://www.ecn.purdue.edu/KDDCUP/
4http://www.cs.helsinki.fi/u/goethals/software/
5University of Cornell Database group, Himalaya Data
Mining Tools, http://himalaya-tools.sourceforge.net/</p>
      <sec id="sec-6-1">
        <title>Results</title>
        <p>Figures 6 through Figure 12 show the running time
with varying minimum supports for the seven
algorithms, namely LCMfreq, LCM, LCMmax,
FPgrowth, eclat, apriori, mafia-mfi on the nine datasets
described in the previous subsection. In the
following, we call all, maximal, closed frequent item set
mining simply by all, maximal, closed.</p>
        <sec id="sec-6-1-1">
          <title>Results on Synthetic Data</title>
          <p>Figure 4 shows the running time with minimum
support ranging from 0.15% to 0.025% on IBM-Artificial
T10I4D100K datasets. From this plot, we see that
most algorithms run within around a few 10
minutes and the behaviors are quite similar when
minimum support increases. In Figure 4, All of LCMmax,
LCM, and LCMfreq are twice faster than FP-growth
on IBM T10I4D100K dataset. On the other hand,
Mafia-mfi, Mafia-fci, and Mafia-fi are slower than
every other algorithms. In Figure 5, Mafia-mfi is fastest
for maximal, and LCMfreq is fastest for all, for
minimum support less than 1% on IBM T10I4D100K
dataset.</p>
        </sec>
        <sec id="sec-6-1-2">
          <title>Results on KDD-CUP datasets</title>
          <p>Figures 6 through Figure 8 show the running time
with range of minimum supports from ranging from
0.1% to 0.01% on three real world datasets
BMSWebView-1, BMS-WebView-2, BMS-POS datasets.
In the figure, we can observe that LCM algorithms
outperform others in almost cases, especially for
lower minimum support. In particular, LCM was
best among seven algorithms in a wide range of
minimum support from 0.1% to 0.01% on all datasets.</p>
          <p>For higher minimum support ranging from 0.1% to
0.06%, the performances of all algorithms are similar,
and LCM families have slightly better performance.
For lower minimum support ranging from 0.04% to
0.01%, Eclat and Apriori are much slower than every
other algorithms. LCM outperforms others. Some
frequent item set miners such as Mafia-fi, and
Mafiafci runs out of 1GB of main memory for these
minimum supports on BMS-WebView-1,
BMS-WebView2, BMS-POS datasets. LCMfreq works quite well
for higher minimum support, but takes more than 30
minutes for minimum support above 0.04% on
BMSWeb-View-1. In these cases, the number of frequent
item sets is quite large, over 100,000,000,000.
Interestingly, Mafia-mfi’s performance is stable in a wide
range of minimum support from 0.1% to 0.01%.</p>
          <p>In summary, LCM family algorithms significantly
perform well on real world datasets
BMS-WebView1, BMS-WebView-2, BMS-POS datasets.</p>
        </sec>
        <sec id="sec-6-1-3">
          <title>Results on UCI-ML repository and PUMSB datasets</title>
          <p>Figures 9 through Figure 12 show the running
time on middle sized data sets pumsb and pumsb*,
kosarak and small sized datasets connect, chess,
and mushroom. These datasets taken from machine
learning domains are small but hard datasets for
frequent pattern mining task since they have many
frequent patterns even with high minimum supports,
e.g., from 50% to 90%. These datasets are originally
build for classification task and have slightly
different characteristics than large business datasets such
as BMS-Web-View-1 or BMS-POS.</p>
          <p>In these figures, we see that Mafia-mfi constantly
outperforms every other maximal frequent item sets
mining algorithms for wide range of minimum
supports except on pumsb*. On the other hand, Apriori
is much slower than other algorithm. On the mining
of all frequent item sets, LCMfreq is faster than the
others algorithms. On the mining of frequent closed
item sets, there seems to be no consistent tendency
on the performance results. However, LCM does not
store the obtained solutions in the memory, while the
other algorithms do. Thus, in the sense of
memorysaving, LCM has an advantage.
8</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>
        In this paper, we present an efficient algorithm LCM
for mining frequent closed item sets based on
parentchild relationship defined on frequent closed item
sets. This technique is taken from the algorithms
for enumerating maximal bipartite cliques [14, 15]
based on reverse search [
        <xref ref-type="bibr" rid="ref2">3</xref>
        ]. In theory, we
demonstrate that LCM exactly enumerates the set of
frequent closed item sets within polynomial time per
closed item set in the total input size. In practice, we
show by experiments that our algorithms run fast on
several real world datasets such as BMS-WebView-1.
We also showed variants LCMfreq and LCMmax of
LCM for computing maximal and all frequent item
sets. LCMfreq uses new schemes hybrid and
hypercube decomposition, and the schemes work well for
many problems.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgement</title>
      <p>We gratefully thank to Prof. Ken Satoh of National
Institute of Informatics. This research had been
supported by group research fund of National Institute
of Informatics, JAPAN.</p>
    </sec>
    <sec id="sec-9">
      <title>References</title>
      <p>[1] R. Agrawal and R. Srikant, “Fast Algorithms for
Mining Association Rules in Large Databases,”
In Proc. VLDB ’94, pp. 487–499, 1994.
[2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen
and A. I. Verkamo, “Fast Discovery of
Associa[17] M. J. Zaki, “Scalable algorithms for
association mining,” Knowledge and Data Engineering,
12(2), pp. 372–390, 2000.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>tion Rules</surname>
          </string-name>
          ,”
          <source>In Advances in Knowledge Discovery and Data Mining</source>
          , MIT Press, pp.
          <fpage>307</fpage>
          -
          <lpage>328</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Avis</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Fukuda</surname>
          </string-name>
          , “Reverse Search for Enumeration,
          <source>” Discrete Applied Mathematics</source>
          , Vol.
          <volume>65</volume>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>46</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R. J. Bayardo</given-names>
            <surname>Jr</surname>
          </string-name>
          .,
          <article-title>Efficiently Mining Long Patterns from Databases</article-title>
          ,
          <source>In Proc. SIGMOD'98</source>
          , pp.
          <fpage>85</fpage>
          -
          <lpage>93</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Boros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gurvich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Khachiyan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Makino</surname>
          </string-name>
          , “
          <article-title>On the Complexity of Generating Maximal Frequent and Minimal Infrequent Sets,”</article-title>
          <source>In Proc. STACS</source>
          <year>2002</year>
          , pp.
          <fpage>133</fpage>
          -
          <lpage>141</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Burdick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Calimlim</surname>
          </string-name>
          , J. Gehrke, “
          <article-title>MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases,”</article-title>
          <source>In Proc. ICDE</source>
          <year>2001</year>
          , pp.
          <fpage>443</fpage>
          -
          <lpage>452</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>J</surname>
          </string-name>
          . Pei,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yin</surname>
          </string-name>
          , “
          <article-title>Mining Frequent Patterns without Candidate Generation,”</article-title>
          <source>In Proc. SIGMOD'00</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          ,
          <year>2000</year>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kohavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Brodley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Frasca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mason</surname>
          </string-name>
          and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , “KDD-Cup
          <source>2000 Organizers' Report: Peeling the Onion,” SIGKDD Explorations</source>
          ,
          <volume>2</volume>
          (
          <issue>2</issue>
          ), pp.
          <fpage>86</fpage>
          -
          <lpage>98</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Mannila</surname>
          </string-name>
          , H. Toivonen, “
          <article-title>Multiple Uses of Frequent Sets and Condensed Representations,”</article-title>
          <source>In Proc. KDD'96</source>
          , pp.
          <fpage>189</fpage>
          -
          <lpage>194</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Pasquier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bastide</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Taouil</surname>
          </string-name>
          , L. Lakhal,
          <article-title>Discovering frequent closed itemsets for association rules</article-title>
          ,
          <source>In Proc. ICDT'99</source>
          , pp.
          <fpage>398</fpage>
          -
          <lpage>416</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pei</surname>
          </string-name>
          , J. Han,
          <string-name>
            <surname>R</surname>
          </string-name>
          . Mao, “
          <article-title>CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets,”</article-title>
          <source>ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery</source>
          <year>2000</year>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>30</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Possas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ziviani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. Meira</given-names>
            <surname>Jr.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Ribeiro-Neto</surname>
          </string-name>
          , “
          <article-title>Set-based model: a new approach for information retrieval,”</article-title>
          <source>In Proc. SIGIR'02</source>
          , pp.
          <fpage>230</fpage>
          -
          <lpage>237</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tsukiyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ide</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ariyoshi</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Shirakawa</surname>
          </string-name>
          , “
          <article-title>A New Algorithm for Generating All the Maximum Independent Sets,”</article-title>
          <source>SIAM Journal on Computing</source>
          , Vol.
          <volume>6</volume>
          , pp.
          <fpage>505</fpage>
          -
          <lpage>517</lpage>
          ,
          <year>1977</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>