=Paper= {{Paper |id=Vol-126/paper-8 |storemode=property |title=LCM ver. 2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets |pdfUrl=https://ceur-ws.org/Vol-126/uno.pdf |volume=Vol-126 |dblpUrl=https://dblp.org/rec/conf/fimi/UnoKA04 }} ==LCM ver. 2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets== https://ceur-ws.org/Vol-126/uno.pdf
                    LCM ver. 2: Efficient Mining Algorithms for
                       Frequent/Closed/Maximal Itemsets

                          Takeaki Uno1 , Masashi Kiyomi1 , Hiroki Arimura2
    1
      National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430, Japan
                         e-mail: uno@nii.jp, masashi@grad.nii.ac.jp
                     2
                       Information Science and Technology, Hokkaido University
    Kita 14-jo Nishi 9-chome, 060-0814 Sapporo, JAPAN, e-mail: arim@ist.hokudai.ac.jp


Abstract: For a transaction database, a frequent             a ppc extension needs no previously obtained closed
itemset is an itemset included in at least a specified       itemset. Hence, the memory use of LCM does not
number of transactions. A frequent itemset P                 depend on the number of frequent closed itemsets,
is maximal if P is included in no other frequent             even if there are many frequent closed itemsets.
itemset, and closed if P is included in no other                The time complexity of LCM is theoretically
itemset included in the exactly same transactions            bounded by a linear function in the number of fre-
as P .    The problems of finding these frequent             quent closed itemsets, while the existing algorithms
itemsets are fundamental in data mining, and                 are not. We further developed algorithms for the
from the applications, fast implementations for              frequency counting, occurrence deliver and hybrid of
solving the problems are needed. In this paper,              diffsets. They reduce the practical computation time
we propose efficient algorithms LCM (Linear time             efficiently. Moreover, the framework of LCM is sim-
Closed itemset Miner), LCMfreq and LCMmax                    ple. Generating ppc extensions needs no sophisti-
for these problems. We show the efficiency of our            cated data structure such as binary trees. LCM is
algorithms by computational experiments compared             implemented with only arrays. Therefore, LCM is
with existing algorithms.                                    fast, and outperforms than other algorithms for some
                                                             sparse datasets.
                                                                However, LCM does not have any routine for re-
                                                             ducing the database, while many existing algorithms
1     Introduction                                           have. Thus, the performance of LCM is not good for
                                                             dense datasets with large minimum supports, which
Frequent item set mining is one of the fundamental           involve many unnecessary items and transactions.
problems in data mining and has many applications            At FIMI03, we also proposed modifications of LCM,
such as association rule mining, inductive databases,        LCMfreq and LCMmax, for enumerating all frequent
and query expansion. From these applications, fast           itemsets and maximal frequent itemsets. Although
implementations of frequent itemset mining problems          they are fast for some instances, if LCM is not fast
are needed. In this paper, we propose the second             for an instance, they are also not fast for the instance.
versions of LCM, LCMfreq and LCMmax, for enu-                Existing maximal frequent itemset mining algorithms
merating closed, all and maximal frequent itemsets.          have efficient pruning methods to reduce the number
LCM is an abbreviation of Linear time Closed item            of iterations, while LCMmax does not have. It is also
set Miner .                                                  a reason of the slowness of LCMmax.
   In FIMI03[7], we proposed the first version of               This paper proposes the second version of LCM
LCM, which is for enumerating frequent closed item-          algorithms. We added database reduction to LCM,
sets. LCM uses prefix preserving closure extension           so that problems of dense datasets can be solved in
(ppc extension in short), which is an extension from         short time. The second version of LCMmax includes
a closed itemset to another closed itemset. The ex-          a pruning method, thus the computation time is re-
tension induces a search tree on the set of frequent         duced when the number of maximal frequent itemsets
closed itemsets, thereby we can completely enumer-           is small. We further developed new algorithms for
ate closed itemsets without duplications. Generating         checking the maximality of a frequent itemset and

                                                         1
for taking the closure of an itemset. We compare                   f rq(Q) < f rq(P ). We call the item with the maxi-
the performance of LCM algorithms and other algo-                  mum index in P the tail of P , and denote by tail(P ).
rithms submitted to FIMI03 by computational exper-
iments. In many instances, LCM algorithms perform
above other algorithms.                                            3      Algorithms for Efficient Enu-
   The organization of the paper is as follows. Sec-                      meration
tion 2 introduces preliminaries. The main algorithms
and practical techniques of LCM algorithms are de-                 In this section, we explain the techniques used in the
scribed in Section 3. Section 4 shows the results of               second versions of LCM algorithms. We explain them
computational experiments, and Section 5 concludes                 one-by-one with comparing to the techniques used
the paper.                                                         by the other algorithms, in the following subsections.
                                                                   The new techniques used in the second version are:

2     Preliminaries                                                3.2. new database reduction (reduce the frequency
                                                                        counting cost)
Let I = {1, ..., n} be the set of items. A transaction             3.6. database reduction for fast checking closedness
database on I is a set T = {t1 , . . . , tm } such that each       3.8. database reduction for fast checking maximality
ti is included in I. Each ti is called a transaction. We           3.7. new pruning algorithm for backtracking-based
denote by ||T || the sum of sizes of all transactions in                maximal frequent itemset mining.
T , that is, the size of database T . A set P ⊆ I is
called an itemset.                                                     The techniques also used in the first versions are:
   For itemset P , a transaction including P is called
an occurrence of P . The denotation of P , denoted                 3.4. occurrence deliver (compute frequency in
by T (P ) is the set of the occurrences of P . |T (P )| is             linear time)
called the frequency of P, and denoted by f rq(P ). For            3.5. ppc extension (generates closed itemsets with
given constant θ, called a minimum support, itemset                    neither memory nor duplication)
P is frequent if f rq(P ) ≥ θ. If a frequent itemset P             3.3. hypercube decomposition (fast enumeration by
is included in no other frequent itemset, P is called                  grouping frequent itemsets by equivalence class).
maximal. For any itemsets P and Q, T (P ∪ Q) =
T (P )∩T (Q) holds, and if P ⊆ Q then T (Q) ⊆ T (P ).                The techniques used in the existing algorithms and
An itemset P is called closed if no other itemset Q                the first version have citations to the previous papers.
satisfies T (P ) = T (Q), P ⊆ Q.
   Given set S ⊆ T of transactions, let I(S) be the set            3.1     Enumerating Frequent Itemsets
T items common to all transactions in S, i.e., I(S) =
of
                                                                   Any itemset included in a frequent itemset is it-
   T ∈S T . Then, we define the closure of T      itemset P
in T , denoted by clo(P ), by I(T (P ))(= t∈T (P ) t).             self frequent. Thereby, the property “frequent” is
For every pair of itemsets P and Q, the following                  monotone. From this, we can construct any frequent
properties hold[13, 14].                                           itemset from the empty set by adding items one-by-
                                                                   one without passing through any infrequent itemset.
 (1) If P ⊆ Q, then clo(P ) ⊆ clo(Q).                              Roughly speaking, the existing algorithms are classi-
 (2) If T (P ) = T (Q), then clo(P ) = clo(Q).                     fied into two groups, and algorithms in both groups
 (3) clo(clo(P )) = clo(P ).                                       use this property.
 (4) clo(P ) is the unique smallest closed itemset                    The first group is so called apriori or level-by-level
    including P .                                                  algorithms [1, 2]. Let Dk be the set of frequent item-
 (5) A itemset P is a closed itemset if and only                   sets of size k. Apriori algorithms start with D0 , that
    if clo(P ) = P .                                               is {∅}, and compute Dk from Dk−1 in the increasing
                                                                   order of k from k = 1. Any itemset in Dk is obtained
  For itemset P and item i ∈ P , let P (i) = P ∩                   from an itemset of Dk−1 by adding an item. Apriori
{1, . . . , i} be the subset of P consisting only of el-           algorithms add every item to each itemset of Dk−1 ,
ements no greater than i, called the i-prefix of P .               and choose frequent itemsets among them. If Dk = ∅
An itemset Q is a closure extension of an itemset                  holds for some k, then Dk0 = ∅ holds for any k 0 > k.
P if Q = clo(P ∪ {i}) holds for some i 6∈ P . If                   Thus, apriori algorithms stop at such k. This is the
Q is a closure extension of P , then Q ⊃ P, and                    scheme of apriori algorithms.

                                                               2
   The other group is so called backtracking                          Database reduction performs well when the mini-
algorithms[3, 18, 19]. Backtracking algorithm is                   mum support is large, and many existing algorithms
based on recursive calls. An iteration of a backtrack-             use it. LCM algorithms also use database reduction.
ing algorithm inputs a frequent itemset P , and gener-                In the existing studies, the input databases are
ates itemsets by adding every items to P . Then, for               often stored and maintained by using FP-tree (fre-
each itemset being frequent among them, the itera-                 quent pattern tree), which is a version of prefix tree
tion generates recursive calls with respect to it. To              (trie) [9]. By using FP-tree, we can search specified
avoid duplications, an iteration of backtracking algo-             transactions from the datasets efficiently. FP-tree
rithms adds items with indices larger than the tail                compresses the common prefix, so we can decrease
of P . We describe the framework of backtracking                   the memory use. In addition, FP-tree can detect the
algorithms as follows.                                             identical transactions, thus we can merge them into
                                                                   one. This merge accelerates the frequency counting.
ALGORITHM BackTracking (P :current solution)                       From these reasons, FP-trees are used in many algo-
1. Output P                                                        rithms and implementations.
2. For each e ∈ I, e > tail(P ) do                                    Although FP-tree has many good advantages, we
3.   If P ∪ {e} is frequent then                                   do not use it in the implementation of LCM, but use
          call BackTracking (P ∪ {e})                              simple arrays. The main reason is that LCM does
                                                                   not have to search transactions in the database. The
   An execution of backtracking algorithms gives a                 main operation of LCM is tracing the transactions in
tree structure such that the vertices of the tree are              the the denotation of the current solution. Thus, we
iterations, and edges connect two iterations if one of             do not need to use sophisticated data structures for
the iteration calls the other. If an iteration I recur-            searching.
sively calls another iteration I 0 , then we say that I               The other reason is the computation time for the
is the parent of I 0 , and I 0 is a child of I. For an iter-       initialization. If we use a standard binary tree for
ation, the itemset received from the parent is called              implementing FP-tree, the initialization of the in-
the current solution.                                              put database takes O(||T || + |T | log |T |) time. for
   Apriori algorithms use much memory for storing                  constructing FP-tree in memory. Compared to this,
Dk in memory, while backtracking algorithms use                    LCM detects the identical transactions and stores
less memory since they keep only the current solu-                 the database in memory within linear time of the
tion. Backtracking algorithms need no computation                  database size. This is because that LCM uses radix
for maintaining previously obtained itemsets, so the               sort for this task, which sorts the transactions in a
computation time of backtracking algorithms is gen-                lexicographic order in linear time. In general, the
erally short. However, apriori algorithms have ad-                 datasets of data mining problems have many trans-
vantages for the frequency counting.                               actions, and each transaction has few items. Thus,
   LCM algorithms are based on backtracking al-                    ||T || is usually smaller than |T | log |T |, and LCM has
gorithms, and use an efficient techniques for the                  an advantage. The constant factors of the compu-
frequency counting, which are occurrence deliver                   tation time of binary tree operations are relatively
and anytime database reduction described below.                    larger than that of array operations. LCM also has
Hence, LCM algorithms compute the frequency effi-                  an advantage at this point. Again, we recall that
ciently without keeping previously obtained itemsets               LCM never search the transactions, so each opera-
in memory.                                                         tion required by LCM can be done in constant time.
                                                                      FP-tree has an advantage in reducing the memory
                                                                   use. This memory reduction can also reduce the
3.2     Maintaining Databases                                      computation time of the frequency counting. To
                                                                   check the efficiency of the reduction, we checked
In the existing studies, database reduction is said to             the reduction ratio by FP-tree for some datasets
be important to reduce the computation time. It is                 examined in FIMI03. The result is shown in Table
to reduce the input database as the following rules:               1. Each cell shows the ratio of the number of items
                                                                   needed to be stored by arrays and FP-tree. Usually,
1. remove each item included in less than θ transactions           the input database is reduced in each iteration,
2. remove each item included in all transactions                   hence we sum up the numbers over all iterations to
3. merge the identical transactions into one.                      compute the ratio. In the results of our experiments,

                                                               3
the ratio was not greater 3 in many instances. If             generated itemset. To reduce the computation time,
|I| is small, |T | is large, and the dataset has a            existing algorithms uses down project. For an itemset
randomness, such as accidents, the ratio was up to 6.         P , down project computes its denotation T (P ) by us-
Generally, a binary tree uses memory three times as           ing two subsets P1 and P2 of P . If P = P1 ∪ P2 , then
much as an array. Thus, the performance of FP-tree            T (P ) = T (P1 )∩T (P2 ). Under the condition that the
seems to be not quite good rather than an array in            items of P1 and P2 are sorted by their indices, the in-
both memory use and computation time, for many                tersection can be computed in O(|T (P1 )| + |T (P2 )|)
datasets.                                                     time. Down project uses this property, and computes
                                                              the denotations quickly. Moreover, if |T (P1 )| < θ or
                                                              |T (P2 )| < θ holds, we can see that P never be fre-
3.3    Hypercube Decomposition                                quent. It also helps to reduce the computation time.
                                                                 Apriori-type algorithms accelerates the frequency
LCM finds a number of frequent itemsets at once for           counting by finding a good pair P1 and P2 of subsets
reducing the computation time[18]. Since the item-            of P , such that |T (P1 )| + |T (P2 )| is small, or either
sets obtained at once compose a hypercube in the              P1 or P2 is infrequent. Backtracking algorithm adds
itemset lattice, we call the technique hypercube de-          an item e to the current solution P in each itera-
composition. For a frequent itemset P , let H(P )             tion, and compute its denotation. By using T ({e}),
be the set of items e satisfying e > tail(P ) and             the computationP time for the frequency counting is
T (P ) = T (P ∪ {e}). Then, for any Q ⊆ H(P ),                reduced to O( e>tail(P ) (|T (P )| + |T ({e})|)).
T (P ∪ Q) = T (P ) holds, and P ∪ Q is frequent.                 The bitmap method[5] is a technique for speed-
LCMfreq uses this property. For two itemsets P and            ing up the computation of taking the intersection in
P ∪ Q, we say that P 0 is between P and P ∪ Q if              down project. It uses a bitmap image (the charac-
P ⊆ P 0 ⊆ P ∪ Q. In the iteration with respect to             teristic vector) of the denotations. To take the in-
P , we output all P 0 between P and P ∪ H(P ). This           tersection, we have to take O(|T |) time with bitmap.
saves about 2|H(P )| times of the frequency counting.         However, a 32bit CPU can take the intersection of
   To avoid duplications, we do not generate recur-           32bits at once, thus roughly speaking the computa-
sive calls with respect to items included in H(P ).           tion time is reduced to 1/32. This method has a
Instead of generating these recursive calls, we output        disadvantage for sparse datasets, and is not orthog-
frequent itemsets including items of H(P ) in recur-          onal to anytime database reduction described in the
sive calls with respect to items not included in H(P ).       below. From the results of the experiments in FIMI
When the algorithm generates a recursive call with            03, bitmap method seems to be not good for sparse
respect to e 6∈ H(P ), we pass H(P ) to it. In the re-        large datasets.
cursive call, we output all itemsets between P ∪ {e}             LCM algorithms use another method for the fre-
and P ∪ {e} ∪ H(P ) ∪ H(P ∪ {e}). Since any itemset           quency counting, called occurrence deliver[18, 19].
Q satisfies T (P ∪ Q ∪ H(P )) = T (P ∪ Q), the item-          Occurrence deliver computes the denotations of P ∪
sets output in the recursive calls are frequent. We           {e} for e = tail(P ) + 1, ..., |I| at once by tracing
describe hypercube decomposition as follows.                  transactions in T (P ). It use a bucket for each e to
ALGORITHM HypercubeDecomposition                              be added, and set them to empty set at the begin-
          (P :current solution, S:itemset)                    ning. Then, for each transaction t ∈ T (P ), occur-
S 0 := S ∪ H(P )                                              rence deliver inserts t to the bucket of e for each
Output all itemsets including P                               e ∈ t, e > tail(P ). After these insertions, the bucket
         and included in P ∪ S 0                              of e is equal to T (P ∪ {e}). For each transaction t,
For each item e ∈ I \ (P ∪ S 0 ), e > tail(P ) do             occurrence deliver takes O(|t ∩ {tail(P )P  + 1, ..., |I|}|)
    If P ∪ {e} is frequent then                               time. Thus, the computation time is O( T ∈T (P ) |T ∩
                                                                                                     P
       call HypercubeDecomposition (P ∪ {e}, S 0 )            {tail(P )+1, ..., |I|}|) =O(|T (P )|+ e>tail(P ) |T (P ∪
End for                                                       {e})|). This time complexity is smaller than down
                                                              project. We describe the pseudo code of occurrence
3.4    Frequency Counting                                     deliver in the following.

Generally, the most heavy part of the frequent item-          ALGORITHM OccurrenceDeliver
set mining is the frequency counting, which is to                       (T:database, P :itemset)
count the number of transactions including a newly            1. Set Bucket[e] := ∅ for each item e > tail(P )

                                                          4
 dataset and                     chess   accidents    BMS-WebView2         T40I10D100K
 minimum support                  40%         30%            0.05%                0.1%
 reduction factor
 by FP-tree                       2.27         6.01                 1.9              1.57
 reduction factor by
 Hypercube decomposition          6.25            1                1.21                 1
 reduction factor
 by apriori (best)                1.11         1.34                1.35              2.85
Table 1: Efficiency test of FP-tree, hypercube decomposition, and apriori: the reduction factor of FP-tree is
(sum of # of elements in reduced database by LCM) / ( sum of # of elements in reduced database by FP-tree),
over all iterations, the reduction ratio of hypercube decomposition is the
                                                                        P average number of output frequent
itemsets in an iteration, and the reduction ratio of apriori is (sum of
P                                                                          e>tail(P ) |T (P ∪ {e})|) / (sum of
   e>F (P ) |T (P ∪ {e})|), over all iterations.




2. For each transaction t ∈ T (P ) do                            be merged.
3.   For each item e ∈ t, e > tail(P ) do                           According to this, LCM algorithms recursively re-
4.      Insert t to Bucket[e]                                    duce the database while the execution of recursive
5.   End for                                                     calls. Before the recursive call, LCM algorithms gen-
6. End for                                                       erate a reduced database according to the above dis-
7. Output Bucket[e] for all e > tail(P )                         cussion, and pass it to the recursive call. We call this
                                                                 technique anytime database reduction.
   Let F (P ) be the set of items e such that e >                   Anytime database reduction reduces the compu-
tail(P ) and P ∪ {e} is frequent. Apriori algorithms             tation time of the iterations located at the lower
have possibility to find out in short time that P ∪ {e}          levels of the recursion tree. In the recursion tree,
is infrequent, thus, in the bestPcase, their computa-            many iterations are on the lower levels and few
tion time can be reduced to O( e∈F (P ) |T (P ∪{e})|).           iterations are on the upper levels. Thus, anytime
   P
If e>tail(P ),e6∈F (P ) |T (P ∪ {e})| is large, occurrence       database reduction is expected to be efficient. In
deliver will be slow.                                            our experiments, anytime database reduction works
                  P
   To decrease e>tail(P ),e6∈F (P ) |T (P ∪ {e})|, LCM           quite well. The following table shows the efficiency
algorithms sort indices of items e in the increasing             of anytime database reduction. We sum up over all
order of |T ({e})|.     As we can see in Table 1, this           iterations the sizes of the database received from
                P
sort reduces e>tail(P ),e6∈F (P ) |T (P ∪ {e})| to 1/4 of        the parent, in both cases with anytime database
P
                                                                 reduction and without anytime database reduction.
   e>tail(P ) |T (P ∪ {e})| in many cases. Since apriori
algorithms take much time to maintain previously                 Each cell shows the sum. The reduction ratio is large
obtained itemsets, the possibility of speeding up by             especially if the dataset is dense and the minimum
apriori algorithms is not so large.                              support is large.
   LCM algorithms further speeds up the frequency
counting by iteratively reducing the database. Sup-
pose that an iteration I of a backtracking algorithm             3.5      Prefix Preserving Closure Exten-
receives a frequent itemset P from its parent. Then,                      sion
in any descendant iteration of I, no item of indices
smaller than tail(P ) is added. Hence, any such item             Many existing algorithms for mining closed itemsets
can be removed from the database while the exe-                  are based on frequent itemset mining. That is, the
cution of the descendant iterations. Similarly, the              algorithms enumerate frequent itemsets, and output
transactions not including P never include the cur-              those being closed. This approach is efficient when
rent solution of any descendant iteration, thus such             the number of frequent itemsets and the number of
transactions can be removed while the execution of               frequent closed itemsets differ not so much. How-
the descendant iterations. Indeed, infrequent items              ever, if the difference between them is large, the algo-
can be removed, and the identical transactions can               rithms generate many non-closed frequent itemsets,

                                                             5
 dataset and                         connect          pumsb      BMS-WebView2         T40I10D100K
 minimum support                        50%             60%              0.1%                0.03%
 Database reduction                188319235      2125460007          2280260           1704927639
 Anytime database reduction           538931         7777187           521576             77371534
 Reduction factor                      349.4           273.2               4.3                 22.0
Table 2: Accumulated number of transactions in database in all iterations

thus they will be not efficient. Many pruning meth-              ate the closure of P . By adding to P all items e such
ods have been developed for speeding up, however                 that f rq(P ) = f rq(P ∪ {e}), we can construct the
they are not complete. Thus, the computation time                closure of P . We call the second closure operation.
is not bounded by a linear function in the number of                LCM uses closure operations for generating ppc ex-
frequent closed itemsets. There is a possibility of over         tensions. Similar to the frequency counting, we use
linear increase of computation time in the number of             database reduction for closure operation. Suppose
output.                                                          that the current solution is P , the reduced database
   LCM uses prefix preserving closure extension (ppc-            is composed of transactions S1 , ..., Sh , and each Sl is
extension in short) for generating closed itemsets[18,           obtained from transactions T1l , ..., Tkl of the original
19]. For a closed itemset P , we define the closure              database. For eachTSl , we define the interior inter-
tail clo tail(P ) by the item i of the minimum index             section In(Sl ) by T ∈{T l ,...,T l } T . Here the closure
                                                                                   T        1     k
satisfying clo(P (i)) = P . clo tail(P ) is always in-           of P is equal to S∈{S1 ,...,Sh } In(S). Thus, by using
cluded in P . We say that P 0 is a ppc extension of P            interior intersections, we can efficiently construct the
if P 0 = clo(P ∪ {e}) and P 0 (e − 1) = P (e − 1) hold for       closure of P .
an item e > clo tail(P ). Let P0 be the itemset satis-              When we merge transactions to reduce the
fying T (P 0 ) = T . Any closed itemset P 0 6= P0 is a           database, interior intersections can be updated effi-
ppc extension of another closed itemset P , and such             ciently, by taking the intersection of their interior in-
P is unique for P 0 . Moreover, the frequency of P is            tersections. In the same way as the frequency count-
strictly larger than P 0 , hence ppc extension induces a         ing, we can remove infrequent items from the interior
rooted tree on frequent closed itemsets. LCM starts              intersections for more reduction. The computation
from P0 , and finds all frequent closed itemsets in a            time for the closure operation in LCM depends on
depth first manner by recursively generating ppc ex-             the size of database, but not on the number of previ-
tensions. The proof of ppc extension algorithms are              ously obtained itemsets. Thus, storage method has
described in [18, 19].                                           advantages if the number of frequent closed itemsets
   By ppc extension, the time complexity is bounded              is small. However, for the instances with a lot of
by a linear function in the number of frequent closed            frequent closed itemsets, which take long time to be
itemsets. Hence, the computation time of LCM never               solved, LCM has an advantage.
be super linear in the number of frequent closed item-
sets.
                                                                 3.7    Enumerating Maximal Frequent
                                                                        Itemsets
3.6    Closure Operation
                                                                 Many existing algorithms for maximal frequent item-
To enumerate closed itemsets, we have to check                   set enumeration are based on the enumeration of fre-
whether the current solution P is a closed itemset               quent itemsets. In breadth-first manner or depth-
or not. In the existing studies, there are two meth-             first manner, they enumerate frequent itemsets and
ods for this task. The first method is to store in               output maximal itemsets among them. To reduce the
memory previously obtained itemsets which are cur-               computation time, the algorithms prune the unnec-
rently maximal among itemsets having the identical               essary itemsets and recursive calls.
denotation. In this method, we find frequent itemsets               Similar to these algorithms, LCMmax enumerates
one-by-one, and store them in memory with remov-                 closed itemsets by backtracking, and outputs maxi-
ing itemsets included in another itemset having the              mal itemsets among them. It uses a pruning to cut
identical denotation. After finding all frequent item-           off unnecessary branches of the recursion. The prun-
sets, only closed itemsets remain in memory. We call             ing is based on a re-ordering of the indices of items,
this storage method. The second method is to gener-              in each iteration. We explain the re-ordering in the

                                                             6
following.                                                     is a heavy task, thus many existing algorithms avoid
   Let us consider a backtracking algorithm for enu-           it. They store in memory maximal itemsets among
merating frequent itemsets. Let P be the current so-           previously obtained frequent itemsets, and update
lution of an iteration of the algorithm. Suppose that          them when they find a new itemset. When the al-
P 0 is a maximal frequent itemset including P . LCM-           gorithms terminate and obtain all frequent itemsets,
max puts new indices to items with indices larger              only maximal frequent itemsets remain in memory.
than tail(P ) so that any item in P 0 has an index             We call this storage method. If the number of max-
larger than any item not in P 0 . Note that this re-           imal frequent itemsets is small, storage method is
ordering of indices has no effect to the correctness of        efficient. However, if the number is large, storage
the algorithm.                                                 method needs much memory. When a frequent item-
   Let e > tail(P ) be an item in P 0 , and consider the       set is newly found, storage method checks whether
recursive call with respect to P ∪ {e}. Any frequent           the itemset is included in some itemsets in the mem-
itemset P̂ found in the recursive call is included in          ory or not. If the number of frequent itemsets is large,
P 0 , since every item having an index larger than e           the operation takes long time.
is included in P 0 , and the recursive call adds to P
items only of indices larger than e. From this, we                To avoid the disadvantage of storage method,
can see that by the re-ordering of indices, recursive          LCMmax operates maximality check. LCMmax
calls with respect to items in P 0 ∩ H generates no            checks the maximality by finding an item e such
maximal frequent itemset other than P 0 .                      that P ∪ {e} is frequent. If and only if such e ex-
   According to this, an iteration of LCMmax chooses           ists, P is not maximal. To operate this efficiently,
an item e∗ ∈ H, and generates a recursive call with            we reduce the database. Let us consider an itera-
respect to P ∪ {e∗ } to obtain a maximal frequent              tion of LCMmax with respect to a frequent itemset
itemset P 0 . Then, re-orders the indices of items other       P . LCM algorithms reduce the database by anytime
than e∗ as the above, and generates recursive calls            database reduction for the frequency counting. Sup-
with respect to each e > tail(P ) not included in P 0 ∪        pose that the reduced database is composed of trans-
{e∗ }. In this way, we save the computation time for           actions S1 , ..., Sh , and each Sl is obtained by merg-
finding P 0 , and by finding a large itemset, increase         ing transactions T1l , ..., Tkl of the original database.
the efficiency of this approach. In the following, we          Let H be the set of items to be added in the iter-
describe LCMmax.                                               ation. Suppose that we remove all items e from H
                                                               such that P ∪ {e} is infrequent. Then, for any l,
ALGORITHM LCMmax (P :itemset, H:items to                       T1l ∩ H = T2l ∩ H =, ..., = Tkl ∩ H holds. For an item e
                be added)                                      and a transaction Sl , we define the weight w(e, Sl ) by
1. H 0 := the set of items e in H s.t. P ∪ {e} is frequent     the number of transactions in P     T1l , ..., Tkl including e.
2. If H 0 = ∅ then                                             Here the frequency of P ∪{e} is S∈{S1 ,...,Sh } w(e, S).
3.     If P ∪ {e} is infrequent for any e then                 Thus, by using the weights, we can efficiently check
          output P ; return                                    the maximality, in linear time of the size of the re-
4.     End if                                                  duced database.
5. End if                                                         When we merge transactions to reduce the
6. Choose an item e∗ ∈ H 0 ; H 0 := H 0 \ {e∗ }                database, the weights can be updated easily. For each
7. LCMmax (P ∪ {e}, H 0 )                                      item e, we take the sum of w(e, S) over all transac-
8. P 0 := frequent itemset of the maximum size                 tions S to be merged. In the same way as frequency
       found in the recursive call in 7                        counting, we can remove infrequent items from the
9. For each item e ∈ H \ P 0 do                                database for maximality checking, for more reduc-
10.     H 0 := H 0 \ {e}                                       tion.
11.     LCMmax (P ∪ {e}, H 0 )
12. End for                                                      The computation time for maximality check in
                                                               LCMmax depends on the size of database, but not
3.8    Checking Maximality                                     on the number of previously obtained itemsets. Thus,
                                                               storage method has advantages if the number of max-
When LCMmax finds a frequent itemset P , it checks             imal frequent itemsets is small, but for the instances
the current solution is maximal or not. We call                with a lot of maximal frequent itemsets, which take
this operation maximality check. Maximality check              long time to be solved, LCMmax has an advantage.

                                                           7
                                        BMS-WebView-2-all
                                        BMS-WebView-2-all
                                         BMS-WebView-2-all                                                           BMS-WebView-2-closed
                                                                                                                     BMS-WebView-2-closed
                                                                                                                      BMS-WebView-2-closed
                 10
                 1000
                  10                                                                              10
                                                                                                  1000
                                                                                                   10
                           afopt_all
                            afopt_all
                              afopt_all afopt_all                                                          afopt_closed
                                                                                                            afopt_closed
                                                                                                              afopt_closed
                             fpg_all
                               fpg_all
                                 fpg_all fpg_all                                                             fpg_closed
                                                                                                               fpg_closed
                                                                                                                 fpg_closed
                              lcm_all
                                lcm_all lcm_all                                                               lcm_closed
                                                                                                                lcm_closed
                  100         mafia_all mafia_all                                                   100       mafia_closed
                                 dci_all dci_all
cputime (sec)




                                                                                  cputime (sec)
                                  patriciamine_all
                  1
                  10                                                                                 1
                                                                                                     10


                     1                                                                                1


                 0.1
                 10.1                                                                              0.1
                                                                                                   10.1
                  0.11
                    0.11
                     0.110.1
                          0.1
                           0.10.09
                               0.09
                                0.090.08
                                     0.08
                                      0.080.07
                                           0.07
                                            0.070.06
                                                 0.06
                                                  0.060.05
                                                       0.05
                                                       0.050.04
                                                            0.04
                                                            0.04 0.03
                                                                 0.03
                                                                 0.03 0.02
                                                                      0.02 0.01                     0.11
                                                                                                      0.11
                                                                                                       0.110.1
                                                                                                            0.1
                                                                                                             0.10.09
                                                                                                                 0.09
                                                                                                                  0.090.08
                                                                                                                       0.08
                                                                                                                        0.080.07
                                                                                                                             0.07
                                                                                                                              0.070.06
                                                                                                                                   0.06
                                                                                                                                    0.060.05
                                                                                                                                         0.05
                                                                                                                                         0.050.04
                                                                                                                                              0.04
                                                                                                                                              0.04 0.03
                                                                                                                                                   0.03
                                                                                                                                                   0.03 0.02
                                                                                                                                                        0.02 0.01
                                            minsup
                                             minsup
                                             minsup(%)
                                                     (%)
                                                      (%)                                                                     minsup
                                                                                                                               minsup
                                                                                                                               minsup(%)
                                                                                                                                       (%)
                                                                                                                                        (%)

                                     BMS-WebView-2-maximal
                                     BMS-WebView-2-maximal
                                      BMS-WebView-2-maximal
                 10
                 1000
                  10
                          afopt_maximal
                           afopt_maximal
                             afopt_maximal
                            fpg_maximal
                              fpg_maximal
                                fpg_maximal
                             lcm_maximal
                               lcm_maximal
                  100        mafia_maximal
 cputime (sec)




                     1
                     10


                      1


                 0.1
                 10.1
                  0.11
                    0.11
                     0.110.1
                          0.1
                           0.10.09
                               0.09
                                0.090.08
                                     0.08
                                      0.080.07
                                           0.07
                                            0.070.06
                                                 0.06
                                                  0.060.05
                                                       0.05
                                                       0.050.04
                                                            0.04
                                                            0.04 0.03
                                                                 0.03
                                                                 0.03 0.02
                                                                      0.02 0.01
                                            minsup
                                             minsup
                                             minsup(%)
                                                     (%)
                                                      (%)

                                                                    Figure 1: Results 1


                 4        Computational Experiments                                               minutes, or abnormal terminations. The results are
                                                                                                  displayed in Figure 1 and 2. In each graph, the hori-
                 In this section, we show the results of our compu-                               zontal axis is the size of minimum supports, and the
                 tational experiments. We implemented our three al-                               virtical axis is the CPU time written in a log scale.
                 gorithms LCM, LCMfreq, and LCMmax. They are
                 coded by ANSI C, and complied by gcc. The exper-                                   From the performances of implementations, the in-
                 iments were executed on a notebook PC, with AMD                                  stances were classified into three groups, in which the
                 athron XP 1600+ of 224MB memory. The perfor-                                     results are similar. Due to the space limitation, we
                 mance of LCM algorithms are compared with the                                    show one instance as a representative for each group.
                 algorithms which marked good score on FIMI 03:                                     The first group is composed of BMS-WebView1,
                 fpgrowth[8], afopt[11], MAFIA[5, 6], kDCI[12], and                               BMS-WebView2, BMS-POS, T10I4D100K, kosarak,
                 PATRICIAMINE[16]. We note that kDCI and PA-                                      and retail. These datasets have many items and
                 TRICIAMINE are only for all frequent itemset min-                                transactions but are sparse. We call these datasets
                 ing. To reduce the time for experiments, we stop                                 sparse datasets. We chosen BMS-WebView2 as the
                 the execution when an algorithm takes more than                                  representative.
                 10 minute. The following figures show the results.
                 We do not plot if the computation time is over 10                                  The second group is composed of datasets taken

                                                                              8
                                               chess-all
                                                chess-all                                                                       chess-closed
                100
                 100                                                                            1000
                            afopt_all
                             afopt_all afopt_all                                                              afopt_closed
                              fpg_all
                                fpg_all fpg_all                                                                 fpg_closed
                               lcm_all lcm_all                                                                 lcm_closed
                 10          mafia_all mafia_all                                                 100          mafia_closed
                10              dci_all dci_all
cputime (sec)




                                                                                cputime (sec)
                                 patriciamine_all
                      1                                                                           10

                 1
                 0.1                                                                                 1


                0.1
                0.01                                                                             0.1
                    90
                     90      80
                              80      70
                                       70      60
                                                60   50
                                                      50        40
                                                                40   30    20                            90      80
                                                                                                                  80     7070     60 60 50 5040       40
                                                                                                                                                       30   30
                                                                                                                                                            20
                                              minsup
                                              minsup(%)
                                                     (%)                                                                         minsup (%)

                                            chess-maximal                                                                       accidents-all
                                                                                                                                 accidents-all
                100                                                                             100
                                                                                                1000
                           afopt_maximal                                                                      afopt_all
                                                                                                               afopt_all afopt_all
                             fpg_maximal                                                                        fpg_all
                                                                                                                  fpg_all fpg_all
                            lcm_maximal                                                                          lcm_all lcm_all
                           mafia_maximal                                                                       mafia_all mafia_all
                10                                                                               100              dci_all dci_all
cputime (sec)




                                                                                cputime (sec)


                                                                                                                   patriciamine_all
                                                                                                10

                 1                                                                                10




                0.1                                                                              11
                      90     80       70       60     50        40   30    20                     9090        80
                                                                                                               80      70
                                                                                                                        70   60
                                                                                                                              60 50 50    40
                                                                                                                                          40     30
                                                                                                                                                 30    20   10
                                              minsup (%)                                                                       minsup
                                                                                                                                minsup(%)
                                                                                                                                       (%)

                                            accidents-closed
                                             accidents-closed                                                                accidents-maximal
                100
                1000                                                                            1000
                           afopt_closed
                            afopt_closed                                                                      afopt_maximal
                             fpg_closed
                               fpg_closed                                                                       fpg_maximal
                              lcm_closed                                                                       lcm_maximal
                            mafia_closed                                                                      mafia_maximal
                 100                                                                             100
cputime (sec)




                                                                                cputime (sec)




                10

                  10                                                                              10




                 11                                                                                  1
                  9090       80
                             80      70
                                      70     6060 50 50 40 4030      30
                                                                      20   20
                                                                           10                            90     80     70     60   50    40      30    20   10
                                              minsup
                                               minsup(%)
                                                      (%)                                                                      minsup (%)

                                                                     Figure 2: Results 2



                                                                                9
from UCI-Machine Learning Repository1 , connect,              5    Conclusion
chess, mushrooms, pumsb, and pumsb-star. These
datasets have many transactions but few items. We             In this paper, we proposed a fast implementation of
call these datasets middle density datasets. As a rep-        LCM for enumerating frequent closed itemsets, which
resentative, we show the result of chess.                     is based on prefix preserving closure extension. We
   The third group is accidents. It is different from         further gave implementations LCMfreq and LCM-
any other dataset. It has huge number of transac-             max for enumerating all frequent itemsets and max-
tions, but few items. Transactions includes many              imal frequent itemsets by modifying LCM. We show
items, so the dataset is very dense. We call this             by computational experiments that our implements
dataset very dense dataset.                                   of LCM, LCMfreq and LCMmax perform above the
                                                              other algorithms for many datasets, especially for
   In almost instances and minimum supports, LCM
                                                              sparse datasets. There is a possibility of speeding up
algorithms perform well. When the minimum sup-
                                                              LCM algorithms by developing more efficient maxi-
port is large, LCM algorithms are the fastest for all
                                                              mality checking algorithms, or developing a hybrid of
instances, because of the fast initialization. For all
                                                              array and FP-tree like data structures.
instances with any minimum support, LCM outper-
forms other closed itemset mining algorithms. This
shows the efficiency of ppc extension.                        Acknowledgment
   For sparse datasets, LCM algorithms are the
fastest, for any minimum support. The efficiency of           This research is supported by joint-research funds of
FP-tree is not large, and occurrence deliver works ef-        National Institute of Informatics.
ficiently. The performances of afopt and fp-growth
are quite similar for these problems. They are the            References
second bests, and 2 to 10 times slower than LCM
algorithms. For enumerating frequent closed item-              [1] R. Agrawal and R. Srikant, “Fast Algorithms for
sets, they take much time when the number of closed                Mining Association Rules in Large Databases,”
itemsets is large. Although PATRICIAMINE is fast                   In Proceedings of VLDB ’94, pp. 487–499, 1994.
as much as fp-groth and afopt, it abnormally ter-
minated for some instances. kDCI is slow when the              [2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen
number of frequent itemsets is large. MAFIA was the                and A. I. Verkamo, “Fast Discovery of Associa-
slowest for these instances, for any minimum support.              tion Rules,” In Advances in Knowledge Discov-
   For middle density datasets, LCM is the fastest                 ery and Data Mining, MIT Press, pp. 307–328,
for all instances on closed itemset mining. On all                 1996.
and maximal frequent itemset mining, LCMfreq and               [3] R. J. Bayardo Jr., “Efficiently Mining Long Pat-
LCMmax are the fastest for large minimum supports,                 terns from Databases”, In Proc. SIGMOD’98,
for any dataset. For small minimum supports, for                   pp. 85–93, 1998.
half instances LCMfreq and LCMmax are the fastest.             [4] E. Boros, V. Gurvich, L. Khachiyan, and
For the other instances, the results are case by case:             K. Makino, “On the Complexity of Generat-
each algorithm won in some cases.                                  ing Maximal Frequent and Minimal Infrequent
   For accidents, LCM algorithms are the fastest                   Sets,” STACS 2002, pp. 133-141, 2002.
when the minimum support is large. For small sup-              [5] D. Burdick, M. Calimlim, J. Gehrke, “MAFIA:
ports, LCM(closed) is the fastest, however LCMfreq                 A Maximal Frequent Itemset Algorithm for
and LCMmax are slower than fp-growth For this                      Transactional Databases,” In Proc. ICDE 2001,
dataset, the efficiency of FP-tree is large, and the               pp. 443-452, 2001.
compression ratio is up to 6. Bitmap is also efficient
                                                               [6] D. Burdick, M. Calimlim, J. Flannick, J.
from the density. Hence, the computation time for
                                                                   Gehrke, and T. Yiu, “MAFIA: A Performance
the frequency counting is short in the execution of
                                                                   Study of Mining Maximal Frequent Itemsets,”
existing implementations. However, by ppc exten-
                                                                   In Proc. IEEE ICDM’03 Workshop FIMI’03,
sion, LCM has an advantage for closed itemset min-
                                                                   2003. (Available as CEUR Workshop Proc. se-
ing. hence LCM(closed) is the fastest.
                                                                   ries, Vol. 90, http://ceur-ws.org/vol-90)
                                                               [7] B. Goethals,         the FIMI’03 Homepage,
  1 http://www.ics.uci.edu/   mlearn/MLRepository.html
                                                                   http://fimi.cs.helsinki.fi/, 2003.

                                                         10
 [8] G. Grahne and J. Zhu, “Efficiently Using                     as CEUR Workshop Proc. series, Vol. 90,
     Prefix-trees in Mining Frequent Itemsets,” In                http://ceur-ws.org/vol-90)
     Proc. IEEE ICDM’03 Workshop FIMI’03, 2003.              [19] T. Uno, T. Asai, Y. Uchida, H. Arimura, “An
     (Available as CEUR Workshop Proc. series,                    Efficient Algorithm for Enumerating Closed Pat-
     Vol. 90, http://ceur-ws.org/vol-90)                          terns in Transaction Databases,” to appear in
 [9] J. Han, J. Pei, Y. Yin, “Mining Frequent Pat-                Proc. of Discovery Science 2004, 2004.
     terns without Candidate Generation,” SIGMOD             [20] M. J. Zaki, C. Hsiao, “CHARM: An Efficient
     Conference 2000, pp. 1-12, 2000                              Algorithm for Closed Itemset Mining,” 2nd
[10] R. Kohavi, C. E. Brodley, B. Frasca, L. Ma-                  SIAM International Conference on Data Mining
     son and Z. Zheng, “KDD-Cup 2000 Organizers’                  (SDM’02), pp. 457-473, 2002.
     Report: Peeling the Onion,” SIGKDD Explo-               [21] Z. Zheng, R. Kohavi and L. Mason, “Real World
     rations, 2(2), pp. 86-98, 2000.                              Performance of Association Rule Algorithms,”
[11] Guimei Liu, Hongjun Lu, Jeffrey Xu Yu, Wang                  KDD 2001, pp. 401-406, 2000.
     Wei, and Xiangye Xiao, “AFOPT: An Efficient
     Implementation of Pattern Growth Approach,”
     In Proc. IEEE ICDM’03 Workshop FIMI’03,
     2003. (Available as CEUR Workshop Proc. se-
     ries, Vol. 90, http://ceur-ws.org/vol-90)
[12] S. Orlando, C. Lucchese, P. Palmerini, R. Perego
     and F. Silvestri, “kDCI: a Multi-Strategy Algo-
     rithm for Mining Frequent Sets,” In Proc. IEEE
     ICDM’03 Workshop FIMI’03, 2003. (Available
     as CEUR Workshop Proc. series, Vol. 90,
     http://ceur-ws.org/vol-90)
[13] N. Pasquier, Y. Bastide, R. Taouil, L. Lakhal,
     Efficient Mining of Association Rules Using
     Closed Itemset Lattices, Inform. Syst., 24(1),
     25–46, 1999.
[14] N. Pasquier, Y. Bastide, R. Taouil, L. Lakhal,
     Discovering Frequent Closed Itemsets for Asso-
     ciation Rules, In Proc. ICDT’99, 398-416, 1999.

[15] J. Pei, J. Han, R. Mao, “CLOSET: An Efficient
     Algorithm for Mining Frequent Closed Item-
     sets,” ACM SIGMOD Workshop on Research Is-
     sues in Data Mining and Knowledge Discovery
     2000, pp. 21-30, 2000.
[16] A. Pietracaprina and D. Zandolin, “Mining
     Frequent Itemsets using Patricia Tries,” In
     Proc. IEEE ICDM’03 Workshop FIMI’03, 2003.
     (Available as CEUR Workshop Proc. series,
     Vol. 90, http://ceur-ws.org/vol-90)
[17] S. Tsukiyama, M. Ide, H. Ariyoshi and I. Shi-
     rakawa, “A New Algorithm for Generating All
     the Maximum Independent Sets,” SIAM Jour-
     nal on Computing, Vol. 6, pp. 505–517, 1977.

[18] T. Uno, T. Asai, Y. Uchida, H. Arimura,
     “LCM: An Efficient Algorithm for Enumerat-
     ing Frequent Closed Item Sets,” In Proc. IEEE
     ICDM’03 Workshop FIMI’03, 2003. (Available

                                                        11