=Paper= {{Paper |id=Vol-126/paper-3 |storemode=property |title=DCI Closed: A Fast and Memory Efficient Algorithm to Mine Frequent Closed Itemsets |pdfUrl=https://ceur-ws.org/Vol-126/lucchese.pdf |volume=Vol-126 |dblpUrl=https://dblp.org/rec/conf/fimi/LuccheseOP04 }} ==DCI Closed: A Fast and Memory Efficient Algorithm to Mine Frequent Closed Itemsets== https://ceur-ws.org/Vol-126/lucchese.pdf
                 DCI Closed: a Fast and Memory Efficient Algorithm
                         to Mine Frequent Closed Itemsets

                        Claudio Lucchese                                         Salvatore Orlando
                         ISTI “A. Faedo”                                    Computer Science Dept.
             Consiglio Nazionale delle Ricerche (CNR)                         Università Ca’ Foscari
                            Pisa, Italy.                                           Venezia, Italy.
             email: claudio.lucchese@isti.cnr.it                          email: orlando@dsi.unive.it
                                                   Raffaele Perego
                                                  ISTI “A. Faedo”
                                      Consiglio Nazionale delle Ricerche (CNR)
                                                     Pisa, Italy.
                                      email: raffaere.perego@isti.cnr.it


                         Abstract                                  cover all the itemsets having support higher than (or
                                                                   equal to) a given threshold min supp.
   One of the main problems raising up in the frequent                The paper is organized as follows. In Sect. 2 we in-
closed itemsets mining problem is the duplicate detection.         troduce closed itemsets and describe a framework for
In this paper we propose a general technique for promptly          mining them. This framework is shared by all the algo-
detecting and discarding duplicate closed itemsets, with-          rithms surveyed in Sect. 3. In Sect. 4 we formalize the
out the need of keeping in the main memory the whole set           problem of duplicates and propose our technique. Sec-
of closed patterns.                                                tion 5 proposes an implementation of our technique
   Our approach can be exploited with substantial perfor-          and discusses the experimental results obtained. Fol-
mance benefits by any algorithm that adopts a vertical             low some concluding remarks.
representation of the dataset. We implemented our tech-
nique within a new depth-first closed itemsets mining al-          2. Closed itemsets
gorithm. The experimental evaluation demonstrates that
our algorithm outperforms other state of the art algo-                The concept of closed itemset is based on the two
rithms like CLOSET+ and FPCLOSE.                                   following functions f and g:

                                                                              f (T ) = {i ∈ I | ∀t ∈ T, i ∈ t}
                                                                               g(I) = {t ∈ D | ∀i ∈ I, i ∈ t}
1. Introduction
                                                                   where T and I, T ⊆ D and I ⊆ I are, respectively,
   Frequent itemsets mining is the most important and              subsets of all the transactions and items occurring in
demanding task in many data mining applications. To                dataset D. Function f returns the set of itemsets in-
describe the mining problem we introduce the follow-               cluded in all the transactions in T , while function g re-
ing notation. Let I = {a1 , ..., aM } be a finite set of           turns the set of transactions supporting a given item-
items, and D a finite set of transactions (the dataset),           set I. Since the set of transaction g(I) can be repre-
where each transaction t ∈ D is a list of distinct items           sented by a list of transaction identifiers, we refer to
t = {i0 , i1 , ..., iT }, ij ∈ I. A k-itemset is a sequence of k   g(I) as the tid-list of I. We can introduce the follow-
distinct items I = {i0 , i1 , ..., ik } | ij ∈ I, sorted on the    ing definition:
basis of some total order relation between item literals.
The number of transactions in the dataset including an             Definition 1 An itemset I is said to be closed if and
itemset I is defined as the support of I (or supp(I)).             only if
Mining all the frequent itemsets from D requires to dis-                      c(I) = f (g(I)) = f ◦ g(I) = I
where the composite function f ◦ g is called Galois oper-                                 The algorithms for mining frequent closed itemsets
ator or closure operator.                                                              adopt a strategy based on two main steps: Search space
                                                                                       browsing, and Closure calculation. In fact, they browse
                                                                                       the search space by traversing the lattice of frequent
   The closure operator defines a set of equivalence                                   itemsets from an equivalence class to another, and they
classes over the lattice of frequent itemsets: two item-                               calculate the closure of the frequent itemsets visited in
sets belong to the same equivalence class iff they have                                order to determine the maximal elements (closed item-
the same closure, i.e. they are supported by the same                                  sets) of the corresponding equivalence classes. Let us
set of transactions. We can also show that an itemset I                                analyze in some depth these two phases.
is closed if no superset of I with the same support ex-
                                                                                       Browsing the search space. The goal of an effec-
ists. Thus, a closed itemset is also the maximal itemset
                                                                                       tive browsing strategy should be to devise a spanning
of an equivalence class. Mining all these maximal ele-
                                                                                       tree over the lattice of frequent itemsets, visiting ex-
ments of each equivalence class corresponds to mine all
                                                                                       actly a single itemset in each equivalence class. We
the closed itemsets.
                                                                                       could in fact mine all the closed itemsets by calculat-
   Figure 1.(a) shows the lattice of frequent itemsets                                 ing the closure of just an itemset per equivalence class.
derived from the simple dataset reported in Fig. 1.(b),                                Let us call the itemsets used to compute closures dur-
mined with min supp = 1. We can see that the item-                                     ing the visit closure generators.
sets with the same closure are grouped in the same                                        Some algorithms choose the minimal elements (or
equivalence class. Each equivalence class contains ele-                                key patterns) of each equivalence class as closure gen-
ments sharing the same supporting transactions, and                                    erators. Key patterns form a lattice, and this lattice can
closed itemsets are their maximal elements. Note that                                  be easily traversed with a simple apriori-like algorithm.
the number of closed itemsets (five) is remarkably lower                               Unfortunately, an equivalence class can have more than
than the number of frequent itemsets (fifteen).                                        one minimal element leading to the same closed item-
                                                                                       set. For example, the closed itemset {ABCD} of Fig.
                                                                                       1 may be mined twice, since it can be obtained as clo-
                                         ABCD 1                                        sure of the two minimal elements of its equivalence class
                                                                                       {AD} and {CD}.
                                                                                          Other algorithms use instead a different technique
               ABC2            ABD1               ACD1                BCD1             that we call closure climbing. As soon as a generator
                                                                                       is devised, its closure is computed, and new genera-
                                                                                       tors are built as supersets of the closed itemset discov-
        AB 2          AC 2         AD 1         BC 2             BD 2           CD 1   ered. Since closed itemsets are maximal elements, this
                                                                                       strategy always guarantees to jump from an equiva-
                                                                                       lence class to another. Unfortunately, it does not guar-
                A 2              B 3                 C 3                 D 2           antee that the new generators belong to equivalence
                                                                                       classes that were not previously visited.
                                           Ø                                              Regardless of the strategy adopted, some kind of
                                                                        Equivalence
                                                                                       duplicate check has thus to be introduced. A naive ap-
                 B 3                     ABD1
                                                  SUPPORT
                        Frequent
                        Closed Itemset             Frequent itemset
                                                                        Class
                                                                                       proach to check for duplicates is to search for each gen-
                                     (a)                                               erated closed itemset among all the ones mined so far.
                                                                                       Indeed, in order to avoid to perform a lot of expen-
                  TID                    items                                         sive closure operations, several algorithms exploit the
                   1             B       D                                             following lemma:
                   2             A       B C                D                          Lemma 1 Given two itemsets X and Y , if X ⊂ Y and
                   3             A       B C                                           supp(X) = supp(Y ) (i.e. |g(X)| = |g(Y )|), then c(X) =
                   4             C                                                     c(Y ).
                                     (b)
                                                                                       Proof. If X ⊂ Y , then g(Y ) ⊆ g(X). Since |g(Y )| =
  Figure 1. (a) Lattice of frequent itemsets with                                      |g(X)| then g(Y ) = g(X). g(X) = g(Y ) ⇒ f (g(X)) =
  closed itemsets and equivalence classes given by                                     f (g(Y )) ⇒ c(X) = c(Y ).
  the dataset (b).                                                                     
                                                                                          Therefore, given a generator X, if we find an al-
                                                                                       ready mined closed itemsets Y that set-includes X,
and the supports of Y and X are identical, we can            minimal elements of each equivalence class. Since a k-
conclude that c(X) = c(Y ). Hence we can prune the           itemset is a key pattern if and only if no one of its
generator X without actually calculating its closure.        (k − 1)-subsets has the same support, minimal ele-
Also this duplicates checking strategy is however ex-        ments are discovered with an intensive subset check-
pensive, both in time and space. In time because we          ing. In its second step, A-CLOSE calculates the closure
may need to search for the inclusion of each genera-         of all the minimal generators previously found. Since a
tor in a huge number of closed itemsets, in space be-        single equivalence class may have more than one min-
cause to perform it we need to keep all the closed item-     imal itemsets, redundant closures may be computed.
sets in the main memory. To reduce such costs, closed        A-CLOSE performance suffers from the high cost of
sets can be stored in compact prefix tree structures, in-    the off-line closure calculation and the huge number of
dexed by one or more levels of hashing.                      subset searches.
                                                                The authors of FP-Growth [2] (J. Han, et al.) pro-
Calculating Closures. To calculate the closure of an
                                                             posed CLOSET [6] and CLOSET+ [7]. These two algo-
itemset X, we have to apply the Galois operator c. Ap-
                                                             rithms inherit from FP-Growth the compact FP-Tree
plying c requires to intersect all the transactions of the
                                                             data structure and the exploration technique based on
dataset including X. Another way to calculate the clo-
                                                             recursive conditional projections of the FP-Tree. Fre-
sure is given by the following lemma:
                                                             quent single items are detected after a first scan of the
Lemma 2 Given an itemset X and an item i, if g(X) ⊆          dataset, and with another scan the pruned transactions
g(i) ⇒ i ∈ c(X).                                             are inserted in the FP-Tree stored in the main memory.
                                                             With a depth first browsing of the FP-Tree and recur-
Proof. Since g(X ∪ i) = g(X) ∩ g(i), g(X) ⊆ g(i) ⇒           sive conditional FP-Tree projections, CLOSET mines
g(X ∪ i) = g(X). Therefore, if g(X ∪ i) = g(X) then          closed itemsets by closure climbing, and growing up
f (g(X ∪ i)) = f (g(X)) ⇒ c(X ∪ i) = c(X) ⇒ i ∈ c(X).        frequent closed itemsets with items having the same
                                                            support in the conditional dataset. Duplicates are dis-
    From the above lemma, we know that if g(X) ⊆ g(i),       covered with subset checking by exploiting Lemma 2.
then i ∈ c(X). Therefore, by performing this inclusion       Thus, all closed sets previously discovered are kept in
check for all the items in I not included in X, we can       the main memory, and are indexed by a two level hash.
incrementally compute c(X). Note that, since the set         CLOSET+ is similar to CLOSET, but exploits an
g(i) can be represented by the tid-list associated with      adaptive behaviour in order to fit both sparse and dense
i, this suggests the adoption of a vertical format for the   datasets. As regards the duplicate detection technique,
input dataset in order to efficiently implement the in-      CLOSET+ introduces a new one for sparse datasets
clusion check: g(X) ⊆ g(i).                                  named upward checking. This technique consists in the
    The closure calculation can be performed off-line or     intersection of every path of the initial FP-Tree leading
on-line. In the first case we firstly retrieve the com-      to a candidate closed itemset X, if such intersection is
plete set of generators, and then we calculate their clo-    empty then X is actually closed. The rationale for us-
sures. In the second case, as soon as a new generator        ing it only in sparse dataset is that the transactions
is discovered, its closure is computed on-the-fly.           are short, a thus the intersections can be performed
    The algorithms that compute closures on-line are         quickly. Note that with dense dataset, where the trans-
generally more efficient. This is because they can adopt     actions are usually longer, closed itemsets equivalence
the closure climbing strategy, according to which new        classes are large and the number of duplicates is high,
generators are created recursively from closed itemsets.     such technique is not used because of its inefficiency,
These generators are likely longer than key patterns,        and CLOSET+ steps back using the same strategy of
which are the minimal itemsets of the equivalence class      CLOSET, i.e. storing every mined closed itemset.
and thus are the shorter possible generators. Obviously,        FPCLOSE [1], which is a variant of CLOSET+, re-
the longer the generator is, the fewer checks (on fur-       sulted to be the best algorithm for closed itemsets min-
ther items to add) are needed to get its closure.            ing presented at the ICDM 2003 Frequent Itemset Min-
                                                             ing Implementations Workshop.
3. Related Works                                                CHARM [9] (M. Zaki, et al.) performs a bottom-up
                                                             depth-first browsing of a prefix tree of frequent item-
    The first algorithm proposed for mining closed item-     sets built incrementally. As soon as a frequent itemset
sets was A-CLOSE [5] (N. Pasquier, et al.). A-CLOSE          is generated, its tid-list is compared with those of the
first browses level-wise the frequent itemsets lattice by    other itemsets having the same parent. If one tid-list
means of an Apriori-like strategy, and mines all the         includes another one, the associated nodes are merged
since both the itemsets surely belong to the same equiv-                  that do not result order preserving according to the def-
alence class. Itemset tid-lists are stored in each node of                inition below.
the tree by using the diff-set technique [8]. Since differ-
                                                                          Definition 2 A generator X = Y ∪ i, where Y is a
ent paths can however lead to the same closed itemset,
                                                                          closed itemset and i 6∈ Y , is said to be order preserv-
also in this case a duplicates pruning strategy is imple-
                                                                          ing one iff i ≺ (c(X) \ X).
mented. CHARM adopts a technique similar to that of
CLOSET, by storing in the main memory the closed                             The following Theorem shows that, for any closed
itemsets indexed by single level hash.                                    itemset Y , it is possible to find a sequence of order pre-
   According to our classification, A-CLOSE exploits a                    serving generators in order to climb a sequence of clo-
key pattern browsing strategy and performs off-line clo-                  sure itemsets and arrive at Y . The following Corollary
sure calculations, while CHARM, CLOSET+ and FP-                           states that this sequence is unique.
CLOSE are different implementations of the same clo-                      Theorem 1 For each closed itemset Y 6= c (∅), there ex-
sure climbing strategy with incremental closure com-                      ists a sequence of n (n ≥ 1) items i0 ≺ i1 ≺ ... ≺ in−1
putation.                                                                 such that

4. Removing duplicate generators of                                       {gen0 , gen1 , . . . , genn−1 } = {Y0 ∪i0 , Y1 ∪i1 , . . . , Yn−1 ∪in−1 }
   closed itemsets
                                                                          where the various geni are order preserving generators,
   In this Section we discuss a particular visit of the lat-              with Y0 = c (∅), Yj+1 = c(Yj ∪ ij ) ∀j ∈ [0, n − 1] and
tice of frequent sets used by our algorithm to identify                   Y = Yn .
unique generators of each equivalence class, and com-
pute all the closed patterns through the minimum num-                     Proof. First of all, we show that given a generic gener-
ber of closure calculations.                                              ator gen ⊆ Y , c(gen) ⊆ Y . More formally, if ∃Y 0 such
   In our algorithm, we use closure climbing to browse                    that Y 0 is a closed itemset, and Y 0 ⊂ Y , and we extend
the search space, find generators and compute their                       Y 0 with an item i ∈ Y \ Y 0 to obtain gen = Y 0 ∪ i ⊆ Y ,
closure. As soon as a generator is found, its closure is                  then ∀j ∈ c(gen), j ∈ Y .
computed, and new generators are built as supersets                          Note that g(Y ) ⊆ g(gen) because gen ⊆ Y . More-
of the closed itemset discovered so far. So, each gen-                    over, if j ∈ c(gen), then g(c(gen)) ⊆ g(j). Thus, since
erator gen browsed by our algorithm can be generally                      g(Y ) ⊆ g(gen), then g(Y ) ⊆ g(j) also holds, so that
represented as gen = Y ∪ i, where Y is a closed item-                     j ∈ c(Y ) too. So, if j 6∈ Y hold, Y would not be closed,
set, and i, i 6∈ Y is an item in I 1 .                                    and this is in contradiction with the hypothesis.
   Looking at Figure 1.(a), we can unfortunately dis-                        As regards the proof of the Theorem, we show it by
cover multiple generators gen = Y ∪ i, whose closures                     constructing a sequence of closed itemsets and associ-
produce an identical closed itemset. For example, we                      ated generators having the properties stated above.
have four generators, {A}, {A, B}, {A, C} and {B, C},                        We have that Y0 = c (∅). All the items in Y0 ap-
whose closure is equal to the closed itemsets {A, B, C}.                  pear in every transaction of the dataset and therefore
Note that all these generators have the form Y ∪ i,                       by definition of closure they must be included also in
since they can be obtained by adding a single items to                    Y , i.e. Y0 ⊆ Y .
a smaller closed itemset, namely ∅, {B} and {C}.                             Since Y0       6= Y by definition, we choose
   The technique exploited by our algorithm to detect                     i0 = min≺ (Y \ Y0 ), i.e. i0 is the smallest item
duplicate generators exploits a total lexicographic or-                   in {Y \ Y0 } with respect to the lexicographic or-
der relation ≺ between all the itemsets of our search                     der ≺, in order to create the first order preserv-
space2 . Since there exist a relation ≺ between each pair                 ing generator {Y0 ∪ i0 }. Afterwards we calculate
of k-itemsets, in order to avoid duplicate closed item-                   Y1 = c(Y0 ∪ i0 ) = c(gen0 ).
sets, we do not compute the closure of the generators                        Once Y1 is found, if Y1 = Y we stop.
                                                                             Otherwise we choose i1 = min≺ (Y \ Y1 ), where
1   For each closed itemset Y 0 6= c (∅), it is straightforward to
                                                                          i0 ≺ i1 by construction, in order to build the next or-
    show that there must exists at least a generator having the           der preserving generator gen1 = Y1 ∪ i1 and we calcu-
    form gen = Y ∪ i, where Y , Y ⊂ Y 0 , is a closed itemset, i 6∈ Y ,   late Y2 = c(Y1 ∪ i1 ) = c(gen1 ).
    and Y 0 = c(gen).
                                                                             Once Y2 is found, if Y2 = Y we stop, otherwise we
2   This lexicographic order is induced by an order relation be-
    tween single item literals, according to which each k-itemset         iterate, by choosing i2 = min≺ (Y \ Y2 ), and so on.
    I can be considered as a sorted set of k distinct items                  Note that each generator genj = {Yj ∪ ij } is order
    {i0 , i1 , ..., ik }.                                                 preserving, because c({Yj ∪ ij }) = Yj+1 ⊆ Y and ij is
the minimum item in {Y \Yj } by construction, i.e. ij ≺                  set of closed itemsets already mined to the tid-lists as-
{Yj+1 \ {Yj ∪ ij }}.                                                     sociated with single items. This new technique is not
                                                                        resource demanding, because frequent closed itemsets
                                                                         need not to be stored in the main memory during the
Corollary 1 For each closed itemset Y 6= c (∅),                          computation, and it is not time demanding, because the
the sequence of order preserving generators                              order preserving check is cheaper than searching the set
{gen0 , gen1 , . . . , genn } = {Y0 ∪ i0 , Y1 ∪ i1 , . . . , Yn ∪ in }   of closed itemsets mined so far. Note that CLOSET+
as stated in Theorem 1 is unique.                                        needs the initial FP-tree as an additional requirement
Proof. Since all the items in Y0 appear in every transac-                the current FP-tree in use, and morover does not use
tion of the dataset, by definition of closure, they must                 its upward checking tchnique with dense datasets.
be included also in Y , we have that Y0 = c (∅).
   During the construction of the sequence of genera-                    5. The DCI Closed algorithm.
tors, suppose that we choose ij 6= min≺ (Y \ Yj ) to con-
struct generator genj . Since genj and all the following                    The pseudo-code of the recursive procedure
generators must be order preserving, it should be im-                    DCI Closed() is shown in Algorithm 1. The pro-
possible to obtain Y , since we can not consider any-                    cedure receives three parameters: a closed item-
more the item i = min≺ (Y \ Yj ) ∈ Y in any other gen-                   sets CLOSED SET, and two sets of items, i.e. the
erator or closure in order to respect the order preserv-                 PRE SET and POST SET.
ing property.                                                               The procedure will output all the non-duplicate
                                                                        closed itemsets that properly contain CLOSED SET.
   Looking at Figure 1.(a), for each closed itemset we                   In particular, the goal of the procedure is to deeply
can easily identify those unique sequences of order pre-                 explore each valid new generator obtained from
serving generators. For example, for the the closed                      CLOSED SET by extending it with all the ele-
itemset Y = {A, B, C, D}, we have Y0 = c(∅) = ∅,                         ment in POST SET.
gen0 = ∅ ∪ A, Y1 = c(gen0 ) = {A, B, C}, gen1 =                             Before calling procedure DCI Closed(), the dataset
{A, B, C} ∪ D, and, finally, Y = c(gen1 ). Another ex-                   D is scanned to determine the frequent single items
ample regards the closed itemset Y = {B, D}, where                       F1 ⊆ I, and to build the bitwise vertical dataset
we have Y0 = c(∅) = ∅, gen0 = ∅ ∪ B, Y1 = c(gen0 ) =                     VD containing the various tid-lists g(i), ∀i ∈ F1 .
B, gen1 = B ∪ D, and, finally, Y = c(gen1 ).                             The procedure is thus called by passing as arguments
                                                                         CLOSED SET = c(∅), PRE SET = ∅, and POST SET
   In order to exploit the results of Theorem 1, we need                 = F1 \ c(∅). Note that the itemset c(∅) contains, if
a fast way to check whether a generator is order pre-                    any, the items that occur in all the transactions of the
serving.                                                                 dataset D.
Lemma 3 Let gen = Y ∪ i be a generator of a closed                          The procedure builds all the possible generators,
itemset where Y is a closed itemset and i 6∈ Y , and let                 by extending CLOSED SET with the various items
pre-set(gen) = {j ≺ i | j 6∈ gen}. gen is not order pre-                 in POST SET (lines 2–3). The infrequent and dupli-
serving, iff ∃j ∈ pre-set(gen), such that g(gen) ⊆ g(j).                 cate generators (i.e., the not order preserving ones) are
                                                                         however discarded as invalid (lines 4-5). Note that the
Proof. If g(gen) ⊆ g(j), then j ∈ c(gen). Since, by hy-                  items i ∈ P OST SET used to obtain those invalid gen-
pothesis, j ≺ i, it is not true that i ≺ (c(gen) \ gen) be-              erators will no longer be considered in the following
cause j ∈ (c(gen) \ gen).                                                recursive calls. Only the valid generators are then ex-
                                                                        tended to compute their closure (lines 6–15). It is worth
   The previous Lemma introduces the concept of                          noting that each generator new gen ← CLOSED SET
pre-set(gen), where gen = {Y ∪ i} is a genera-                           ∪ i is strictly extended according to the order preserv-
tor, and gives a way to check the order preserv-                         ing property, i.e. by using all items j ∈ POST SET
ing property of gen by scanning all the g(j), for all                    such that i ≺ j (line 8). Note that all the items j, i ≺ j,
j ∈ pre-set(gen).                                                        which do not belong to c(new gen) are included in the
   We have thus contributed a deep study on the the                      new POST SET (line 12) and are used for the next re-
problem of duplicates in mining frequent closed item-                    cursive call. At the end of this process, a new closed set
sets. By reformulating the duplicates problem as the                     (CLOSED SETN ew ← c(new gen)) is obtained (line
problem of visiting the lattice of frequent itemsets, ac-                15). From this new closed set, new generators and cor-
cording to a total (lexicographic) order, we have moved                  responding closed sets can be build, by recursively call-
the dependencies of the order preserving check from the                  ing the procedure DCI Closed() (line 16). Finally, it
Algorithm 1 DCI-closed pseudocode
 1: procedure DCI Closed(CLOSED SET, PRE SET, POST SET)
 2:    for all i ∈ POST SET do                                                          . Try to create a new generator
 3:       new gen ← CLOSED SET ∪ i
 4:       if supp(new gen) ≥ min supp then                                                          . new gen is frequent
 5:           if is dup(new gen, PRE SET) = FALSE then                                                . Duplication check
 6:               CLOSED SETN ew ← new gen
 7:               POST SETN ew ← ∅
 8:               for all j ∈ POST SET, i ≺ j do                                         . Compute closure of new gen
 9:                  if g(new gen) ⊆ g(j) then
10:                      CLOSED SETN ew ← CLOSED SETN ew ∪ j
11:                  else
12:                      POST SETN ew ← POST SETN ew ∪ j
13:                  end if
14:               end for
15:               Write out CLOSED SETN ew and its support
16:               DCI Closed(CLOSED SETN ew , PRE SET, POST SETN ew )
17:               PRE SET ← PRE SET ∪ i
18:           end if
19:       end if
20:    end for
21: end procedure
22:
23:
24: function is dup(new gen, PRE SET)
25:    for all j ∈ PRE SET do                                                                           . Duplicate check
26:       if g(new gen) ⊆ g(j) then
27:           return FALSE                                                          . new gen is not order preserving!!
28:       end if
29:      end for
30:      return TRUE
31: end function
is worth to point out that, in order to force the lexico-    dure to only build new generators X ∪ j, where i ≺ j
graphic order of the visit, the two for all’s (line 2 and    (according to the hypotheses of Theorem 1.
line 8) have to extract items from POST SET while re-           The composition of the new PRE SET depends in-
specting this order.                                         stead on the valid generators3 that precedes new gen =
    Before recursively calling the procedure, it is neces-   Y ∪ i in the lexicographic order. If all the generators
sary to prepare the suitable PRE SET and POST SET            were valid, it would simply be composed of all the items
to be passed to the new recursive call of the procedure.     j that precede i in the lexicographic order, and j 6∈ X =
Upon each recursive call to the procedure, the size of       c(new gen). In other words, the new PRE SET would
the new POST SET is monotonically decreased, while           be the complement set of X ∪ POST SETnew .
the new PRE SET’s size is instead increased.                    While the composition of POST SET guarantees
    As regards the composition of the new POST SET,          that the various generators will be produced accord-
assume that the closed set X =CLOSED SETnew                  ing to the lexicographic order ≺, the composition of
passed to the procedure (line 16) has been obtained          PRE SET guarantees that duplicate generators will be
by computing the closure of a generator new gen =            pruned by function is dup().
Y ∪ i (c(new gen)), where Y =CLOSED SET and
                                                                Since we have shown that for each closed itemset
i ∈ POST SET. The POST SETnew to be passed to
                                                             Y exists one and only one sequence of order preserv-
the recursive call of the procedure is built as the set of
                                                             ing generators and since our algorithm clearly explores
all the items that follow i in the lexicographic order and
                                                             every possible order preserving generator from every
that have not been already included in X. More for-
mally, POST SETnew = {j ∈ F1 | i ≺ j and j 6∈ X}.
This condition allows the recursive call of the proce-       3   The ones that have passed the frequency and duplicate tests.
closed itemset, we have that the algorithm is complete        we may know that some large portions of the bitwise
and does not produce any duplicate.                           tidlists g(X) are however strictly included in g(j). Let
                                                              g(X)j be the portion of the bitwise tidlist g(X) strictly
5.0.1. Some optimizations exploited in the al-                included in the corresponding portion of g(j), namely
gorithm. We adopted a large amount of optimizations           g(j). Hence, since g(X)j ⊆ g(j), it is straightforward to
to reduce the cost of the bitwise intersections, needed
                                                              show that g(X ∪ Y )j ⊆ g(j) continues to hold, for all
for the duplication and closure computations (line 10
                                                              itemset Y used to extend X, because g(X ∪ Y ) ⊆ g(X)
and 34). For the sake of simplicity, these optimizations
                                                              holds . So, when we extend X to obtain a new genera-
are not reported in the pseudo-code shown in Algo-
                                                              tor, we can limit the inclusion check of the various g(j)
rithm 1.
                                                              to the complementary portions of tid-lists g(j), thus
   DCI-CLOSED inherits the internal representation of
                                                              strongly reducing the cost of them.
our previous works DCI[4] and kDCI[3].The dataset is
stored in the main memory using a vertical bitmap rep-        5.0.2. Dealing with sparse datasets. It is possi-
resentation. With two successive scans of the dataset,        ble to show that in sparse datasets the number of closed
a bitmap matrix DM ×N is stored in the main memory.           itemsets is nearly equal to the number of frequent ones,
The D(i, j) bit is set to 1 if and only if the j -th trans-   so near that they are often the same. This means that
action contains the i -th frequent single item. Row i of      the techniques for mining closed itemsets are of no use,
the matrix thus represent the tid-list of item i.             because almost every duplicate checking or closure cal-
   The columns of D are then reordered to profit of           culating procedure is likely to fail.
data correlation. This is possible and highly worthwhile         For this reason, in case of sparse datasets, we pre-
when we mine dense datasets. As in [3][4], columns are        ferred to exploit our frequent itemset mining algorithm
reordered to create a submatrix E of D having all its         [3] with an additional closedness test over the frequent
rows identical. Every operation (e.g. intersection ones)      itemset discovered. Given a new frequent itemset X,
involving rows in the submatrix E will be performed           every of it subset of length |X| − 1 with the same sup-
only once, thus gaining strong performance improve-           port as X is marked as non closed. Experiments showed
ments.                                                        that this approach is fruitful (see Fig. 2.b).
   This kind of representation fits with our framework,
because the three main operations, i.e. support count,        5.0.3. Space complexity. For all the algorithms re-
closure, and duplicates check, can be fastly performed        quiring to keep in the main memory the whole set of
with cheap bit-wise AND/OR operation.                         closed itemsets to perform the duplicate check, the size
                                                              of the output is actually a lower bound on their space
    Besides the DCI optimizations, specifically tailored      complexity. Conversely, we will show that the amount
for sparse and dense datasets, we exploited more spe-         of memory required by an implementation based on
cific techniques made possible by the depth-first visit       our duplicate discarding technique is independent of
of the lattice of itemsets.                                   the size of the output. To some extent, its memory oc-
    In order to determine that the itemset X is closed,       cupation depends on those data structures that also
the tidlist g(X) must have been compared with all the         need to be maintained in memory by other algorithms
g(j)’s, for all items j contained in the pre-list (post-      that visit depth-first the lattice and exploit tid-list in-
list) of X, i.e. the items that precede (follows) all         tersections adopting a vertical datasets.
items included in X according to a lexicographic or-             The main information needed to be kept in the
der. The PRE SET must have been accessed for check-           main memory is the tid-list of each node in the cur-
ing duplicate generators, and the POST SET for com-           rent path along the lattice, and the tid-list of every
puting the closure. In particular, for all j ∈ PRE SET        frequent single item. In this way we are able to browse
∪ POST SET, we already know that g(X) * g(j), oth-            the search space intersecting nodes with single item
erwise those items j must have been included in X.            tid-lists, and to discard duplicates checking the order
    Therefore, since g(X) must have already been com-         preserving property.
pared with all the g(j), for all items j contained in            The worst case of memory occupation happens when
the PRE SET (POST SET) of X, we may save some                 the number of generators and frequent single items is
important information regarding each comparison be-           maximal: this occurs when c(∅) = ∅ and every item-
tween g(j) and g(X). Such information will be used            set is frequent and closed. If N is the number of fre-
to reduce the cost of the following use of g(j), when         quent single items, the deepest path has N nodes, and
these tidlists g(j) will have to be exploited to look for     since one of this node is a single item, the total num-
further closed itemsets that include/extend X. In par-        ber of tid-lists to be kept in the main memory is 2N −1.
ticular, even if, for all j, it is true that g(X) * g(j),     Since the length of a tid-list is equal to the number of
                                                               CONNECT                                                                                           RETAIL
                              4                                                                                              2
                         10                                                                                             10
                                   FPCLOSE                                                                                        FPCLOSE
                                   CLOSET+                                                                                        CLOSET+
                                   DCI−CLOSED                                                                                     DCI−CLOSED




                              3
                         10
      size (MegaBytes)




                                                                                                          time (sec.)
                                                                                                                             1
                                                                                                                        10




                              2
                         10




                              1                                                                                              0
                         10                                                                                             10
                              45     40         35        30      25       20        15        10   5                    0.12          0.1          0.08          0.06           0.04        0.02    0
                                                               support %                                                                                        support %


                                                                   (a)                                                                                      (b)
                                                                CHESS                                                                                           PUMSB
                          3                                                                                              3
                         10                                                                                             10
                                   FPCLOSE                                                                                        FPCLOSE
                                   CLOSET+                                                                                        CLOSET+
                                   DCI−CLOSED                                                                                     DCI−CLOSED




                          2                                                                                              2
                         10                                                                                             10
     tempo (sec.)




                                                                                                         time (sec.)




                          1                                                                                              1
                         10                                                                                             10




                          0                                                                                              0
                         10                                                                                             10
                              45          40         35           30            25        20        15                       75      70        65          60               55          50     45    40
                                                               support %                                                                                        support %


                                                                    (c)                                                                                     (d)
                                                               CONNECT                                                                                          PUMSB*
                              3                                                                                              3
                         10                                                                                             10
                                   FPCLOSE                                                                                        FPCLOSE
                                   CLOSET+                                                                                        CLOSET+
                                   DCI−CLOSED                                                                                     DCI−CLOSED




                              2                                                                                              2
                         10                                                                                             10
      tempo (sec.)




                                                                                                          time (sec.)




                              1                                                                                              1
                         10                                                                                             10




                              0                                                                                              0
                         10                                                                                             10
                              45     40         35        30      25       20        15        10   5                        22       20       18          16               14          12      10   8
                                                               support %                                                                                        support %


                                                                    (e)                                                                                         (f)


Figure 2. (a) Memory occupation on the connect dataset as a function of the minimum support threshold.
(b-f) Execution times of FPCLOSE, CLOSET+, and DCI-CLOSET as a function of the minimum support
threshold on various publicly available datasets.
transactions T in the dataset, the space complexity of        The experimental evaluation demonstrated that our
our algorithm is                                           approach is very effective. The proposed implementa-
                                                           tion outperforms FPCLOSE and CLOSET+ in all the
                  O ((2N − 1) × T ) .                      test conducted and requires orders of magnitude less
                                                           memory.
   Figure 2.(a) plots memory occupation of FPCLOSE,
CLOSET+ and our algorithm DCI-CLOSED when                  References
mining the connect dataset as a function of the sup-
port threshold. The experimental results agree with our    [1] Gosta Grahne and Jianfei Zhu. Efficiently using prefix-
estimates: whereas FPCLOSE and CLOSET+ mem-                    trees in mining frequent itemsets. In Proceedings of the
ory occupation grows exponentially because of the huge         IEEE ICDM Workshop on Frequent Itemset Mining Im-
number of closed itemsets generated, our implementa-           plementations, November 2003.
tion needs much less memory (up to two order of mag-       [2] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent
nitude) because its occupation depends linearly on N .         patterns without candidate generation. In Proc. SIG-
                                                               MOD ’00, pages 1–12, 2000.
5.1. Experimental results                                  [3] Claudio Lucchese, Salvatore Orlando, Paolo Palmerini,
                                                               Raffaele Perego, and Fabrizio Silvestri. kdci: a multi-
    We tested our implementation on a suite of pub-            strategy algorithm for mining frequent sets. In Proceed-
licly available dense datasets (chess, connect, pumsb,         ings of the IEEE ICDM Workshop on Frequent Itemset
                                                               Mining Implementations, November 2003.
pumsb*), and compared the performances with those of
two well known state of the art algorithms: FPCLOSE        [4] S. Orlando, P. Palmerini, R. Perego, and F. Silvestri.
[1], and CLOSET+ [7]. FPCLOSE is publicly available            Adaptive and resource-aware mining of frequent sets. In
as http://fimi.cs.helsinki.fi/src/fimi06.html,                 Proc. The 2002 IEEE International Conference on Data
                                                               Mining (ICDM02), page 338345, 2002.
while the Windows binary executable of CLOSET+
was kindly provided us from the authors. Since FP-         [5] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi
CLOSE was already proved to outperform CHARM in                Lakhal. Discovering frequent closed itemsets for associa-
every dataset, we did not used CHARM in our tests.             tion rules. In Proc. ICDT ’99, 1999.
    The experiments were conducted on a Windows XP         [6] Jian Pei, Jiawei Han, and Runying Mao. Closet: An ef-
PC equipped with a 2.8GHz Pentium IV and 512MB                 ficient algorithm for mining frequent closed itemsets. In
of RAM memory. The algorithms FPCLOSE and DCI-                 SIGMOD International Workshop on Data Mining and
                                                               Knowledge Discovery, May 2000.
CLOSED were compiled with the gcc compiler avail-
able in the cygwin environment.                            [7] Jian Pei, Jiawei Han, and Jianyong Wang. Closet+:
    As shown in Fig. 2.(b-f), DCI-CLOSED outperforms           Searching for the best strategies for mining frequent
both algorithms in all the tests conducted. CLOSET+            closed itemsets. In SIGKDD ’03, August 2003.
performs quite well on the connect dataset with low        [8] Mohammed J. Zaki and Karam Gouda. Fast vertical min-
supports, but in any other case it is about two orders         ing using diffsets. In Technical Report 01-1, Computer
of magnitude slower. FPCLOSE is effective in pumsb*,           Science Dept., Rensselaer Polytechnic Institute, March
where it is near to DCI-CLOSED, but it is at one or-           2001.
der of magnitude slower in all the other tests.            [9] Mohammed J. Zaki and Ching-Jui Hsiao. Charm: An ef-
                                                               ficient algorithm for closed itemsets mining. In 2nd SIAM
                                                               International Conference on Data Mining, April 2002.
6. Conclusions
   In this paper we provide a deep study on the prob-
lem of mining frequent closed itemsets, formalizing a
general framework fitting every mining algorithm. Use
such framework we were able to analyse the problem
of duplicates rising in this new mining problem.
   We have proposed a technique for promptly detect-
ing and discarding duplicates, without the need of keep-
ing in the main memory the whole set of closed pat-
terns, and we implemented this technique into a new
algorithm which uses a vertical bitmap representation
of the dataset.